Gesture Recognition by Computer Vision: An Integral Approach

Lichtenauer, J.F.

Gesture Recognition by Computer Vision: An Integral Approach

Title

Gesture Recognition by Computer Vision: An Integral Approach

Author

Lichtenauer, J.F.

Contributor

Reinders, M.J.T. (promotor)

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Mediamatics

Date

2009-10-13

Abstract

The fundamental objective of this Ph.D. thesis is to gain more insight into what is involved in the practical application of a computer vision system, when the conditions of use cannot be controlled completely. The basic assumption is that research on isolated aspects of computer vision often leads to `too' general solutions. That these solutions lack the robustness and accuracy, which could only be achieved by an integral approach of a specific application. Furthermore, an integral approach, and actually trying out a computer vision system in practice, can lead to new insights that can determine the direction of future research in computer vision. The application for the research in this thesis is automatic sign recognition for feedback in active learning with an electronic learning environment for sign language. The goal of this learning environment is to enlarge the vocabulary of deaf and hard of hearing children, between the age of 3 and 5, in order to facilitate in decreasing a delay in language development. The research has been focussed on a number of aspects that were assumed to have the most influence on the robustness of sign recognition. These were: tracking of movements, the extraction of relevant structure information from an image, skin color detection, including the third dimension of hand locations, dealing with variations of time as well as shape of a sign and reducing the required effort to teach the system to recognize a new sign. `Particle filtering' is a popular method to track hand movement. However, tests with the CONDENSATION algorithm show contradictions in dealing with different situations. When the motion is unpredictable (as is the case with tracking of human hands) a particle filter has difficulty to keep track of the object. It turns out that, under different conditions, different strategies are required to deal with this in the best possible way. Isophote properties can be used as local abstractions of an image. One advantage of isophote properties is that they are independent of image contrast. In experiments with face detection using isophote properties, the results are superior to using pixels, gradients, or the popular Haar features. Because face detection requires significant computational cost, and the methods involved are less suitable for detection of hands, it is appealing to detect these body parts on the basis of their color alone. Unfortunately, color behaves less predictable in practical situations, than can be described by a single light-reflection model. Deviations from physical models for reflection are caused by properties and settings of the camera that is used, but also by the combination of different light sources and reflections. By combining these uncertainties in a more general model, robustness can be obtained in unknown circumstances. Unfortunately, this generalization comes at the price of accuracy in more friendly conditions. To combine robustness with accuracy, we have proposed an adaptive chromatic model, which can use a small set of measurements to model variation of the color of a face, using a bi-modal piecewise linear model in the red/green/blue space. Sign language takes place in a three-dimensional space, while images only allow measurements in two dimensions. Therefore, we have used stereometry to convert the measured hand locations in the images from two cameras into three-dimensional positions of the hands in space. The experiments show that this richer information does indeed lead to an improvement in sign recognition. Alternatively, the perspective of a single wide-angle camera at a short distance turned out to achieve a comparable improvement. However, the disadvantage of the latter solution is a decreased robustness, because perspective depends highly on the location of a person relative to the camera. Using dynamic recognition methods, like ``Hidden Markov Models'' (HMM) or Statistical ``Dynamic Time Warping'' (SDTW) a sequence of measured features of a person can be recognized as a specific sign. These models are able to deal with differences in tempo, contrary to conventional methods of pattern recognition, which can only deal with a fixed set of features. However, one of the disadvantages of HMM and SDTW is that they assume that what is important for estimating time warping, is equally important to class recognition. Furthermore, they are based on the factorization of probabilities for different time points, preventing the modeling of dependencies between measurements at different time steps. For these reasons, we have proposed to separate time warping and classification into subsequent processing steps. Experiments show a significant improvement over HMM or SDTW alone. In practice, it is difficult to obtain many examples of signs from different persons, in order to train a recognition system. To make the system robust to small training sets, we let the system make use of sign classes that were already trained with many examples. Here, we assumed that, when a part of the new sign is very similar to a part of a learned sign, its variation can be modeled in the same way. With a single example as the training material, this generalizing system performed comparable to when five examples are used in the regular training method. From this thesis, it can be concluded that robustness is not only relevant for practical applications of computer vision, but also deserves a place in fundamental research. Combining vantage points from different disciplines, such as physics, machine learning, neuropsychology and human computer interaction, makes sure that all aspects of a computer vision process can be integrally taken into account. With this, more robust solutions can be obtained than with each of the disciplines separately.

To reference this document use:

http://resolver.tudelft.nl/uuid:219c54e7-2a19-4717-b2bb-47b30c574382

Publisher

TU Delft Mediamatica

Embargo date

2009-10-08

ISBN

9789081381154

Part of collection

Institutional Repository

Document type

doctoral thesis

Rights

Files

PDF

Proefschrift.pdf

9.4 MB

Close viewer