International Journal of Electrical and Computer Engineering (IJECE)
Vol. 3, No. 6, December 2013, pp. 770~778
ISSN: 2088-8708



770

Real-Time Hand Gesture Recognition Based on the Depth Map
for Human Robot Interaction
Minoo Hamissi1, Karim Faez2
1

Department of Electrical and Computer Engineering, Islamic Azad University of Qazvin
2
Department of Electrical Engineering, Amirkabir University of Technology

Article Info

ABSTRACT

Article history:

In this paper, we propose and implement a novel and real-time method for
recognizing hand gestures using depth map. The depth map contains
information relating to the distance of objects from a viewpoint. Microsoft’s
Kinect sensor is used as input device to capture both the color image and its
corresponding depth map. We first detect bare hand in cluttered background
using distinct gray-level of the hand which is located near to the sensor.
Then, scale invariance feature transform (SIFT) algorithm is used to extract
feature vectors. Lastly, vocabulary tree along with K-means clustering
method are used to partition the hand postures to ten simple sets as: “one”,
“two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine” and “ten”
numbers based on the number of extended fingers. The vocabulary tree
allows a larger and more discriminatory vocabulary to be used efficiently.
Consequently, it leads to an improvement in accuracy of the clustering. The
experimental results show superiority of the proposed method over other
available approaches. With this approach, we are able to recognize 'numbers'
gestures with over 90% accuracy.

Received Jul 19, 2013
Revised Oct 2, 2013
Accepted Oct 26, 2013
Keyword:
Depth map
Hand gesture recognition
Human robot
Vocabulary tree

Copyright © 2013 Institute of Advanced Engineering and Science.
All rights reserved.

Corresponding Author:
Minoo Hamissi
Department of Electrical and Computer Engineering, Islamic Azad University of Qazvin
Qazvin Islamic Azad University - nokhbegan Blvd. qazvin. iran
(+98-281)3665275-3665276-3665277
Email: hamissi.minoo@gmail.com

1.

INTRODUCTION
Nonverbal communication can be efficiently used for sending and receiving messages between
people. Gestures and touch, body language or posture, physical distance, facial expression and eye contact
are all types of nonverbal communication. Hand gestures provide suitable and efficient interface between
human and computer. Particularly, using hand gestures, simple commands can be transmitted to the computer
or personal robots in the real time. In fact, hand gestures are mainly developed for wordless or visual
communication in human computer interfaces (HCIs). However, the recognition of hand gestures is very
challenging in environment with cluttered background and variable illumination. Furthermore, real-time
performance and recognition accuracy which are two requirements of the HCIs have to be considered in this
field. Up to now several gesture recognition techniques have been proposed to meet these requirements.
Early systems usually require markers or colored gloves to make the task easier such as [1], which use a dark
glove with color coded ring markers to detect the fingertips. Nevertheless, using of markers and gloves has
some limitations for the user’s convenience.
Hand gesture recognition systems can be categorized into two classes [2]: the 3-D hand model-based
methods, and the appearance-based methods [3].
The 3-D hand model-based technique [4-7] compares the input frames and the 2-D appearance
projected by the 3-D hand model. These methods have high degrees of freedom and are based on 3-D

Journal homepage: http://iaesjournal.com/online/index.php/IJECE

IJECE

ISSN: 2088-8708



771

kinematic hand model. This class has two major drawbacks. The 3-D hand models provide wide class of
hand gestures. Consequently, a huge image database is required to deal with the entire characteristic shapes
under several views. Second problem is the difficulty of feature extraction and inability to handle
singularities which occur from unclear views.
The appearance-based techniques which use 2-D image features have attracted extensive interest so
far. The real-time performance is the main advantage of this class. In these schemes, image features are
extracted to model the hand. Then, these features are compared with the video frames.
One study [8] reported a method based on color of skin in the image. But, this method is very
sensitive to lighting conditions and required that no other skin-like object exist in the image. In [9], hand
postures are represented in terms of hierarchies of multi-scale color image features at different scales, with
qualitative inter-relations in terms of scale, position and orientation. Although, the proposed algorithm shows
real-time performance, it cannot recognize hand gesture in the image where other skin-colored objects exist.
In another work, Argyros et al. [10] introduced an algorithm for controlling a computer mouse via 2D and 3D
hand gestures. This method is vulnerable against noise and variable illumination in the clutter background.
Some researchers focused on local invariant features [11-13].
In [11], using Adaboost learning algorithm and SIFT features leads to the rotation invariant hand
detection. SIFT method [14] is a robust feature detection to represent image based on key-points. The keypoints provide rich local information of an image. However, several features such as a contrast context
histogram had to be used to achieve hand gesture recognition in real time.
In order to achieve real-time performance and high recognition accuracy, Haar-like features and the
AdaBoost learning algorithm were suggested by Chen et al. in [13]. Juan et al. [15] evaluated performance of
SIFT, principal component analysis (PCA) – SIFT, and speeded up robust features (SURF) by many
experiments. SIFT algorithm extracts features which is invariant to the rotation and scale from images. PCASIFT, which is introduced in [16], employs PCA to normalize gradient patch. In [17], robust SURF features
are used for Fast-Hessian detector and image convolutions.
Here, we focus on bare hand gesture recognition without help of any markers and gloves. To be
robust against cluttered background and various lighting conditions, we used depth map which contains
information relating to the distance of objects from a viewpoint. For this aim, Kinect sensor is utilized to
capture both the color image and its corresponding depth map. Using the depth map, hand can be accurately
detected according to distinct gray-level in our test environment. The detected hand is extracted by replacing
hand area with a black circle. After extracting the hand, the hand area only is saved in a small image, which
will be used in extracting the features by scale invariance feature transform (SIFT) algorithm. For the first
time, Lowe [14] proposed using SIFT features which are invariant to scale, orientation and partially invariant
to illumination changes, and are extremely distinctive of the image. Therefore, SIFT features are extracted
from the hand detected images. After this step, a vocabulary tree is offline trained by the hierarchical Kmeans clustering. Next, a weighted vocabulary tree using Term Frequency Inverse Document Frequency
(TFIDF) weighting is build to recognize numbers using k-nearest neighbor and voting. Figure 1 shows an
overall picture of system.
The remainder of the paper is organized as follows. In section 2, hand detection and SIFT algorithm
for feature extraction are explained. Section 3 describes offline training of the vocabulary tree using K-means
clustering. Experimental results and comparison to the other state of the art methods are discussed in
section 4. Finally, section 5 concludes the paper [4].

2. HAND DETECTION AND FEATURE EXTRACTION USING SIFT ALGORITHM
2.1. Hand Detection
Having a reliable hand detector in the clutter background and various lighting conditions is the main
requirement of our system. The Kinect 3-D camera [18], with its depth sensing capability, provides the depth
image in 640×480 resolutions at 30 fps. The depth information which is captured by an infrared camera will
be converted into a gray scale image. Figure 2 shows an original depth map. In order to accurately extract the
hands by judging the depth, the person's hands have to be in the front. Owing to low color contrast in the raw
depth image, gray level rescaling is required. As a result, by adjusting the scale factor, we make the body and
the background invisible. Besides, the hand is in gray scale and visible. Therefore, the hand, the gray part,
can be extracted from the depth map by thresholding the gray level. Finally, we extracted a hand shape image
for recognition. This procedure is depicted in Figure 3.

Real-Time Hand Gesture Recognition Based on the Depth Map for Human Robot … (Minoo Hamissi)

772



ISSN: 2088-8708

Figure 1. Illustrating overall picture of the system

2.2. Features Extraction using Scale Invariant Feature Transform
The main features of the SIFT algorithm which motivate us to apply this algorithm, are invariant to
scale and rotation and real time extraction for low resolution images. The SIFT algorithm extract features in
four stages:
First stage: A set of difference of Gaussian filters applied at different scales all over the image, and
then the locations of potential interest points in the image are computed.
Second stage: The potential points are improved by removing points of low contrast.
Third stage: Assigning an orientation to each key point based on local image features.
Fourth stage: Computing a local feature descriptor at each key-point which is based on the local
image gradient, transformed according to the orientation of the key-point to provide orientation invariance.
The extracted feature vectors from each hand image in this step are used to train our hand gesture
recognition system. The number of key-points is dependent on the area of the detected hand. In fact, the
‘five’ gesture, the largest area gesture, has the maximum number of key-points. In this work, the number of
the key-points which is obtained from the training images is from 30 to 78 key-points and each key-point has
128-dimensional feature vectors.
To have accurate digits (numbers) recognition system, we used several training images from
different people with different scale, orientations, and illumination conditions. Therefore, the training stage is
time consuming state. However, this will not affect the testing stage speed. In the next section, we discussed
about the training of the vocabulary tree using the features vectors.

IJECE Vol. 3, No. 6, December 2013 : 770 – 778

IJECE



ISSN: 2088-8708

a

b

c

d

e

f

773

Figure 2. (a) Original color Image in clutter background (b) Depth Image.
(c) Rescaling Image (d) Median Filtering (e) Thresholding (f) Hand contour

Figure 3. Illustrating recognition procedure with vocabulary tree

Real-Time Hand Gesture Recognition Based on the Depth Map for Human Robot … (Minoo Hamissi)

774



ISSN: 2088-8708

3.

OFFLINE TRAINING OF THE VOCABULARY TREE USING K-MEANS CLUSTERING
In this section, we implemented gesture classification for ten postures of finger using the vocabulary
tree approach. The extracted SIFT features are hierarchically quantized in a vocabulary tree. In this way, each
high-dimension feature vector is quantized into an integer which corresponds to a path in the vocabulary tree.
We trained the vocabulary tree using 150 images from each hand gesture, which are “one”, “two”, “three”,
“four”, “five”, “six”, “seven”, “eight”, “nine” and “ten” numbers. Our training images are captured from
different people in various conditions to increase robustness of the classifier. Figure 4 depicts the training
model. We first discuss about k-means clustering which is used to quantize feature space. Then, building a
weighted vocabulary tree using TFIDF weighting and gesture recognition using k-nearest neighbor and
voting are examined.
3.1. Building the Vocabulary Tree using k-Means Clustering
In the unsupervised training of the tree, firstly, the training data is clustered to the k = 4 centers
using k-means method [19]. The training data is then divided into ten groups, where each group consists of
the descriptor vectors closest to an individual cluster center. Afterwards, each group is segmented into in ten
new parts by the k-means process. The process is continued to reach its maximum number of levels L = 6.
Figure 4 shows the process of building the vocabulary tree. After building the vocabulary tree, we have to
train it (assigning node weights) according to the data base. This is further detailed in the next section.

Figure 4. Illustrating process of building a vocabulary tree using k-means (k = 3) with branch
factor 3 and three levels

3.2. Setting the TFIDF Weights of the Tree
Once the vocabulary tree is defined, we require to determine a weight wi to each node i in the
vocabulary tree. Here, TFIDF, the product of term frequency and inverse document frequency, is used to
assign the weights in the vocabulary tree. The tf-idf weighting algorithm diminishes weight nodes which
appear often in the database as:
(1)
Where N is the total number of images in the database, Ni is the number of images in the database
with at least one key-point path through node i, and tf the frequency of occurrence of node i in place of Ni.
We define query (qi = niwi) and database (di = miwi) vectors, where ni and mi are the number of key-points
vectors of the query and database image, respectively, with a path through node i. After assigning the
weights, scoring scheme is defined as:
,

| |

| |

(2)

Where | | is L1-norm.
In table 1, we investigate the effect of changing number of levels (L) and the branch factor (k) on
the performance of the proposed algorithm. To this aim, we vary number of levels from 4 to 7, and the
branch factor from 3 to 6. The results on the various modes indicated that the case of L = 6 and k = 4 is the
best compromise between the accuracy and speed of training.

IJECE Vol. 3, No. 6, December 2013 : 770 – 778

IJECE



ISSN: 2088-8708

775

Table 1. Effect of Number of Levels and Branch Factor on the Algorithm Acuracy and Speed (Training time)
Branch Factor
Num.
of levels
L=4

k=3

k=4

k=5

k=6

Time (s)

Accuracy (%)

Time (s)

Accuracy (%)

Time (s)

Accuracy (%)

Time (s)

Accuracy (%)

5.29

71.25

5.14

88.24

6.58

78.78

7.96

75.98

L=5

6.24

73.45

6.69

90.12

7.36

79.25

8.24

78.65

L=6

6.96

75.25

7.45

97.42

8.45

80.59

9.54

79.33

L=7

7.47

68.47

8.66

85.97

9.24

74.75

10.23

73.12

4.

EXPERIMENTAL RESULTS
In order to evaluate the performance of the proposed method, the proposed method is simulated on
both our image data set and on public image data set. First, we present the experiment by simulation on our
image data to show the performance of the proposed method in different situations. Then the method is
simulated on public image data set to show its perceptual advantages in comparison with other methods.
We tested ten hand gestures as database image which are shown in Figure 5. Figure 5 simply shows
detected hand gestures in the free background. The camera used for recording video files in our experiment is
a Microsoft’s Kinect which provides video capture with resolutions 640×480, at 30 frames-per second, which
is adequate for real-time speed image recognition.

Figure 5. Ten detected hand gestures available in our image data set

The Sebastien Marcel database [20], which is a benchmark database in the field of hand gesture
recognition, is also used as the public database image in this paper. This database contains (100 × 100 pixels)
color images of six hand postures performed by different people against uniform and complex backgrounds.
In the training stage, we captured 100 training images for each new hand posture in the clutter background.
The first experiment is performed on our database image to find recognition accuracy of the
proposed method. Our method has excellent recognition results on these images as shown in Table 2. The
overall recognition accuracy is 94.42 and the recognition time is about 15 milliseconds.

Real-Time Hand Gesture Recognition Based on the Depth Map for Human Robot … (Minoo Hamissi)

776



ISSN: 2088-8708

Figure 6. Our @Home robot in MRL at Qazvin Azad University

Table 2. Performance of the proposed method on our image database
Posture name
One
Two
Tree
Four
Five
Six
Seven
Eight
Nine
Ten

Recognition
Accuracy
98.25%
95.85%
94.12%
95.23%
95.77%
94.42%
96.12%
95.33%
97.85%
96.23%

Recognition
Time ( second/frame)
14.1 ms
14.3 ms
16.2 ms
15.1 ms
15.3 ms
14.2 ms
16.5 ms
16.4 ms
15.3 ms
14.1 ms

As the second experiment, to examine the robustness of our method against different scales,
rotations, and illuminations conditions, we test each gesture with several images which is captured in
different situations. All the results are summarized in Table 3 for 1000 images. Recognition time is reported
in millisecond.

Table 3. Performace of the proposed method for some typical gesture against scaling,
rotation with diffrent illumination conditions
Gesture
Name
“two”
“five”
“seven”
“eight”
“ten”

Recognition
Accuracy
86.28%
83.21%
81.33%
75.56%
76.14%

Recognition
Time (millisecond)
86
95
76
88
89

Finally, to compare our gesture recognition algorithm with other schemes, we used some papers [2024], [2] and [12] that had a real-time performance. We report recognition time and accuracy (%) for The
Sebastien Marcel database. As can be observed from the table, the proposed method outperforms other
methods for recognition accuracy.
IJECE Vol. 3, No. 6, December 2013 : 770 – 778

IJECE



ISSN: 2088-8708

777

The proposed method is used in the @home robot which is an autonomous robot. This robot
focuses on real-word application and can assist humans in everyday life. Figure 6 shows our robot in the
mechatronics research laboratory (MRL) at Qazvin Azad University. Our experiments with @home robot
verify that our algorithm is suitable for real-time applications. Some typical hand detected images in the
clutter background with their feature points are shown in Figure 7. These posture are recognized in real-time
by the vision section (Kinect Sensor in top of the head) of the robot and can be used as commands to control
the robot.

Table 4. Comparison among our method and other real-time methods
Method
[21]
[22]
[23]
[12]
[24]
[20]
[2]
our method

Number of posture
15
6
3
4
8
6
10
10

Background
Wall
Cluttered
Different
White wall
Not discussed
Cluttered
Cluttered
Cluttered

Frame resolution
160×120
320×240
640×480
320×240
640×480
100×100
640×480
640×480

Recognition time
0.4
0.09-0.11
0.1333
0.03
0.066667
Not discussed
0.017
0.025

Recognition accuracy
94.89%
93.8%
96.2%
90.0%
96.9%
76.1%
96.23%
97.42%

Figure 7. Some typical hand detected images in the clutter background with their feature points in small
green circles

5.

CONCLUSION
We proposed a method for hand gesture recognition using Microsoft's Kinect sensor. The
Microsoft's Kinect which is an infrared camera captures the depth information and provides a gray scale
depth image. According to the depth image, hand can be detected based on its distinct gray level. We found
that hand detection process is independent from background and illumination when hand located nearer to the
sensor from other objects. After hand detection, we trained a vocabulary tree to recognize hand gestures. We
experimentally obtained the best parameters (number of levels and branch factor) of the vocabulary tree to
have proper recognition accuracy. Extensive simulations on both our image dataset and the Sebastien Marcel
database show the superiority of the scheme in comparison to other state-of-the-art. Finally, we implemented
our algorithm on @home robot. The results confirm that the proposed method has a significant accuracy and
can be used in real-time applications.

REFERENCES
[1]
[2]

[3]
[4]
[5]

A El-Sawah, N Georganas, and E Petriu. “A prototype for 3-D hand tracking and posture estimation”. IEEE Trans.
Instrum. Meas. 2008; 57(8): 1627–1636.
Dardas NH, Georganas, Nicolas D. “Real-Time Hand Gesture Detection and Recognition Using Bag-of-Features
and Support Vector Machine Techniques”. IEEE Transactions on Instrumentation and Measurement. 2011; 60(11):
3592, 3607.
H Zhou and T Huang. “Tracking articulated hand motion with Eigen dynamics analysis”. in Proc. Int. Conf.
Comput. Vis. 2003; 2: 1102-1109.
JM Rehg and T Kanade. “Visual tracking of high DOF articulated structures: An application to human hand
tracking”. in Proc. Eur. Conf. Comput. Vis. 1994: 35-46.
AJ Heap and DC Hogg. “Towards 3-D hand tracking using a deformable model”. in Proc. 2nd Int. Face Gesture
Recog. Conf., Killington, VT. 1996: 140-145.

Real-Time Hand Gesture Recognition Based on the Depth Map for Human Robot … (Minoo Hamissi)

778
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]



ISSN: 2088-8708

Y Wu, JY Lin, and TS Huang. “Capturing natural hand articulation”. in Proc. 8th Int. Conf. Comput. Vis.,
Vancouver, BC, Canada. 2001; II: 426-432.
B Stenger, PRS Mendonça, and R Cipolla. “Model-based 3D tracking of an articulated hand”. in Proc. Brit. Mach.
Vis. Conf., Manchester, U.K. 2001; I: 63-72.
B Stenger. “Template based hand pose recognition using multiple cues”. in Proc. 7th ACCV. 2006: 551-560.
L Bretzner , I Laptev and T Lindeberg. “Hand gesture recognition using multiscale color features, hieracrchichal
models and particle filtering”. Proc. Int. Conf. Autom. Face Gesture Recog. 2002
A Argyros and M Lourakis. “Vision-based interpretation of hand gestures for remote control of a computer
mouse”. Proc. Workshop Comput. Human Interact. 2006: 40 -51.
C Wang and K Wang. Hand Gesture Recognition Using Adaboost With SIFT for Human Robot Interaction.
Springer-Verlag. 2008; 370.
A Barczak and F Dadgostar. “Real-time hand tracking using a set of co-operative classifiers based on Haar-like
features”. Res. Lett. Inf. Math. Sci. 2005; 7: 29-42.
Q Chen, N Georganas and E Petriu. “Real-time vision-based hand gesture recognition using Haar-like features”.
Proc. IEEE IMTC. 2007: 1-6.
DG Lowe. “Distinctive image features from scale-invariant keypoints”. Int. J. Comput. Vis. 2004; 60(2): 91-110.
L Juan and O Gwun. “A comparison of SIFT, PCA-SIFT and SURF”. Int. J. Image Process. (IJIP). 2009; 3(4):
143-152.
Y Ke and R Sukthankar. “PCA-SIFT: A more distinctive representation for local image descriptors”. in Proc. IEEE
Conf. Comput. Vis. Pattern Recog. 2004: II-506–II-513.
H Bay, A Ess, T Tuytelaars, and L Gool. “SURF: Speeded up robust features”. Comput. Vis. Image Understand.
(CVIU). 2008; 110(3): 346-359.
Microsoft Corp. Redmond WA. Kinect for Xbox 360.
DJC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press,
2003.
S Marcel. “Hand posture recognition in a body-face centered space”. in Proc. Conf. Human Factors Comput. Syst.
(CHI). 1999: 302-303.
W Chung, X Wu, and Y Xu. “A real time hand gesture recognition based on Haar wavelet representation”. in
Proc. IEEE Int. Conf. Robot.Biomimetics. 2009: 336-341.
Y Fang, K Wang, J Cheng, and H Lu. “A real-time hand gesture recognition method”. in Proc. IEEE Int. Conf.
Multimedia Expo. 2007: 995-998.
L Yun and Z Peng. “An automatic hand gesture recognition system based on Viola-Jones method and SVMs”. in
Proc. 2nd Int. Workshop Comput. Sci. Eng. 2009: 72-76.
Y Ren and C Gu. “Real-time hand gesture recognition based on vision”. in Proc. Edutainment. 2010: 468-475.

BIOGRAPHIES OF AUTHORS
Mino Hamissi received the B.Sc. degree from Qazvin Azad University, Qazvin, Iran, in 2009,
where sheis currently pursuing the M.Sc. degree, all in computer engineering.

Karim Faez was born in Semnan, Iran. He received his BSc. degree in Electrical Engineering
from Tehran Polytechnic University as the first rank in June 1973, and his MSc. and Ph.D.
degrees in Computer Science from University of California at Los Angeles (UCLA) in 1977 and
1980 respectively.
Professor Faez was with Iran Telecommunication Research Center (1981-1983) before Joining
Amirkabir University of Technology (Tehran Polytechnic) in Iran in March 1983, where he
holds the rank of Professor in the Electrical Engineering Department.
He was the founder of the Computer Engineering Department of Amirkabir University in 1989
and he has served as the first chairman during April 1989-Sept. 1992. Professor Faez was the
chairman of planning committee for Computer Engineering and Computer Science of Ministry
of Science, Research and Technology (during 1988-1996). His research interests are in
Biometrics Recognition and authentication, Pattern Recognition, Image Processing, Neural
Networks, Signal Processing, Farsi Handwritten Processing, Earthquake Signal Processing, Fault
Tolerance System Design, Computer Networks, and Hardware Design.
Dr. Faez coauthored a book in Logic Circuits published by Amirkabir University Press. He also
coauthored a chapter in the book: Recent Advances in Simulated Evolution and Learning,
Advances in Natural Computation, Vol. 2, Aug.2004,World Scientific. He published about 300
articles in the above area. He is a member of IEEE, IEICE, and ACM, a member of Editorial
Committee of Journal of Iranian Association of Electrical and Electronics Engineers, and
International Journal of Communication Engineering. Emails: kfaez@aut.ac.ir, kfaez@ieee.org,
kfaez@m.ieice.org

IJECE Vol. 3, No. 6, December 2013 : 770 – 778