Real-time recognition of Indonesian sign language using recurrent neural network

ABSTRACT


INTRODUCTION
Sign language is a unique form of communication that does not rely on the sounds of human speech or written symbols.Deaf individuals typically use sign language as their primary means of communication.According to data from the World Health Organization (WHO) as of March 15, 2018, there were 466 million deaf individuals, including 34 million children, constituting more than 5% of the global population [1].
In everyday life, deaf individuals often face challenges when communicating with those who do not understand sign language.Therefore, there is a growing need for sign language translators to facilitate communication between deaf individuals and those unfamiliar with sign language.One solution is to convert sign language into written text.
Various methods have been implemented to convert hand gestures to text.For instance, [2] u tilized the sign language MNIST dataset with a modified Inception V3 model of a convolutional neural network (CNN).Meanwhile, [3] developed continuous sign language using a recurrent neural network as the sequence learning module on multimodal data.Another approach involves using an adapted deep convolutional neural network with the Hand Gesture Dataset LSP [4].ideo-based sign language has been proposed using a hierarchical attention network with latent space [5].In another study, a lightweight model based on You Only Look Once (YOLO) v3 has been developed without any further preprocessing [6].Yet another study also utilized CNN for real-time dynamic hand gestures [7].
The methods employed in converting hand gestures into text across various sign languages exhibit weaknesses as they heavily rely on input accuracy and hand positioning to generate accurate outputs.In addition, the Indonesia Sign language is still limited.Some studies only focus on Sistem Isyarat Bahasa Indonesia (SIBI) [8][9] [10] whereas most of the deaf prefer to use BISINDO [11] instead of SIBI.Given the shortcomings identified in previous studies, this research aims to develop a hand gesture-to-text conversion system for BISINDO using the Recurrent Neural Network (RNN) method.The RNN approach is anticipated to address issues related to input accuracy, specifically in the domain of feature extraction, while also facilitating real-time implementation.Previous work by Heyuan Guo, Yang Yang, and Hua Cai [12] employed RNN to convert diverse gestures into text, albeit with the use of Kinect for hand movement detection.In contrast, this research utilizes real-time camera input.
The structure of this paper is as follows: Section 2 provides a brief explanation of RNN.Then, Section 4 outlines the method, followed by results and discussion.Finally, the conclusion is summarized in the last section.

RECURRENT NEURAL NEYWORK
Recurrent Neural Network (RNN) is a form of Artificial Neural Networks (ANN) architecture specifically designed to process sequential data.RNN does not discard information from the past in its learning process.This is what distinguishes RNNs and ordinary ANNs.RNN is able to store memories that allow it to recognize data patterns well, then use them to make accurate predictions.
In each process, the output is not only a function of that sample, but also based on the internal state that is the result of processing previous samples.The way that RNNs are able to store information from previous samples is by looping within their architecture, which automatically keeps information from the past stored.The general architecture of an RNN can be seen in Fig. 1.

Fig. 1. Architecture RNN
In theory, an RNN can process sequential data of any length.The amount of previous data that becomes input feedback for the next stage is as long as the input pattern to be recognized.However, in practice, the recurrent topology is expressed by cloning as many RNN units as the number of backward steps required.These rows of clones will form a long string of neural networks.Because the training data used is backpropagation like neural networks in general, strands that are too long will tend to experience vanishing gradient problems which cause the gradient value to become very small or very large when the input sequence is too long.The equations that can be used to find the hidden state and output values are shown in ( 1) and (2).
where   is the input at time step,  ℎ is the weight that converts input   to state ℎ  ,  ℎℎ is the weight that converts the previous state (ℎ −1 ) to the current state (ℎ  ),   is the output at time step,  ℎ is the weight that maps the calculated state to the output, and f is an activation function such as tanH, sigmoid, or ReLU.
In training using RNN, it is difficult to do because of the vanishing/exploding gradient problem.To overcome this, a variant of RNN called LSTM was introduced [13].Long Short-Term Memory (LSTM) is a variant of RNN that can overcome the exploding/vanishing gradient problem that can cause inaccurate accuracy in training a model.Fig. 2. shows the structure of the LSTM.In LSTM, there are three gates that distinguish between LSTM and regular vanilla RNN, namely forget gate, input gate, and output gate.In the forget gate section, information that is not needed will be discarded using a sigmoid function so as not to cause exploding/vanishing gradient.The equation that can be used to show the value of the forget gate can be seen in ( 3).In the input gate section, information will also be selected to be updated to the cell state using the sigmoid activation function.The equation that can be used to show the value of the input gate can be seen in equation 4. Then in the output gate section, the sigmoid function will be run again to get the output value in the hidden state and place the cell state in the tanh activation function.Equations that can be used to show the value of the output gate, cell state, and hidden state can be seen in equations 5, 6, 7.

METHODS
The method used for sign language recognition will be implemented using a Recurrent Neural Network (RNN) programmed in Python software.Videos will be captured using a Logitech C922 camera, and each frame will be captured at a rate of 10 frames per second (FPS).The process involved in hand gesture-to-text conversion can be observed in Fig. 4. Testing is conducted by calculating the success rate of hand gesture-to-text conversion.This test employs the confusion matrix method, which compares the system's classification results with the actual classification results.You can find the confusion matrix in Table 1.
True Positive (TP) represents correctly identified positive data by the system, while False Positive (FP) denotes incorrectly identified positive data.True Negative (TN) stands for correctly identified negative data, while False Negative (FN) indicates incorrectly identified negative data.

Dataset Preprocessing Stage
Before the data can be utilized in RNN training, it undergoes a preprocessing phase to simplify the training process.This preprocessing consists of two steps: converting video data into frames and transforming each frame into an array.
Prior to data processing, the acquired videos are initially converted into frames.From the 90 videos per class that were captured, 30 frames are selected from each video, resulting in a total of 97,200 frames.
The dataset, now in the form of frames, needs further conversion into arrays.This step is necessary because feature extraction is performed using MP Holistic, resulting in array data containing hand positions for each frame.The array data will be stored in .npyformat.

RNN Training
Before commencing the training process, it is essential to determine the RNN architecture to be utilized.In this study, several tests were conducted to identify the architecture with the most optimal parameters.
For this research, the hardware specifications employed during training include an NVIDIA GEFORCE 1650 graphics card, Intel i7 processor, and 8 GB of RAM.The hardware that supports the training process includes NVIDIA GPU drivers, Tensorflow 2.31, Python 3.7, and other necessary components.Each trained model will be saved in .h5format.
During the training process using the RNN method, the optimizer employed includes the Adam optimizer and the Stochastic Gradient Descent (SGD) optimizer.
During the RNN training process, achieving the best model is an iterative process involving trial and error.The chosen model should exhibit both the highest accuracy value and the lowest loss value.This stage entails modifying various parameters within the RNN, including the optimizer, learning rate, and the number of epochs.
To begin the RNN training process, the first step is to determine the most optimal optimizer.In this study, two training sessions were conducted using both the Adam optimizer and the Stochastic Gradient Descent (SGD) optimizer.The comparison of the results obtained with these two optimizers is presented in Fig. 7.

Fig. 7. Comparison loss optimizer Adam and SGD
From the figure, it is evident that the Adam optimizer is more effective in minimizing the loss value.Consequently, the Adam optimizer is employed in this research.
The value of the learning rate can also significantly impact the model's loss value and accuracy.In this study, learning rate parameters of 0.01, 0.001, and 0.0001 are used.Table 2 displays the effects of the learning rate on both the loss value and accuracy.From Table 2, it is evident that a learning rate of 0.0001 yields the smallest loss value and the highest accuracy.Consequently, in this study, the optimal learning rate utilized is 0.0001.
In the RNN training process, the number of epochs also has a significant impact on accuracy and loss values.To determine the optimal number of epochs, this research conducted training for both 300 and 500 epochs, while maintaining consistent parameters, including a learning rate of 0.0001, 7 hidden layers, a dropout rate of 0.2, and the Adam optimizer.Table 3 illustrates the effects of the number of epochs on both loss value and accuracy.From the figures and tables above, it's evident that the epoch value influences the accuracy.In the upcoming tests to find the optimal model, we will evaluate two models: one with 300 epochs and another with 500 epochs.
Testing aims to evaluate the performance of the RNN LSTM model obtained during the training phase for real-time Indonesian sign language recognition.This testing process was repeated five times for each class, with five different respondents participating.It was conducted on the two best models obtained, which were trained for 300 and 500 epochs, respectively.Both models share the same parameters: a learning rate of 0.0001, three LSTM layers, three Dense layers, a dropout rate of 0.2, and the Adam optimizer.
Following the selection of the best model based on the comparison of the two epochs, real-time testing involved assessing the impact of varying lighting conditions.The results of the real-time tests showed a test accuracy of 66.89% for the 300-epoch model and 81.67% for the 500-epoch model.After obtaining the optimal model, it will undergo further testing to assess the model's performance in recognizing sign language under varying room lighting conditions.The selected model for this testing phase utilizes 500 epochs.Two tests were conducted, one in bright lighting and another in dim lighting.

Fig. 4 .Fig. 5 .Fig. 6 .
Fig. 4. System design flowchart 4. RESULTS AND DISCUSSION 4.1.Data Collection Process This research begins with the data collection phase, involving the recording of videos that are later processed by MP Holistic to create arrays.These arrays then undergo training using the Recurrent Neural Network (RNN) method.For this study, a Logitech C922 webcam camera is used to collect the video dataset.During the data collection process, the respondents are positioned 60 cm away from the camera, placed in the center, and facing it.The data collection room is well-lit to ensure the camera captures the necessary letter and number characters.Three Electrical Engineering students from the Faculty of Engineering at Universitas Sriwijaya participated in this data collection.The study incorporates 36 classes, comprising 26 letters of the alphabet (A to Z) and 10 numbers (1 to 10).Each class consists of 90 videos, each containing 30 frames, which are converted into arrays.This results in a total of 97,200 npy data files.Thirty video samples are collected for each class.You can view examples of the collected data in Figs. 5. and 6.

Table 1
contains both predicted values, provided by the RNN system, and actual values that have been determined.The accuracy value can be calculated using Equation8.

Table 2 .
Loss and Accuracy from The Effect of Learning Rate (300 Epochs)

Table 3 .
Loss and Accuracy Results from The Effect of Epoch (Learning Rate 0.0001)

Table 4 .
Table 4 presents the test results for the model trained with 500 epochs.Test Results from Model with 500 Epochs