Keywords

1 Introduction

Communication is the process of conveying the most fundamental information, such as emotions and thoughts, to the other party using a variety of means. Although communication is a multifaceted process, language is the most effective component. Because of language, humans can execute their daily tasks with relative ease. While language speeds up communication, it is inaccessible to many individuals with hearing impairments. Every country has a sign language that is unique to its language structure. However, the fact that sign language and current grammar are often dissimilar makes it challenging for hearing-impaired individuals to become literate. According to the 2018 data of the World Health Organization, there were 34 million hearing-impaired individuals in Europe, and this number is projected to increase by around 12 million by 2050Footnote 1. It has been observed that the fact that people with hearing impairments experience communication challenges has led to a rise in the number of studies aimed at resolving this issue. The advancement of artificial intelligence research, which has gained momentum in recent years, has led to a rise in sign language research [1, 2]. Serious research has been undertaken [3, 4] (particularly in the fields of machine learning and deep learning). Convolutional neural networks (CNN), one of the deep learning methods, are commonly employed in domains such as image classification, similarity-based grouping, and object recognition.

Communication is essential for the continued existence of humans on earth. There are two main components in any communication: the recipient and the sender [5]. During communication, a channel is formed between the transmitter and the receiver; through this channel, many acts, such as emotions and thoughts, can be transmitted to the other side. Sign language is a visual language, that is a collection of gestures, mimics, and hand and facial movements intended for hearing-impaired people to communicate. According to the Turkish Statistical Institute’s (TUIK) 2015 figuresFootnote 2, there are 406 thousand disabled men and 429 thousand disabled women in Turkey.

Hearing-impaired individuals can communicate effectively with the norms they have established among themselves, but they cannot interact efficiently with other individuals or institutions. This extremely difficult-to-express mechanism generates social dysfunction. They cannot communicate themselves clearly and cannot even comprehend the other party’s posts. As a result, individuals with hearing loss tend to withdraw themselves from society [6]. In 2018, there were 34 million hearing-impaired people in Europe alone, according to data released by prestigious health agencies such as the World Health Organization [7]. In 32 years, or in 2050, it is expected that this data would expand by 35.29%. Even in sports, it is quite difficult to interact with hearing-impaired individuals from diverse groups when several studies are analyzed [8]. There are more than 120 sign languages in the world [9], and although they are closely related, there are still communication gaps between them. The statistics indicate that the development of digital solutions to enhance the communication of individuals with impairments is necessary [10, 11]. This study proposes a model in which the numbers in Turkish sign language will be developed with the assistance of CNN in order to contribute to the stated challenge.

2 Related Work

The scientific field of sign language recognition is expanding in the field of gesture recognition. Research on the recognition of sign language has been carried out all around the world utilizing a variety of sign languages. These sign languages include American Sign Language [29], Chinese Sign Language [28], Japanese Sign Language [27], Turkish Sign Language [26], etc. Numerous systems for sign language recognition employ machine learning due of its capacity to train useful models using limited and sometimes noisy sensor input. There are a variety of sensor options, including data gloves and other tracker systems, computer vision approaches employing a single camera, numerous cameras, and motion capture systems, and handcrafted sensor networks.

Approaches to representing basic units of signed languages vary significantly across researchers. The simultaneous nature of significant left-hand, right-hand, and head gestures in sign languages presents a barrier for many sequential approaches [12]. Others attempt to develop models with a structure resembling phonemes, whereas the majority of studies opt to use the sign as their modeling unit of origin. Utilizing technology means allows for the possibility of locating a solution to the problem that will remove the bottlenecks that are now in place.

Examining the studies in the scientific literature reveals that image processing [13] technologies are commonly utilized to detect human limb motions. Numerous models have been developed in this direction with the contribution of deep learning models [14, 15], which have recently acquired prominence in this field. Attractiveness has been drawn to the success of deep learning systems in image processing and classification. In the study of Kemalolu and Sevli [16], for instance, convolutional neural networks (CNN), one of the deep learning techniques, are utilized to train and process an image set including Turkish sign language numbers. In the process of classifying sign languages, a considerable amount of strategies and methodologies have been presented. Pigou et al. [15] completed a deep learning investigation to describe 20 Italian sign language hand movements. In the study, the results of an artificial neural network were mixed with the CNN model. As a result of this combination structure, they attained a 91.7% success rate. Bheda et al. [17] completed another investigation utilizing the deep learning model on American sign language. They utilized a small-scale dataset that they previously developed. Consequently, while expanding the datasets and utilizing them with the CNN model, a 97% success rate has developed. Kalam et al. [18] generated a total of 7000 images by rotating 700 numerals (images) in American sign language from ten different angles, yielding a total of 7000 images. By training the dataset they created using the CNN architecture, they attained a success rate of 97.28%.

3 Material and Method

This section discusses the dataset that was used, the preprocessing techniques that were implemented, as well as the CNN and pretrained models that were developed to train using the dataset.

3.1 Dataset

In this study, Turkish sign language images obtained with the participation of 218 students studying at Ankara Ayrancı Anatolian High School were used as a dataset [19]. The dataset was created in jpeg (rgb) format to represent numbers between 0 and 9 at 100 \(\times \) 100-pixel resolution. Each student was asked to create 10 different numbers from 0 to 9. In this manner, 2180 image data were obtained. Figure 1 depicts a sub-example of sign images ranging from 0 to 9.

Fig. 1.
figure 1

Samples from Sign Language Dataset

3.2 Data Preprocessing

Red-green-blue (RGB) is the format of the study’s data set. RGB (red-green-blue) channels allow for the coloring of images, although working with colored images can be challenging at times. Thus, images will be examined and analyzed in grayscale. The grayscale nature of the images renders them two-dimensional. In this scenario, image colors can have values between 0 and 255 in a single dimension. The images are normalized because the findings of investigations on one-dimensional values between 0 and 255 are often unsuccessful. Normalization identifies the minimum and maximum values of all existing numeric values in a column and reduces these values to 1. As this circumstance falls between 0 to 255 in the present investigation, the image values have been lowered from 0 to 1.

3.3 Methods

In deep learning applications, a learning model may be developed from scratch. However, transfer learning has increased the performance of models. The weights of a previously trained network can be utilized to train the initial model. When comparing these two methods in terms of performance evaluation, it has been discovered that transfer learning is quicker and more efficient. In this investigation, a model was constructed, and transfer learning techniques were utilized to train the sign language visuals. In the study, a 2D-CNN was developed, and it was fine-tuned for different pre-train models, including VGG16, ResNet50V2, EfficientNetB7, InceptionV3, and MobileNetV2, as depicted in Fig. 2. The Adam optimizer, a 0.001 learning rate (lr), and categorical cross-entropy were chosen for the optimization of the specified models. All models employ the same structure since the Adam optimizer and learning rate selections are the metrics that yield the greatest outcomes.

Fig. 2.
figure 2

A graphical representation of the concept of the research methodology.

4 Experimental Results

The aim of this study is to classify sign language gestures consisting of numbers and to further increase the performance of training with transfer learning. In the initial phase of our classification efforts, a two-dimensional CNN model was developed. After the initial 2 convolution layers, the max pooling layer was added, followed by 2 further convolution layers. It was then leveled by going through a layer of maximum pooling. Three dense layers have been traversed to reach the final layer. Ten different classes are predicted by training the last layer using the softmax activation function. Following model training, 86% accuracy was determined. The model’s confusion matrix is depicted in Fig. 3. According to the basic model, the distortion rate in the images between the layers was not significant and was found to be normal.

In the second part of our experiment, the model was fine-tuned using the most prominently pretrained models from the literature [25]. We applied the several CNN architectures such as VGG16, ResNet50V2, EfficientNetB7, InceptionV3, and MobileNetV2, each of which offered distinct capabilities. VGG16 is a convolutional neural network model developed in 2014 by a University of Oxford working group with the same name [20]. As the name suggests, there are sixteen distinct layers. When training our own dataset using the VGG16 architecture’s weights, 98% accuracy is attained. In the VGG16 model, the rate of distortion and loss in the images between the layers was quite high without fine-tuning. When the fine-tuning is applied to the model the rate of distortion in images decreased. It is depicted in Fig. 4.

Fig. 3.
figure 3

The Confusion Matrix of a) base and b) fine-tuned VGG16 models

Fig. 4.
figure 4

Visualization of high-level feature map from \(conv2d_57\) layer of the fine-tuned VGG16 model using samples from the dataset.

ResNet50V2 is another pretrained model created for imagenet database classification that is designed by Microsoft [21]. There are fifty layers. Its success percentage on the dataset of sign language remained at 89%. EfficientNetB7, is also a pretrained model built by Google [22], which can be classified into eight different architectures. It has evolved consistently from B0 to B7. In these cases, the EfficientNetB7 architecture enables more effective training. The proper classification success percentage for the dataset of sign language was determined to be 90%. InceptionV3 is a model developed by Google with 50 deep layers [23]. It has the ability to classify nearly 1,000 objects using ImageNet weights. The first input size of this network is 299 \(\times \) 299 pixels. When we pre-train our sign language dataset with the InceptionV3 model, the obtained accuracy value is 97%. MobileNetV2 is a kind of convolutional neural network developed by Google for mobile display applications [24]. It is a model that uses a limited number of resources. It offers precise validation for tiny datasets. However, it has become clear in our experience that it is not suitable for a sign language dataset. The obtained accuracy value could not exceed 21%.

We adjusted models for our problem by adding a new fully connected layer for each of the 10 classes in our dataset. Backpropagation was then used to fine-tune the original CNN filter weights acquired from natural images such that they more accurately mirrored the modalities in the sign language dataset. It was decided that the VGG16 had the best performance out of all the models. The training and validation error for the 10 epochs of fine-tuned and base models are depicted in Fig. 5. The training error rates for both CNNs follow a consistent trend of a steady decline followed by a plateau. The similarity between the training and validation curves indicates that the proposed fine-tuned model did not overfit the training data.

Fig. 5.
figure 5

The accuracy and loss scores of a) base and b) fine-tuned VGG16 models

In Table 1, all results are presented in a comparable manner.

Table 1. Classes-based classification performance of base and fine-tunes models.

5 Conclusions

Despite the fact that sign language was developed to aid hearing-impaired individuals in talking with others, it is obvious that they continue to struggle with communication in society. To cover all aspects of sign languages, powerful algorithms that reliably extract characteristic features in uncontrolled contexts were developed. In this research, we present a CNN-based architecture for the classification of sign language gestures. The CNN model has a two-dimensional structure. VGG16, ResNet50V2, EfficientNetB7, InceptionV3, and MobileNetV2 were also trained using a pre-trained model to improve performance and decrease training time. We have observed that transfer learning allows for the creation of more reliable systems. The suggested model outperforms prior state-of-the-art classifiers on average with a recognition rate of 98%. The intriguing results of this study can be used as a starting point for further research into how to recognize complex hand and face movements.