1 Introduction

Image character recognition is one of the most important topics in the computer field. The detection and recognition algorithms of image characters have been deeply explored and improved. The detection and recognition of characters covers people's daily life, such as character recognition system, business card recognition system, document recognition system, vehicle license plate recognition system and container coding recognition in traffic management. These applications not only facilitate text entry, document entry and other work, but also because of the extensive use of imaging equipment such as smartphones in daily life. With the rapid progress of science and technology, character detection and recognition system is no longer limited to the application in the field of office. Nowadays, the demand for optical character recognition system to record information and search information is increasing ordinary people's daily life (Yang et al. 2017; Feng and Wang 2020; Li et al. 2021; Liu et al. 2021).

The main aim of text detection methodology is to generate the text region in an image using various image processing methods (Zhaojun and Jun 2019; Xu et al. 2021). Initially, the text is detected and distinguished from the entire image using the bounding box technique. A bounding box is created around the text object in that particular image which is further followed by text localization. This step determines the location of text in the image through the drawn bounding box. This step is further followed by the text enhancement and segmentation. However, the bounding box method however, accurately provides the text location in the image but still segmentation is required. Text image enhancement is required in order to remove some background noise and make the text recognition process easy. The text recognition phase acquires the useful text information from enhanced region and then the text is obtained as the output (Pang et al. 2021; Wang et al. 2018). This entire process of text recognition is depicted in Fig. 1.

Fig. 1
figure 1

Process of text recognition from image

Since entering the information age, multimedia information represented by digital image and video has gradually penetrated into all aspects of human society. The effectively recognition, retrieval and control of multimedia information is the major concern in today's society. As the product of human abstract thinking, text information has a strong generalization and representation images and videos (Sharma et al. 2017; Dogra et al. 2020). To sum up, the research of image text recognition based on canny edge detection algorithm and k-means algorithm is of great significance.

This article significantly contributes in image text recognition by using the combination of canny edge detection algorithm and k-means algorithm. The method proposed in this article has several contributions like:

  • It modifies the canny edge detection and k-means algorithm and this combination is further optimized to obtain the optimal text recognition output.

  • The preprocessing steps are involved in the proposed methodology along with the optimization of stroke width for image text which provides improved performance.

  • The outcomes of the proposed method are assessed in terms of recognition rate, error and average recognition time and some of the other parameters are also evaluated. The other parameters computed are recall, precision, f-score and accuracy.

  • The evaluation parameters are further compared with the other state of the art methodologies in order to assess the superiority of the proposed technique.

  • The outperformance of the proposed methodology is observed compared to the other current methodologies in this domain and it can be further utilized for various application of smart city planning like character recognition, business card recognition, document recognition, vehicle license plate recognition etc.

The rest of this article is organized as: Literature review of traditional methods being utilized for image text recognition is provided in Sect. 2. Section 3 presents the research methodology including the principles of canny edge detection algorithm and k-means algorithm along with the optimization approach. The results are presented in Sect. 4 followed by the concluding remarks in Sect. 5.

2 Literature review

Digital image is one of the largest information sources in information access in the digital age, and edge is the most essential feature of image. Although edge pixels only occupy a small part of an image, they carry most of the information of the image. These contours play an important role in describing or identifying images. Most of the classical edge detection algorithms are based on differential operations in mathematics and belong to differential algorithm (Wang et al. 2019a). Convenient and reliable, easy to operate is the biggest advantage of this method, and this method is more mature, and they can be called directly in the MATLAB, now it has been widely used in many fields, and the effect is remarkable. For this reason, improving classical algorithms has become one of the research directions. Some of the conventional text detection and recognition methodologies are compared in Table 1.

Table 1 Comparative analysis of conventional text detection and recognition approaches

The text recognition methodology adopted by some of the latest techniques of 2019–2020 are explained in detail in the following section.

Zheng et al., obtained the gray prediction image with enhanced edge by replacing the pixel value in the original image with the maximum gray prediction value. The original image is subtracted from the gray prediction image to obtain the gray prediction subtraction image of the edge separated from the non-edge point. The best separation threshold in grey prediction subtraction images can be determined by the global adaptive threshold selection method. The neighborhood search method is then deployed to remove stray points and burrs from the image after the target is separated from the background, thus creating the final edge image. Experiments were performed on the spectre of computer simulations to find that subjective visual effects and objective evaluation criteria are better in the proposed method than several other competing methods. He proposed edge detection algorithm shows excellent edge detection capability and highly robust to noise (Zheng et al. 2020). Zhang et al., proposed an improved adaptive canister edge detection algorithm. Use bilateral filtering instead of gaussian filtering to eliminate noise and sharpen image edges. The gradient size of the image is then calculated using gradient templates in four directions: Horizontal, vertical, 45 and 135. A traditional OTSU threshold segmentation algorithm is improved. The improved concept is to find the intra-class and level variance as the maximum of the threshold. Reducing the search range of threshold can reduce the computation and realize fast segmentation. The verification of road marking image shows that the improved canny algorithm can better mark the location, reduce the edge disconnection and false edge, and shorten the processing time relatively (Zhang et al. 2020). Shi et al., proposed a new edge detection method, which combines the canned operator and improves the ant colony optimization algorithm. In this method, first, the traditional can operator extracts the edges of the image. the endpoint of the edge is then calculated as the initial position of the ant. The fuzzy triangle member function is introduced by the gray value of the neighborhood. the fuzzy member values of each pixel between edge endpoints are calculated as a heuristic matrix of ant colonies. Heuristic matrices prompt ants to search along real edges to detect continuous and complete edge lines. Experimental results show that this method effectively improves the accuracy of contour extraction of target objects in images, and the edge information extracted by this method is clearer and (Shi et al. 2019).

The innovation of this paper is to propose a combination of canny text edge detection algorithm and k-means algorithm to identify text. The main processes include image grayscale processing, image binarization processing using Maximally stable extremal regions (MSER), canny edge detection, pixel clustering, text segmentation, OCR recognition and so on. The image text of ICDAR2013 data set is recognized by this algorithm and compared with other recognition text methods.

3 Research methodology

3.1 Principle of canny edge detection algorithm

This algorithm is a detection algorithm based on edge features. It will keep the value of text attribute unchanged in the process of text processing, but it will not process the text, but will reduce the data scale of text image. At present, there are many algorithms based on edge detection in the world. In this paper, the canny algorithm is used. This algorithm can deal with edge detection problem well, which brings advantages to text recognition. Edge detection algorithms are diverse, and canny algorithm is to find the most suitable edge detection, meaning as follows.

  1. (1)

    Search for the best detection: Edge detection will extract some features of the text edge as much as possible, but also need to meet the probability of missing detection as small as possible.

  2. (2)

    Edge location rule: It is required that the point of searching the edge is not far from the edge point of the actual text, in short, the deviation between the edge position of the search and the edge position of the actual text cannot be too large, thus improving the recognition accuracy.

  3. (3)

    Comparison of search location and edge location: The algorithm makes the search point correspond to the actual text point.

3.2 Principle of k-means algorithm

k-means clustering analysis algorithm is the most widely used algorithm in clustering analysis. k-means algorithm is one of the classical methods in cluster analysis, because of its efficiency, therefore, clustering of large-scale data is widely used. K-means algorithm realizes data clustering by dividing samples into k classes with homogeneous variance (Liu and Zou 2020). The algorithm needs to specify the number of classes divided. It works better with big data, has been widely used in practical applications. K-means algorithm divides n samples in the dataset N into k disjoint classes, c these k classes in letters, n sample is represented X letters, each class has its own central ui. K-means algorithm is an iterative optimization algorithm, finally, the following mean square error is minimized:

$$\min \sum\limits_{i = 0}^{n} {\sum\limits_{{x_{j} \in c_{i} }} {\left( {\left\| {x_{j} - u_{i} } \right\|^{2} } \right)} }$$
(1)

3.3 Image preprocessing

Image grayscale conversion. Grayscale transformation refers to the method of changing the gray value of each pixel in the original image point by point according to a certain transformation relation according to a certain target condition. The purpose is to improve the quality of the picture, remove the noise, make the display effect of the image clearer, and provide convenience for the subsequent text segmentation and extraction. Noise in image text recognition is mainly caused by interference of normal information of image text due to external illumination factors, color factors and other factors. therefore, noise removal is a very important link. We mainly use filtering to remove image noise, including domain average filtering and median filtering (Pei et al. 2008; Kumar et al. 2021; Dhawan et al. 2021; Fan et al. 2020).

$$v\left( i \right) = \frac{{\left| {Q_{{i + {\Delta }}} - Q_{{i - {\Delta }}} } \right|}}{{\left| {Q_{i} } \right|}}$$
(2)

By the calculation of the above formula, we can roughly get a region, then the region is considered to be Maximally stable extremal regions (MSER). However, the results obtained only by the above formula are not very good, in some special cases, the above formula calculation may not be able to obtain MSER. And that means it may not be detected, so we still need to deal with it. The common practice is to do another test after region reversal, so that in half of the cases, the more accurate MSER can be measured.

Text edge detection based on canny algorithm. A binary processed image is divided into feature regions to realize the separation of text blocks from the image background by detecting the gray level change, color change and texture feature difference of the digital image. The main steps of the canny edge detection algorithm are as follows: Smoothing the image with gaussian filtering in order to remove noise; finding the intensity gradient of the image; using non-maximum suppression technique to eliminate edge false detection; using double threshold method to determine the possible boundary; using hysteresis technique to perform edge detection.

  • Image smoothing: The Gao Si filter can be realized by two one-dimensional Gao Si kernels, that is, one-dimensional X direction convolution, and then one-dimensional Y direction convolution. It can also be realized directly by a two-dimensional Gao Si kernel.

  • Non-maximum suppression: The purpose of non-maximum suppression is to clear the boundary of the image after edge detection, that is, to retain the maximum value of each pixel. The processing process of pixel points is as follows: i) the so-called gradient direction value is processed first i) the second is that the gradient value of each pixel is positive and negative, and their gradient value needs to be compared. ii) if the gradient intensity of the pixel is obtained, it is retained, otherwise it is suppressed.

  • Double threshold processing: After the above series of processing, it is not over, because there will still be some noise, this time we need to use double threshold technology, the core of this technology is to set a threshold of the upper and lower bounds, through the upper and lower bounds of the value of rational judgment, there are three possibilities, pixel points between the upper bound, lower bound or upper and lower bound.

  • Edge detection: Through the above processing analysis, we have obtained the pixel edge value situation, so how to determine the best edge position? We're looking at the pixels and the thresholds.

3.4 Enhanced processing of stroke width in image text

After the image preprocessing stage, it is necessary to optimize the stroke width of the text. The width processing is still the strength of the canny algorithm. According to this value, we have the following analysis.

  1. (1)

    When the pixel p is located at the edge of the text, the gradient value must be 90 degrees with the stroke direction, and then search the other pixel points corresponding to the min value along that direction q, then the pixel p is roughly opposite to the q direction. When the pixel p and q are found to satisfy the condition, the length of the formed ray is calculated by Euclidean distance, and the width value of a stroke can be obtained, which is the best choice in this direction.

  2. (2)

    Repeat step (1) to calculate the stroke width attribute value of pixels on all routes that are not discarded, so the algorithm ends.

3.5 Optimization of stroke width for image text

After image preprocessing and text stroke width optimization, the target image can be extracted by k-means clustering algorithm. By clustering the pixels in the target image, the text target area to be recognized can be extracted from the image. The operation flow is shown in Fig. 2.

Fig. 2
figure 2

Flow Chart of optimization operation of stroke width for image text

4 Results and discussion

4.1 Canny operator recognition effect before and after improvement

To verify the effect of human behavior recognition improved by canny operator, the ICanny-RBF and Canny-RBF are compared and the experimental results are shown in Fig. 3. Figure 3 reveals that the average recognition accuracy of the five human behaviors compared with the Canny-RBF, ICanny-RBF improves by a significant factor of 1.52%, 2.38%, 0.23%, 3.89%, 2.40% for various processes of human behavior like walking, run, jump, side and skipping respectively.

Fig. 3
figure 3

Canny Comparison of Recognition Effect before and after Operator Improvement

The comparison results show that the improved canny operator can be used to detect the contour of human behavior images, and a more complete foreground contour can be obtained. It is extremely beneficial to the subsequent feature extraction and the classification and recognition of human behavior RBF neural network.

4.2 Comparison of RBF and BP neural networks

The analysis of support for the reference framework is done using the comparative analysis of ICanny-RBF neural network with ICanny-BP neural network. The boundary conditions considered for the experimentation are network training target error to 0.001, learning rate of 0.005. The training and testing samples are created using 70–30% train and test criteria.

For analyzing the advantages of ICanny-RBF neural network, comparison and simulation is done using ICanny-BP neural network and their average recognition accuracy is shown in Fig. 4. Figure 4 shows that, the recognition performance of ICanny-RBF neural network is better than that of ICanny-BP neural network. The comparison shows that, an ideal recognition effect can be achieved by using RBF neural network. The network training target error is set to 0.001 for comparison of training process and training errors for BP neural network and RBF neural network are shown in Fig. 5.

Fig. 4
figure 4

BP Comparison of recognition results for neural network and RBF neural network

Fig. 5
figure 5

BP Training error curve of neural network and RBF neural network

Comparing the results, RBF the neural network iterates to 340, has achieved preset precision, when the iteration stops; and BP the neural network iterates 180 times, to achieve preset precision. The comparison shows that, compared to BP neural networks, RBF neural network has faster learning speed and it improves the recognition efficiency of human behavior.

4.3 Identification of recognition speed

In order to detect the recognition speed of human behavior, the average recognition time of the test set is used as the evaluation index, and the results are shown in Table 2. Table 2 shows that the human behavior recognition time of RBF neural network is quite short and less than the average recognition time of BP neural network, which is very suitable for the real-time requirement of human behavior recognition.

Table 2 Comparison of Average Recognition Time for Different Models

This paper based on the combination of canny text edge detection algorithm and k-means clustering algorithm to identify the image text, in which the steps mainly include the gray level conversion of the image. The binary processing of the image and the preprocessing stage of the image text edge detection. By measuring the width of the text stroke, the scope of text recognition is reduced; then the k-means algorithm is used to cluster and integrate the recognized image pixels. The text area is divided, and process the text segmentation of the text area by optimizing the integration. Finally, the recognition of picture text is realized by OCR recognition interface (Kaur et al. 2021; Bhuyan et al. 2021; Wang et al. 2019b; Zhang et al. 2018). To analyze the results of this experiment, we use ICDAR2013 data set, which contains many kinds of text content and the image without text and some fuzzy text is further removed. Through this algorithm, 233 pairs of ICDAR2013 data sets are studied by using MATLAB (Mahajan et al. 2021). The recall rate is 72.4% and the precision is 88.3%. The average performance F-score parameter is 75.9% while accuracy value of 90.5% is observed for the proposed method. Table 3 and Fig. 6 shows the performance results of processing ICDAR2013 data sets by different methods (Sharma et al. 2021a).

Table 3 ICDAR2013 Text recognition results
Fig. 6
figure 6

Performance results of processing ICDAR2013 datasets by different methods

The data tabulated in Table 3 and graphically presented in Fig. 6 reveals that in the text recognition of the ICDAR2013 data set, the recall rate and the correct rate are improved because the ICDAR2013 data set removes the fuzzy text and the category without text. Compared with other algorithms, this algorithm better the recognition accuracy of ICDAR2013 data set (Sharma et al. 2021b).

5 Conclusion

This paper utilizes a maximally stable extremal region phenomenon for the binarization of image, canny algorithm to process the edge detection of the text and uses the k-means algorithm to cluster the pixels. The standard ICDAR2013 dataset is used for the study and optimal text recognition outputs are obtained utilizing the combination of preprocessing steps along with the optimization of stroke width for image text. Various assessment parameters are evaluated like recall, precision, f-score and accuracy and they are further compared with the other state of the art methodologies in order to assess the superiority of the proposed technique. The proposed approach yields the recall rate of 72.4% with precision of 88.3% and F-score of 75.9% while accuracy value of 90.5% is observed for the proposed method. It was observed that the method proposed in this article outperforms all other current methodologies in this domain. This work can be further utilized for various applications of smart city planning like character recognition, business card recognition, document recognition, vehicle license plate recognition etc. The future research direction of this algorithm can deal with the location and extraction of text area in the background of complex digital image, and get the image text that can be recognized by the Optical Character Recognition (OCR).