Curved text detection in blurred/non-blurred video/scene images

Xue, Minglong; Shivakumara, Palaiahnakote; Zhang, Chao; Lu, Tong; Pal, Umapada

doi:10.1007/s11042-019-7721-2

Curved text detection in blurred/non-blurred video/scene images

Published: 30 May 2019

Volume 78, pages 25629–25653, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Curved text detection in blurred/non-blurred video/scene images

Download PDF

Minglong Xue¹,
Palaiahnakote Shivakumara²,
Chao Zhang¹,
Tong Lu¹ &
…
Umapada Pal³

494 Accesses
15 Citations
Explore all metrics

Abstract

Text detection in video/images is challenging due to the presence of multiple blur caused by defocus and motion. In this paper, we present a new method for detecting texts in blurred/non-blurred images. Unlike the existing methods that use deblurring or classifiers, the proposed method estimates degree of blur in images based on contrast variations in neighbor pixels and a low pass filter, which results in candidate pixels for deblurring. We consider gradient values of each pixel as the weight for the degree of blur. The proposed method then performs K-means clustering on weighted values of candidate pixels to get text candidates irrespective of blur types. Next, Bhattacharyya distance is used to extract symmetry property of texts to remove false text candidates, which provides text components. Further, the proposed method fixes bounding box for each text component based on the nearest neighbor criteria and direction of the text component. Experimental results on defocus, motion, non-blurred images and standard datasets of curved text show that the proposed method outperforms the existing methods.

Robust Text Detection and Recognition in Blurred Images

Robust Video Text Detection with Morphological Filtering Enhanced MSER

Article 13 March 2015

A Deep Convolutional Deblurring and Detection Neural Network for Localizing Text in Videos

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Due to the revolution in developing smart image capturing devices with low cost, smart phones equipped with cameras are now ubiquitous [7, 20, 25]. Since pictures are often taken at the spur of a moment instead of after elaborate planning, one cannot expect to capture high quality images all the time. Therefore, there are high chances of introducing blur due to defocus or motion blur introduced by camera or scene movements. Thus, images affected by multiple blur types are quite common for real time applications. It is noted from the literature that blur in images is one of the key causes for not achieving good results [5, 11]. One such case is text detection from video and scene images, which is an essential part of information retrieval as it helps in extracting features at high level, for example, retrieving information from news or sports videos, finding street information from urban video images, etc. Recently, there have been elegant methods developed for text detection by exploring powerful deep learning tools for addressing issues of complex background [19], low contrast [10], low resolution [23], uneven illumination effect, small font [17], multi-oriented, multi-lingual [16, 26] and even blur [21]. However, the methods target particular blur types, use deblurring to remove blur effect, or classify blurred images from non-blurred ones. When images are affected by multiple blur types as mentioned-above, they do not perform well. In addition, the success of text detection in blurred images heavily depends on the success of deblurring and classification. On top of that, arbitrary orientation (curved) of texts in blurred images increases the complexity of the problem. It is evident from the results shown in Fig. 1, where we can see the existing methods, namely, Connectionist Text Proposal Network (CTPN) [17] and Efficient and Accurate Scene Text Detector (EAST) [26], which both use deep learning tools and work well for multi-oriented, multi-script, complex background, small font and low contrast text images, fail to detect curved texts properly for defocus and motion blurred texts as shown in Fig. 1 (a) and Fig. 1(b), respectively. Therefore, curved text detection in blurred/non-blurred images is of research interest in this work without the supports from classification and deblurring.

2 Related work

As mentioned in the previous section, there are three ways of improving text detection performances in blurred images, namely, detecting texts through deblurring, classifying blurred and non-blurred images, and estimating degree of blur. Therefore, this section reviews the methods related to the above three categories. Note that the proposed method belongs to the third category.

Liu et al. [11] proposed robust text detection via multi-degree of sharpening and blurring. The method explores the operation for sharpening and blurring edges, which represent texts using unsharp masking. Then Maximally Stable Extreme Regions (MSER) is used to extract text components. The aim of this method is to enhance low contrast information by sharpening and smoothing. However, the method does not consider images affected by multiple blur types for text detection. Cao et al. [1] proposed scene text deblurring using text-specific multi-scale dictionaries. The method uses Stroke Width Transform (SWT) for obtaining stroke width distance for each pixel in an image. Then the method trains the system based on text-specific properties at multi-levels. The features extracted based on stroke width information are compared with training patches to derive a deblurring model. The target of the method is to deblur images for better text detection performances. However, the performance of the method depends on several parameters. In addition, it is not sure that the same method works for different types of blur. Khare et al. [5] proposed a blind deconvolution based method for deblurring texts in images. The method aims to deblurring texts affected by both defocus and motion blur. However, the performance of the method depends on several parameters and knowledge of database. In summary, the methods which propose deblurring for removing blur in images, usually require a parameter-tuning task according to prior knowledge of the dataset. The main issue here is that when the degree of blur and type changes from one image to another, the performances of the methods degrade. In addition, text detection considers deblurring as a pre-processing step for text detection, which adds more complexity to the text detection step. In this case, the scope of text detection limits to blurred images but not non-blurred ones. Therefore, the methods may not be suitable for text detection in images of different types of blur and non-blurred images.

To overcome the above limitation, methods are developed for classifying blurred images from non-blurred ones such that an appropriate method can be chosen for improving performance of text detection. For instance, Khare et al. [6] proposed a quad-tree based method for blurred and non-blurred video text image classification through quality metrics. The method studies quality metrics on different degrees of blur in images to derive the condition. However, it is not sure that the method considers images of different types or a particular type of blurred images. The success of text detection depends on how well a classification method works. Therefore, we can conclude from the review on deblurring and classification methods that although text detection methods consider the above pre-processing step, the results may not be consistent for different blurred types of images. Hence, the methods are developed for addressing blur to improve text detection performance as follows.

Shi et al. [16] proposed detecting oriented texts in natural images by linking segments. Since the method finds small segments of text information, it has the ability to work for blurred images to some extent. To achieve this, the method explores the VGG 16 model with different layers. However, the performance of the method depends on training and classifiers when the degree of blur changes from one image to another. Wei et al. [21] proposed text detection in scene images based on exhaustive segmentation. The aim of the method is to address the issues of low resolution, low contrast and blurred images. The method extracts features, which are invariant to rotation, scaling and to some extent to blur for text detection. However, the method does not mention types of blur for text detection. In addition, since the method uses binarization, it is hard to obtain good results for images having different degrees of blur. Zhang et al. [24] proposed text detection in natural scene images based on color prior guided MSER. The method extracts different features based on characteristics of text components using MSER. Then the method explores deep learning for false positive removal. Since the method involves deep learning, it can overcome the issue of blur. However, the primary goal of the method is not blurred text detection in the images. Liao et al. [9] proposed a fast text detector with single deep neural networks. The main focus of the method is to detect texts efficiently and therefore, explore single text detector architecture which does not involve any post processing steps. Overall, despite the methods exploit the advantage of deep learning for addressing the issue of blur in images for text detection, when the input contains different blur types and non-blurred images simultaneously, one cannot expect consistent performance [18].

In the light of above discussions, it is noted from text detection in naturals scene and video images that there are powerful methods for addressing challenges such as multi-orientation, multi-script, low resolution, low contrast, complex background, multi-font, and size variations. However, none of the methods considers the challenges posed by images affected by multiple blur for text detection. It is obvious that defocus and motion blur are common in capturing images and videos. However, it is ignored by the existing methods. In case, if an input image contains blur, rather than detecting texts in blurred images, the existing methods use deblurring to remove blur before text detection. Here, text detection performance depends on the success of deblurring. At the same time, deblurring methods cannot be applied on non-blurred images. In this case, the existing methods use classification for classifying blurred and non-blurred images before applying deblurring and text detection. Hence, this pipeline is cumbersome and also hard to ensure that deblurring methods provide good results for blurred images of different situations. This is the key point for motivating us to propose a method for detecting texts in blurred images without deblurring or the classification of blurred and non-blurred images.

Motivated by the statement [8] that non-blur patches contain strong edges with arbitrary directions, defocus blur patches consist of weak edges with arbitrary directions, and motion blur patches involve strong gradients along the perpendicular direction of motion, one can estimate degree of blur irrespective of types by studying spatial information and direction of pixels. Besides, degree of blur is high where there are strong edges. This shows that the degree of blur indicates edges which represent texts in blurred images, and high gradient values represent text pixels in non-blurred images. Based on this cue, the method combines the degree of blur and gradient values of pixels to balance both blurred and non-blurred text pixels in images for text detection in this work.

Therefore, the main contribution of the proposed work is that exploring the combination of degree of blur estimation and gradient values for addressing challenges of text detection in both blurred and non-blurred images. In addition, the use of Bhattacharya distance measure for eliminating false text candidates given by degree of blur estimation and estimating degree of similarity automatically is new. To the best of our knowledge, this is the first of kind to develop a method for detecting texts in different blur types of images without any pre-processing step of deblurring.

3 Proposed method

For this work, non-blurred images and blurred ones affected by defocus or motion are the input for text detection. The proposed method first explores low pass filter based on neighbor gradient values for estimating degree of blur. When there are strong edges, there will be more blur. Besides, edges which represent text pixels are considered as strong edges [2]. This shows that pixels that represent texts in blurred images have a high degree of blur compared to the other pixels. Similarly, in case of non-blurred images, pixels that represent texts have high gradient values [2]. Based on this observation, the proposed method combines degree of blur and gradient values of pixels for the input image, which results in a high gap between text and non-text pixels. The proposed method then uses k-means clustering with k = 2 for separating text pixels from non-text ones. Since the input image can have complex background and variations in degree of blur, there are high chances of misclassifying non-text pixels as text ones from the above step. To alleviate this problem, inspired by the statement that text components exhibit symmetry property, the proposed method divides text candidate images into four sub-parts. It extracts symmetry features based on Bhattacharyya distance measure, which is a well-known distance measure for estimating similarity and dissimilarity between histograms of sub-regions of text candidate images [12]. Next, the features are passed to an SVM classifier for text components classification. It is noted that characters in the same text line have almost constant spacing and successive characters are the nearest neighbors. This is the basis to propose grouping procedure for text line extraction. The proposed method finds the nearest neighbor for each text component in a text line using linking to fix the bounding box for the whole line. Since the grouping process fixes a bounding box for each text component based linking, the proposed method works well for any orientation of text lines including curved, horizontal and non-horizontal texts. The block diagram of the proposed method is shown in Fig. 2.

3.1 Text Candiate detection

As discussed in the previous section, in order to estimate the degree of blur in the input image, the proposed method explores horizontal and vertical low pass filters, say, f_v and f_v′, to perform convolution operation over the input image as defined in Eq. (1) and Eq. (2), respectively. This outputs two filtered images, namely, B_V and B_H, for horizontal and vertical low pass filters. It is noted that the variations at neighbor pixels of each pixel provide vital information for estimating the degree of blur in both blurred and non-blurred images. To study the effect on neighboring pixels, the proposed method finds the absolute differences horizontally and vertically for filtered images as defined in Eq. (3)–(6). It is expected that variations among neighbor pixels are high (medium) where there is sharpness (blur). To extract such high variations, the proposed method finds the difference between the absolute differences of the filtered images further as defined in Eq. (7). To compare the outputs of Eq. (7) with the input image to find actual variations horizontally and vertically, the proposed method computes the sum of the values in difference images with respect to the input image and filtered images as defined in Eq. (8)–(9). Further, the values are normalized as defined in Eq. (10). This gives blur estimation with respect to horizontal and vertical filtered images. The final degree of blur is estimated as defined in Eq. (11). Since the process of degree of blur estimation considers neighbor pixel information, horizontal and vertical direction, the proposed method estimates the degree regardless of blur types.

$$ {f}_h= transpose\left({f}_v\right)={f}_v^{\prime } $$

(1)

where $ {f}_v=\frac{1}{9}\left[1\ 1\ 1\ 1\ 1\right] $.

$$ {B}_V={f}_v\ast I $$

(2)

where I is the input gray level image, and B_H = f_h ∗ I.

$$ D{I}_V\left(i,j\right)=\mid I\left(i,j\right)-I\left(i-1,j\right)\mid \kern0.5em \mathrm{i}=1,\dots, \mathrm{m}-1,\mathrm{j}=0,\dots, \mathrm{n}-1 $$

(3)

$$ D{I}_H\left(i,j\right)=\mid I\left(i,j\right)-I\left(i,j-1\right)\mid \kern0.5em \mathrm{i}=0,\dots, \mathrm{m}-1,\mathrm{j}=1,\dots, \mathrm{n}-1 $$

(4)

$$ D{B}_V\left(i,j\right)=\mid {B}_V\left(i,j\right)-I\left(i-1,j\right)\mid \kern0.5em \mathrm{i}=1,\dots, \mathrm{m}-1,\mathrm{j}=0,\dots, \mathrm{n}-1 $$

(5)

$$ D{B}_H\left(i,j\right)=\mid {B}_H\left(i,j\right)-I\left(i,j-1\right)\mid \kern0.5em \mathrm{i}=0,\dots, \mathrm{m}-1,\mathrm{j}=1,\dots, \mathrm{n}-1 $$

(6)

where DI_V, DI_H, DB_V and DB_H are the absolute difference images with respect to horizontal and vertical filters.

$$ D{V}_V=\mathit{\operatorname{Max}}\left(0,D{I}_V-D{B}_V\right),\kern0.5em {\mathrm{DV}}_H=\mathit{\operatorname{Max}}\left(0,D{I}_H-D{B}_H\right) $$

(7)

where DV_V and DV_V are the differences in absolute difference filtered images with respect to horizontal and vertical directions, respectively. The sum of the coefficients consisting of DI_V, DI_H, DV_V and DV_H are as follows, where DV_V and DV_V respectively denote the difference values from the input image without filtering effect with respect to horizontal and vertical directions.

$$ s{I}_V={\sum}_{i,j=1}^{m-1,n-1}D{I}_V\ \left(i,j\right),\kern0.5em s{I}_H={\sum}_{i,j=1}^{m-1,n-1}D{I}_H\ \left(i,j\right) $$

(8)

$$ s{V}_V={\sum}_{i,j=1}^{m-1,n-1}D{V}_V\ \left(i,j\right),\kern0.5em s{V}_H={\sum}_{i,j=1}^{m-1,n-1}D{V}_H\ \left(i,j\right) $$

(9)

$$ b{I}_{Ver}=\frac{s{I}_V-s{V}_V}{s{V}_V},\kern0.5em b{I}_{Hor}=\frac{s{I}_H-s{V}_H}{s{I}_H} $$

(10)

where bI_Ver and bI_Hor are the blur estimations in vertical and horizontal directions, respectively.

$$ blurdegree=\mathit{\operatorname{Max}}\left({bI}_{Ver},{bI}_{Hor}\right) $$

(11)

The above step gives high values for blurred pixels, which represent text pixels in blurred images regardless of defocus and motion blur. To extract high gradient values which indicate text pixels in non-blurred images, the proposed method multiplies the degree of blur to gradient values of the input image. This outputs high values for text pixels compared to non-text ones in both blurred and non-blurred images. This is valid because for a blurred image, low gradient value is multiplied to high degree of blur, while for a non-blurred image, high gradient value is multiplied to low degree of blur. It is evident from Fig. 3, where (a) gives the results of gradient operation for the input defocus, motion and non-blurred images in Fig. 1(a), and Fig. 3(b) shows the result of multiplication of gradient and degree of blur. It is observed from Fig. 3(a) and Fig. 3(b) that the pixels which represent texts are sharpened compared to the pixels in Fig. 3(a). However, it is also noted from Fig. 3(a) and Fig. 3(b) that a few background pixels are also highlighted due to complex background, where we can expect strong edge pixels like text edge ones.

The above steps give a weighted gradient image containing high values for text pixels and low values for non-text pixels irrespective of blur types. The weighted gradient image is then passed to K-means clustering with k = 2, which gives two clusters. The cluster that gives high mean is considered as a text one, while the other one as a non-text as shown in Fig.3(c), where we can see text clusters for defocus, motion and non-blurred input images, which results in text candidates.

3.2 Text component detection

Next, we consider the problem of removing false text candidates from the output of the previous step because sometimes objects in background may overlap with texts. Therefore, we propose to study specific characteristics of text components. One such characteristic is symmetry between sub-regions of text components and pixels in text components share almost the same values. With these cues, the proposed method divides the whole text candidate image into four parts based on the major and minor axes of the ellipse as shown in Fig. 4(a), where we can see four equal sub-regions, namely, A, B, C and D. Motivated by the property of Bhattacharyya distance as defined in Eq. (12) and Eq. (13), which estimate similarity and dissimilarity between distributions of two regions, the proposed method performs histogram operation for sub-regions as shown in Fig. 4(b), where one can notice that pixel distributions in histograms of A-C and B-D have a high degree of similarity. The reason to choose Bhattacharya distance measure is that it is required to estimate the degree of similarity between histograms which are pixels distributions but not pixel values. It is evident from the method in [12], where it is mentioned that Bhattacharya distance measure is good for comparing pixel distributions to estimate the degree of similarity. Furthermore, the Bhattacharya distance measure involves probability concept, and this tolerance can help us to estimate high degree of similarity between histograms though distributions are not exactly similar due to blurred and non-blurred pixels in text candidate images.

The proposed method estimates degree of similarity between histograms of sub-regions using Bhattacharya distance measure, which gives 6 features for each text candidate image as illustrated in Fig. 4(a)-(c). For example, let A, B, C and D be four sub-parts of the text candidate image. The distances between A&B, A&C, A&D, B&C, B&D and C&D are estimated using Bhattacharya distance measure. If the input is a text candidate image, one can expect high values for all the 6 features, else low values for all the 6 features. This is justifiable because the histograms of sub-regions of text candidate images have a high degree of similarity, while for non-text candidate images, it does not. Therefore, it is expected if we perform k-means clustering with k = 2, most of the values are classified into the Max cluster in case of a text candidate image, else classified into the Min cluster in case of a non-text candidate image. If the Max cluster contains more values compared to the values in the Min cluster, the text candidate is considered as a real text component else a non-text component. This condition is used for classifying text and non-text components. Since this condition is rigid, it may not be effective for handling the images affected by different types of blur and non-blurred images. Therefore, the extracted features are passed to an SVM classifier [4, 14] for removing false text candidates, which outputs text components. The reason to choose an SVM classifier is that it does not require a large number of features and samples as CNN and ELM. In addition, SVM is linear and is good for the two class classification problem. The effect of false candidate removal can be seen in Fig.5, where it is confirmed that all the non-text components are removed, which results in text components.

The above steps can be formulated as follows. Formally, histogram for each sub-region can be obtained as defined in Eq. (12).

$$ H\left({r}_k\right)={N}_k $$

(12)

where r_k is the grayscale values, N_k is the number of pixels with grayscale of N_k in the gray image I, H is the number of each grade r_k.

Let H₁, H₂, H₃, H₄ be the histograms of the four sub-regions. Bhattacharyya distance can be then estimated as defined in Eq. (13).

$$ d\left({H}_1,{H}_2\right)=\sqrt{1-\frac{1}{\sqrt{{\overline{H}}_1{\overline{H}}_2{N}^2}}}{\sum \limits}_I\sqrt{H_1(I){H}_2(I)} $$

(13)

where$ {\overline{H}}_k=\frac{1}{N}{\sum}_I{H}_k(I) $, N is the number of all the pixels.

3.3 Curved text detection by grouping text components

Figure 5 shows that a text component can be a single character, a whole word, or a part of the word. It is observed that text components are in arbitrary directions, which results in curved texts. As a result, it is not as easy to fix bounding boxes as for horizontal text components. In addition, the aim of the proposed method is to fix bounding boxes for any orientation. Therefore, for each text component in the images of Fig. 5, the proposed method finds the nearest neighbor based on the distance between two text components. Euclidean distance is calculated from the centroid of current component to the centroid of the nearest neighbor component. Since the distance between two character components is smaller than that of two words or lines, the proposed method finds an adjacent text component as the nearest neighbor component. When the proposed method finds the nearest neighbor component, it verifies the properties, namely, size and direction of text components with the current text component before grouping them as a text component of the same text line. In this way, the proposed method fixes bounding box for each text component in curved text lines as shown in Fig. 6(a) for defocus, motion and non-blurred images. Finally, bounding box and direction information of each text component are used to fix the bounding box for the whole word as shown in Fig. 6(b), where it can be seen that the proposed method fixes bounding boxes for each word in the images. If any false positive exists due to complex background and different types of blur, the proposed method uses the same combination of features and an SVM classifier as discussed in the previous section to remove false positives. In this step, the proposed method extracts features on text line level but not component level to remove false positives. The complete algorithmic steps of the proposed method are presented in the Algorithm 1.

4 Experimental results

Since text detection in different blurred and non-blurred images of curved texts is a new work, unlike text detection in video and natural scene images [7, 20, 25], there are no standard datasets and blurred images of curved texts are not provided for experimentation. Therefore, we create our own dataset by capturing images of different orientations, resolutions, backgrounds, colors, scripts and font sizes. We follow the same process for capturing images as stated in [7] except for curved texts, blur created by defocus and movement of cameras, and objects where texts have been embedded. Our dataset includes 500 defocus, 500 motion and 500 non-blurred images of horizontal, non-horizontal and curved texts, respectively, which gives total 1500 images for experimentation. Thus, we believe that our dataset has enough variations for evaluating the proposed method. This dataset will be available to the researchers at free of cost after the acceptance of this paper.

There are standard datasets for curved and arbitrarily oriented text detection in natural scene images, namely, CUTE80 [13] which provides 80 curved text line images, and MSRATD-500 [22] which provides 200 multi-oriented test samples of different scripts. Since the proposed method is capable of detecting texts in different blurred and non-blurred images, to show effectiveness of the proposed method, we also use these two standard datasets for experimentation. Note that the above two standard datasets do not contain blurred images like our dataset. For measuring the performance of the proposed method, we use the standard measures, namely, Recall (R), Precision (P) and F-measure as defined in Eq. (16).

$$ \mathrm{Precision}=\frac{TP}{TP+ FP} $$

(14)

$$ \mathrm{Recall}=\frac{TP}{TP+ FN} $$

(15)

$$ \mathrm{F}1-\mathrm{score}=2\ast \frac{Precision\ast Recall}{Precision+\kern0.5em Recall} $$

(16)

where True Positive (TP) is defined as the number of items correctly labeled as belonging to the positive class, True Negative (TN) is defined as the number of items correctly rejected, False Positive (FP) is defined as the number of items labeled incorrectly, and False Negative (FN) is defined as the number of items rejected wrongly.

To show the superiority of the proposed method, we compare it with the state-of-the-art methods [9, 16, 17, 26] that explore deep learning for text detection in natural scene images. According to [9, 16, 17, 26], these methods are robust to low contrast, complex background, font size, font, script and orientation variations and to some extent blur. Therefore, we believe that the above existing methods are relevant for comparative studies with the proposed method. Since codes are online available for the above four existing methods, we use the same for experimentation in this work. We follow the same set up as mentioned in [9, 16, 17, 26] for conducting experiments on both our and benchmark datasets.

The Experimental Section is structured as follows. To test the objectiveness of the proposed method with an SVM classifier, we conduct experiments using the proposed method with Extreme Learning Machine (ELM) [3, 15] for comparative studies. The proposed and existing methods are tested on our own dataset as well as standard datasets, namely, CUTE80 and MSRATD-500, to validate the performance of the proposed method. Since the CUTE80 dataset contains curved text line images as the proposed dataset, experiments on this dataset justify the evaluation of curved text detection. Similarly, MSRATD-500 dataset contains arbitrarily-oriented text line images of different languages, and thus experiments on this dataset justify the ability to detect texts irrespective of scripts.

4.1 Classification of text and non-text components

In this work, classifying text and non-text components is an important step for achieving better results. For this, we explore Bhattacharya distance measure to extract symmetry property of text components as discussed in Section 3.2. It is true that texts have regular spacing, uniform color, aspect ratio, orientation, etc. As a result, one can expect symmetry for text, and non-symmetry for non-text. With this cue, the proposed method estimates the degree of similarity between sub-regions of text candidate images. Then it fixes an automatic condition for the degree of similarity to classify text and non-text components. Further, instead of the rigid condition, we can also use classifiers for classifying text and non-text components by passing the degree of similarity values to classifiers. In order to choose the best one, we conduct experiments using the proposed method with symmetry condition discussed in the above, the proposed method with a Support Vector Machine (SVM) classifier, and the proposed method with an Extreme Learning Machine (ELM) classifier. The confusion matrix and Classification Rate (CR), which is the mean of diagonal elements of the confusion matrix of the three ways for our dataset, are reported in Table 1. It is observed from Table 1 that the proposed method with an SVM classifier scores the best results in terms of classification rate compared to the proposed methods respectively with rule and ELM. This is valid because the proposed rule is a rigid condition, thus it may not be robust for challenges posed by blur. The proposed method with ELM requires a large number of samples to achieve the best results. But in this work we provide only 6 features and a few samples. On the other hand, the proposed method with SVM does not require a large number of samples and features compared to ELM for achieving good results as it is a linear classifier and good for a two class problem. However, when we compare the classification rates of all three ways, the difference is marginal.

Table 1 Confusion matrix of the proposed method with rule, ELM and SM classifiers for text/non-text components classification on our dataset

Full size table

4.2 Experiment on our dataset

Sample qualitative results of the proposed and existing methods for our dataset on defocus, motion and non-blurred images are shown in Figs. 7, 8 and 9, respectively, where one can notice that the proposed method detects texts well compared to the two existing methods for defocus, motion and non-blurred images. It is observed from Figs. 7, 8, and 9 that the existing methods do not detect texts properly for defocus and motion blurred images, although the methods detect texts well for non-blurred images. This is valid because the existing methods are developed for good quality images but not blurred type images. In addition, the existing methods do not cope with the challenge of curved texts because of the limitation of the existing methods. On the other hand, the proposed method is good for different blur types and non-blurred images. This is due to the advantage of estimating the degree of blur and use of gradient values for detecting text candidates. Furthermore, the proposed method involves the nearest neighbor criteria for fixing bounding box for each text in images, thus the proposed method can tackle the challenge of curved and other orientation texts.

Quantitative results of the proposed and existing methods for defocus, motion and non-blurred images are reported in Table 2, where it is noted that the proposed method reports consistent results for all the three different types, while the two existing methods report inconsistent results. In other words, the existing methods score the best results for non-blurred images compared to the proposed method, although the methods score low results for defocus and motion blurred images. The reason is the same as discussed in the above. However, the proposed method is not the best for non-blurred images compared to the existing methods. The main reason is that the proposed method does not involve training a large number of samples as the existing methods for text detection. Therefore, there is a scope for the improvement to explore deep learning for robust feature selection to improve the results.

Table 2 Text detection performance of the proposed and the existing methods on different types of blurred images of our dataset

Full size table

We also compare the proposed method with the existing methods in terms of processing time with the following configurations, that is, 3.0GB Core i5, 8 GB RAM and Nvidia GT730, for our dataset. The average time processing for the proposed and existing methods is reported in Table 2, where it can be seen that the proposed method consumes more time compared to the existing methods. This is because the main focus of this work is to detect texts in blurred. As a result, the developed method is not structured well. In addition, since the method involves k-means clustering which is iterative for text candidate detection, and grouping text components which analyzes connected components for text detection, the proposed method is expensive compared to the existing methods. On the other hand, since we use pre-trained existing codes and the average processing time is considered for testing, the methods do not consume more time.

4.3 Experiments on benchmark datasets

As mentioned earlier, to test the objectiveness of the proposed method, we conduct experiments on two benchmark datasets, namely, CUTE80 and MSRATD-500, which consist of curved text and arbitrarily oriented text images, respectively. Sample qualitative results of the proposed and existing methods for CUTE80 and MSRATD-500 datasets are shown in Fig. 10 and Fig. 11, respectively. It is noted from Fig. 10 and Fig. 11 that the existing methods do not detect texts properly for curved images, while both the methods work well for MSRATD-500 images. This shows that the existing methods are good for horizontal and non-horizontal texts, but give poor results for curved text images. However, the proposed method gives better results for both CUTE80 and MSRATD-500 dataset.

Quantitative results of the proposed and existing methods for CUTE80 and MSRATD-500 datasets are reported in Table 3 and Table 4, respectively. It is noted from Table 3 and Table 4 that the proposed method is better than the existing methods for both the datasets. For CUTE80 dataset, the existing methods report low results compared to MSRATD-500 dataset. The reason for getting low results for CUTE80 dataset is that it involves many-curved texts but the MSRATD-500 dataset does not, rather it contains multi-oriented texts (horizontal and non-horizontal). Furthermore, Table 2, Table 3 and Table 4 show that the proposed method reports almost uniform results for defocus, motion, non-blurred curved images and the two benchmark datasets. Therefore, we can conclude that the proposed method is consistent and works well for different blur types and non-blurred images.

Table 3 Text detection performance of the proposed and the existing methods on CUTE80 dataset

Full size table

Table 4 Text detection performance of the proposed and the existing methods on MSRATD-500 dataset

Full size table

Sometimes, when images are affected by different blur types and contain texts of non-uniform spacing or fancy texts, the proposed method fails to detect texts accurately as shown in Fig. 12, where we can see that the images are suffering from multiple factors. In addition, when images contain single characters along with blur information, it is hard to extract distinct features for separating texts from non-texts. To determine the upper limit of the degree of defocus and motion blur, we also conduct experiments by increasing the degree of defocus and motion blur with inbuilt functions for calculating measures on our dataset. In this experiment, we use inbuilt functions, namely, HSIZE for defocusing and LEN for motion blur to increase the effect of blur and over blurred images of our dataset. The results are illustrated in Fig. 13(a) and Fig. 13(b) for defocus and motion blur, respectively, where it is noted that as degree of blur increases on X axis, the performance in terms of recall, precision and F-measure decrease. As the degree of blur increases on X axis for both results, the performance of method decreases gradually at the initial stage and we can see significant changes at the later stages. This shows that one cannot expect consistent results as the degree of blur increase infinitely. This is justifiable because as long as the content is visible regardless of degree of blur, the method can perform well. Therefore, there is a scope for improvement of the proposed method further.

5 Conclusions and future work

In this work, we have proposed a new method for detecting texts in different blurred and non-blurred images. The proposed method explores the concept of degree for blur estimation based on variance of neighbor pixels. At the same time, to detect text pixels in non-blurred images, the proposed method obtains gradient for each pixel and convolves with degree of blur, which widen the gap between text and non-text pixels regardless of defocus, motion and non-blurred pixels. The proposed method uses k-means clustering for the above features to obtain text candidates. To reduce the number of false text candidates, the proposed method explores Bhattacharyya distance for estimating similarity matrix for histograms of the combination of sub-regions. Further, the proposed method uses the nearest neighbor criteria for fixing bounding box for each text line of any orientation and blurred type. Experimental results on our own dataset, which includes defocus, motion and non-blurred curved texts, and the benchmark datasets, namely, CUTE80 and MSRATD-500, show that the proposed method scores consistent results for all the datasets compared to the existing methods. As noticed in the Experimental Section, there are a few issues to be fixed in the future to improve the results for different situations. Furthermore, for the proposed hypothesis, we have planned to derive theories, in future, to validate its generic property for text detection in blurred and non-blurred images.

References

Cao S, Ren W, Zuo W, Guo X, Foroosh H (2015) Scene text deblurring using text specific multiscale dicionaries. IEEE Trans Image Processing 24:1302–1314
Article MathSciNet MATH Google Scholar
M. G. Chun and S. G. Kong, Focusing in thermal imagery using morphological gradient operator, Pattern Recogn Lett, Vo. 38, 2014, pp 20–25.
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing:489–501
Keerthi S, Shevade SK, Bhattacharyya C, Murthy KRK (2000) A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Trans. NN, pp124–136
Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolutional model for scene text detection and recognition. Pattern Recogn 54:128–148
Article Google Scholar
Khare V, Shivakumara P, Kumar A, Chan CS, Lu T, Blumenstien M (2016) A quad tree based method for blurred and non-blurred video text frames classification through quality metrics, In Proc ICPR, pp 4023–4028
Khare V, Shivakumara P, Paramesran R, Blumenstein M (2017) Arbitrarily-oriented multi-lingual text detection in video. Mutlimedia Tools and Applications 76:16625–16655
Article Google Scholar
Lee H, Kim C (2014) Blurred image region detection and segmentation, In Proc ICIP, pp 4427–4431
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In Proc, AAAI
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. In Proc. CVPR:3454–3461
Liu J, Su H, Yi Y, Hu W (2016) Robust text detection via multi-degree of sharpening and blurring. Signal Process 124:259–265
Article Google Scholar
Nwe TL, Hieu NT, Limbu DK (2013) Bhattacharyya distance based emotional dissimilarity measurs for emotion classification, In Proc. ICASSP, pp7512–7516
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41:8027–8048
Article Google Scholar
Semwal VB, Mondal K, Nandi GC (2017) Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach. Neural Comput & Applic, pp 565–574
Semwal VB, Gaud N, Nandi GC (2019) Human gait state prediction using cellular automata and classification using ELM. In Proc MISP, pp 135–145
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments, In Proc. CVPR, pp 3482–3490
Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network, In Proc. ECCV, pp 56–72
Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network, In Proc. ECCV, pp 56–72
Veit A, Matera T, Neumann L, Matas J, Belongie S (2017) COCO-Text: Dataset and Benchmark for text detection and recognition in natural scene images, arXiv:1601.07140v2
Wang X, Song Y, Zhang Y, Xin J (2017) A hierachical recursive method for text detection in natural scene images. Multimed Tools Appl 76:26201–26223
Article Google Scholar
Wei Y, Zhang Z, Shen W, Zeng D, Fang M, Zhou S (2017) Text detection in scene images based on exhaustive segmentation. Signal Processing: Communication 50:1–8
Google Scholar
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In Proc. CVPR, pp. 1083–1090
Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks, In Proc CVPR, pp 4159–4167
Zhang X, Gao X, Tian C (2018) Text detection in natural scene images based on color prior guided MSER. Neurocomputing 307:61–71
Article Google Scholar
Zhao F, Yang Y, Zhang HY, Yang LL, Zhang L (2018) Sign text detection in street view images using an integrated features. Multimed Tools Appl:1–28
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W (2017) EAST: an efficient and accurate scene text detector, In Proc. CVPR, pp 2645–2651

Download references

Acknowledgments

The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61672273 and No. 61272218, and the Science Foundation for Distinguished Young Scholars of Jiangsu under Grant No. BK20160021. This work is also partly supported by University of Malaya under Grant No: UM.0000520/HRU.BK (BKS003-2018).

Author information

Authors and Affiliations

National Key Lab for Novel Software Technology, Nanjing University, Nanjing, China
Minglong Xue, Chao Zhang & Tong Lu
Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Palaiahnakote Shivakumara
Comp. Vision and Pattern. Recog. Unit, Indian Statistical Institute, Kolkata, India
Umapada Pal

Authors

Minglong Xue
View author publications
You can also search for this author in PubMed Google Scholar
Palaiahnakote Shivakumara
View author publications
You can also search for this author in PubMed Google Scholar
Chao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Umapada Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Lu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xue, M., Shivakumara, P., Zhang, C. et al. Curved text detection in blurred/non-blurred video/scene images. Multimed Tools Appl 78, 25629–25653 (2019). https://doi.org/10.1007/s11042-019-7721-2

Download citation

Received: 05 August 2018
Revised: 19 March 2019
Accepted: 01 May 2019
Published: 30 May 2019
Issue Date: 30 September 2019
DOI: https://doi.org/10.1007/s11042-019-7721-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Curved text detection in blurred/non-blurred video/scene images

Abstract

Similar content being viewed by others

Robust Text Detection and Recognition in Blurred Images

Robust Video Text Detection with Morphological Filtering Enhanced MSER

A Deep Convolutional Deblurring and Detection Neural Network for Localizing Text in Videos

1 Introduction

2 Related work