A method for text line detection in natural images

Yuan, Jie; Wei, Baogang; Liu, Yonghuai; Zhang, Yin; Wang, Lidong

doi:10.1007/s11042-013-1702-7

A method for text line detection in natural images

Published: 27 September 2013

Volume 74, pages 859–884, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

A method for text line detection in natural images

Download PDF

Jie Yuan¹,
Baogang Wei²,
Yonghuai Liu³,
Yin Zhang² &
…
Lidong Wang⁴

386 Accesses
12 Citations
Explore all metrics

Abstract

Text information in natural images is very important to cross-media retrieval, index and understanding. However, its detection is challenging due to varying backgrounds, low contrast between text and non-text regions, perspective distortion and other disturbing factors. In this paper, we propose a novel text line detection method which can detect text line aligned with a straight line in any direction. It is mainly composed of three steps. In the first step, we use the maximal stable extremal region detector with dam line constraint to detect candidate text regions, we then define a similarity measurement between two regions which combines sizes, absolute distance, relative distance, contextual information and color histograms. In the second step, we propose a text line identification algorithm based on the defined similarity measurement. The algorithm firstly searches three regions as the seeds of a line, and then expands to obtain all regions in the line. In the last step, we develop a filter to remove non-text lines. The filter uses a sparse classifier based on two dictionaries which are learned from feature vectors extracted from morphological skeletons of those candidate text lines. A comparative study using two datasets shows the excellent performance of the proposed method for accurate text line detection with horizontal or arbitrary consistent orientation.

An Efficient Detection Method for Text of Arbitrary Orientations in Natural Images

Using Scale-Space Anisotropic Smoothing for Text Line Extraction in Historical Documents

TextCatcher: a method to detect curved and challenging text in natural scenes

Article 11 March 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of multimedia, internet, and electronic technologies, images can easily be captured and transmitted at high speeds. Images contain rich information that is usually embedded into complex structures. How to effectively organize, retrieve and understand them has become an imperative task. Many images contain meaningful text information which is very rich in semantics, for example, the name on the front cover of a book, the trademark of a computer, some instruction on a road board and so on. Text information such as program name, time and so on is also usually added into video frame images. However, text information contained in images is usually hard to acquire by other ways than text extraction and recognition. Text information from images also enables useful real-world applications such as image annotation and retrieval [6, 32].

Some researchers have managed to utilize text information to organize and retrieve images or videos [10, 18, 21, 22, 25, 26, 31]. There are several advantages to do so. On one hand, most text contained in images is relevant to the semantic contents of those images, and text needs only limited spaces for storage. On the other hand, text in images is likely to be recognized with a high accuracy. Besides, text is naturally a means for content expression. Text detection manages to find out the accurate areas of text pixels in a image or video frame[5, 22, 31]. Text in images can be classified into two main categories as in Fig. 1: (1) scene text [29]: text that appears in natural scenes, and (2) artificial text [4, 12]: text that is added in the post-production stage. Generally speaking, with the variation of background, view perspective, illumination condition, text location and other aspects, scene texts are more challenging to detect than artificial texts. In this paper, we mainly focus on the detection of scene text.

Text detection methods can be classified into three main categories [11]: gradient based methods [13, 18, 24, 28, 31, 33], texture based methods [1, 23] and color clustering based methods [7, 30]. Gradient based methods assume that text exhibits strong edges against background, and therefore those pixels with high gradient values are regarded as good candidates for text regions. [31] and [18] both detect text strokes by searching stroke paths, in which two end point pairs on the edge mask have approximately opposite gradient directions, and then use clustering and other heuristic rules to classify those strokes into different text lines. They share the advancement that both of them can detect text lines with arbitrary consistent orientation. However, based on stroke detection, they work not well on images with complex background. [24] employed the Fourier-Laplacian filter to enhance the difference between text and non-text regions, and then used the K-means clustering to identify candidate text regions based on the maximum difference, finally used skeletons to split candidate regions and heuristic rules to identify text regions. While their method can also detect non-horizontal text lines, it sometimes detects broken text regions. [13] used contour and temporal features to enhance the accuracy of text caption localization in videos, But the method could not be used in scene text detection directly for the lack of temporal features. Gradient based methods become less reliable when the scene also contains many edges in the background. Texture based methods extract texture by Gabor filters, wavelet transforms, fast Fourier transforms (FFT), spatial variance multichannel processing and so on. By means of texture features, texts can be detected by machine learning methods, such as neural networks and support vector machines (SVM). [1] used discrete wavelet transform to detect three kinds of edges and it used neural network to obtained text regions. The method could detect text embeded in complex background, But neural network takes a lot of time in training the weighting values and it could only detect horizontal text. [23] presented new statistic features based on Fourier transform in RGB space (FSF) and then used the K-means clustering to separate text pixels from the background, the projection profiles of text pixels were analyzed and some heuristics were finally used to identify text blocks. Texture based methods usually have two shortcomings. On one hand, they need a set of representative images for training which is not easily obtained generlly; on the other hand, because of the signal response on few directions, they can only detect text in horizontal or vertical orientation. This is not sufficient for text detection in natural images. Color-based approaches assume that the text in images possesses a uniform color. [30] firstly detected the accumulated edge map of an image, and then colorized and decomposed it using the affinity propagation (AP) clustering algorithm [3]. Finally, a projection algorithm was employed to detect text regions in each decomposed edge map. Their method can make text detection and recognition more accurate, however, it is barely true that text appears in a uniform color in images. [7] proposed a split-and-merge segmentation method using colour perception to extract text from web images. Their method can detect text lines in arbitrary orientation, but it did not work well on scene images which usually have more complex color distribution than that in web images. There are also some methods using other features to detect texts in images recently. For example, [34] proposed a corner based approach to detect text and captions. But in natural images, only corners are not sufficient for text detection because of the low contrast between text and non-text regions in many natural images. [8] used the existing transient colors between inserted text and its adjacent background to generate a transition map. While the transient color is usually hard to detect especially in images containing scene text only, it performs poorly on natural images.

In this paper, we propose a novel text line detection method which can detect scene text in natural images. Firstly, maximally stable extremal regions (MSER) [9, 17] are detected. A MSER is a region which keeps stable in image binarization when modifying the threshold in a certain range. To prevent unwanted regions from being merged, the canny edge of the image is used to serve as dam line. The connected components (CCs) are identified by using 4 neighbor connections on the MSER mask image with dam lines. We then define enhanced geometry distance (similarity) and color distance (similarity) between CC pairs. Based on the distance (similarity) measurement, those CCs with center points in the same line are organized into candidate text lines. Finally, all candidate text lines are transformed into horizontal or vertical ones by a rotation operation, and a sparse classifier is used to identify real text lines. Our method uses the edge constrained MSER method to detect text regions, so it can detect more stable regions compared to general MSERs, yet overcoming the shortcomings of gradient-based methods. By using text line detection and rotation transformation, our method can detect text lines aligned with a straight line in any direction. By using the sparse classifier as a filter, our method can obtain a higher accuracy than the existing methods.

The contributions of this paper can be summarized mainly as three aspects as follows:

(1)
A similarity measurement between any two connected components is developed. It integrates region sizes, absolute distance, relative distance, contextual information and color information into a single value. It is powerful in characterizing text regions.
(2)
A text line identification algorithm is proposed. Based on the similarity measurement, the text line identification algorithm firstly searches three CCs as the seed of a line, and then expands to obtain all other CCs in the line. The method is effective for candidate text line extraction.
(3)
A new filter is developed to remove those non-text lines. To this end, a sparse classifier based on two dictionaries learned from a feature vector extracted from morphological skeletons of the MSERs has been developed. With the sparse filter, our method can obtain a higher precision than other methods.

To validate our proposed method for text line detection in natural images, two datasets were used. For a comparative study, several art methods were used. Experimental results show that our method outperforms significantly the selected state of the art ones.

The remaining of this paper is organized as follows: Section 2 briefly describes the framework of our novel text line detection method. Candidate text region detection and similarity measurement are described in Section 3. Section 4 details the proposed text line detection method. The sparse filter is developed in Section 5. Experimental results and analyses are presented in Section 6. Finally, we draw some conclusions and indicate our future work in Section 7.

2 The framework of text detection

Our text line detection method consists of 3 steps as shown in Fig. 2. The first step mainly uses the MSER detector to detect all candidate text regions. Though the detected MSERs have consistent intensity in themselves, they are isolate from each other and un-structured. In the second step, nearby regions are merged into candidate text lines based on the similarity and angles among them. Candidate text lines contain not only text lines but also some non-text lines. Finally, a sparse filter is used to get rid of non-text lines. The sparse filter uses reconstruction error by learned dictionaries of sparse classifier to work.

3 Region detection and similarity definition

In this section we detail the candidate text region detection and measurement of similarity between two candidate text regions.

3.1 Maximally stable extremal regions

Definition [17]

Image I is a mapping I:D ⊂ ℤ² → SI. Two conditions should be met under which Extremal regions are well defined:

(1)
S is totally ordered.
(2)
An adjacency (neighbourhood) relation A ⊂ D × D is defined.

Region Q is a contiguous subset of D

(Outer) Region Boundary ∂ Q = {q ∈ D \ Q : ∃p ∈ Q : qAp}

Extremal Region Q ⊂ D is a region such that for all p ∈ Q, q ∈ ∂ Q:I(p) > I(q) (maximum intensity region) or I(p) < I(q) (minimum intensity region).

Maximally Stable Extremal Region (MSER). Let Q ₁, Q ₂,… … Q _{i
-1}, Q _i,… …be a sequence of nested extremal regions, Extremal region Q _{i
*} is maximally stable iff q(i) = |Q _{i + ∆}\Q _{i − ∆} |/|Q _i | has a local minimum at i*(|.| denotes cardinality). ∆ ∈ S is a parameter of the method.

There are mainly three reasons for us to select MSER as our candidate region detector: firstly it is invariant to affine transformation of image intensities, secondly it is very stable and lastly it can detect regions at different scales. Readers can refer to [17] for more details about MSER.

3.2 Connected components collection

Although MSERs are very stable, their detection pays particular attention to consistent gray intensities while neglects the gradient features to some extent. Under some conditions, it may link noisy pixels to text ones with similar gray intensity. For example in Fig. 3b, many noisy pixels are connected to pixels of character “N” in the first text row in MSERs, and the noisy pixels even connect characters in the first row to those in the second row. These incorrect connections will lead to difficulties for later operations. To overcome this shortcoming, we use the canny edge in the image to serve as dam lines of those MSERs. Those pixels which are not only in the canny edge mask but also in detected MSERs are removed, and connected components (CCs) are collected in the detected MSERs with the canny edge as dam lines. In the collecting process, a pixel can only connect to its directly nearby 4 pixels which are above, below, left to and right to it respectively. This can prevent two pixels in different sides of an edge pixel from being connected. It can also be seen from Fig. 3 that our text detection method indeed benefits from the removal of pixels on the edge.

3.3 Similarity measurement

Next we want to merge similar text regions into text lines. To this end, we need to define a similarity between any two connected components. A good similarity measurement should satisfy two properties: one is that the similarity between CCs in the same text line should be large enough; the other is that the similarity between CCs in different text lines should be small enough. Supposing that V = {CC₁, CC₂, …. CC_n} containing all CCs in a image, y = [y₁, y₂, … y_n] in which y_i denotes the text line No. of CC_i, we want to define a similarity function that satisfies the condition as follows: . Because CCs in the same text line usually have similar size and colour, we define two different similarities: geometry similarity and colour similarity.

3.3.1 Geometry similarity

We firstly introduce geometry distance between two CCs, and then describe geometry similarity based on the former geometry distance.

The geometry distance integrates normalized absolute distance, relevant distance and size ratio between two CCs into a single measurement. The normalized absolute distance between CC_i and CC_j is defined as follows:

$$ \mathrm{di}{s}_{\mathrm{ij}}^1=\frac{\left|{C}_{x_i}-{C}_{x_j}\right|+{k}_1\cdot \left|{C}_{y_i}-{C}_{y_j}\right|}{k_1\cdot {h}_{\mathrm{im}}+{w}_{\mathrm{im}}-\left({k}_1\cdot \left({h}_i+{h}_j\right)+{w}_i+{w}_j\right)/2} $$

(1)

Where $ {\mathrm{C}}_{{\mathrm{x}}_{\mathrm{i}}} $, $ {\ \mathrm{C}}_{{\mathrm{y}}_{\mathrm{i}}} $, $ {\mathrm{C}}_{{\mathrm{x}}_{\mathrm{j}}} $ and $ {\mathrm{C}}_{{\mathrm{y}}_{\mathrm{j}}} $ respectively denote the horizontal and vertical coordinates of the centroids of CC_i and CC_j. h_i, h_j, w_i, and w_j respectively denote the height and width of CC_i and CC_j, and h_im and w_im denote the height and width of the current image respectively. k₁ is a constant which controls the contribution of the horizontal distance compared to the vertical distance. In this paper, its value is 2. The numerator in Eq. 1 denotes the weighted L₁ absolute distance between two centroids, while the denominator mainly serves as a normalization factor. The value of dis ¹_ij ranges from 0 to 1. When two regions are at the opposite corners of the image, it reaches the largest value of 1. Text lines are generally aligned horizontally, so the vertical distance plays a more important role than horizontal distance in distinguishing different text lines. While this observation is reflected in Eq. 1, it can be concluded that, when the centre points of two CCs CC_i and CC_j are of the same distance from another centre point of component CC_k, except that the centres of CC_j and CC_k have the same horizontal coordinate, while those of CC_i and CC_k have the same vertical coordinate, we would consider that CC_i are farther than CC_j from CC_k. It can also be seen from Fig. 4 that characters “A” and “B” are in the same row, so the distance between them should be shorter than that between “A” and “D”. However, if k ₁ is set as a large value, it will prevent texts in vertical line from being clustered into the same text line, so 2 is a compromised value.

To enlarge the difference of distances between different CC pairs, the distance measurement in Eq. 1 can be modified as follows:

$$ \mathrm{di}{s}_{\mathrm{ij}}^2=\mathrm{di}{s}_{\mathrm{ij}}^1\sqrt{\aleph_i(j)} $$

(2)

Where ℵ _i(j) denotes the index number of dis ¹_ij among all distances dis ¹_i ⋅ from the center of CC_i. For example, if the distance between the centre of CC_i and all other CC centres are 1.5, 1.7, 2.3, 4.5, and 5.6 while that of CC_j from the centre of CC_i is 4.5, then ℵ _i(j) = 4. It can be seen from Eq. 2 that dis ²_ij has powerful distinguishing ability by setting a larger distance from CC_i to distant CCs. It can also be concluded that dis ²_ij is not equal to dis ²_ji . The motivation of Eq. 2 is that the No. in the sorted distance list brings context information for distance calculation.

When there are many characters in a text line, the distance between characters on both ends calculated by Eq. 2 will be large, which may prevent them to merge into a line. To solve this problem, we modify Eq. 2 as follows:

$$ {\mathrm{dis}}_{\mathrm{ij}}^3={ \min}_{{\mathrm{p}}_{\mathrm{k}}}\left\{{\mathrm{dis}}_{{\mathrm{ip}}_{\mathrm{k}}\mathrm{j}}^2,\left|{\mathrm{p}}_{\mathrm{k}}\right|\in \left\{0,1,2,\dots \dots \mathrm{n}-2\right\}\right\} $$

(3)

Where p_k denotes a path from CC_i to CC_j and the length of p_k is between 0 and n-2. dis ³_ij denotes the shortest path between CC_i and CC_j and it can be obtained by Floyd algorithm or other algorithms with similar functions. The motivation of Eq. 3 is that a good distance measurement between two objects should describe the relationship between the two objects in the context of all objects. On the other hand, the distance defined in Eq. 3 fulfils triangle inequality which is very important for a distance measurement. An example is shown in Fig. 4. Supposing that dis ²_ik = 0.5, while dis ²_is = 0.2, dis ²_st = 0.2 and dis ²_it = 0.9, it can be concluded from above that dis ²_it > dis ²_ik . However, CC_i and CC_t are in the same text line while CC_i and CC_k are not. So this may result in wrong text line detection if we use only dis². By using dis³, we can get dis ³_it = 0.4 (the shortest path between CC_i and CC_t is CC_i-CC_s-CC_t) while dis ³_ik = 0.5 and dis ²_it < dis ²_ik which correctly reflects the inner distance between the two pairs of CCs. Because the distance between adjacent CCs in the same text line is usually small, the distances among CCs in the same line will be further reduced by using Eq. 3.

The shape distance between two regions is defined as follows:

$$ \mathrm{di}{s}_{\mathrm{ij}}^4=\sqrt{\frac{ \max \left({h}_i,{h}_j\right)\cdot \max \left({w}_i,{w}_j\right)}{ \min \left({h}_i,{h}_j\right)\cdot \min \left({w}_i,{w}_j\right)}} $$

(4)

It can be seen from Eq. 4 that two CCs with similar size will obtain a small value of dis ⁴_ij . CCs in the same text lines usually share similar height and width, so they will produce a small dissimilarity value.

At last all distance measurements defined above are integrated into a single similarity measurement as follows:

$$ \mathrm{sim}{i}_{\mathrm{geometry}}\left(i,j\right)= \exp \left(-\sqrt{ \max \left(\mathrm{di}{s}_{\mathrm{ij}}^3,\mathrm{di}{s}_{\mathrm{ji}}^3\right)\cdot \mathrm{di}{s}_{\mathrm{ij}}^4}\right) $$

(5)

The similarity ranges from 0 to 1 while it can not achieve 0 and 1. The greater these distance values, the smaller the similarity. It can be seen from the above 5 equations that our geometry similarity combines sizes, normalized absolute distance, relative distance and contextual information, so it is expected that it will be powerful in characterizing text regions.

3.3.2 Colour similarity

Text CCs in the same text line usually share the same colour, so colour feature is also an important factor that should be taken into account. In this paper, we firstly convert images from RGB colour space into HSV colour space, and then H, S, and V components are quantified respectively into 8, 3, and 3 bins, leading the dimension of the colour histogram to be 72. Supposing that the color feature vector of CC_i and CC_j are C_i = [C_i,1, C_i,2,… C_i,t,… C_i,n] and C_j = [C_j,1, C_j,2,… C_j,t,… C_j,n] respectively, then the color similarity can be calculated as below:

$$ \mathrm{sim}{i}_{\mathrm{color}}\left(i,j\right)={\displaystyle \sum_{t=1}^n \min \left({C}_{i,t},{C}_{j,t}\right)} $$

(6)

Where n = 72 in this paper.

The similarity between two CCs can be finally estimated as:

$$ \mathrm{simi}\left(i,j\right)=\left(\mathrm{sim}{i}_{\mathrm{geometry}}\left(i,j\right)+\mathrm{sim}{i}_{\mathrm{color}}\left(i,j\right)\right)/2 $$

(7)

4 Candidate text line detection

Since texts are almost written in lines, so we can use the contextual information of CCs to merge similar CCs into text lines. Text line detection can be divided into two steps: sibling identification and text line identification. Sibling identification uses some heuristic rules to decide whether two adjacent CCs can be merged together, if can, we call them siblings. The heuristic rules mainly decide whether their sizes are similar and whether their absolute distance is small enough to merge. If two CCs are siblings, text line identification manages to decide whether they are in the same line. We detail the two steps in the following sections.

4.1 Sibling identification

Partly based on [31], three constraints are defined to decide whether two connected components are siblings of each other.

(1)
The height ratio and width ratio of two adjacent CCs should fall between two thresholds T ₁ and T ₂.
(2)
Two adjacent characters should not be too far away from each other in spite of various heights and widths, so the distance between two connected components should not be greater than T ₃ times the width or height of the larger one.
(3)
Two adjacent characters should share similar colour feature, so their colour similarity should be above a threshold T ₄.

The three constraints can be formalized as follows:

$$ {\mathrm{S}}_{\mathrm{ij}}={\mathrm{S}}_{\mathrm{ij}}^1\wedge {\mathrm{S}}_{\mathrm{ij}}^2\wedge {\mathrm{S}}_{\mathrm{ij}}^3 $$

(8)

S_ij denotes whether two connected components CC_i and CC_j are siblings of each other. If the value of S_ij is 1, it denotes that CC_i and CC_j are siblings and may be in the same text line, otherwise they cannot lie in the same text line. S ¹_ij , S ²_ij and S ³_ij represent the above three constraints respectively. In this paper, we set T₁ = 2, T₂ = 4, T₃ = 3 and T₄ = 0.4. It should be noticed that although our constraints are similar to those in [31], they differ in many aspects. In [31], their constraint rules work generally under an assumption that text lines are of horizontal direction and their rules were only used in their adjacent character grouping stage, but ours deal with text lines in arbitrary directions. To this end, our rules have to firstly estimate the text line orientation. Even though they also proposed methods to deal with text lines in arbitrary directions, they (text line grouping) didn’t use the constraint rules. For the first constraint, we can represent it as follows in Eqs. 9 and 10:

$$ \begin{array}{l}{\mathrm{h}}_{\mathrm{r}}= \max \left({\mathrm{h}}_{\mathrm{i}},{\mathrm{h}}_{\mathrm{j}}\right)/ \min \left({\mathrm{h}}_{\mathrm{i}},{\mathrm{h}}_{\mathrm{j}}\right)\hfill \\ {}{\mathrm{w}}_{\mathrm{r}}= \max \left({\mathrm{w}}_{\mathrm{i}},{\mathrm{w}}_{\mathrm{j}}\right)/ \min \left({\mathrm{w}}_{\mathrm{i}},{\mathrm{w}}_{\mathrm{j}}\right)\hfill \end{array} $$

(9)

$$ {S}_{\mathrm{ij}}^1=\left\{\begin{array}{ll}1\hfill & \mathrm{if}\left({h}_r\le {T}_1\&\&{w}_r\le {T}_2\right)\mathrm{and}\left|\mathrm{tg}\uptheta \right|\le 1\hfill \\ {}1\hfill & \mathrm{if}\left({h}_r\le {T}_2\&\&{w}_r\le {T}_1\right)\mathrm{and}\left|\mathrm{tg}\uptheta \right|>1\hfill \\ {}\kern2.25em 0\hfill & \begin{array}{llllll}\hfill & \hfill & \hfill & \hfill & \hfill & \mathrm{otherwise}\hfill \end{array}\hfill \end{array}\right. $$

(10)

In Eq. 10, θ denotes the angle between the positive X axis and the line segment that connects centres of the two connected components CC_i and CC_j.

If the absolute value of the line slope is smaller than 1, the orientation of the candidate text line is treated to be roughly horizontal, otherwise it is roughly vertical. For the horizontal text line, the height ratio should be less than T₁ and width ratio less than T₂. In contrast, for the vertical text line, the height ratio should be less than T₂ and width ratio less than T_1.

For the second constraint, we can represent it as follows in Eq. 11:

$$ {S}_{\mathrm{ij}}^2=\left\{\begin{array}{ll}1\hfill & \mathrm{if}\left(\left|\mathrm{tg}\uptheta \right|>1\&\&\mathrm{di}{s}_i{}_j\le {T}_3\cdot \max \left({h}_i,{h}_j\right)\right)\hfill \\ {}1\hfill & \mathrm{if}\left(\left|\mathrm{tg}\uptheta \right|\le 1\&\&\mathrm{di}{s}_i{}_j\le {T}_3\cdot \max \left({w}_i,{w}_j\right)\right)\hfill \\ {}\hfill & \begin{array}{llllll}\hfill & \hfill & \hfill & \hfill & 0\hfill & \mathrm{otherwise}\hfill \end{array}\hfill \end{array}\right. $$

(11)

It means that if CC_i and CC_j are in a horizontal text line, their distance should be shorter than T₃ times the larger width, otherwise it should be shorter than T₃ times the larger height.

4.2 Text line identification

If two connected components are siblings, it only means that the two CCs have similar size, colour and a small distance between them. However, sibling CCs still have an ambiguity whether they lie in the same text line. So the next step is to find out all candidate text lines. Because the sibling identification step has considered the size, distance and colour feature of connected components, in this step, we mainly consider the central locations of CCs. It holds true that if a number of points p₁, p₂,…, p_n are in the same line l, then any line l _t created by randomly linking two points p_i, p_j ∈ {p₁, p₂, … … p_n} has the same slope as that of the whole line l. and the reverse is also true. So we can use this property to detect candidate text lines. Given a set of centroids of connected components S_cc, groups of collinear character centroids are computed as:

$$ \mathrm{C}=\left\{\mathrm{c}\Big|\mathrm{c}=\mathrm{centroid}\left(\mathrm{CC}\right)\wedge \mathrm{C}\mathrm{C}\in {\mathrm{S}}_{\mathrm{c}\mathrm{c}}\right\} $$

(12)

$$ \mathrm{L}={\mathrm{L}}_1\cup {\mathrm{L}}_2 $$

(13)

$$ {L}_1=\left\{G\Big|G\subseteq C,\left|G\right|\ge 3,\forall {c}_i,{c}_j,{c}_k\in G,l\left({c}_i,{c}_j\right)=l\left({c}_i,{c}_k\right)=l\left({c}_j,{c}_k\right)\right\} $$

(14)

$$ {L}_2=\left\{G\prime \Big|G\prime =\left({c}_i,{c}_j\right),\exists G\in {L}_1, slope(G)= slope\left(G\prime \right)\right\} $$

(15)

In Eq. 12, C denotes the set of centroids of all the connected components S_cc. In Eq. 13, L denotes the set of all candidate text lines. It includes lines from two categories: lines that contain at least three CCs and lines that contain only two CCs. L₁ in Eq. 14 denotes the set of text lines which are composed of at least 3 CCs. 1(c_i, c_j) denotes the line passing through c_i and c_j. L₂ in Eq. 15 denotes the set of text lines in which every text line is composed of two CCs but must be parallel to at least one line in set L₁. slope (G) denotes the slope of a line identified by points in set G.

To identify L₁, for every line, we firstly search its three seed CCs, and then expand it to contain more CCs. In the seed CCs searching phrase, if three lines created by linking any two from among three components CC_i, CC_j, and CC_k have the same slope, we think they are in the same line and constitute the seed CCs of the current line. After having obtained the seed CCs of a line l _i, we keep an average angle of it. Then for every remained CC_u, we obtain its K-NN(K nearest neighbouring) CCs CCS_K in the current line. If the slope angle of the line segment through it and any CC_v ∈ CCS_k is close enough to the average angle of the current line, it is also in the current line. It should be noticed that when we calculate the K-NN CCs of a CC we use L2 distance instead of geometry distance mentioned before. The angle between two line segments c _i c _j and c _j c _k is calculated as follows:

$$ \varDelta {\theta}_{ijk}= \min \left\{ \arccos \left(\frac{v\left({c}_i{c}_j\right)\cdot v\left({c}_j{c}_k\right)}{\left\Vert v\left({c}_i{c}_j\right)\right\Vert \cdot \left\Vert v\left({c}_i{c}_j\right)\right\Vert}\right),\pi - \arccos \left(\frac{v\left({c}_i{c}_j\right)\cdot v\left({c}_j{c}_k\right)}{\left\Vert v\left({c}_i{c}_j\right)\right\Vert \cdot \left\Vert v\left({c}_i{c}_j\right)\right\Vert}\right)\right\} $$

(16)

where v(c _i c _j) and v(c _j c _k) denote the vectors of c _i c _j and c _j c _k respectively.

The average slope angle between any two line segments c _i c _j and c _m c _n are defined as follows:

$$ \overline{\uptheta}=\left\{\begin{array}{ll}\frac{\theta {i}_j+{\theta}_{\mathrm{mn}}+\pi }{2}\hfill & \mathrm{if}\left({\theta}_{\mathrm{ij}}\cdot {\theta}_{\mathrm{mn}}\le 0\&\& \max \left(\left|{\theta}_{\mathrm{ij}}\right|,\left|{\theta}_{\mathrm{mn}}\right|\ge \frac{\pi }{4}\right)\right.\hfill \\ {}\frac{\theta_{\mathrm{ij}}+{\theta}_{\mathrm{mn}}}{2}\hfill & \begin{array}{lllllllll}\hfill & \hfill & \hfill & \hfill & \hfill & \hfill & \hfill & \hfill & \mathrm{otherwise}\hfill \end{array}\hfill \end{array}\right. $$

(17)

In the above Equation, θ_ij and θ_mn ranges in $ \left[-\frac{\uppi}{2},\frac{\uppi}{2}\right] $. The angle difference between a line segment c _i c _j and a line with an average slope angle of $ \overline{\uptheta} $ is defined as follows:

$$ \varDelta \uptheta = \min \left\{\left|-{\uptheta}_{\mathrm{ij}}-\overline{\uptheta}\right|,\uppi -\left|-{\uptheta}_{\mathrm{ij}}-\overline{\uptheta}\right|\right\} $$

(18)

The reason why the minus sign appears in front of θ_ij is that the positive orientation of Y axis in the screen is opposite to that in the geometric coordinate system. The three centroids are approximately collinear if Δθ ≤ T₅. The value of T₅ is determined as follows:

$$ {T}_5= \max \left(\frac{\pi }{36}, \min \left(\frac{\pi }{10},\frac{\pi }{12}/\sqrt{\frac{{\mathrm{d}}_{\mathrm{ij}}}{{\overline{\mathrm{d}}}_{\mathrm{l}}}}\right)\right) $$

(19)

Where d_ij denotes the distance between c_i and c_j while $ {\overline{\mathrm{d}}}_1 $ denotes the average distance between all centroid of CCs on line l on which all centroids are ordered from left to right or top to bottom. It can be concluded from Eq. 19 that if two CCs is far away from each other compared to the average interval distance in the line, the corresponding angle threshold should be smaller than $ \frac{\uppi}{12} $, while if they are very close, the threshold could be larger than $ \frac{\uppi}{12} $. It can also be illustrated by Fig. 5. In Fig. 5, the upper line denotes the average angle $ \overline{\uptheta} $ of the text line. Though θ₃ is smaller than θ₂ (θ₂ and θ₃ are respectively represented as ∠2 and ∠3 in Fig. 5), it corresponds to a smaller Threshold than that of θ₂, so it can be correctly removed from the text line “jungle”.

The framework of text line identification is shown in Fig. 6. We here describe it in details. We initialized set UL _cc by adding all connected components. We also initially set a label array LA _cc with all values as 0 in which LA ⁱ_cc denotes whether CC _i has been merged into a line or not.. Every component of UL _cc denotes a CC that is not in any text line so far. For every connected component CC _i, we calculate the similarity simi(i,*) between it and all other CCs by using Eq. 7, and then the two maximum similarities are picked up and their sum is obtained and represented as partSimi(CC _i ). Then all partSimi values are sorted into list PSL in descend order. A CC CC _i is picked squentially from list PSL, and all CCs with label value as 0 are sorted into list SCL in descend order according to the distance between themselves and CC _i For any two CCs CC_j, and CC_k (j ≠ k)picked from list SCL which fulfill S_ij = 1 ∧ S_jk = 1, calculate the angle difference Δθ_ijk,Δθ_jik and Δθ_ikj. If $ \varDelta {\uptheta}_{\mathrm{ijk}}<=\frac{\pi }{12}\wedge \varDelta {\uptheta}_{\mathrm{jik}}<=\frac{\pi }{12}\wedge \varDelta {\uptheta}_{\mathrm{ikj}}<=\frac{\pi }{12} $, then create a new text line L _t, record its components S _cc (L _t ) = {CC _i ,CC _j ,CC _k } and calculate the average angle $ \overline{\uptheta} $ by using Eq. 17, and remove the three components from set UL _cc. Then the three CCs serve as the seed of the current line and their label values in LA _cc are set as 1. For every remaining CC _m in UL _cc, calculate its similarity to line L _t by using $ \mathrm{s} imi\left(C{C}_m,{L}_t\right)={\displaystyle {\sum}_{C{C}_n\in {S}_{cc}\left({L}_t\right)} simi\left(C{C}_m,C{C}_n\right)} $, and sort the simi in descend order. Orderly picks up a CC (CC _t) from UL _cc, CC_t is added to S _cc (L _t ) only if it meets the following 3 conditions:

(1)
There at least exists one CC _k ∈ S _cc (L _t) among the K-NN CCs CC ₁ (L _t ),CC ₂ (L _t ),…CC _R (L _t ) ∈ S _cc (L _t ) of CC_t which is not only the sibling of CC _t but also the angle difference ∆θ between the two CCs and the average angle $ \overline{\uptheta} $ of current line L _t is under T ₅ .
(2)
CC_t is also among the K-NN CCs of the founded CC _k in Condition 1.
(3)
The distance between the center of CC_t and that of the current line L _t is less than T_6.

In this paper we set the value of K as 3. T₆ is obtained as follows:

$$ {T}_6=\left\{\begin{array}{ll}k\prime \cdot {h}_t\hfill & \left| tg\theta \right|\le 1\hfill \\ {}k\prime \cdot {w}_t\hfill & \left| tg\theta \right|>1\hfill \end{array}\right. $$

(20)

In the above equation, h_t and w_t denote the height and width of CC_t respectively, θ is the angle between the positive X axis and the line that links the centers of CC_t and CC_k and k’ = 1/3. The first and third conditions ensure that CC_t is in the current line while the second condition ensures that CCs in the same line are mutually similar. If the current CC is added to S _cc (L _t ), then the set UL _cc, the array LA _cc and the average angle of the current line are updated. We repeat the procedure until all components in UL _cc are processed. Then repeat to find another group of line seeds until there does not exist any group of line seeds.

For the detection of line in set L₂ in Eq. 15, if the line connecting the centres of two CCs CC_i and CC_j fulfils the following two conditions, we consider it as a candidate text line and added it into L₂: (1) It is parallel to at least one line detected in the former operation; (2) CC_i is among the K-NN CCs of CC_j and CC_j is also among the K-NN CCs of CC_i. Eventually, we obtain all candidate text lines. An example of the full candidate text line identification procedure is illustrated in Fig. 7.

5 Text line filtering

Candidate text lines obtained by the former operation may not be the real text line. So we need to filter out those false ones. To this end, we use a sparse filter to be detailed below.

[33] uses a sparse classifier in [16] to classify regular image blocks into two categories: text blocks and non-text blocks. Non-text blocks are then removed. However, the method suffered from many shortcomings as follows:

(1)
They divided images into blocks with fixed size. In this case, their method could not work well when text sizes and intensities vary in a large range.
(2)
Their method can only detect text line in horizontal direction because the method cannot obtain the slant angles of the text lines by using only a regular block.
(3)
The classifier of [16] they used is not convex and does not explore the discrimination capability of sparse coding coefficients. So their method could not work well for text fonts with similar edge properties to those of non-text blocks, for example, Mistral font style.

In this paper, we develop a more powerful sparse classifier to overcome all these shortcomings: (1) we convert MSERs instead of the original images into regular blocks to overcome the first shortcoming; (2) To overcome the second shortcoming, we transform text lines in arbitrary direction into horizontal or vertical ones; and (3) To overcome the third shortcoming, we use skeleton instead of edge and a more powerful classifier. We will detail all the techniques in the following sections.

5.1 Sparse classifier

We here use the Fisher discrimination dictionary learning (FDDL) schema [27] as our sparse classifier. The reason for us to select it is that both the reconstruction error and the coding coefficient are discriminative. Compared to other state-of-the-art methods, it has competitive performance in various pattern recognition tasks.

Given the learned structured dictionary D = [D₁,D₂,…,D_c], where D_i is the class-specified sub-dictionary associated with class i, and c is the total number of classes. Denote by A = [A₁,A₂,…A_c] the set of training samples, where A_i is the sub-set of the training samples from class i. Denote by X the coding coefficient matrix of A over D, i.e. A ≈ DX. We can write X as X = [X₁, X₂, …, X_c], where X_i is the sub-matrix containing the coding coefficients of A_i over D. Apart from requiring that D should have powerful reconstruction capability of A, we also require that D should have powerful discriminative capability of images in A.

The FDDL model is defined as follows:

$$ {J}_{\left(D,X\right)}= \arg { \min}_{\left(D,X\right)}\left\{\begin{array}{c}\hfill {\displaystyle \sum_{i=1}^cr\left({A}_i,D,{X}_i\right)+{\lambda}_1{\left\Vert X\right\Vert}_1+}\hfill \\ {}\hfill {\lambda}_2\left(\mathrm{tr}\left({S}_W(X)-{S}_B(X)\right)\right.+\eta {\left\Vert X\right\Vert}_F^2\hfill \end{array}\right\} $$

(21)

where the first term describes the discriminative fidelity; the second and fourth terms impose the sparsity constraint, and the third term is a discrimination constraint imposed on the coefficient matrix X; and λ₁, λ₂ and η are scalar parameters. S_W (X) and S_B (X) in the above equation are calculated as follows:

$$ \begin{array}{l}{S}_W(X)={\displaystyle \sum_{i=1}^c{\displaystyle \sum_{{\mathbf{x}}_{\mathrm{k}}\in {X}_i}\left({\mathbf{x}}_{\mathrm{k}}-{\mathbf{m}}_{\mathrm{i}}\right){\left({\mathbf{x}}_{\mathrm{k}}-{\mathbf{m}}_{\mathrm{i}}\right)}^{\mathrm{T}}}}\hfill \\ {}{S}_b(X)={\displaystyle \sum_{i=1}^c{n}_i\left({\mathbf{m}}_{\mathrm{i}}-\mathbf{m}\right){\left({\mathbf{m}}_{\mathrm{i}}-\mathbf{m}\right)}^{\mathrm{T}}}\hfill \end{array} $$

(22)

where m _i and m are the mean vectors of X _i and X respectively, and n _i is the number of samples in class A _i.

In [27], they proposed two classification schemes, the global classifier (GC) and local classifier (LC). Because there are only two categories (text region and non-text region) and we have lots of training samples of each class, we here use local classifier.

Denote by m_i = [m ¹_i ; …;m ^k_i ; …;m ^c_i ], where m ^k_i is the sub-vector associated with sub-dictionary D_k. Denote by y a testing sample. The coding coefficients α associated with D_i are obtained by minimizing:

$$ \widehat{\alpha}={\mathrm{argmin}}_{\upalpha}\left\{{\left|\left|\boldsymbol{y}-{\mathrm{D}}_{\mathrm{i}}\alpha \right|\right|}_2^2+{\upgamma}_1{\left|\left|\alpha \right|\right|}_1+{\upgamma}_2{\left|\left|\alpha -{\mathbf{m}}_{\mathrm{i}}^{\mathrm{i}}\right|\right|}_2^2\right\} $$

(23)

where γ₁ and γ₂ are constants. The final classification rule is shown as follows:

$$ C(y)= argmi{n}_i\left({\left\Vert y-{D}_i\widehat{\alpha}\right\Vert}_2^2+{\upgamma}_1{\left\Vert \widehat{\alpha}\right\Vert}_1+{\upgamma}_2{\left\Vert \widehat{\alpha}-{m}_i^i\right\Vert}_2^2\right) $$

(24)

where C(y) denotes the category label of y.

5.2 Feature extraction and dictionary learning

To deal with text lines in arbitrary direction, before extracting features for sparse filtering, we firstly uniformly convert arbitrary direction text lines into horizontal or vertical ones.

For a candidate text line, its slope angle was estimated in the last section. Then we rotate the text line about its center by θ_r degrees so that it will be aligned with either the horizontal or vertical axis. The value of θ_r is set as follows:

$$ {\theta}_r=\left\{\begin{array}{ll}-\overline{\uptheta}\hfill & \left|\ \overline{\uptheta}\right|\le \frac{\uppi}{4}\hfill \\ {}\mathrm{sign}\left(\overline{\uptheta}\right)\cdot \left(\frac{\uppi}{2}-\left|\overline{\uptheta}\right|\right)\hfill & \left|\ \overline{\uptheta}\right|>\frac{\uppi}{4}\hfill \end{array}\right. $$

(25)

where $ \overline{\uptheta} $ denotes the average slope angle of the current line as in Eq. 17. That is to say, if the absolute value of the slant angle is lower than 45°, then the text line is rotated as a horizontal one, otherwise it is rotated as a vertical one. The idea is illustrated in Fig. 8. Figure 8a shows the original text line mask. The transformed result of Fig. 8a is shown in Fig. 8b while the skeleton of Fig. 8a is shown in Fig. 8c. Then the abundant blank row and column around the text line in the rotated image are removed. Finally we obtain a horizontal or vertical candidate text line.

Then for every MSER CC, we resize it to s_rh × s_rw in which max(s_rh, s_rw) = s_rg. In this paper, s_rg = 32. That is to say, we resize the CC to make the length of its maximal side as s_rg while keeping the height/width ratio unchanged. Then its skeleton is extracted and also enlarged so that the longer side is s_rg, then the skeleton is extracted again on the enlarged skeleton block and located at the centre of the regular block. Finally the regular block is transformed into a vector with a dimensionality of 32 × 32 = 1024, which serves as input of the sparse filter. The skeleton extraction procedure is illustrated in Fig. 9. It can be seen from above that by converting a MSER into a regular block while keeping the layout of its skeleton and alignment of centre, our method is robust to various text sizes.

In this paper, we train two discriminative dictionaries. The first dictionary provides a sparse representation for the text while the second one provides a sparse representation for the background, each of which is composed of 512 base vectors. To train the text dictionary, we choose as training samples mainly isolate machine-printed characters extracted from 36 synthesized document images. These images contain 26 English letters and 10 Arabic numbers in various fonts. We also take Chinese characters into account. For Chinese characters, the components of a character are not always connected, so the problem is more complex. To deal with this problem, we select 412 Chinese character components and 1,500 common Chinese characters for training. The 412 components contain 212 Chinese character components and radicals and other 200 commonly used components, all of which does not have isolate parts. An example is shown in Fig. 10. The components in Fig. 10b, c and f are also Chinese characters. It should be noticed that even though the character in Fig. 10b is the same as the left part of that in Fig. 10a, their layout is not the same, and our training set contains both of them.

Even though the orientation of candidate text line can be obtained, the orientation of texts in the line is challenging to estimate. It can be seen in Fig. 11a that, both the left two text lines can be rotated into horizontal ones, and the letters in them are the same. However, the orientations of texts in them are opposite. The orientation of texts in the right text line in Fig. 11a cannot be estimated easily either. To overcome this difficulty, we rotate the original image by 90, 180, or 270° for training as in Fig. 11b. We only rotated those training images which contain English letters or Arabic numbers, while keep those training images with Chinese characters un-rotated for simplicity. That is to say, in this paper, we assume that all Chinese characters stand upward.

In total, we collect 15,375 images containing English letters or Arabic numbers, 15,000 images containing both Chinese characters and components and 30,354 non-text images, each of which is the skeleton of a MSER. For Chinese character, a MSER may correspond to a part of a character, so we use a projecting method to obtain the whole character. The accuracy of the learned sparse classifier on the training set is 97.4 %.

5.3 Filter design

The simplest filtering way is to use the sparse classifier to identify every candidate CC and remove those CCs which are identified as non-text regions. But there are three main difficulties in doing so: the first one is that the direction of current CC cannot be obtained before text line identification which may seriously influence the accuracy of the sparse classifier; the second one is that false classification of CC will seriously influence the later sibling identification and text line identification; and finally, the third one is that it is very time-consuming to identify every CC in which many MSERs are detected. So in this paper we propose to apply the sparse classifier to the identified candidate text lines only. For a candidate text line L _t and its component CCs S _cc(L _t) = {CC ^t₁ , CC ^t₂ , …, CC ^t_n }, supposing C(y ^t_i ) = identity(f ^t_i ) denotes the category label of the feature vector f ^t_i of component CC ^t_i , Then the label of the whole candidate text line can be identified as:

$$ C\left({L}_t\right)=\left\{\begin{array}{ll}1\hfill & {\displaystyle \sum_i\mathrm{C}\left({\mathrm{y}}_{\mathrm{i}}^{\mathrm{t}}\right)\ge {\mathrm{C}}_{\mathrm{T}}}\hfill \\ {}0\hfill & \begin{array}{cc}\hfill \hfill & \hfill \mathrm{otherwise}\hfill \end{array}\hfill \end{array}\right. $$

(26)

$$ {C}_T={k}_2\cdot n $$

(27)

where k ₂ is a controling parameter and n is the number of CCs in the current candidate text line L _t. In ideal condition the label of every CC can be correctly identified, so k ₂ can be assumed as 1. However, in most condition the accuracy of sparse classifier can not achieve 100 %, so generally the value of k ₂ is less than 1 and it can be interpreted as a lossen coefficient. The assignment of k ₂ will be discussed in the next section.

6 Experimental results

In this section, we use experiments to validate our proposed method for text line detection. To evaluate our method comprehensively, we use two datasets: ICDAR and the Oriented Scene Text dataset (OSTD) which is collected by Yi et al. in [31]. The OSTD dataset can be accessed from the website of [19]. In the first dataset, there are 509 images in total, 258 images for training and 251 images for testing. In the second dataset, there are 89 scene images which contain text lines with arbitrary orientations.

For a comparative study of our method for text line detection, we use several state-of-art methods. All the experiments were implemented in MATLAB 2007 on a general purpose PC (Core 2 Duo 2.66GHZ, 4GB memory).

6.1 Performance evaluation

To evaluate the performance, we use two metrics, precision p and recall r as in [14, 15, 20]. Here, precision is the ratio of the area of the successfully extracted text regions to the area of the whole detected text region, and recall is the ratio of the area of the successfully extracted text regions to the area of the ground truth text regions. The area of a region is the number of pixels inside it. A region is not necessarily a rectangle; we use a rectangle to describe a region. A low precision means overestimate while a low recall means underestimate. To combine p and r into a single measurement, we use a standard f as follows:

$$ f=\frac{2\cdot p\cdot r}{p+r} $$

(28)

For a non-horizontal text line, we use the measure in the Eqs. 12 and 13 in [31]. The basic unit in these two dataset is text line rather word, because it is non-trivial to perform word partition without high level information.

6.2 Parameter selection

Two parameters mainly influence the performance significantly: parameter Δ in MSER detection and k ₂ in Eq. 27. Because there is no straightforward relationship between them, in this paper, we change the value of one parameter while keeping another fixed. In MSER detection, Δ is a very important parameter which controls the step of gray threshold variation. In this paper, we perform experiments on the ICDAR dataset with different values of Δ while keep k ₂ = 0.7. The experimental result is shown in Fig. 12. The horizontal axis denotes the value of Δ. It can be seen from Fig. 12 that when Δ is set as 6, the f-measure obtains its maximum value. So in the later experiments we set the value of Δ as 6. The conclusion in Fig. 12 can be explained as follows: when Δ takes a smaller value, more candidate text regions will be detected, so the recall rate will be higher, while more non-text regions will also be collected which will reduce the precision rate. On the contrary, if Δ takes a larger value, fewer text regions will be detected, leading more non-text regions to be removed which will reduce the recall while increase the precision rate. Thus the value of 6 balances a tradeoff between precision and recall. Generally speaking, if the contrast between text and non-text pixels is low, it is better for Δ to take a smaller value. On the contrary, high contrast corresponds a higher value of Δ. To find out the best value of the parameter k ₂, we also performed experiments on the ICDAR dataset with different values of k ₂ while keeping Δ = 6. As shown in Fig. 13, when k ₂ is assigned a higher value, more text lines will be removed which will reduce the recall while increases the precision. On the contrary, more non-text regions will remain, when k ₂ is smaller. In this case, the precision will be reduced while the recall will be increased. However, if k ₂ is large enough, for example, 0.8 here, more text lines will be removed which will decrease both precision and recall. So k ₂ = 0.7 is a reasonable value. To compare the edge feature with the skeleton feature, we also obtained a sparse dictionary from the edge feature of MSERs. The experimental result is shown in Fig. 14. It can be seen from Fig. 14 that the proposed method using skeleton feature can obtain higher precision and recall than those of edge feature. So the skeleton feature is better than the edge feature for the sparse classifier. There are two reasons for this conclusion: one is that for the same character, edge varies more seriously than skeleton for different font styles; the other is that skeleton is more robust than edge against the same noisy signals.

6.3 Results and discussions

The quantitative measurements of the performance of different methods for text detection are presented in Table 1. All methods are evaluated on the standard benchmark ICDAR dataset. It can be shown in Table 1 that Becker’s Method [14] achieved the best recall value while our method obtains the best precison and F-measure among all methods. There are two reasons which can account for the high precision of our method. On the one hand, by using only structure or texture features, most methods did not work well when both non-text and text lines have similar structures. On the other hand, our method use both the structure among CCs in the line and the intrinsic feature of every CC, so it can work well over both non-text and text lines with varying structures. Because of the executables of other methods are not available, we only evaluate three methods on the OSTD dataset by using the experiment results in their original papers respectively. The three methods are respectively our method, Yi’s method[31] and TD-Mixture[28]. The performances of all three method on OSTD dataset are shown in Table 2. It can be shown from Table 2 that TD-Mixture method[28] achieve the best recall while our method can also achieve the best precision on OSTD dataset. Generally speaking, one the one hand, with several restrictive conditions, our method may discard some text lines, so our method can not obtain a relatively high recall. On the other hand, for the reason explained above, our method can get a relatively higher precision than other methods. We need to develop more flexible constraints to enhance the recall measurement in the future.

Table 1 Performances of different text detection methods evaluated on the ICDAR test set

Full size table

Table 2 The performance of our mehtod on OSTD dataset

Full size table

Some example results of text line detection by using our method on the ICDAR dataset are shown in Fig. 15. We use the minimum surrounding rectangle to cover the detected text regions. It can be seen from Fig. 15 that our method can detect texts varying in size, color, intensity, and orientation. In some images with strong highlights, mirror reflection or other strong disturbing factors, our method can still detect partial text lines. Our method can also detect blurred texts, which are challenging for edge-based methods such as [18].

For better comparing, we also show some results by using Yi’s method [31] on the ICDAR dataset and OSTD dataset.

Some example results of text line detection with Yi’s method in [31] on the same images as in Fig. 15 are shown in Fig. 16. It can be seen from Fig. 16 that sometimes their method failed to remove non-text lines, for example, brick in Fig. 16c, toys in (d), stone in (g), pillars in (h), white blob in (k), patches of the door in (n) and the tin top in (o). It can also be seen from Fig. 16 that their method failed to detect text lines or distinguish text regions from non-text regions, for example, HSBC in Fig. 16i, SHER in (j), the spider web in (l) and PEPSI in (o), due to the similar colors in the text and non-text regions, blurred text or other disturbing factors. The reason is that the structure analysis in conjunction with the color clustering of their method is not powerful enough for removing all those non-text linear alignments of the noisy components in similar sizes.

Some example results of text line detection with our method on the OSTD dataset are shown in Fig. 17 while those with Yi’s method in [31] are shown in Fig. 18. Images in Figs. 17 and 18 contain text lines with arbitrary orientations. The minimum surrounding rectangle is also used to cover the detected text regions. It can be seen from Fig. 17 that even though the orientations of text lines vary in a large range, our method still correctly identifies them in most situations. It also can be seen from Fig. 17 that under some conditions our method may miss the text line for example in the last one in the first row due to low contrast and blurred texts. The method in [31] sometimes mis-detected non-text lines as text ones, for example, the line between two timbers in Fig. 18a, the lines between two road signs in (c), and the sealing line of the chewing gum in (g). In Fig. 18b, h and l it detected a larger area than necessary.

Some poor example results of our method are shown in Fig. 19. In Fig. 19a and b, because of the transparent text or the serious light variation, our method missed many characters. In Fig. 19c, detected characters are arranged in the false text lines. The reason is that the text lines are aligned vertically, the interval between them is not large enough to separate them from each other, and their colours are very similar. In Fig. 19d and e our method failed to find text line in them due to the existence of few characters for the detection of seed CCs and low contrast between the text and its background. Clearly, these images are also challenging for the existing algorithms to detect test lines inside.

7 Conclusion

Text information in natural images is very informative for image content understanding which finds many applications in the real world. However, due to the unpredictable text appearances and complex backgrounds, it still remains a challenging problem for text detection especially in natural images. This paper proposes a novel text line detection method which is mainly based on a similarity measurement and a sparse filter. Firstly, all connected components (CC) are obtained by MSER detector with canny edges as dam lines, and then a similarity measurement is proposed to estimate the similarity between two CCs. Based on such similarity measurement, a method is developed to find out all candidate text lines. The method firstly finds three connected components (CC) as the seed CCs of a line, and then expands to contain all other CCs which are in the same line as the three seeds. Finally a sparse classifier based on the learned dictionary using a feature vector extracted from morphological skeletons in the MSERs is used to remove those non-text lines. Our method can detect text lines aligned with a straight line in any direction in natural images. A comparative study has shown that the proposed method outperformed significantly other selected state of the art methods for text line detection.

Further research will consider three different categories of text line detections: (1) those text lines which are composed by less than three text regions; (2) those text lines aligned without a straight line; and (3) text extraction from detected text regions. In this respect, the colour consistency theory may be investigated.

References

Chen T (2008) Text localization using DWT fusion algorithm. In: 11th IEEE International Conference on Communication Technology (ICCT), pp. 722–725
Chen X, Yuille A (2004) Detecting and reading text in natural scenes. In: the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 366–373
Frey BJ et al (2007) Clustering by passing messages between data points. Science 315:972–976
Article MATH MathSciNet Google Scholar
Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol.1, pp. 425–428
Grana C, Borghesani D, Cucchiara R (2011) Automatic segmentation of digitalized historical manuscripts. Multimed Tools Appl 55(3):483–506
Article Google Scholar
Idris F, Panchanathan S (1997) Review of image and video indexing techniques. J Vis Commun Image Represent 8(2):146–166
Article Google Scholar
Karatzas D, Antonacopoulos A (2004) Text Extraction from Web Images Based on a Split-and-Merge segmentation Method Using Color Perception. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol.2, pp. 634–637
Kim W, Kim C (2009) A New approach for overlay text detection and extraction from complex video scene. IEEE Trans Image Process 18:401–411
Article MathSciNet Google Scholar
Kimmel R, Zhang C, Bronstein AM, Bronstein MM (2011) Are MSER features really interesting? IEEE Trans Pattern Anal Mach Intell 33(11):2316–2320
Article Google Scholar
Li Z, Liu G, Qian X, Wang C, Ma Y, Yang Y (2010) A Video Text Detection Method Based on Key Text Points. In: Processing of the 11th Pacific-Rim Conference on Advances in multimedia information processing (PCM), pp. 284–295
Liang J, Doermann D, Li H (2005) Camera based analysis of text and documents: a survey. Int J Doc Anal Recognit 7:84–104
Article Google Scholar
Lienhart R (2000) Automatic text segmentation and text recognition for video indexing. Multimed Syst Mag 8:69–81
Article Google Scholar
Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489
Article Google Scholar
Lucas SM (2005) ICDAR 2005 text locating competition results. In: Proceeding of the 8th International Conference on Document Analysis Recognition, vol. 1, pp. 80–85
Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Proceeding of 7th International Conference on Document Analysis Recognition, pp. 682–687
Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Discriminative learned dictionaries for local image anlysis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8
Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. British Machine Vision Computing Conference, pp. 384–393
Ofek BEE, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. IEEE Conf Comput Vis Pattern Recognit pp. 2963–2970
Minetto R, Thome N, Cord M, Fabrizio J, Marcotegui B (2010) “Snoopertext: A multiresolution system for text detection in complex visual scenes”. In: 17th IEEE International Conference on Image Processing ICIP, pp. 3861–3864
Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images. In: Proceeding of International Conference on Document Analysis Recognition, pp. 1491–1496
Shivakumara P, Dutta A, Tan CL, Pal U (2010) A New Wavelet-Median-Moment based Method for Multi-Oriented Video Text Detection. In: Proceedings of the Ninth IAPR International Workshop on Document Analysis and Systems (DAS), pp. 279–288
Shivakumara P, Huang W, Phan TQ, Tan CL (2010) Accurate video text detection through classification of low and high contrast images. Pattern Recogn 43:2165–2185
Article Google Scholar
Shivakumara P, Phan TQ, Tan CL (2010) New fourier-statistical features in RGB space for video text detection. IEEE Trans Circ Syst Video Technol 20(11):1520–1532
Article Google Scholar
Shivakumara P, Phan TQ, Tan CL (2011) A laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33:412–419
Article Google Scholar
Shivkumara P, Huang W, Tan CL (2008) Efficient Video Text Detection using Edge Features. In: Proceedings of 19th international Conference on Pattern Recognition (ICPR), pp. 1–4
Wang F, Ngo C-W, Pong T-C (2008) Structuring low-quality videotaped lectures for cross-reference browsing by video text analysis. Pattern Recogn 41:3257–3269
Article Google Scholar
Yanga M, Zhanga L, Fengb X, Zhang D (2011) Fisher Discrimination Dictionary Learning for Sparse Representation. In: Proceeding of IEEE International Conference on Computer Vision (ICCV), pp. 543–550
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012 June) Detecting Texts of Arbitrary Orientations in Natural Images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ye Q, Jiao J, Huang J, Hua Y (2007) Text detection and restoration in natural scene images. Vis Commun Image Represent 18:504–513
Article Google Scholar
Yi J, Peng Y, Xiao J (2007) Color-based clustering for text detection and extraction in image. In: Proceedings of the 15th International Conference on Multimedia (MM), pp. 847–850
Yi C, Tian YL (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605
Google Scholar
Zhang D, Islam M, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45:346–362
Article Google Scholar
Zhao M, Li S, Kwok J (2010) Text detection in images using sparse representation with discriminative dictionaries. Image Vis Comput 28:1590–1599
Article Google Scholar
Zhao X, Lin K-H, Fu Y, Hu Y, Liu Y, Huang TS (2011) Text from corners: a novel approach to detect text and caption in videos. IEEE Trans Image Process 20(3):790–799
Article MathSciNet Google Scholar

Download references

Acknowledgments

We would like to express our sincere thanks to both the anonymous associate editor and reviewers for their constructive comments that have significantly improved the quality as well as readability of the paper.

This work is supported by the National Natural Science Foundation of China (No.60673088).

Author information

Authors and Affiliations

Jiangsu Electric Power Information Technology Co. Ltd., Nanjing, 210029, China
Jie Yuan
College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
Baogang Wei & Yin Zhang
Department of Computer Science, Aberystwyth University, Wales, UK, SY23 3DB
Yonghuai Liu
Qianjiang College, Hangzhou Normal University, Hangzhou, 310027, China
Lidong Wang

Authors

Jie Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Baogang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yonghuai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lidong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Yuan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, J., Wei, B., Liu, Y. et al. A method for text line detection in natural images. Multimed Tools Appl 74, 859–884 (2015). https://doi.org/10.1007/s11042-013-1702-7

Download citation

Published: 27 September 2013
Issue Date: February 2015
DOI: https://doi.org/10.1007/s11042-013-1702-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A method for text line detection in natural images

Abstract

Similar content being viewed by others

An Efficient Detection Method for Text of Arbitrary Orientations in Natural Images

Using Scale-Space Anisotropic Smoothing for Text Line Extraction in Historical Documents

TextCatcher: a method to detect curved and challenging text in natural scenes

1 Introduction

2 The framework of text detection

3 Region detection and similarity definition

3.1 Maximally stable extremal regions

3.2 Connected components collection

3.3 Similarity measurement

3.3.1 Geometry similarity

3.3.2 Colour similarity

4 Candidate text line detection

4.1 Sibling identification

4.2 Text line identification

5 Text line filtering

5.1 Sparse classifier

5.2 Feature extraction and dictionary learning

5.3 Filter design

6 Experimental results

6.1 Performance evaluation

6.2 Parameter selection

6.3 Results and discussions

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A method for text line detection in natural images

Abstract

Similar content being viewed by others

An Efficient Detection Method for Text of Arbitrary Orientations in Natural Images

Using Scale-Space Anisotropic Smoothing for Text Line Extraction in Historical Documents

TextCatcher: a method to detect curved and challenging text in natural scenes

Explore related subjects

1 Introduction

2 The framework of text detection

3 Region detection and similarity definition

3.1 Maximally stable extremal regions

3.2 Connected components collection

3.3 Similarity measurement

3.3.1 Geometry similarity

3.3.2 Colour similarity

4 Candidate text line detection

4.1 Sibling identification

4.2 Text line identification

5 Text line filtering

5.1 Sparse classifier

5.2 Feature extraction and dictionary learning

5.3 Filter design

6 Experimental results

6.1 Performance evaluation

6.2 Parameter selection

6.3 Results and discussions

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation