Keywords

1 Introduction

At present the tourism sector has been extended to perform activities, such as games, dining, marriage rituals, undersea, river, and water. Therefore, image indexing, retrieval, and understanding of underwater images have received increasing attention from researchers in the field of image processing and computer vision [1, 2]. For example, in the case of scuba diving under the sea, text detection approach can be used to trace the swimmer by using text appeared on equipment and camera, such that we can prevent them to reaching dangerous area. At the same time, guide can control and monitor the swimmer and teach them different skills. In the similar way, text detection can be used to retrieve under water activities in the ocean. Therefore, text detection in underwater images is useful and significant in understanding underwater images and videos. Compared to text detection in natural scene images, detection in underwater images is more challenging due to various distortions caused by the refraction of light, the surface of water, depth of water, and particles in water. Many methods have been proposed for text detection in natural scene images [3, 4] but text detection in underwater images has received little attention. Existing natural scene text detection methods do not perform well on underwater images. In Fig. 1, the results of two state-of-the-art methods, ContourNet [3] and FDTA [4] are shown where it can be seen that tiny text lines in underwater images are missed despite their excellence performance on natural scene images. This is because underwater images lose quality due to distortion caused by the nature of water. On the other hand, the proposed method can detect such text properly in underwater as well as natural scene images. This is the contribution of the proposed model compared to the state-of-the-art methods.

Fig. 1.
figure 1

Example of text detection of proposed and existing methods for under water and natural scene images. Results of ContourNet, FDTA and the proposed method are shown in (a), (b) and (c), respectively.

In order to address the challenges of underwater images, inspired by the work [5], which combine different frequency domain analysis methods to improve the image quality for face anti-spoofing detection, we explore different combinations of Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT) to enhance the text details in underwater images. The intuition behind the idea is that the text pixels in any image share high energy, spatial resolution, and brightness compared to non-text pixels. It is noted that the energy can be achieved by DCT, the fine scaling can be achieved by DWT and the brightness is achieved by FFT. The combination of DCT-DWT-FFT produces six enhanced images. The enhanced images are supplied to modified Character Region Awareness for Text Detection (CRAFT) [6] in such a way that the model should work for both underwater and natural scene images.

In summary, the main contributions of the proposed method are as follows. (i) This is the first work addressing the challenges of text detection in underwater images. (ii) The combination of DCT-DWT-FFT is introduced for enhancing low contrast text information in underwater images. (iii) The proposed text detection method performs well on both underwater and natural scene images and outperforms state-of-the-art methods on multiple datasets.

2 Related Work

The existing methods on text detection in natural scene images can be categorized into top-down (regression based), bottom-up (segmentation based), and hybrid models for accurate text detection. We review the recent methods of the same.

The regression-based methods consider the whole text as an object for text detection. For instance, Cao et al. [4] proposed FDTA, which is fully convolutional scene text detection with text attention. Liu et al. [7] proposed GCCNet, which is a grouped channel composition network for scene text detection in natural scene images. The model optimizes anchor functions rather than a handcrafted feature for tackling the challenges of text detection. Shi et al. [8] explored iterative polynomial parameter regression for accurate arbitrarily shaped scene text detection. However, these methods are not robust for handling an arbitrarily shaped text.

To overcome the problem of arbitrarily-orientation and arbitrarily shaped text, the segmentation-based methods were developed. These methods used character and pixel information as local information for accurate text detection. For instance, Qin et al. [9] explored soft attention mechanism and dilated convolution for detecting arbitrarily shaped text in natural scene images. Baek et al. [6] developed a model called CRAFT, which is character region awareness for text detection in natural scene images. Dai et al. [10] proposed scale-aware data augmentation and shape similarity constant for accurate text detection in natural scene images. Hu et al. [11] proposed TATD, which is text contour attention for scene text detection in natural scene images. The model uses text center intensity maps and text kernel maps for accurate results. Liao et al. [12] proposed MaskTextSpotter, which is a trainable neural network for spotting text with arbitrary shapes. The model works based on developing sequence to sequence network. Deng et al. [13] developed a method called RFRN, which is recurrent features refinement network for accurate and efficient text detection in natural scene images. However, these methods are sensitive to distortion and complex background images.

To overcome the problems of regression and segmentation-based methods, hybrid methods were developed. These methods consider merits of regression and segmentation-based approaches to improve text detection performance. For example, Wang et al. [3] proposed ContourNet for detecting text in natural scene images that use advantages of both regressions-based models and segmentation-based models to tackle the challenge of arbitrarily shaped text detection. Liu et al. [14] proposed semi-supervised learning for text detection in natural scene images.

In summary, most of the models targeted the challenges of text detection in natural scene images but not other images, such as low light images, deformable text detection, tattoo text detection, and underwater images. However, there are methods [15,16,17] that consider the combination of enhancement and deep learning for low light images, deformable text detection from sports images based on episodic learning, and tattoo text detection based on deformable convolutional inception neural network. None of the existing methods consider underwater images for text detection. Hence, this work aims to develop an enhancement model for improving text detection in underwater and natural scene images.

3 Proposed Method

To detect text in underwater and natural scene images, it is observed that the properties of pixels, such as energy, fine scaling (spatial resolution), and brightness are common for all the text pixels regardless of image type and irrespective of qualities. Inspired by the method [5] where the combinations of DWT-LBP-FFT have been used for separating pixels affected by face attack from the actual pixels, we explore the combination in a different way for enhancing text pixels in the underwater images. In addition, it is noted that the proposed transforms involve combination of low and high pass filters to find fine details, namely edge information for reconstructing images. This observation motivated us to propose different combinations of above transforms such that the fine details in poor quality under water image caused by multiple adverse factors can be enhanced.

Fig. 2.
figure 2

The proposed block diagram for text detection in underwater images.

To explore the optimal configuration of image enhancement model, we evaluate different combinations of three image transform techniques: DCT, DWT, and FFT. The combinations of transforms results in six versions of enhanced images, namely, DCT-DWT-FFT, DCT-FFT-DWT, DWT-DCT-FFT, DWT-FFT-DCT, FFT-DCT-DWT, FFT-DWT-DCT. After image enhancement, we adopt the state-of-the-art text detection method, Character Region Awareness for Text Detection (CRAFT) [6], which works well for good quality images by studying character shape. We use the same to modify the model such that the modified model can withstand the challenges of text detection in underwater images by considering six enhanced images as input in this work. The schematic diagram of the proposed work is shown in Fig. 2.

3.1 DCT-DFT-FFT Images for Enhancement

For the input image, the proposed method obtains IDCT, IDWT, and IFFT images. In order to take advantage of DCT, DWT, and FFT, the proposed approach performs the combination of DCT, DWT, and FFT bypassing reconstructed images. For example, the reconstructed image of DCT is supplied to DWT, the reconstructed image of DWT is supplied to FFT, which outputs the final reconstructed image of the first combination. The same process is continued to obtain all the six reconstructed images by the six combinations, namely, DCT-DWT-FFT, DCT-FFT-DWT, DWT-DCT-FFT, DWT-FFT-DCT, FFT-DCT-DWT, FFT-DWT-DCT. The calculations of DCT, DWT and FFT are obtained as defined in Eqs. (1)–(3), respectively.

$$ G\left( {r,s} \right) = \beta_{r} \beta_{s} \sum\nolimits_{x = 0}^{P - 1} {\sum\nolimits_{y = 0}^{Q - 1} {g\left( {x,y} \right)cos\left[ {\frac{{\pi \left( {2x + 1} \right)r}}{2P}} \right]cos\left[ {\frac{{\pi \left( {2y + 1} \right)s}}{2Q}} \right]} } , $$
(1)
$$ \begin{aligned} g\left( {x,y} \right) = & \;\frac{1}{{\sqrt {PQ} }}\sum\limits_{u} {\sum\limits_{v} {W_{\delta } (k_{0} ,u,v)\delta_{{k_{0} ,u,v}} \left( {x,y} \right)} } \\ & + \frac{1}{{\sqrt {PQ} }}\sum\limits_{i = H,V,D} {\sum\limits_{{k = k_{0} }} {\sum\limits_{u} {\sum\limits_{v} {W_{\omega }^{i} (k,u,v)\omega_{k,u,v}^{i} \left( {x,y} \right)} } } } , \\ \end{aligned} $$
(2)
$$ B_{j} = \sum\limits_{u = 0}^{U - 1} {e^{{ - i\frac{2\pi ju}{U}}} b_{u} } , $$
(3)

where, \(\beta_{r} and \beta_{s}\) represent the beta distribution of the pixel, P and Q are the spectrum intensity and spectrum density, respectively. On the other hand, \(W_{\delta }\) is the wavelet transform function k, u, v represents the horizontal, vertical and diagonal transform, respectively. Last but not the least, \(b_{u}\) depicts the pixel value before DFT. Fast Fourier Transform (FFT) is a fast way of computing Discrete Fourier Transform by taking 2-point and 4-point DFT and generalizing them to 8-point, 16-point, …, 2r-point.

Fig. 3.
figure 3

Six combinations of enhanced images and their pixel distributions.

The results of each combination can be seen in Fig. 3 for the input image, where it can be seen that the brightness increase for all the enhanced images. We believe that each combination helps us to enhance the fine details of text pixels in the underwater images. As a result, the contrast between text and non-text pixel increases. It is evident from the plots of six enhanced images shown in Fig. 3. Here, X-axis represents the normalized pixel value and Y-axis represents the pixel frequency downscaled by 100. Higher pixel frequency means high degree of enhancement which helps for better detection. It can be seen that the pixel frequency of six enhanced images increases compared to the values in the input image. This observation motivated us to explore CRAFT mode for text detection from the enhanced images. Therefore, we modify the CRAFT such that it accepts all the six enhanced images as input for accurate text detection irrespective of image type and quality.

3.2 Proposed Modified CRAFT for Text Detection in Underwater Images

It is noted that the existing CRAFT can address most of the challenges, such as arbitrarily shaped, arbitrarily-orientation, which are common in the case of underwater images. However, it works well for the images with good quality and contrast but not for the underwater image which generally suffers from poor quality. Therefore, we modify the CRAFT such that it performs well for underwater images by considering all the six enhanced images obtained from the previous section as input. The modified architecture can be seen in Fig. 4. The ResNet-50 has been used here as the foundation of CRAN. We use FPN to meld include maps produced by various phases of the backbone initializing from the top. Utilizing the melded highlight maps, the consideration module further upgrades its discriminative parts by creating comparing consideration loads. Then the improved element maps are used in the correction module and the acknowledgment organization to create the character groups. The proposed network is an end-to-end trainable and the acknowledgment network utilized is equivalent to the consideration-based encoder-decoder in [18].

Fig. 4.
figure 4

Proposed modified CRAFT for text detection in under water image.

The 2D character detection module comprises a detection head, a component savvy duplication (i.e. element-wise multiplication) activity. Given the information highlight maps \(I_{in}\), the consideration head produces the consideration map, \(CM\) through three convolution layers, every one of which is trailed by a cluster standardization (i.e. batch normalization) layer and a ReLU activation layer. Then, at that point the yield highlight maps \(I_{out}\) can be registered as defined in Eq. (4).

$$ I_{out} = I_{in} \otimes CM $$
(4)

where \(\otimes\) addresses component insightful augmentation. Note that \(CM\) has one channel and therefore, the proposed model broadcast it to a similar shape as \(I_{in}\) to accomplish component astute augmentation. Even though our consideration module is exceptionally basic, it brings extensive improvement because of the oversight of the consideration map during preparation.

We use the correction module to amend the element guides of discretionary formed examples to customary ones. It comprises an amendment head organization and a sampler. In the first place, the correction head predicts the directions of K control focuses, then, at that point the sampler utilizes Thin-Plate-Spline (TPS) change to produce the redressed include maps. Unique concerning Aster that straightforwardly redresses the info pictures, we amend the upgraded highlight maps. As enough discriminative data is separated, we don’t require as many convolution layers as Aster. The configuration of our modified CRAFT architecture is shown in Table 1.

Table 1. Configuration of the proposed modified CRAFT architecture

During preparation, the consideration module and acknowledgment module are regulated. The entire framework can be prepared in a start to finish way, with the accompanying loss function as defined in Eq. (5).

$$ L = \sum\nolimits_{I \in textImage} {\lambda \times L_{att} + L_{rec} } , $$
(5)

where, \(L_{att}\) addresses the deviation between the anticipated consideration map and the ground truth, which is determined by the Smooth \(L_{1}\) Loss as defined in Eq. (6).

$$ \begin{aligned} Smooth_{{L_{1} }} \left( x \right) = & \;\{ - 0.5a^{2} if\;a < 1; \\ = & \;\{ \left| {\text{a}} \right| - 0.5\;{\text{otherwise}} \\ \end{aligned} $$
(6)

So, \(L_{att}\) can be defined using Eq. (7).

$$ L_{att} = Smooth_{{L_{1} }} \left( {CM - CM^{*} } \right), $$
(7)

where \(CM\) is the anticipated consideration guide and \(CM^{*}\) is the comparing ground truth. Moreover, \(L_{rec}\) denotes the recognition loss, which can be formulated as defined in Eq. (8).

$$ L_{rec} = - \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\log p\left( {x_{i} {|}I,\emptyset } \right)} , $$
(8)

where \(x_{1} ,x_{2} , \ldots \ldots \ldots ,x_{N}\) is the ground truth record grouping, \(\emptyset\) addresses all teachable boundaries of our organization, \(I\) is the info picture. The hyper-boundary λ is intended to balance two losses. Samples are chosen from SynthText and Synth90K is utilized during preparation. For tests from SynthText, we use jumping box comments of each character to create the ground reality of the consideration map. Since Synth90K doesn’t provide character-level comments, we disregard \(L_{att}\) for tests from Synth90k, i.e., the acknowledgment misfortune is utilized to upgrade the model. λ is observationally set to 1000 in our tests.

The results of the proposed modified CRAFT and the existing CRAFT are shown in Fig. 5(a)–(b), where it can be seen that the proposed modified CRAFT detects all the text in all the three images including tiny and a big text as shown in Fig. 5(b), while the existing CRAFT misses’ text in underwater images especially for tiny text as shown in Fig. 5(a). This shows that modifications to existing CRAFT are effective for accurate text detection in underwater images.

Fig. 5.
figure 5

Effectiveness of the proposed modified CRAFT for different water images.

4 Experimental Results

Creating a dataset for text detection in underwater images is not an easy task and at present there is no dataset for underwater images. Therefore, we created our own dataset by immersing objects containing text in the water at different levels. Both clear and polluted water are used for data collection. For making the dataset complex, we have added dust and mud into water. In addition, we used different objects, papers, bottles with labels, different covers of the packets for creating the dataset. The text in such images can have arbitrary-shaped characters, orientation, and dense text in addition to the adverse effect of water. We believe that the way we created the dataset matches with the real scenario of underwater images. Our dataset consists of 500 images with large variations.

To show that the proposed model works well for natural scene images, we consider six benchmark datasets as follows. MSRA-TD500 [19]: This is to test the ability of multi-oriented and multi-lingual text detection. The dataset provides 300 images for training and 200 images for testing. CTW1500 [19]: This is to test the curved text detection ability of the methods. It provides 1000 images for training and 500 images for testing. Total-Text [19]: This is also a curved text dataset with more variations in the images for evaluating the performance of the methods. It provides 1255 images for training and 300 for testing. ICDAR 2017 MLT [19]: This is to test the multi-lingual ability of the methods, which includes 9 different languages. It provides 7200 images for training images, 1800 images for validation, and 9000 images for testing. ICDAR 2019 ArT dataset [20]: This dataset combines the images of Total-Text, CTW1500, and Baidu Curved Scene Text. This is a huge dataset compared to all other datasets. In total, 10,166 images are there, and 5603 images are used for training and 4563 images for testing. COCO-Text [19]: This is not created to evaluate text detection methods; the images are collected with the intention of other objectives. As a result, one can expect large variations in the images compared to other benchmark datasets. It provides 43686 images for training, 20000 images for validation, and 900 images for testing.

To show the superiority of the proposed method over existing methods, we compared the results of the proposed method with the results of the SOTA methods [3, 4, 6, 8, 9, 14]. The reason to choose the above existing methods for comparison is that the objective of these methods is the same as the proposed work. In addition, the methods addressed the challenges that similar to text detection in underwater images. To test the above-mentioned existing methods on our underwater image dataset, we retrain the methods with training samples of respective datasets. The standard measures, namely, Precision (P), Recall (R), and F-Score (F) for evaluating the performance of the methods are used as defined in [3,4,5,6,7,8].

For experiments, we use 70:30 ratios for training and testing in the case of our underwater image dataset while for all other benchmark datasets, we use the number of training and testing samples according to the ratio provided in the respective datasets. However, the evaluation scheme followed in this study is the same for all the experiments.

4.1 Ablation Study

The key steps of the proposed method are the combination of DCT-DWT-FFT and modifications to the existing CRAFT to achieve the best text detection performance. To validate the contribution of each transform, enhanced images obtained by six combinations, and the effectiveness of the proposed modified CRAFT, the following experiments are conducted.

(i) The existing CRAFT without any modifications was applied on an underwater image dataset for calculating measures as reported in Table 2 and the results are considered as baseline results for comparing with other steps of the proposed method. In this experiment, the input underwater images are passed to the existing CRAFT for text detection. (ii)–(iv) The reconstructed images given by DCT, DWT, and FFT are fed to the proposed modified CRAFT for text detection. This is to test the contribution of DCT alone, DFT alone and FFT alone to achieve the best detection results by the proposed method. (v)–(x) Enhanced images given by each combination are supplied to proposed modified CRAFT for text detection. This is to test the effectiveness of each combination. (xi) The input images are passed to the proposed modified CRAFT without enhancement images. This is to test the contribution of the modifications done to the existing CRAFT. (xii) All the six enhanced images are fed to the proposed modified CRAFT for text detection in underwater images.

It is observed from Table 2 that the results of experiments from (ii)–(xi) show the Precision, Recall and F-Score of each experiment are improving compared to the baseline results of the experiment (i). Therefore, one can infer that the transforms, the combinations of different transforms, and modifications to the existing CRAFT are all effective and contribute equally to achieving the best results for text detection in underwater images by the proposed method as reported in Experiment (xii).

Table 2. Ablation Study using our underwater images dataset.
Fig. 6.
figure 6

Text detection results of the proposed method for underwater images.

4.2 Experiments on Our Underwater Images Dataset

Sample results of the proposed method for text detection in underwater images are shown in Fig. 6, where it can be seen that the proposed method is capable of detecting the different types of texts in the underwater images including tiny, dense, arbitrarily oriented text with complex background. Therefore, one can argue that the proposed model is robust to underwater images of different qualities. For quantitative results, to show that the enhancement step presented in Sect. 3.1 is effective in improving the text detection performance of the methods, we calculate the measures by feeding input image as input for the proposed and existing methods, which is called before enhancement experiments. Similarly, the measures are calculated by feeding six enhanced obtained by enhancement steps as input to the proposed and existing methods, which is called after enhancement experiments.

In the case of before-enhancement experiments, the input images are fed to the proposed modified CRAFT without enhanced images for text detection. For after-enhancement experiments, the six enhanced images are fed to the proposed modified CRAFT for text detection. It is observed from Table 3 that all the methods report better results in terms of Precision, Recall, and F-Score for after-enhancement compared to before-enhancement. This indicates that the enhancement step is effective and contributes to achieving better detection results for underwater images. In the same way, when we compare the results of the proposed and existing methods before and after enhancement, the proposed method is the best at Precision, Recall and F-Score compared to the existing methods. Therefore, one can conclude that the proposed method is capable of addressing the challenges of underwater images. On the other hand, since the existing methods were developed for detecting text in natural scene images, the existing methods are not effective for underwater images, which are affected by distortion caused by the depth of water, purity of water, light refraction, light absorption, and the objects like labels on the bottles, covers, papers, different objects etc.

Table 3. Performance of the proposed and existing methods on underwater images dataset

4.3 Experiments on Benchmark Dataset of Natural Scene Images

To show that the proposed method has ability to detect text in natural scene images, the measures are calculated for the images of six standard natural scene text datasets, namely, MSRA-TD500, ICDAR 2019 MLT, CTW1500, Total-Text, ICDAR 2019 ArT, and COCO-Text. Sample results of the proposed method are shown in Fig. 7, where we can observe that the proposed method detects text well for all the images of different datasets. This shows that despite the proposed model is developed for text detection in underwater images, it detects text well for natural scene images and hence the proposed method is robust.

Quantitative results of the proposed and existing methods before, after-enhancement for all the aforementioned datasets are reported in Tables 4 and 5. Tables 4 and 5 show that all the methods including the proposed method report high results for after-enhancement (After) compared to before-enhancement (Before) in terms of all the three measures. Therefore, we can confirm that the proposed enhancement is useful for improving text detection performance even for natural scene images also. Similarly, the results of the proposed and existing methods after-enhancement show that the proposed method is better than existing methods. This indicates that the proposed method is independent of image type, text type, and image quality. It is evident from the results of the proposed method on all the datasets that the results are almost the same for all the datasets. This is the advantage of obtaining the enhanced images by the six combinations of DCT-DWT-FFT and modified CRAFT. The reason for the poor results of existing methods is that although the models are robust to low contrast, low resolution, and taking advantage of deep learning, the models are not consistent and stable when the images suffer from poor quality affected by multiple adverse factors of water images.Overall, the proposed enhancement is effective in improving the performance of text detection for both underwater and natural scene images. In addition, the proposed model is generic because it performs well for images of different complexities.

Fig. 7.
figure 7

Example of text detection of the proposed model for images of different benchmark natural scene text datasets.

Table 4. Text detection performance of the proposed and existing methods on MSRATD-500, ICDAR 2019 MLT and CTW1500.
Table 5. Text detection performance of the proposed and existing methods on Total-Text, ICDAR 2019 ArT and COCO-text datasets.

Sometimes, when the text is too tiny and water contains more dust as shown in Fig. 8, the proposed model does not detect text, accurately. It can be seen from the examples shown in Fig. 8, where the proposed model misses text and does not fix proper bounding boxes for each text line in the images. Therefore, there is a scope for improving the proposed model further. In these cases, just enhancement using image information is not sufficient. We need to extract object information to define the context and then the context information can be used to enhance the whole region rather than focusing only on text information. Next, to improve the quality of the enhanced region, one can think of super-resolution concept, which enhances the fine details of the text.

Fig. 8.
figure 8

Limitation of the proposed model.

5 Conclusion

We have proposed a novel method for text detection in underwater images through a new enhancement approach and using modifications of existing CRAFT. The main objective of the proposed work is to address the challenges of text detection in underwater images and natural scene images. To the best of our knowledge, this is the first work of its kind, unlike existing methods that focus only on text detection in natural scene images. For the enhancement, the proposed approach explores the combination of DCT-DWT-FFT, which generates six enhanced images for each input image. For text detection results, we have modified the existing CRAFT to detect text in underwater images irrespective of image type, text type, and quality affected by multiple adverse factors, which consider nine enhanced images as input. Experimental results of the proposed and existing methods on the underwater image dataset and six standard natural scene text datasets show that the proposed model is superior to existing methods in terms of consistency, stable results, and robustness to different datasets.