Keywords

1 Overview

As the all-media era progresses, people now interact with social hotspots and share their personal opinions through online platforms like Weibo, Zhihu, Jitterbug, and Racer. In addition to traditional text narratives, short videos and photos are becoming increasingly important carriers for conveying personal emotions because of their vivid and rich contents. However, there are also organizations that use the ability of images and videos to hide important details, such as water armies, network automata, etc., to spread false information. If they are not stopped in time, they could lead to a bad public image and cause significant societal losses.

Traditional approaches of monitoring public opinion online rely on a single data source, mostly text data crawled from web pages. The ability to recognize text in videos and images falls behind the ability to recognize text in multimodal data. In 2018, for the problem of distorted text detection, Liu et al. [1] proposed a deep character embedding network (CENet), which can extract the image data within the bounding box as a multiscale feature mapping and is thus characterized as embedding vectors. This makes text detection a clustering task in the character embedding space. In response to the underwhelming performance of the majority of existing approaches for curved text detection, Chen et al. [2] suggested an instance-aware segmentation-based text detection method for atypical scenarios in 2019. The main goal is to create a model for attention-guided semantic segmentation that correctly labels the weighted boundaries of text sections [3]. The approach has shown some promise on datasets with curved text. 2020 To solve the expensive issue of training a text recognition model, which necessitates a huge amount of data spanning as much variation as possible, Luo et al. [4] presented a text image enhancement method. The method starts with a collection of unique base points and then employs joint learning to close the gap between the two separate processes of data improvement and network optimization. Experiments show that the method produces training samples for the recognition network that are better suited for it. In 2021, Fang et al. [5] propose the autonomous, bidirectional, and iterative scene text recognition model Autonomous Bidirectional and Iterative Network (ABInet), which has an autonomous institution that proposes to block the gradient flow of visual and language models for display language modeling, followed by a bidirectional feature representation. This addresses the limitations of language models, including implicit language modeling, one-way feature representation, and noise. In the same year, Yong et al. [6] used a crawler technique to obtain text data of a “popular event” from Baidu’s index, preprocessed these data to build a logistic differential equation model of online public opinion, and then used the Sine Cosine Algorithm (SCA) to solve it by combining the processed data. Zhang et al. [7] suggested a cross-modal depth metric learning-based oracle character recognition method in almost the same amount of time to address the problem that it is difficult to acquire topical oracle character samples in oracle character images. They were able to recognize topical oracle characters across modalities by modeling the common feature space and nearest neighbor classification of imitation and topical oracle characters.

The current web opinion monitoring techniques primarily depend on routinely crawling text content from web pages, which makes it difficult to rapidly obtain and identify text content from images and videos. Accordingly, this paper updates the Collaborative Mutual Learning (CML) distillation of the original PP-OCR V2 in the text detection module and then enhances the PP-OCR [8, 9] model of OpenVINO [10] based on the most recent developments in the field of text recognition and the characteristics of data in public opinion analysis. It is suggested that Large Kernel Pixel Aggregation Network (LK-PAN), a Pixel Aggregation Network (PAN) module with large perceptual fields, be used to address prior shortcomings in detecting text with multiple scales and extreme aspect ratios [11]. In order to effectively mine the contextual data of text line images and increase the text recognition module's capacity for mistake correction, a transformer network has been introduced [12]. Finally, the experimental findings demonstrate that the new model put forward in this study enhances both the accuracy and speed of text recognition in both images and videos.

2 Related Work

2.1 The Overall Framework of Traditional Opinion Text Recognition

Three modules of the PP-OCR model text detection, text detection frame correction, and text recognition are deployed and accelerated using OpenVINO. The implemented model can swiftly recognize and extract text data from pictures or video sources. The model's initial structure is depicted in Fig. 1.

Fig. 1
A flowchart for the model before improvement. The flow is as follows. Stage 1, images and videos. Stage 2, open V I N O toolkit. Stage 3, text detection, detection boxes rectify, and text recognition. Stage 4, output recognition results.

Structure diagram of the model before improvement

The image or video dataset is first prepared in the first step (Stage 1) and used as the input for the following stage. Secondly, the photos or videos are pre-processed in the second step (Stage 2) using the OpenVINO Toolkit. Following that, the input data are progressively subjected to inference detection by three PP-OCR model sub-modules (text detection, text detection frame correction, and text recognition). The text detection module marks the area with text and gets the Bounding Box containing text; the text detection correction box module corrects the direction of the text inside the Bounding Box; and the text recognition module performs text recognition on the corrected area. Finally, In Stage 4, the extracted text information is visualized.

2.2 PP-OCR Model

The method of Optical Character Recognition (OCR) involves turning a handwritten or printed image of text into computer-encoded text [13, 14]. The process is to determine the shape by detecting light and dark patterns in the image, and then translate the detected shape in the image into text by a recognition algorithm.

A text recognition module called PP-OCR is based on Baidu's deep learning flying paddle architecture, which recently combined deep learning, model compression, and the OCR fields. The technique offers end-to-end recognition in addition to the conventional method of text detection followed by text recognition. The PP-OCR model consists of three primary modules: text detection, detection bounding box correction, and text recognition, as shown in Stage 3 of Fig. 1.

The text detection module makes use of Differentiable Binarization (DB) [13], a segmentation-based scene text detection method that divides the heatmap produced by segmentation methods into bounding boxes and text regions. Then, using binarization operations, it marks the regions containing text using bounding boxes. Traditional binarization processes often use preset thresholds that cannot be modified to accommodate complex and changing detection scenarios. The binarization operation is inserted into the partition network for optimization, which in turn leads to the adaptation of the threshold value in each region of the heat map. The input height of the original code is 32. First statistical training sample image aspect ratio distribution In terms of image dataset, we chose ICDAR 2015 because it has the highest aspect ratio with the training dataset (containing 1000 training sets and 500 test sets).

To address the scenario where the bounding box appears to be skewed in orientation and to correct the text detected in the image, the detection bounding box correction module employs a text orientation classifier. The specific operation flow of the direction classifier is as follows: The text angle classification is mainly applied to the case where the image is not at 0°, in which case the detected text lines in the image must go through a conversion process. However, when testing a large number of images, some overly long text shows obvious errors, so this paper processes the images in the improved model. If the text was too long, it was truncated; if the text was too short, it was copied and expanded to the size of the input image. The modified overly long text test results showed a significant improvement. After the detection area is directionally turned, it helps to improve the accuracy of text recognition.

The convolutional recurrent neural network (CRNN) is used in the text recognition module [15, 16]. The technical challenge of end-to-end OCR is how to handle the indeterminate long sequence alignment problem. The CRNN network first uses Convolutional Neural Network (CNN) to extract image features, taking the idea of solving indeterminate long speech sequences in speech recognition and modeling it as a time-dependent lexical or phrase recognition problem and then applies Bidirectional Long Short Term Memory Unit (Bi-LSTM) to resolve the indeterminate long sequence text problem [17, 18]. Finally, the Connectionist Temporal Classification (CTC) model [19] is introduced to resolve the segmentation alignment problem of samples in order to increase the model's applicability and lessen the significant amount of segmentation annotation work.

2.2.1 DB Algorithm

The main text detection algorithms are shown in Table 1. Text detection methods are classified into two types based on regression and segmentation.

Table 1 Main algorithms for text detection [20, 21]

Because regression-based algorithms have difficulty obtaining smooth text wrap-around curves for curved text, researchers have proposed image segmentation-based text detection algorithms. The segmentation procedure is as follows: First, we classify at the pixel level and determine whether each pixel point is a text target to obtain a probability map of the text region. Next, we process the probability map to obtain the wrap-around curve of the text segmentation region, which has better advantages for irregular text detection.

To address the time-consuming problem of subsequent processing caused by threshold binarization processing, PP-OCR uses a DB algorithm with a learnable threshold method. A binarization function that closely resembles a step function is created [22], and Eq. (1) defines the binarization function:

$$ \hat{B}_{i,j} = \frac{1}{{1 + e^{{ - k(P_{i,j} - T_{i,j} )}} }} $$
(1)

where, \(\hat{B}\) is the threshold value of text segmentation that the segmentation network can learn end-to-end during training, T is the adaptive threshold map learned from the network, and k stands for the magnification factor, and (i, j) represents the pixel coordinate locations on the image. The design of the binarization function allows the segmentation network to learn the threshold of text segmentation end-to-end during training, and automatically adjusts the threshold since the improvement of accuracy and also improves the text detection performance. DB architecture is shown in Fig. 2. The picture or video dataset is first created and utilized as the model's input. Second, The ResNet-18 network is used in the Backbone module to extract the feature information of the model input. Following 3*3 convolution upsampling and right after the feature map, the Fusion module aggregates the feature information of 1/8, 1/16, and 1/32 (indicating the scale of the input image or video) with the feature layers of 1/4, and describes the text by forecasting the threshold map and probability map of the feature layers. Finally, the fixed threshold is binarized in the post-processing stage to obtain the approximate binarization map, and then the text box is obtained from the approximate binarization map.

Fig. 2
A flowchart for D B architecture. It flows through input, backbone, feature map, post-processing, upsampling, R e s net 18 v d layer, feature layers, text threshold diagram, dynamic probability diagram, D B binary diagram, box formation, and output.

DB overall architecture diagram

2.2.2 CRNN Network

Text recognition is one of the key subtasks of PP-OCR, which mainly completes the recognition of text content within the bounding box area and outputs the text content in the image and the corresponding confidence level. A single test image is used as the input to the CRNN network in the first step of the text recognition process, which is depicted in Fig. 3. Secondly, it goes through the image correction pre-processing module, which corrects skewed images and distorted text in images to reduce the difficulty of the visual feature extraction module. The visual feature extraction module then employs a convolutional neural network to extract the input image's feature data in order to produce the visual feature V. Then, the sequence feature extraction module uses the extracted visual feature information V as input, and the sequence feature extraction module goes on to extract the image's contextual feature information to produce the sequence feature information L. Finally, the prediction module processes L and outputs the result, the recognized text information.

Fig. 3
A flowchart for the text recognition. The flow is as follows. A signboard in a foreign language, image correction module, visual feature extraction module, visual feature V, sequence feature extraction module, contextual feature L, prediction module, and mairie du.

Text recognition flow chart

A CRNN network is used by PP-OCR to recognize text. Assuming input x and output y, we want the posterior probability p(y|x) to be as large as possible [23], and the CTC Loss function (assuming that the outputs of the RNN are independent of each other at each moment) is defined as shown in Eqs. (2) and (3).

$$ p(\pi |x) = \prod\limits_{t = 1}^{T} {y_{{\pi_{t} }}^{t} } ,\;\;\;\;\;\forall \pi \in L^{T} $$
(2)
$$ p(y|x) = \sum\limits_{{\pi \in B^{ - 1} (y)}} {p(\pi |x)} $$
(3)

where B−1 is the mapping function of the set of all paths of y. \(y_{{\pi_{t} }}^{t}\) represents the probability of \(\pi_{t}\) selecting a character at time step t, \(\pi_{t}\) represents the output character corresponding to path π at time step t. The CTC Loss function is next obtained as shown in Eq. (4).

$$ {\mathcal{L}}(s) = - \sum\limits_{(x,y) \in S} {\ln \sum\limits_{{\pi \in B^{ - 1} (y)}} {\prod \begin{array}{*{20}c} T \\ {t = 1} \\ \end{array} y\begin{array}{*{20}c} t \\ {\pi_{t} } \\ \end{array} } } ,\;\;\;\;\forall \pi \in L^{T} $$
(4)

The sample set is denoted by S in the equation above, and the maximum prediction probability can be reached by utilizing a dynamic programming approach to determine the ideal value of the loss function using the HMM's (Hidden Markov Model) Forward–Backward algorithm concept [24] as shown in Eqs. (5) and (6).

$$ p(l|x) = \sum\limits_{s = 1}^{|l|} {\frac{{\alpha_{t} (s)\beta_{t} (s)}}{{y_{{l_{s} }}^{t} }}} $$
(5)
$$ - \ln (p(l|x)) = - \ln \left[ {\sum\limits_{s = 1}^{|l|} {\frac{{\alpha_{t} (s)\beta_{t} (s)}}{{y_{{l_{s} }}^{t} }}} } \right] $$
(6)

In the above Eq. (5) αt(s), denotes the probability sum of all paths passing through character s at time step t at 1 minus time step t, and βt(s) denotes the probability sum of all paths passing through character s at time step t from t minus moment T. CRNN networks use mainstream convolutional structures, such as Resnet, MobileNet, etc. [25, 26], For the problem of a large amount of contextual information in the input data, CRNN introduces Bi-LSTM to enhance the contextual relationship modeling, and the final sequence is input to CTC for decoding, which avoids the problem of unaligned predictions and labels.

2.3 Inference Acceleration Engine

Due to the limited computing power and storage at the edge, models have to be deployed to the edge after training in the cloud, at which time they need to be pruned, quantized, etc. [27]. In order to assure tolerated accuracy loss following model compression and to avoid potential incompatibility issues when models are delivered to various edges, OpenVINO came into being. It is a collection of tools created by Intel based on its own hardware platform that helps speed up the creation of computer vision and deep learning applications. It includes a number of inference libraries, model optimization, and other deep learning-related materials. It can deploy algorithmic models online and is interoperable with models learned using a variety of open source frameworks. Figure 4 shows the workflow diagram. First, the model structure is selected in the training model phase, such as TensorFlow, Caffe, etc. The trained model is then converted into an intermediate representation that the inference engine can understand by using the model optimizer to improve the neural network model created in the previous phase. In the subsequent inference acceleration phase, the model is sped up for inference computation. Finally, the model optimizer and the inference engine are the core elements of the OpenVINO engine for producing user-ready cloud applications. The inference engine controls the loading and compilation of optimized neural network models while also supporting asynchronous operations. Model Optimizer, a cross-platform command-line tool [28], converts the trained neural network from its source framework to an open-source intermediate representation for inference operations that is compatible with nGraph.

Fig. 4
A flowchart for the engine. The flow is as follows. Training model, deep learning, run model optimizer, inference engine, cloud application. The run model optimizer also leads to generating intermediate representations and inference engine leads to Pytorch tensorflow.

Inference engine workflow diagram

3 Improved Model

We address the technical issue that existing web opinion monitoring methods cannot rapidly acquire and identify textual information in visual multimodal scenes in this paper. Combining the characteristics of data in opinion analysis, the PP-OCR model deployed in the OpenVINO environment is improved and adapted based on the latest achievements in the current text recognition field. Before focusing on improvement, evaluate the pre-training model first and use the DB + CRNN combination model to determine whether the issue is with detection or recognition in the dense text images. try increasing image resolution and stretching the image within a specific range in order to sparse the text and enhance the recognition impact if the image contains small amounts of dense text. Improvements can be seen in two areas: (1) The LK-PAN network with a large feeling field is proposed in the detection module to upgrade the CML distillation strategy. (2) The transformer is introduced in the recognition module to mine the contextual data between text lines, while the original CTC decoder is changed to the Guided Training of CTC (GTC) method. Figure 5 shows the structure diagram of the improved model.

Fig. 5
A flowchart of the improved model in 5 stages. Stage 1, images and videos. Stage 2, model optimizer, acceleration, and inference engine. Stage 3, open C V, media S D K, and processing. Stage 4, text detection, detection boxes rectify, and text recognition. Stage 5, result output.

Improved overall framework diagram

As shown in Fig. 5 above, the datasets of images and videos are first prepared in the first stage (Stage 1) as the model's input. The reading and inference of the input images and videos by the text detection and recognition model can be sped up by the second stage (Stage 2) of the inference acceleration engine, which primarily consists of two parts: the model optimizer and the inference engine. The third stage (Stage 3) then further improves the input data (image and video data) from the first stage to enhance the quality of the images and videos. In the fourth stage, strategies such as LK-PAN and DML will be added to the text detection module, as well as the Transformer network and GTC to the text recognition module (Stage 4). Finally, the experimental findings demonstrate that the enhanced model performs better on the dataset than the model without the enhancement.

3.1 Detection of LK-PAN Networks

In this paper, we improve on the original PP-OCR V2's CML distillation strategy. It combines the traditional standard distillation of Teacher (Teacher) guiding students (Students) with Deep Mutual Learning (DML) mutual learning between the network of Students [29], as well as the features that the teacher network guides the students network while they are learning from each other. The LK-PAN of the PAN module with large sensory fields is used in this paper to optimize the teacher model. Figure 6 shows the LK-PAN framework diagram. First, the input feature information is extracted by ResNet50 network features, which are stacked by several similarly structured blocks that are the basic units of the residual network, i.e., residual blocks. This network can increase the model's training effectiveness and speed up model training. When the gradient and feature degradation problems are well solved as the model layers are deepened in order to obtain rich features, Secondly, after the LK-PAN module, the core is to increase the convolutional kernel in Path Augmentation (PAN), and the size of the convolutional kernel is expanded from 3*3 to 9*9 to enhance the perceptual field covered by each position in the feature map by increasing the size of the convolutional kernel. In image and video datasets, it can show good results for detecting text with large fonts and in text with extreme aspect ratios, while combining the LK_PAN network with the DML distillation strategy [30]. Finally, the information that passes through the LK-PAN network is concatenated.

Fig. 6
A flowchart for L K P A N. The flow starts with stages 2, 3, 4, and 5 under R e s net 50 and ends with D B head. It leads through 1 asterisk 1 convolution, upsampling, 9 asterisks 9 convolution layer, and concat.

LK-PAN framework diagram

In this paper, an Residual Squeeze-and-Excitation Feature Pyramid Network (RSE-FPN) with a residual attention mechanism is used for the students model [31]. The RSE-FPN framework is depicted in Fig. 7. The MobileNetV3 network processes the feature information first, automatically learning the relative relevance of each feature channel. The results are then used to boost helpful features and suppress features that are less useful for the present job based on the acquisition. The RSEConv network's subsequent layer receives its input from the feature data from the preceding stage. With the addition of a residual structure, this network transforms the convolutional layer in the FPN into a channel attention structure RSEConv layer with residual structure, which is better able to characterize the feature map. The feature data that has passed through the RSEConv network is finally concatenated.

Fig. 7
A flowchart of R S E F P N. The flow is as follows. Mobile net V 3, upsampling among R S E convolution layers, feature concat, and D B head.

RSE-FPN framework diagram

3.2 Improvement Strategy in Identification Module

To efficiently mine the contextual information of text line images, the recognition module ditches the RNN structure in favor of the Transformer network. The original CTC decoder can make quick inferences, but it has a low accuracy rate. Hence, the GTC technique was developed. As attention is more sensitive to spatial information and can precisely optimize the Spatial Transformer Network (STN) network [32], it is utilized to guide CTC training in order to facilitate the fusion of more features. The attention module is removed during inference without increasing inference time. First, a Gated Recurrent Cell (GRU) + attention module is added in the training phase, and the computational graph is partitioned to guide the learning of CTC. The computational flow is as follows:

The GRU is adopted to learn the attention dependency. At time-step t, The formula for xt is shown in Eq. (7):

$$ x_{t} = Softmax(W^{T} m_{t} ) $$
(7)

where mt is a hidden state of the GRU cell and W is a weight matrix. The hidden state mt is updated via the recurrent process of GRU, As shown in Eq. (8):

$$ m_{t} = GRU(y_{prev} ,g_{t} ,m_{t - 1} ) $$
(8)

where yprev is the embedding vector of the previous output yt−1. During training, yt−1 is replaced by the ground truth sequence. wt represents the glimpse vector calculated As shown in Eq. (9):

$$ w_{t} = \sum\limits_{i = 1}^{T} {(\alpha_{ti} z_{i} )} $$
(9)

where zi is the feature sequence vector of z1:T at the time step i. Lt is the attention weight vector as follows Eq. (10):

$$ L_{t} = Attention(m_{t - 1} ,z_{i} ) $$
(10)

Our improved model uses the TextConAug module [33], which can be used to mine textual contextual information and thereby enrich the contextual information of the training data. Conventional data augmentation techniques include random flipping, cropping, adding noise, color scrambling, and other techniques, but their effectiveness is limited by the amount of original data. The quality of earlier data annotation efforts, which frequently included specialists in related subjects, was enhanced, but they were ineffective and expensive. The network makes use of the TextRotNet network [34], which employs a significant quantity of unlabeled data and is trained via a self-supervised technique [35], lowering the workload of labeled samples and significantly shortening the model training period without sacrificing recognition accuracy.

Finally, the text recognition module introduces the Unlabeled Images Mining (UIM) unlabeled data mining scheme, which has the ability to predict unlabeled data using recognition models with high accuracy, obtain pseudo labels, and use those with high confidence as training data for training models.

4 Systematic Experimental Process and Results Analysis

4.1 Experimental Environment

The experimental hardware configuration for this experiment in the Windows 10 environment is: central processing (CPU) using Intel (R) Core(TM) i5-7200U CPU @2.50 GHZ, 12G memory. The graphics processor (GPU) is NVIDIA Tesla P100 with 24G RAM, Python is the development language, and Pytorch is the framework.

4.2 Experimental Dataset

The model for text recognition on photos and videos is tested in this experiment using the free-to-use ICDAR 2015 dataset [13] and recorded video data, respectively. 500 test photos are randomly chosen, while 1000 images from the ICDAR 2015 dataset serve as the training samples. The video was captured for 1 min, 27 s, just for testing purposes. The original image and data annotation format is shown in Fig. 8 and Table 2 using an image from ICDAR 2015 as an example.

Fig. 8
An inside view of a building. It has a hallway with posters on the left and right walls. Stanchion posts are kept at a distance.

A sample image from ICDAR 2015

Table 2 A sample image in ICDAR 2015 annotated with formatted text, which is corresponding to Fig. 8

4.3 Evaluation Metrics

This experiment uses accuracy P, recall R, and F-value as its assessment metrics. The ratio of samples that were properly identified to all samples in the test dataset is known as the accuracy rate. Recall is defined as the proportion of accurate predictions to all accurate predictions in the test dataset [36]. F-value is a representation of the average of P and R. The formulae are shown in Eqs. (11), (12) and (13).

$$ P = N{\prime} /N $$
(11)
$$ R = N{\prime} /M $$
(12)
$$ F = \frac{(\alpha * \alpha + 1)}{{(P + R)\alpha * \alpha }}P * R $$
(13)

where: N and M stand for the number of words projected to be correct overall and the number of words that were actually correct overall. N′ stands for the number of words identified properly by the model. respectively. α represents learning rate, When α > 1, The F-value is largely influenced by the recall rate. and when 0 < α < 1, The accuracy rate has a stronger impact.

4.4 Experimental Results

4.4.1 Comparison of Experimental Effects

Table 3 compares the network model's impact on the video dataset's detection speed before and after the enhancement. Table 3 shows that the model's average recognition speed for text in video has increased from 19.2 frames per second to 24.9 frames per second, a 29.7% improvement over the model's previous performance.

Table 3 Comparison of the effect of model detection speed before and after improvement

On the image and video datasets, Fig. 9a, b compares the accuracy, recall, and F-value effects of the network models both before and after the modification. Figure 9a, b shows that when Transformer network and the UIM unlabeled data mining scheme are added, the detection effect is slightly better than the model before improvement. The F-value of the text detection effect on the image dataset ICDAR2015 is improved by 10.11%, and the F-value of the text detection effect in video is improved by 17.97%.

Fig. 9
Two line graphs for the I C D A R 2015 dataset and record video data on the left and right, respectively. Improved models (P slash %, 85) has the highest estimated value in both graphs.

a, b Comparison of the effect of network models before and after improvement

Table 4 compares the text detection effect in the image before and after the enhancement using a single image from the dataset ICDAR 2015, namely Fig. 8. Table 4 findings demonstrate that the upgraded model not only has a greater identification rate for “PPINGS” text but also a higher level of recognition confidence. As a result, the enhanced model's recognition capacity has significantly increased.

Table 4 Comparison of image text detection effect before and after improvement

The text detection effect of gambling video is compared in Fig. 10a, b before and after the enhancement and Fig. 11a, b show the comparison of the text detection effect of violent video before and after the improvement. The findings in Figs. 10 and 11 demonstrate that the enhanced approach used in this paper has a stronger identification capacity since it can accurately recognize more text in the video.

Fig. 10
Two posters labeled a and b. Both posters read that speaking of psychology will easier access to gambling mean more gambling addiction, with Shane Kraus, P h D, and Lia Nower. J D P h D 2023, 7, 26. Mean more gambling addiction is highlighted in a. Speaking of psychology with Shane Kraus, P h D, 2023, 7, 26 is highlighted in b.

a Experiment before improvement. b Experiment after improvement

Fig. 11
2 parts, a and b. It has several listicles under causes of gun violence. The phrase, youth risks are due to differences in brain development and social is highlighted in both parts. Highlighted phrases in b are demographic correlates are only a starting point, male risks may be partly physiological.

a Experiment before improvement. b Experiment after improvement

5 Conclusion

In this paper, we proposed an improved method to visual multimodal text identification for scenarios such as internet opinion analysis. Furthermore, the method's validity has been experimentally validated. This paper makes three major contributions: Firstly, to improve the learning effect of the overall model, the text detection module upgrades the traditional distillation approach by integrating it with the DML mutual learning technique. Secondly, the large feeling wild PAN module is suggested in response to the prior shortcomings in identifying multi-scale and extreme aspect ratio text. Finally, the Transformer network is shown in the text recognition module to efficiently mine the contextual data of text line images to increase the text recognition's mistake correcting capabilities. The improved model enhances text identification in images and videos, and in the future, the identified text data will be combined with natural language processing methods. Currently, multimodal sentiment analysis has become a research hotspot, and we will further integrate the features of different modalities, effectively explore the relationship between features, improve the existing sentiment analysis models, and enhance the accuracy of sentiment analysis. In order to provide important technical support for the field of multimodal opinion analysis.