Introduction

Detection and recognition of objects in a static image or video frame are basic and challenging tasks in computer vision. Object recognition and detection have been the subject of much research in the last two decades (Liu et al. 2020; Zhang et al. 2013). Object detection is the process of determining the existence of different individual objects in an image. The challenge of object detection and recognition has been addressed in controlled environments. Still, it remains unsolved in uncontrolled environments, particularly when items are placed in arbitrary poses in a cluttered and occluded environment (Zhao et al. 2019). The recent growth of mobile devices with high-resolution cameras has enabled applications to support daily tasks in various contexts. In this work, we focus on detecting and recognizing grocery products on shelves around the user in a grocery store. Product recognition is more similar to a complex instance recognition problem than a classification problem. It includes many identical objects yet varies in minor aspects, for example, different flavors of the same brand of Lays chips (Tsai et al. 2010). Common challenges of automatic grocery product recognition are shown in Fig. 1. Automatic product detection and recognition in a video frame have many applications, ranging from recognizing specific products to providing review and price information to assisting navigation inside the grocery store.

Fig. 1
figure 1

Challenges in on-the-shelf retail product recognition

Furthermore, automatic grocery product detection and recognition can assist the visually impaired during shopping (George et al. 2015). Because product appearance varies significantly due to the substantial changes in pose, perspective, size variations, occlusion, and lighting conditions, product detection/recognition in grocery shops is complicated. Additional peculiar issues are the product’s packaging can change over time, and different products look remarkably identical. Only small packaging information allows them to differentiate, such as slight differences in the text describing the product or the background of the package’s color. Detecting/recognizing specific products is complex, unlike classifying products in macro-categories such as shampoo, chips, and detergent. Another notable issue in this scenario is the availability of new products. The number of new products is increasing every day, and whenever a new product is introduced, the product recognition system also needs to be scalable with no or minimal retraining.

Acquisition and manual annotation of the training images is a time-consuming task. It is not feasible because the products frequently change over time; collecting and annotating new in-store images and retraining the system is not viable. The system must endure cross-domain scenarios where testing images are obtained from different stores and with varying imaging conditions. Since the training and testing images are from varying imaging conditions, it is vital to establish an ideal system that only needs to be trained once and used in various stores and scenarios. López et al. (López-de-Ipiña et al. 2011) developed an automatic product recognition system using radio frequency identification, sensors, or barcodes. The majority of sensor-based systems require manufacturing fabrication, increasing the product’s cost and requiring massive investment. The sensor-based methods cannot resolve the planogram compliance problems. Compared to sensor-based approaches, computer vision possesses cost-effectiveness and efficiency in terms of real-time implementation.

This work proposes a novel framework to detect and recognize multiple on-the-shelf and off-the-shelf grocery products from shelf images and video frames. We divided our proposed framework into three steps, as shown in Figs. 2 and 3: (1) Grocery product detection—we incorporated the YOLO (You Look Only Once) algorithm (Redmon et al. 2016) to perform the grocery product detection task; we trained and tested the YOLO algorithm using benchmark grocery product datasets to detect multiple objects from the shelf image and video frame. (2) Object text detection—in this step, the output image from the product detection step is given as input to perform product text detection to obtain the corresponding product information such as product name, brand name, price, and expiry date. (3) Object text recognition—detected texts are recognized using an existing text recognition algorithm (Litman et al. 2020). The recognized texts contain complete information about the corresponding product.

Fig. 2
figure 2

Block diagram of proposed grocery product detection and recognition

Fig. 3
figure 3

Illustration YOLOv5 architecture

In summary, the significant contributions of our work are as follows:

  • We incorporated the YOLOv5 algorithm to perform the grocery product detection task. We increased the object class size from 80 to more than 120. YOLOv5 algorithm works well for general object detection, where we used and trained it for grocery product detection tasks and the detection results are compared with existing methods.

  • We improved the performance of the state-of-the-art text detection algorithm by changing the backbone network from VGG16 to ResNet50. We proposed an algorithm to select the centering point in the text center line instead of picking a random point.

  • We proposed an accurate post-processing technique to capture corner characters by combining the Graham Scan algorithm and the rotating calipers technique. The modified backbone network, striding algorithm, and post-processing technique greatly enhanced the performance of the state-of-the-art algorithm. The robustness of the text detection method is evaluated using standard benchmark text detection datasets.

  • We converted the videos of the complex Grozi-120 public dataset into frames, and then we performed a grocery product detection task. For the complex Grozi-120 public dataset alone, we used both videos and static images, whereas the remaining datasets contain only on-shelf products and individual product images. So, we performed a recognition task with only static images.

The organization of the paper is as follows: the “Related Work” section describes the literature review on object detection, text detection, and retail product detection and recognition. The “Proposed Framework” section explains the proposed framework and Width Height based Bounding Box Reconstruction (WHBBR) technique with the help of schematic diagrams. The “Experiment” section presents a description of datasets and implementation details. The “Results and Discussion” section presents experimental results and a brief discussion of the research outcomes. Finally, the “Conclusion and Future Work” section draws a conclusion and future work.

Related Work

This section describes different works carried out by various authors on general object detection algorithms, text detection, and recognition algorithms, then a short literature review of grocery product detection and recognition methods.

General Object Detection

Object detection has been a trending area approached by many researchers in recent years. The primary aim of object detection is to identify and locate the instances of semantic objects of a specific class such as a building, a human, a dog, a bicycle, or a cat in an image or video frame. Generally, object detection methods fall into one of two categories: neural network–based or non-neural network methods. In the non-neural approach, object detectors can extract the features of the objects from either grayscale or color images that can be matched to detect the object. Viola-Jones detectors, histogram of oriented gradients (HOG), principle component analysis (PCA), and Haar-like wavelet transform are the techniques used for object detection in the early stages.

Viola and Jones presented an approach for object detection which minimizes computation time while achieving high detection accuracy. The approach was used to construct a face detection system. This system experimented on a Real-World MIT + CMU test set. This set consists of 130 images with 507 labeled frontal faces. This system yields detection rates comparable to the best previous systems. The detector runs at 15 frames per second without resorting to image differencing or skin color detection. Hence, it is called the first real-time face detector. Viola and Jones introduced the Haar-like feature in their system, which is a basic rectangular feature used for object identification. The Haar-like feature is also called the Haar wavelet. The Haar-like feature can effectively reflect the local gray change information of the image, and can also be quickly calculated through the integral image. Later it was extended by adding rectangular features with a rotation of 45°. The extended features are roughly divided into three types: edge features, line features, and center-surround features.

The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer vision. It detects and describes local features in digital images. It also locates certain key points and then furnishes quantitative information. SIFT was used for object detection in an early stage of research works. The major advantage of SIFT features, over HOG, is invariant to rotation, translation, scaling, viewpoint, and illumination. SIFT was used for human detection; it performs a significant computation process to obtain features from the images such as scale-space extrema detection, keypoint localization, orientation assignment, and keypoint descriptor. The HOG method partitions the video frame or static image into several blocks and then looks for the object based on extracted features. The PCA approach extracts the object features using eigenvectors.

Over the past two decades, the emergence of deep learning has accelerated the development of a rich set of object detection methods. Object detection approaches based on deep learning have yielded significant advancements and outstanding results. Object detection methods are classified into two types: one-stage and two-stage methods. One stage method performs the detection in one step. YOLO (Redmon et al. 2016), Single Shot MultiBox Detector (SSD), DetectNet, and SqueezeDet are the typical one-stage algorithms. One-stage methods only localize the object without computing region proposals directly by performing bounding box regression and classification tasks. The two-stage method follows two steps for object detection. The input image is used to generate region proposals in the first step. The region proposals are classified and their locations are fine-tuned in the second step, which involves classification and regression tasks.

Regional Convolutional Neural Network (R-CNN) series are popularly known for detecting region proposals. R-CNN performs an external selective search over the image to generate region proposals and feeds the computed region proposals into the convolutional neural network (CNN) to perform classification and bounding box regression tasks. The pace of training and detection was quite weak with R-CNN since it involves forward computation of different object regions that may overlap. Instead of extracting region proposals from each image multiple times, Fast R-CNN uses a feature extractor to extract all the features of the entire image to perform object detection. The processing time decreases since Fast R-CNN extracts all the features at an instance. Faster R-CNN is based on the same architecture as Fast R-CNN. Region Proposal Network substitutes the selective search approach in Faster R-CNN, which overcomes the issue of significant time overhead in producing the region of interest. SSD provides a considerable performance over Faster R-CNN in detecting the more prominent objects. The network creates a variety of feature maps of various sizes. On multi-scale feature maps, classification and bounding box regression tasks are performed concurrently. YOLO is a prevalent object detection technique based on the one-stage method.

YOLO detects multiple objects simultaneously by predicting class probability values and bounding boxes. YOLO does not employ multi-scale feature maps. Compared to SSD, generalization capabilities are inferior in YOLO for large-scale changes in an object. It has the problem of poor recognition accuracy and a high missed detection rate. YOLOv2 uses an anchor mechanism to predict bounding boxes, so the feature map’s spatial information is substantially maintained. YOLO, which employs a fully connected layer to predict bounding boxes. It uses convolutional layers. When a fully connected layer is used to predict bounding boxes, the feature maps can be lost. To over this, the YOLOv3 algorithm adapts multi-scale feature maps and uses FPN (Feature Pyramid Networks) to predict bounding boxes. FPN technique helps to merge the middle layers’ output with the latter layer’s, and the smaller objects present in the low-level feature can be spotted by passing high-level features to the bottom layers. The detection speed and accuracy of YOLOv3 have been considerably improved from the earlier versions. YOLOv4 algorithm adapts the architecture of YOLOv3 with modifications in the backbone and neck. The major difference in the YOLOv4 is only the backbone. YOLOv4 uses CSPDarknet53, whereas YOLOv3 uses Darknet53 as their respective backbone network. YOLOv4 backbone architecture comprises mainly three parts: CSPDarknet53, Bag of special and Bag of freebies. A bag of special methods is used to increase inference cost but, object detection accuracy also improved.

Many researchers have widely explored the major problems of object detection in videos and scene images. In these research works many solutions have also been suggested (Liu et al. 2020; Zhang et al. 2013; Zhao et al. 2019). Most of the video object detection algorithms had two networks: Firstly, CNN was used as the backbone network for the feature extraction. Secondly, the detection network classifies the objects and predicts the bound boxes. The methods for detecting objects using deep learning can be classified into LSTM (Long Short Term Memory) based, tracking-based, flow-based, attention-based and other methods. Wang et al. (Wang et al. 2018) proposed MANet to find the optical flow information between adjacent frames, to deliberate the optical flow information in which it extracts the global image features together. This method was efficiently used to extract features by acquiring the instance-level calibration across frames with the optical flows technique and then the pixel-level feature calibration was to improve the performance of video object detection. D&T proposed a ConvNet architecture (Feichtenhofer et al. 2017) to improve object detection and object tracking performance by introducing a multi-task objective frame-based object tracking by adopting techniques such as frame track, regression, correlation features and frame-level detection. The Seq-NMS (Han et al. 2016) proposed a heuristic method composed of sequence selection, re-scoring, and suppression for re-ranking bounding boxes (Qazi et al. 2017) in a video sequence. TSSD (Temporal Single-Shot Detector) method integrates ConvLSTM-based attention used for background and scale suppression.

Text Detection

A novel method was proposed by Shivakumara et al. (Shivakumara et al. 2013) to detect text from video frames based on neighbor component grouping and GVF (Gradient Vector Flow), which use dominant edge pixels to extract TC (Text Candidates). They also presented two grouping schemes: the first scheme tries to find the nearest neighbors. The second scheme extracts neighbors and restores missing CTC to detect arbitrary text in video frames (Neumann and Matas 2013; Huang et al. 2014; Yin et al. 2014; Jaderberg et al. 2016). Hybrid text detection and text tracking work proposed (Zhong et al. 2016) based on MSER (Maximally Stable Extremal Region), which uses Delaunay Triangulation and multi-scale integration to identify the text candidates. It also utilizes convolving Laplacian with wavelet sub-bands to enhance low-resolution text pixels. The MSERs and SWT (Stroke Width Transform) were combined (Jiang et al. 2017) to obtain arbitrarily shaped text regions from video frames. To detect text from complex video frames, Ye et al. (Ye et al. 2009) proposed a texture-based method, LBP (Local Binary Pattern), to extract features of text candidates. PNN (Polynomial Neural Network) was developed to classify text and non-text regions.

A three-stage text detection method was proposed by Zhu et al. (Zhu and Du 2018). In the first stage, video frame features are extracted. Then, text candidates are detected by optimizing RBFNN (Radial Basis Function Neural Network) model, which transforms the input signal into another form, which can be then fed into the network to get linear separability. RBFNN is composed of an input, hidden and output layer. RBFNN is strictly limited to have exactly one hidden layer. The hidden layer is otherwise called a feature vector. A non-linear transfer function can be applied to the feature vector before performing the classification task. When the dimension of the feature vector is increased, the linear separability of the feature vector also increases. RBFNN is structurally the same as perceptron (MLP). RBFNN can be distinguished from other neural networks due to their universal approximation and faster learning speed. Radial basis function networks have many advantages, including function approximation, time series prediction, classification, and system control. RBFNN have characteristics of easy design, good generalization, strong tolerance to input noise, and online learning ability. Finally, a post-processing task was applied to the false detected text candidates. To classify textual and non-textual components, He et al. (He et al. 2016) presented a framework for text detection called Text-Attentional Convolutional Neural Network. They introduced a novel training mechanism to increase the robustness against a complex background.

Liao et al. (Liao et al. 2017) developed a contrast enhancement maximally stable extremal regions detector to improve the video frame’s intensity. This method detects complex-shaped text with a high accuracy rate. The false detection rate was reduced by using the loss function and non-maximum suppression (NMS) (Liao et al. 2019) produces final bounding boxed text regions. To detect text regions from low-quality images, the script identification task was introduced. This method extracts low and high-level features using the CNN-LSTM framework and an attention-based patch was used to respective weights which were calculated. Initially, the video frames were converted into patches and fed into CNN-LSTM. Local features are extracted by performing patch-wise product patch weights and global features are extracted from the final LSTM cell. Weights of local and global features were fused dynamically to perform script identification. Coarse candidate region detection and fine text line detection are effectively used for detecting multi-scale candidate text areas. Candidate text regions are segmented and fed into CNN, which generate a confidence map for each frame’s text regions. Finally, projection analysis refines text candidates and divides them into text lines. The performance of the video text detection technique was enhanced using a novel refined block structure developed by Jianqi et al. (Jianqi et al. 2018), which was constructed using a fully convolutional network.

A multi-scale deformable convolution structure was introduced by Mohanty et al. (Mohanty et al. 2018) to extract additional features and spatiotemporal information using a bipartite graph model and the random walk algorithm. Firstly, text candidates and background regions are extracted. Then, shape, motion, and spatial relations between text and background are exploited to refine text candidates. The correlations between text and background regions highly improvise the accuracy of text detection. Zhang et al. (Zhang et al. 2019) proposed a fusion-based detection method which extracts text regions and locates characters. They used tracking trajectories to refine detection results. Liu et al. (Zhang et al. 2019) developed a polygon-based curve text detector by combining R-CNN and transverse and longitudinal offset connection for the precise detection of irregular texts. Post-processing methods, such as non-polygon suppress and polygonal non-maximum suppression, produce accurate text detection result.

Grocery Product Detection and Recognition

In 1999, the first significant effort was made to recognize retail products in isolation. Naturally, the problem of localization is not addressed. Merler et al. (Merler et al. 2007) introduced a retail product detection problem with a new dataset consisting of rack and product images. It took almost eight years to develop a more comprehensive method to detect and recognize multiple retail products. Marder et al. (Marder et al. 2015) designed two successive layers for a multi-product detection scheme. In the first layer, they followed three different techniques to detect retail products in the rack, i.e., (i) vote map, (ii) HOG, and (iii) BoW (bag of words), based on a sliding-window approach. A saliency map was also used for product recognition and to address the second layer’s planogram compliance problem. Beis et al. (Beis and Lowe 1997) used a k-d tree representation and Speeded Up Robust Features (SURF) descriptors for retail product recognition. The products in the rack images are recognized using a previously constructed k-d tree and the Best-Bin-First search algorithm. In addition, a pose-class histogram in high dimensional space was used to perform fine-grained recognition.

George et al. (George and Floerkemeier 2014) proposed a three-phase detection and recognition method. In the first phase, they developed a non-parametric probabilistic model based on SIFT features. Fine-grained product categorization is performed in the second phase. The first and second phases are coupled with the Karhunen–Loeve transform in the final phase to track the detected boxes in a video. Geng et al. (Geng et al. 2018) developed a product detection system to identify the locations of the products in the video frames by creating a saliency map for the shelf images. The saliency map is constructed using SURF key points and Attention and Information Maximization. Finally, a CNN is used to recognize the products. Ray et al. (Ray et al. 2018) presented a conditional random field (CRF)–based method for classifying structured objects. A CNN extracts the visual features and that was linearly fed into a CRF model. Viterbi and forward–backward algorithms were used to generate the labels of the product sequence.

Franco et al. (Franco and Maltoni 2017) divided the product detection and recognition task into three steps: (i) candidate pre-selection; in this step, they segmented the foreground from the background using fixed-threshold binarization. (ii) Fine-selection, they utilized a customized deep neural network (DNN) and a BoW to select the most robust features; (iii) post-processing technique (Kumar et al. 2021; Georgiadis et al. 2021) reduces the false positives by eliminating the multiple overlapped detections of the same products. Karlinsky et al. (Karlinsky et al. 2017) and Zientara et al. (Advani et al. 2017) calculated a homography matrix to identify the grocery products in shelf images by matching SURF key points of product images with corresponding rack images. Goldman et al. (Goldman and Goldberger 2020) use a hough voting scheme based on matched SURF key points to determine the pose of products and then they determined the location of products by estimating their pose.

Bukhari et al. (Bukhari et al. 2021) developed a vision-based Automatic Retail Checkout system, which uses CNNN for object detection. Canny edge detector and hysteresis thresholding are used to perform NMS and generate a binary image containing the edges, respectively. Morphological operations are performed to fill out holes and gaps. This method highly depends on a motor-powered conveyor-belt mechanism. Ciocca et al. (Ciocca et al. 2021) introduced a multi-task learning network to extract features from the images. They performed the product classification in both supervised and unsupervised learning methods. Yilmazer and Birant (Yilmazer and Birant 2021) combined two concepts semi-supervised learning and on-shelf availability to identify the empty shelves. Similarly, Santra et al. (Santra et al. 2022) use graph convolutional network (GCN) and Siamese network architecture (SNA) for feature extraction and to capture the similarity of the neighboring superpixels, respectively. Finally, the features extracted from GCN and SNA are fed to SSVM for the identity gaps on the rack. Leo et al. (Leo et al. 2021) assessed the performance of different classification models. Olóndriz et al. (Olóndriz et al. 2021) introduced the FooDI-ML dataset and Glovo application to recognize the retail product information. Machado et al. (Lima Machado et al. 2021) developed a product recognition system for visually impaired people. The authors inferred that the ResNet-50-based approach achieves better results than other deep learning–based models. Domingo et al. (Domingo et al. 2022) use cross-validation-voting scheme to classify the retail products.

The existing methods have certain limitations: Santra et al. (Santra et al. 2022) work only in brighter images and dark products on the shelves are assumed as empty shelves. Ray et al. (Ray et al. 2018) cannot differentiate non-identical objects. Geng et al. (Geng et al. 2018) face more partial detections. Franco et al. (Franco and Maltoni 2017) and Karlinsky et al. (Karlinsky et al. 2017) face labeling problem (inaccurate box). Olóndriz et al. (Olóndriz et al. 2021) require more than 2.73 million parameters, computationally complex.

Proposed Framework

The overall architecture of our proposed framework is shown in Fig. 2; it consists of three important modules for grocery product recognition. The first module is to detect grocery products based on the product class using a single-shot object detection algorithm, YOLOv5. The second module uses a text detection algorithm to detect the text on grocery product packing (brand name, product name, quantity, and other information). Finally, the third module recognizes the text using the current state-of-the-art text recognition algorithm. The recognized text has unique information about the corresponding product.

Pre-processing

In a pre-processing step, the input videos in Grozi 120 datasets are converted into video frames to perform object text detection and recognition. Here, the video frames are captured for every 0.5 s, i.e., two frames are extracted per second. We use cv2.VideoCapture() and vidcap.read() predefined function to capture the video frames.

Grocery Product Detection using YOLOv5

YOLOv5 incorporates CSPDarknet and Path Aggregation Network (PANet), which makes it easier to train the object detection model and reduces the computation cost, respectively. When compared to other models, YOLOv5 performs better at detecting smaller objects or far away objects, and inference speed is good when compared to Faster-RCNN, Fast-RCNN, and SSD. Unlike R-CNN and SPP-net, there were no overlapping boxes around the objects.

First, CSPNet is incorporated into the darknet created as CSPDarknet. CSPNet successfully addresses the issue of repeating gradient information, which often occurs in large-scale backbones. The gradient changes are included in the feature map which significantly improves the CNN’s learning ability. In the case of accuracy, it was found to lag due to being lightweight and simultaneously it reduces the needless energy usage, by spreading the entire computation across each layer in CNN. CSPDarknet reduces the model’s size by compressing the feature maps during the feature pyramid generation step via cross-channel pooling. Detection speed, model size, and accuracy are imperative in our grocery product detection task. The resultant efficiency of product detection on low-resource edge devices is determined based on model size.

Second, the YOLOv5 algorithm incorporates a PANet as its neck to increase the flow of information. PANet adopts bottom-up path augmentation and a new FPN to enhance the localization capability of the entire feature hierarchically. Adaptive Feature Pooling allows high-level features to access fine details and high localization of low-level features. Similarly, large receptive fields capture richer context information on a high-level feature to produce accurate prediction. A fully connected fusion is used for mask prediction that differentiates instances and recognizes the various portions of the same object. PANet helps to identify smaller products in our grocery product detection using the shared pooling feature and ensures that products are not missed.

Finally, to achieve multi-scale prediction, the head of the YOLOv5 algorithm adapts the YOLO layer and produces feature maps of different sizes such as 19 × 19, 38 × 38, and 76 × 76 helping the model to handle and detect small, medium, and oversized objects. It also predicts anchor boxes for feature maps. Grocery products can be of different sizes, such as small, medium, and large. The multi-scale detection mechanism in the YOLO layer ensures that the model can detect the grocery product even if the size changes during the detection process.

Text Detection

We propose a text-detecting method for detecting text information in grocery items such as product name, brand name, and amount using an efficient post-processing technique. Figure 4 describes the overall architecture of the text detection model. Conventional text detection algorithms generally assume the text instances in linear form. This linear form could not hold the representation and geometric properties of curve text instances. To address this problem, we proposed a curve-shaped text detection method. The text instances are represented as a sequence of overlapping disks, each centered on the text center line and associated with an orientation and radius. The various transformations of text instances such as rotation, bending, and scaling are captured.

Fig. 4
figure 4

The overall architecture of the text detection model

The text instance (ti) represents an ordered list O(t) = {W0, W1,…Wi,.., Wn} consisting of multiple characters, where “Wi” and “n” in the ordered list represent the ith disk and the total number of disks, respectively. Each disk (W) in the ordered list O(t) is correlated with a set of geometrical characteristics, i.e., W = (c,r,θ). In the center, radius equals half of the text instance t’s local width. The disk orientation is determined by the tangential direction of the text center line around the center (c). W is represented by the numerals c, r, and θ. The geometrical characteristics in O(t) are mainly used to amend irregular shape text instances and change them into rectangular image regions. The text area (t) can be readily reconstructed by calculating the union of the disks in O(t). The proposed FCN model predicts text regions (TRs), text center line (TCL), and its geometric attributes such as radius (r), sin θ, and cos θ. Further, masked TCL is computed from TR; TCL is a component of TR. Each other instance segmentation can avoid TCL overlapping and disjoint sets are calculated and utilized. The central axis point lists are obtained using a striding method and a proposed post-processing technique is used for text instance reconstruction and mitigates false text detection.

Backbone Network

According to recent studies, ResNet50 captures well-defined feature representations. It is used very frequently in many computer vision tasks. ResNet allows us to train extremely deep neural networks with more than 150 layers. ResNet has the technique called skip connections, which addresses the problem of vanishing gradient by providing an alternate path for the gradient to flow through and allows the model to learn an identity function that ensures that the higher layer will perform at least as good as the lower layer. So, we adopted ResNet50 with batch normalization as our backbone network to extract features from an image. The block diagram of our backbone network is illustrated in Fig. 5. Similar to U-Net, we use ResNet’s skip connections in the decoding stage to aggregate low-level features. This network has divided into five stages of convolution and the fully connected (FC) layers replace the feature merging network, which is made up of grouping feature maps of each step. In a merging network, several stages are piled one on top of the other. Each stage has its merging unit that extracts feature maps from its previous stage. The following Eqs. (13) interpret the merging branch.

$${e}_{5}={d}_{1}$$
(1)
$$d_i={Conv}_{3\times3}\left({Conv}_{1\times1}\left[e_i-1;{UpSample}_{\times2}\left(d_i-1\right)\right]\right),for\;i\geq2\;and\;i\leq4$$
(2)
$$\left.d_5={Conv}_{3\times3}\left({Conv}_{3\times3}\left({Conv}_{3\times3}\left({Conv}_{1\times1}\left[e_i-1;{UpSample}_{\times2}\left(d_i-1\right)\right]\right)\right)\right)\right),for\;i=5$$
(3)

where ei and di represent feature maps of the ith stage and the corresponding upsampling and merging units, respectively. After merging, the final detection output size is the same as the size of the input image. The final output has four channels, TR/TCL, and the last three are geometric attributes of text instances such as r, sinθ, and cosθ. The backbone network generates TCL, TR, and geometry maps after feed-forwarding. TR is a binary mask with 1 for foreground pixels (those inside the polygon annotation) and 0 for background pixels. TCL is computed using the sequencing process.

Fig. 5
figure 5

Schematic overview of the text detection backbone network

TCL and TR Generation

Masked TCL is extracted by performing the intersection of TR and TCL. Disjoint-set accurately divides the TCL pixels into discrete text instances. The enhanced striding algorithm predicts the shape and course of the text instances. It consists of three essential tasks: centralizing, striding, and sliding. Firstly, we chose a pixel by centralizing; we made it the starting point. Striding and centralizing are recursively performed in both opposite directions from the starting point until it reaches the end. The searching operation produces two ordered point lists combined to construct a final central axis list. The final axis list precisely describes the text flow and the text shape.

Centralizing

As given in Table 1, we follow three steps to calculate the center point coordinate using the instance segmented TCL as shown in Fig. 6: (i) calculate the x-axis center point (xcp) by finding the leftmost point (x1) and rightmost point (x2) of segmented TCL. (ii) Likewise, calculate the y-axis center point (ycp) by finding the topmost point (y1) and bottommost point (y2) of segmented TCL. (iii) Find the center point coordinates (xcp, ycp).

Table 1 Algorithm to compute an initial center point in the TCL
Fig. 6
figure 6

Calculating center point computation to perform the striding operation

Striding

Once a center point is obtained, the next step is to perform a striding operation. This technique looks for points by taking a stride in two opposite directions within the TCL area.

$$\mathrm{Disp}1=\left(\frac{1}{4}r\times \mathrm{cos\theta },\frac{1}{4}r \times \mathrm{sin\theta }\right)$$
(4)
$$\mathrm{Disp}2=\left(-\frac14r\times\mathrm c\mathrm o\mathrm s\;\mathrm\theta,-\frac14r\times\mathrm s\mathrm i\mathrm n\;\mathrm\theta\right)$$
(5)

Equations (4) and (5) are the offset value for each stride in two opposite directions. If the points move out of the text area, the stride offset value is decremented gradually until the points move inside the text area or it hits the end.

Sliding.

Finally, the sliding procedure iteratively moves along the central text line, drawing circles on predicted text instances with a radius r calculated from the r map. For each point on TCL, the distance between two points on the sides is used to determine the radius (r); by drawing a straight line across the TCL points in the text area, the orientation (θ) is determined. Since the TCL is a straight line, it is simple to compute it using algebraic triangles and quadrangles. But it is not easy to use a generic algebraic technique for polygons with more than four sides. An illustration of the TCL extraction and TCL expansion is shown in Fig. 7 and mask to TCL conversion is given in Table 1.

Fig. 7
figure 7

Schematic overview of TCL extraction and TCL expansion

Table 2 Procedure for Mask to TCL conversion and the proposed WHBBR Algorithm

Text instances (t) represented as a set of vertices (v0, v1, v2,…..,vn). We assumed that text instances had two edges, one at the top and one at the bottom and that the two edges connected to the head or tail are parallel and traverse in the opposite direction. Each edge is measured as M(ei,i+1) = cos(ei−1,i,i+1,i+2), and head and tail edge measurement M is set to − 1. Then, possible text control points are sampled on text sidelines. TCL is extracted by computing midpoints of corresponding text control points. The head and tail edges of TCL get shrunk by ¼ of the radius of control points so that most of the TCL pixels remain within TR. If we take ½ of the radius of control points, we lose the heads and tail of the text areas. At last, the TCL area is expanded by 5 pixels.

Width Height Based Bounding Box Reconstruction Algorithm

Table 1 and Fig. 8 show the algorithm and working mechanism of the proposed post-processing technique, respectively. The final resultant bounding box after the text reconstruction step is in the form of an arbitrary shape that fails the capture the corner characters (starting and ending characters) in the processed image. We propose an accurate post-processing technique named Width Height based Bounding Box Reconstruction (WHBBR) algorithm to enclose the starting and ending characters. Firstly, the set of arbitrary-shaped bounding box coordinates is given as input to the Graham Scan Algorithm. We adopted this algorithm to perform the search operation on the top, left, right, and bottom-most coordinates and so it facilitates us to choose the outmost extreme points by sorting all the points. The main aim of using the Graham Scan Algorithm in this work is to find those coordinates. Secondly, two antipodal points were selected through sidelines, and two directed tangent lines of support were drawn at antipodal points (ai) and (aj). Thirdly, two directed tangent lines of support were drawn at antipodal points. Next, at each antipodal point position, these two parallel lines are used to compute the distance measures between the antipodal points of the arbitrary bounding box. These two parallel lines visit all pairs of antipodal vertices by rotating clockwise. This process continues until all the vertices are visited at least once. At each iteration, the height and width are computed. Finally, the peak height and width with their respective coordinate are considered for detection of the arbitrary-shaped text that will be then enclosed with an accurate bounding box.

Fig. 8
figure 8

Schematic overview post-processing using the Width Height Based Bounding Box Reconstruction (WHBBR) algorithm

The WHBBR algorithm identifies the corner characters accurately and converts the arbitrarily shaped bounding box into a rectangular bounding box.

Text Recognition

We adapt the context attentional network (Litman et al. 2020) as our text recognizer. Cropped text images are fed into a text recognition model. A four-step mechanism was followed to process the detected text image. Firstly, the cropped text image has been transformed into the normalized image using thin-plate spline transformation to reduce the burden for the subsequent feature extraction stage. Secondly, a 29-layer ResNet is used as the convolutional neural network’s backbone to extract essential features from the input image. The final feature map is 512 channels. Thirdly, CTC-Attention-based decoding is used to embed characters with each column and the output of the embedded sequence to the CTC decoder to generate output. Finally, the selective contextual refinement block is employed to mitigate the lack of contextual information. To overcome the problem of long-term dependency, a two-layer Bi-LSTM is used over the feature map. The output from the Bi-LSTM network is combined with a visual feature map to generate a new feature map. In the selective decoder, a two-step attention mechanism is employed; in the first step, 1D attention operates on the output feature map generated from Bi-LSTM; further, an attention map is generated as a fully connected layer from these features. Next, the element-wise product is calculated between yielded attentional features and the attention map. In the second step, a separate encoder-decoder decodes the attention map, and LSTM generates the text characters. The recognized text information such as product name, quantity, expiration date, and price are displayed in a separate file. This text recognizer also supports our framework to obtain the on the shelf and off-shelf product information, extending to which it helps the visually impaired people to hear the product information through an audio device.

Experiment

This paper uses a synthetic dataset SynthText (Gupta et al. 2016) to train our text detection model. We evaluate our model on seven standard benchmarks that contain four “regular” datasets (IC03, IC13) and three “irregular” datasets (IC15, Total-Text, SCUT-CTW1500).

Regular Text Datasets

The performance of our proposed framework has been evaluated using standard benchmark datasets such as ICDAR 2011 (Shahab et al. 2011) and ICDAR 2013 (Karatzas et al. 2013). The majority of the text images in these datasets are almost horizontal text images:

  • ICDAR2011 (IC11) (Shahab et al. 2011) dataset is inherited from previous ICDAR contest benchmarks. Some of the prior dataset’s flaws, such as inconsistent definitions and inaccurate bounding boxes, have been resolved. This dataset contains 484 images, 299 images for training, and 255 images for testing.

  • ICDAR2013 (IC13) (Karatzas et al. 2013) contains 462 images. Most of the text images are inherited from IC03, 229 images for training, and 233 for testing. There are 849 text instances in the training set, whereas the testing set contains 1095 text instances.

Irregular Text Datasets

ICDAR 2015 Incidental Text (Karatzas et al. 2015), Total-Text (Ch'ng and Chan 2017), and SCUT-CTW1500 (Liu et al. 2019) are the benchmark datasets used to evaluate the performance of our framework. In this dataset, most text images are curved and rotated, and of low quality:

  • ICDAR 2015 (IC15) (Karatzas et al. 2015) dataset is from ICDAR 2015 Robust Reading Competition. Images in this dataset are captured using Google Glasses without proper positioning and focusing. It includes more than 200 irregular text images. This dataset contains 1500 images, 1000 images for training, and 500 images for testing. It provides word-level annotations. Notably, it contains 17,548 text instances.

  • Total-Text (Ch'ng and Chan 2017) contains 1555 images, 1255 images for training, and 300 for testing. Images in this dataset are collected from various locations, including business-related locations, tourist sites, club logos, and formal information. At the word level, this dataset contains 11,459 cropped word images with more than three different text orientations: horizontal, multi-oriented, and curved text. Total-text provides polygon-shaped ground truths.

  • SCUT-CTW1500 (Liu et al. 2019) contains 1500 images, 1000 images for training, and 500 images for testing. Images in this dataset are collected from various sources such as Google’s open-Image, Internet, and mobile phone cameras. Notably, it contains 10,751 cropped word images for testing. At least one curved text appears in each image. There are primarily arbitrary-shaped texts in text-line instances, but horizontal or multi-oriented text lines also exist in the text images.

Grocery Datasets

We use four publicly available datasets such as GroZi-120 (Zhang et al. 2007), WebMarket (Jund et al. 2016), Grocery Products (Girshick et al. 2014), and Freiburg Groceries Dataset (Santra and Shaw 2020) to train and test our proposed grocery product detection and recognition framework.

  • GroZi-120 (Zhang et al. 2007) is the first publicly released grocery product benchmark dataset. The product images are acquired from grocery web stores such as Froogle, and the text in the product images differs in size, style, and complex background images. There are 120 product categories and 676 product images in the GroZi-120 dataset. Image of dimension 256 × 256 with 96 dpi.

  • WebMarket (Jund et al. 2016) consists of 3153 shelf images of size 2272 × 1704 with 96 dpi, which is collected from 18 different product shelves. There are 100 product categories where the products are captured on and off the shelf. Rack images are captured in various scale, pose, and illumination, so it differs from product images. Like GroZi-120, the ground truth of the product is manually identified and annotated for each product located in the rack images.

  • Grocery Products (Girshick et al. 2014) dataset is designed to assist with fine-grained object classification and localization. The product images were obtained from the Internet, and the template images were recorded in studio-like conditions. The rack photos were taken using a mobile phone in a real-world retail setting. Various viewing angles, lighting conditions, and magnification settings capture rack images. A rack image can also include anywhere from 6 to 30 products. The ground truth is produced by manually annotating product categories and locations in rack images. There are 80 broad product categories in the dataset. Only 27 of the 80 product categories contain ground truth, including 3235 fine-grained product templates. Image of dimension 348 × 348 with 96 dpi.

  • Freiburg Groceries Dataset (Santra and Shaw 2020) collects pictures of real products and shelves. The Freiburg Groceries Dataset comprises 4947 pictures divided into 25 grocery classes with 97 to 370 images each. The products are captured using four different cameras at Freiburg, including residences, grocery shops, and offices. The text characters present in the product images have various illumination levels and complicated backgrounds in this dataset. Image of dimension 256 × 256 with 96 dpi.

Implementation

The implementation of our proposed framework is done using PyTorch. All the three experiments such as retail product, text detection, and text recognition were carried out on a DELL Precision Tower 7810 workstation, which has Intel(R) Xeon(R) CPU E5-2620 v3 dual processor, 96 GB RAM, and NVIDIA Quadro K2200 graphics card. We use the YOLOv5 object detector to perform grocery product detection. We trained and tested our proposed detection network using benchmark grocery datasets. All these datasets do not have an annotation format; we manually annotated them using Labellmg and then placed all the annotated images and text files (as shown in Fig. 9) in the same directory.

Fig. 9
figure 9

Illustration of bounding box annotation and its format

The unified annotated format is given below:

$$<\mathrm{object}-\mathrm{class}>\;<\mathrm x><\mathrm y><\mathrm{width}><\mathrm{height}>$$

where < object-class >—number of object, represent by an integer number (0) to (total number of class—1), < x > and < y > represents the center of bounding box rectangle.

$$<\mathrm{width}>\;=\;<\mathrm{absolute}\_\mathrm x>/<\mathrm{image}\_\mathrm{width}>$$
$$<\mathrm{height}>\;=\;<\mathrm{absolute}\_\mathrm{height}>/<\mathrm{image}\_\mathrm{height}>$$

Table 1 shows the parameters used to train the proposed framework. For both object and text detection tasks, we divided 80% data for training, 10% data as validation and the remaining 10% for testing (80:10:10). The learning rate is set to be 1 × 10−2. The object detector is optimized with the Adam algorithm, with a batch size of 64. Adam combines the best properties of the AdaGrad and RMSProp algorithms. It has a faster computation time and requires fewer parameters for tuning. Adam is relatively easy to configure whereas the default configuration parameters do well on complex problems. Adam is more stable than the other optimizers; it does not suffer any major decreases in accuracy. The Adam optimizer is the best among the other optimization algorithms. Hence, we employed the Adam optimizer for both object detection and text detection tasks. We used the exact system specification to implement text detection and recognition algorithms. We trained the text detection model using SynthText under full supervision used as our baseline model. We used the Adam optimizer as an optimization algorithm for the text detection algorithm; the learning rate is set to be 1 × 10−3 and the text detection model is trained with a batch size of 64. We chose 299 training images from ICDAR 2011, 229 training images from ICDAR 2013, 1000 training images from ICDAR 2015, 1255 training images from Total-Text, 1,000 training images from SCUT-CTW1500, and 4000 images from SynthText for text detection model training. For a fair comparison, single-scale testing is performed, and a polygonal NMS eliminates redundant detections.

Table 3 Parameters used to tune the proposed framework

Performance Metrics

The performance of the framework can be evaluated based on the confusion matrix. The performance metrics are accuracy, precision, specificity, recall or sensitivity, and F1 score. For classification models, accuracy is a critical measure. It is straightforward to comprehend and use for binary and multi-class classification problems. The percentage of true results in the total number of records examined is termed accuracy. It is useful for evaluating a classification model built only from balanced datasets. If the dataset for classification is skewed or unbalanced, accuracy may provide incorrect results. The percentage of objects or text detected correctly over the total number of detected texts or objects is precision. Another essential metric is recall, which provides more information if all possible positives must be captured. The percentage of objects or text detected correctly over the total ground truth is known as recall. If all positive samples are predicted to be positive, the recall is one. If the best combination of accuracy and recall is needed, these two metrics may be merged to get the F1 score. The F1 score is the harmonic mean of accuracy and recall, ranging from 0 to 1. Equations (6) to (9) provide the formulae for evaluating all of these metrics.

$$\mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(6)
$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(7)
$$\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(8)
$$\mathrm F1-\mathrm{score}=2\times\frac{\mathrm{Precision}\times\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$$
(9)

In practice, a model should be built with precision and recall of 1, resulting in an F1 score of 1, i.e., 100% accuracy, which is difficult to achieve in a classification problem. As a result, the built classification model should have a better accuracy and recall value.

Results and Discussion

We propose a framework to perform three main tasks: object detection, object text detection, and object text recognition. The YOLOv5 algorithm is used as our object detector. The object text detection algorithm detects the text present on the grocery product to detect grocery products. Finally, the object text recognition algorithm can recognize the detected text. Once the YOLOv5 object detector detects the grocery products, the obtained result image can be input into the text detection model. The text detection algorithm is specially designed to capture regular and irregular text. Also, it can detect the text with a complex background, affine distorted texts, text with non-uniform spacing, and different texts in a single image. The text present in the grocery dataset contains complex shapes, sizes, and orientations. However, the text detection model can detect the complex curved shape texts and multiple texts in an image (Fig. 13). The text recognition algorithm uses the CTC-Attention mechanism to recognize the arbitrary-shaped text in the cropped word image. The CTC-Attention-based text recognition model can recognize the curved shaped text, text with non-uniform spacing, and multiple images in an image (Fig. 14). The proposed text detection model and an accurate text recognition model support the proposed framework by detecting and recognizing the on- and off-the-shelf grocery products.

Grocery Product Detection Result

The performance of grocery product detection is given in Table 2 and Table 3. We used YOLOv5 to train and test the four different benchmark datasets. Pre-trained models greatly support the extraction of features. In the GroZi-120 dataset, the images are minimal; four images per class are not limited to object detection tasks. We performed a data augmentation task to increase the dataset images by scaling, rotating, adding noise, skewing, etc. In the GroZi-120 video dataset, the videos are converted into frames and fed to the YOLOv5 model. The GroZi-120 dataset contains 120 classes; we performed training and testing for all 120 classes. YOLOv5 dramatically improves the detection performance by more than 10% (see Table 2), and it can detect both small and large grocery products (see Fig. 10). YOLOv5 provides the most promising for other datasets such as WebMarket, Grocery Products, and Freiburg Groceries Dataset, as shown in Table 2 and Table 3.

Table 4 Comparisons of retail product detection performance with existing methods on the GroZi-120 and WebMarket dataset
Table 5 Comparisons of retail product detection performance with existing methods on the Grocery Products and Freiburg Groceries dataset
Fig. 10
figure 10

Visualization results of text detection model on the Grozi 120 dataset

Geng et al. (Geng et al. 2018) use the GroZi-120 dataset to assess the performance of BRISK and SIFT techniques. The authors used VGG16 and Attention map for feature extraction and classification, respectively. SIFT algorithm is not efficient for many computer vision tasks. Hence, we used a deep learning–based object detection algorithm, YOLOv5, which completely outperforms BRISK and SIFT techniques, with precision (86.3% vs 46.3% and 49.05%), recall (77.8% vs 29.50 and 29.37%), and F1 score (77.04% vs 36.04% and 36.74%). On the GroZi-120 dataset, the YOLOv5 model outperforms other existing models with a greater margin (+ 30). Franco et al. (Franco and Maltoni 2017) and Marder et al. (Marder et al. 2015) use DNN and HOG approaches, respectively, to detect the products of the WebMarket dataset and achieved an F1 score of 46% and 28.33%.

Santra et al. (Santra and Shaw 2020) achieved the F1 score of 80.21% which is the second highest in the Grocery Products dataset. Ray et al. (Ray et al. 2018) and Karlinsky et al. (Karlinsky et al. 2017) were able to achieve a satisfying result with F1 score of 76.20% and 79.05%. Franco et al. (Franco and Maltoni 2017) and Marder et al. (Marder et al. 2015) use the BoW approach for product recognition, achieving the F1 score of 69.30% and 59.91%. Girshick et al. (Girshick et al. 2014) performed semantic segmentation to segment the products from the background which was able to achieve a 78.99% of F1 score. However, the YOLOv5 achieves the best performance on the Grocery Products dataset.

Text Detection Result

The performance of our text detection model is examined in this section on ICDAR 2011, ICDAR 2013, ICDAR 2015, Total-Text, and CTW1500. The performance of the text detection model is shown in Table 4. We adopt the most potent backbone network, ResNet50-FPN, to enhance the text detection performance on different text styles such as horizontal, vertical, and curved text. However, we cropped some of the text from grocery datasets for our text detection task. These images are also used for training and testing purposes. The post-processing algorithm, WHBBR, significantly reduces the false detection rate. We compare our backbone network with the Long et al. (Long et al. 2018) backbone (VGG16-FPN); our model achieves the best F1 score of 87.8% on SCUT-CTW1500 and 87.7% on Total-Text.

Table 6 Comparisons with existing methods on text detection datasets

ICDAR 2011 and ICDAR 2013 datasets focus on the horizontal text. So, we utilize these datasets to assess the robustness of our text detection model for horizontal text, and the performance of our text detector for horizontal text is shown in Table 4. Similarly, the performance of our model on ICDAR 2015 dataset is compared with existing methods for detecting the oriented text. The proposed text detection model based on the WHBBR technique (F1 score: 90.0%) performs better than Long et al. (Long et al. 2018) (F1 score: 82.6%) with an improvement of 7.4% and meets the current state-of-the-art performance on ICDAR15. Table 4 also compares our performance with existing methods for detecting oriented text on Total-Text and CTW1500. We evaluated the efficiency of the proposed method by detecting arbitrarily shaped texts in Total-Text, where horizontal, orientated, and curved text appears simultaneously in most images.

Our detection model’s performance (F1 score: 89.1%) improves dramatically when the fully annotated ICDAR2011 training set is used. Similar to ICDAR 2013 and ICDAR 2015, our text detection model outperforms Long et al. (Long et al. 2018) by 13.3% and achieves current state-of-the-art performance on the Total-Text dataset. Our text detector achieves an 87.8% F1 score and the best recall of 87.2% outperforming most previous state-of-the-art methods. Our model achieves better than most of the existing methods, including Zhang et al. (Zhang et al. 2019) (precision: 89.2%) and Zhong et al. (Zhong et al. 2016) (recall: 83.6%), which are the top performers in irregular, although these methods have higher computational costs. The CTW1500 dataset has a complex background and includes a variety of multi-oriented texts. The proposed framework can detect multiple on-the-shelf retail objects (see Fig. 11), detect both word image (see Fig. 11) and object text (see Fig. 12), recognize irregular text accurately (see Fig. 14). Experiment results demonstrates that the proposed method is robust to both regular and irregular text.

Fig. 11
figure 11

Visualization results of grocery product detection by the YOLOv5 algorithm on the public benchmark retail product datasets

Fig. 12
figure 12

Visualization results of text detection by the proposed model on the public benchmark text detection datasets

Fig. 13
figure 13

Visualization results of text detection by the proposed model on the public benchmark retail product datasets

Fig. 14
figure 14

The text recognition results on the public benchmark retail product dataset

Significance of WHBBR

The importance of the modified backbone, centralizing, and the WHBBR technique is presented in Table 5. This proposed technique enhances the detection rate by an average of + 2.3% compared to state-of-the-art methods.

null 5 Significance of WHBBR technique with state-of-the-art methods

On the irregular datasets, the modified backbone, centralizing, and the WHBBR technique achieved better performance than baseline model: precision (+ 7.3%), recall (+ 7.5%), and F1 score (+ 7.4%) on ICDAR 2015 dataset; precision (+ 6.4%), recall (+ 11.8%), and F1score (+ 9.3%) on Total-Text dataset; precision (+ 17.5%), recall (+ 11.9%), and F1 score (+ 14.8%) on SCUT-CTW1500 dataset.

Proposed Framework Summary

The proposed framework is proposed to detect and recognize on-shelf and off-shelf retail products by extracting text information including, product name, price, quantity, and expiry date from the product’s particulars. In order to do that, we require an object detection, text detection, and text recognition model. We use a popular object detection model YOLOv5 to perform retail product detection. Individual object region coordinates such as (x1,y1) and (x2,y2) are passed to a text detection model to obtain product information. However, the text detection model follows an arbitrarily shaped bounding box construction approach to draw over the detected text, which failed to capture the starting and ending characters of the text. To address this problem, we proposed a WHBBR technique, which can draw a rectangular bounding box around the text accurately. It also helps the text recognition algorithm to predict the correct text. The WHBBR algorithm greatly improves the performance of the state-of-the-art methods, which follow arbitrary-shaped text reconstruction methods. The detected texts are cropped and passed to the text recognition model SCATTER (Litman et al. 2020), which recognizes the text from cropped word images. The proposed framework has many advantages such as assisting visually impaired people, reducing the time taken during checkout, identifying the number of on-shelf products, misplaced products, out-of-stock products, and so on. Figure 15 shows the overall result obtained after performing the series of steps. The proposed framework is a generalized model based on a deep learning algorithm; it can be used in any text-based applications, such as vehicle license plate recognition and road sign recognition. Object text detection and recognition models are trained with only English texts. So the proposed framework is limited to only the English language.

Fig. 15
figure 15

Product detection and product recognition are based on product text information

Conclusion and Future Work

We proposed a new framework composed of three models: product detection, product text detection, and product text recognition to detect and recognize the retail products from the supermarket shelves. Generally, the text present on the retail products (e.g., product name, brand name, price, expiring date) has unique information about the corresponding product. To acquire that precious text information from the retail products, we enhanced the state-of-the-art text detection model by adding an accurate post-processing technique named Width Height based Bounding Box Reconstruction (WHBBR). The text detection model’s modified backbone and post-processing technique greatly eliminate the false detection and inaccurate bounding boxes. The Attention-based text recognition model can accurately detect and recognize the arbitrary-shaped text. The proposed framework has the practical application of assisting visually impaired people during shopping. Our framework is computationally expensive during training but can detect and recognize objects promptly and accurately during testing. Our product recognition model completely depends on the text present on the retail products. If the text is occulted or missing from the product, the product recognition model fails. Still, our product detection model based on the YOLOv5 algorithm can detect the product. The limitations of this paper can be addressed in our future work. The retail product can be detected and recognized based on shape and color features. In addition, we intend to address the out-of-stock problem, product count, and misplaced items.