Keywords

1 Introduction

Some key sectors in the manufacturing industry, such as instrumentation, quality improvement, tracking, etc., employ futuristic automation owing to their high accuracy and precision. The most widely used technology among these is machine-vision [13]. Such applications employ application-specific high-resolution cameras and lenses. Lens selection for such applications depends on the geometry and dimensions of the object [15]. Using a wide-angle or telephoto lens introduces distortions that might cause deformities of the captured object. Thus, the selection of lens impact field-of-view (FOV) and object-to-camera distance.

Limited FOV issues of cameras can be solved using either of these techniques. The first involves Omni Directional cameras [30] arranged with reflective mirrors for enhancing the FOV. The drawback of this technique is the non-uniformity of the output image. The second involves employing a predefined arrangement of multiple cameras known as distributed aperture systems (DAS) [22]. Such DAS systems produce wider FOV outputs with a dynamic view of neighbouring surroundings.

Numerous DAS-based inventions [8] are reported in the literature. The DAS invented by Northrop-Grumman Corporation employs mid-wave infrared (IR) sensors, each covering \(30^{\circ }\), thereby requiring six sensors to cover the complete \(360^{\circ }\) area around a tactical environment. An advanced DAS (ADAS) invented by Raytheon Company [40] uses high-resolution IR sensors for \(360^{\circ }\) view for situational awareness. Another invention of DAS by Sarnoff Corporation [14] provides a \(180^{\circ }\) view of the surroundings inside military vehicles, details of which are confidential. Other inventions based on robotic vehicles [34] use numerous visible and IR cameras to produce seamless broader FOV views. The underlying common factor among all these implementations is image stitching of the acquired images to create a seamless output.

To understand the importance of image stitching in industrial DAS implementations, consider continuous rolling processes such as fabric manufacturing [20] and steel manufacturing [27, 38] in which surface aberrations and their location are captured in real-time. Expanded FOVs are created by image stitching algorithms using a three-stage process [39]. In the first stage, the relationships between the pixels in the overlapping areas are defined [18, 37]. Alternatively, optical flow algorithms are employed to estimate the pixel-wise motion model [31] and thereby the relationship between the corresponding pixels of the images [16, 24, 45]. Also, feature matching is another technique applied to estimate the pixel relationships [5, 6, 9]. In the second stage, the images are projected onto a common plane, so that the output view is assumed to be created by a single camera. The projection onto the common plane is derived through transforming and registering the images from the different cameras. The transformed images are aligned in the common plane without ghosting effects or other artefacts. Finally, in the third stage, the pixels in both images are blended to ensure that the seam between the images is invisible.

Feature descriptor algorithms generally identify and match the common features in the overlapping regions. The eight most popular feature descriptor algorithms are Scale Invariant Feature Transform (SIFT), Speeded Up Robust Feature (SURF), Binary Robust Independent Elementary Features (BRIEF), Oriented Fast and Rotated BRIEF (ORB), Binary Robust invariant scalable keypoints (BRISK), KAZE, Accelerated-KAZE (AKAZE) and Features from Accelerated Segment Test (FAST). This paper focuses on evaluating the qualitative and quantitative performance of these feature descriptor algorithms for low-textural images of steel surface inspection systems.

The main contributions of this paper are as follows:

  • A motivation from the steel industry is presented in this paper that captures information from low-textural images of the steel sheets.

  • The experimental setup for capturing the dataset of real-time low-texture images is presented.

  • Modern feature descriptor algorithms are evaluated for performance and efficiency using the low-textural image dataset. In this regard,

    • The efficiency of the algorithms is compared in terms of sensitivity and specificity of the low-textural features extracted.

    • The execution times of the algorithms are analysed with respect to usage in low textural applications.

  • This paper also presents the limitations of feature descriptor algorithms for low-textural images, thereby paving the way for newer strategies for feature extraction that can be used for low textural images in real-time industrial scenarios.

2 Motivation

This section presents the motivation for this problem to illustrate the advantages of stitching algorithms in machine vision systems. In steel rolling processing, steel surface inspection for defect detection and classification uses high-resolution MV cameras as shown in Fig. 1.

Fig. 1.
figure 1

Illustration of the motivation. Steel sheets are rolled continuously and the machine vision cameras continuously inspect the sheet for detection of surface defects

Steel rolling can be classified into hot and cold rolling processes. In hot rolling, a slab of steel is heated and flattened to thin steel strips. Conversely, cold rolling involves drawing hot rolled coils into thinner sections at variable speeds. Generally, the thickness of the strip is inversely proportional to the rolling speeds. The minimum and maximum speeds in cold rolling mills are 4 and 10 m per second. Hypothetically, considering strip inspection while rolling, using a machine vision system for surface defects. The cameras used are high-resolution line scan cameras with an appropriate lens. Generally, the strip width is approximately 2 m which would require multiple cameras to capture the total width of the strip for inspection.

Uncaptured defects or inaccurate location information may lead to catastrophic degradation of product quality. For example, sheets of steel are rolled at speeds of approximately 20 m per second, and two cameras are arranged across the width of the sheet, capturing 30 frames per second. The images captured by the synchronized cameras are stitched together to create a digital twin of the steel sheet. Images of the strip as captured by the cameras are shown in Fig. 2 and the textural information available on these images is minimal. Consider the scenario where the images are stitched improperly resulting in missed defects or incorrect location information. Any mismatch in the stitching causes missed defects or improper identification of defects. So the feature descriptor and extraction algorithms play a crucial role in ensuring accurate stitching without any ghosting effects or other artefacts. The above discussion aids in understanding the significance of the image stitching of the images from cameras. Therefore, it would lead to understanding the importance of feature extraction and image stitching for industrial applications can be envisaged and the criticality of the performance of these algorithms in stitching can be understood.

Fig. 2.
figure 2

Images of steel strips captured with cameras for surface inspection system

3 Related and Relevant Work

Understanding and analysing the performance of feature-descriptors has always been crucial to research in image registration and there have been numerous studies on the same. Performance of feature-descriptors [28] such as SIFT, Steerable filters, PCA-SIFT, Complex Filters, GLOH, etc., have been evaluated over various image transformations such as rotation, varying illumination, compression, blur, etc. in multiple datasets. Nonetheless, this paper deals with variation in the scale between 200 and 250%. In [41], various feature-detector-descriptor combinations such as SIFT, ORB, SURF, SURF-BinBoost, AKAZE-MSURF, etc., have been compared for point-cloud registrations that were obtained using terrestrial laser scanning methods. Moreover, various feature-detectors and feature-descriptors were evaluated for visual tracking applications [10]. A similar quantitative comparison is provided for multiple feature-detectors such as BRISK, FAST, ORB, SIFT, STAR, SURF, AGAST, and AKAZE were applied to different image data sequences but this paper does not compare the computational times of these algorithms [32]. Another attempt compares SIFT, SURF, ORB, and AKAZE features in a monocular visual odometry application [7] using the KITTI benchmark dataset. The next paper describes performance comparison [19] for SIFT and SURF against different image deformations using the feature-matching technique.

In the comparisons of feature descriptors available in the literature, the basic assumption is that there are enough features to be detected. In general, the scenes or applications used for analysing the feature descriptors for image registration have many distinct textures that can be detected as features. Low-textural images present a distinctive challenge of finding and matching features for feature-descriptor and in turn, image registration algorithms. There are many real-time industrial applications such as paper processing, steel manufacturing, etc. in which image registration and stitching do not work due to a dearth of features. In this paper, we aim to analyse the performance and efficiency of various feature-descriptor algorithms where the images captured have the least textural information. Also, this analysis would aid in understanding the shortcomings of the feature descriptor algorithms when dealing with low-textural images, thus making it difficult for image registration in industrial applications.

4 Hardware Configuration and Experimental Setup

The hardware configuration and experimental setup used in the evaluation of the feature descriptors are shown in Fig. 3a. The primary elements of this experimental setup are:

  • Sensing element: consists of monochrome line-scan cameras (Dalsa Spyder3) combined with 30 mm focal length F-mount lens.

  • Interfacing element: consists of a four-port Peripheral Component Interconnect (PCI) Power-Over-Ethernet (POE) card (Adlink PCIe-GIE64+)

  • Processing element: consists of a windows powered workstation using an i7 processor, 16 Gigabytes of RAM, and 4-gigabyte Nvidia graphics card

  • Application software: comprises Python environment, an interpreted, high-level, general-purpose programming language

The final hardware configuration and setup are shown in Fig. 3b.

Fig. 3.
figure 3

Hardware Schematic and setup with Illumination system at Cold Rolling Mill, Tata Steel

5 DataSet Preparation

Using the hardware setup mentioned in Sect. 4, images are captured in real-time for multiple rolling sections at a frame-rate of 30 frames every second. The trigger for image capturing starts whenever the rolling of a coil starts and then, the captured image forms part of a data set. For preparation of the dataset, images are collected for different types of coils that are rolled at varying speeds. Also, the ambient lighting conditions are varied by collecting images in the morning, noon and night. As part of these experiments, we captured images of 12 types of coils that are rolled at different coils, during the morning, noon and night. Hence, a total of 36 coils were captured and mapped in one day and such images were captured for a period of 10 days. Images of 360 coils and approximately 200 images are captured per coil, and hence, total number of images in the dataset is 72000.

6 Understanding Feature Descriptors

The important phases of image stitching are feature descriptors, registration, and blending [44]. Feature descriptors are algorithms that encode unique information in images into matrices that enable differentiation between images and features. Feature descriptor algorithms can be organized into direct, deep learning-based, and feature-based methods

Direct methods compare the pixel intensities and thereby ensuring that the properties depicted by each of the pixel intensities are compared with the others in the overlapping areas of the multiple images [17]. These direct techniques generally minimize the variations in the pixel intensities and ensure optimal usage of image details. Moreover, these techniques aid in evaluating each pixel intensities in the image, thus making these techniques extremely complicated. Some tomography parameters are assessed using phase correlation [4, 29]. After which, the homography matrix is updated to minimize a specific cost function. The major drawback of this class of techniques is that they are limited to flat scenes without parallax, making these algorithms unsuitable for real-time industrial applications such as surface inspection systems.

Recently developed feature descriptor algorithms based on deep learning are detailed. Some attempts [12, 36, 42] utilize Convolutional Neural Network (CNN) for feature detection replacing traditional feature descriptor algorithms. Other variants use neural networks for feature matching [2] for estimating transformational parameters from the detected features [43]. Furthermore, few papers attempt to design for specific criteria such as fixed views [21, 35], wide-angle views using fisheye lenses [25] etc. The major drawback of such deep learning-based methodologies is that these techniques require very high processing times in the order of seconds, thereby making them unusable for real-time applications. Hence, in the purview of this paper, we do not consider deep-learning-based methodologies for evaluation and comparison.

Feature-based techniques focus on ascertaining the relationships between the overlapping areas in the images by comparing a few key feature descriptors extracted from the images [11]. This class of techniques has no restrictions in terms of the scenes and is highly reliable and fast. One of the basic requirements of these techniques is the presence and accurate detection of feature descriptors that are usually textural features in the images. Feature-based techniques operate on matching different features and ensure invariance to noise, scale, translation, and rotation. Some popular feature descriptors in the literature include SIFT, SURF, ORB, BRISK, and AKAZE.

6.1 Scale Invariant Feature Transform (SIFT)

SIFT, proposed by Lowe, extracts key points and evaluates the local descriptor using image gradient and the direction information from the image [26]. We have implemented the SIFT algorithm to the low-textural images shown in Fig. 2. The results, as shown in Fig. 4a, show that the number of distinct features as required by SIFT is unavailable in the images and hence, the algorithm detects keypoints that are not unique and the matching is also inaccurate.

6.2 Speeded Up Robust Feature (SURF)

The SURF algorithm uses multi-dimensional space theory. Furthermore, the keypoint detection is improved by monitoring their quality, and keypoint matching is improved by using Hessian matrix [3]. The keypoint detection and matching using SURF for the set of low-textural images is shown in Fig. 4b. The detection and matching accuracy is much better than SIFT but still, some points were erroneously detected and matching, thereby affecting the quality of image stitching.

6.3 Oriented Fast and Rotated BRIEF (ORB)

ORB is a fast binary descriptor that captures the salient features of FAST and BRIEF algorithms using key points of BRIEF and detectors of FAST [33]. The performance of ORB is much better than SIFT and SURF in terms of speed and efficiency. The performance of ORB is shown in Fig. 4c. It can be observed that due to the lack of textural features, the detection accuracy of ORB is poor for image stitching.

6.4 Binary Robust Invariant Scalable Keypoints (BRISK)

The BRISK algorithm employs the grayscale relationship between random pairs of points in the image resulting in the binary descriptor [23]. This algorithm is faster than other algorithms and also the storage memory is lower, but the tradeoff is that robustness is reduced. The results of the BRISK, shown in Fig. 4d, reveal the exact feature points, but the matching algorithm is not effective, thereby making the stitching process inefficient.

6.5 Accelerated-KAZE (AKAZE)

AKAZE uses non-linear diffusion in multi-scale feature detection resulting in better repeatability and performance [1]. The major drawback of this algorithm is that it is computationally expensive. The results using the AKAZE algorithm on low-textural images are shown in Fig. 4e. On observation, it can be seen that the matching accuracy is poor even though, the keypoint detection accuracy is high.

Fig. 4.
figure 4

Keypoint Matching using various feature descriptor algorithms on Low-Textural Images

7 Performance Comparison of Feature Descriptor Algorithms

This section analyses the performance and efficiency of the feature descriptor algorithms based on usability in low-textural real-time applications. The fundamental requirement for real-time applications is the ability to process at least 15 to 20 frames of images a second so that the digital twin covers maximum areas at the speed of rolling. Higher frame rates are accomplished by understanding the steps that consume the most computational time during processing. Hence, the computational time of each process step of various algorithms is evaluated and shown in Table 1.

The feature descriptor consumes maximum computational time (approximately 50%) of the image stitching algorithm. Moreover, the feature descriptor has been evolving. With each improved algorithm, there is considerable reduction in computation time.

The fastest existing feature description algorithm consumes 39 ms with a total computational time of 79 ms that means that the maximum number of frames processed in a second is around 12.

Table 1. Comparison of step-wise computational time of Stitching Approaches

Feature descriptor algorithms were applied to nearly 3000 images captured from multiple cameras for conducting Monte-Carlo trials. As part of these trials, estimation and statistical analysis of computation times for the five feature descriptor algorithms were conducted to compute parameters such as mean, median, and standard deviation shown in Table 2.

Table 2. Performance Comparison of Feature Descriptor Algorithms

The mean and median of the AKAZE algorithm are the least among all the feature descriptor algorithms followed by BRISK, ORB, SURF, and SIFT. The standard deviation shows that AKAZE is the least followed by SIFT, ORB, SURF, and BRIEF. Upon analysis, the standard deviation values follow a different order than the mean and median. Thus, AKAZE is the best-performing feature descriptor algorithm for low-textural images that is further ascertained by visually studying the box plots shown in Fig. 5.

Fig. 5.
figure 5

Box Plot showing the performance of the various feature descriptor algorithm

The efficiency of the algorithms is compared using the sensitivity and specificity of the matched low-textural features extracted from the images. Sensitivity, also known as True Positive Rate (TPR), can be defined as the ratio of true positives to the total number of matched low-textural features extracted from the images. Similarly, specificity, also known as False Positive Rate (FPR), can be defined as the ratio of false positives to the total number of matched low-textural features extracted from the images. Table 3 shows the comparison of TPR and FPR for the matched low-textural features.

Table 3. Efficiency Comparison of Feature Descriptor Algorithms
Fig. 6.
figure 6

Efficiency Comparison between feature descriptor algorithms

Fig. 7.
figure 7

Registered Images based on the various feature descriptor algorithm

The TPR increases with every new generation of feature descriptor algorithm while FPR decreases. TPR is maximum and FPR is minimum for AKAZE algorithm. Furthermore, TPR is least and FPR is highest for SIFT algorithm. Another important inference is that, for SIFT, SURF, ORB, and Brisk algorithms, even though the TPR is increasing, the total number of matched features keeps decreasing, thus proving that the trade-off for increasing TPR is the reduction in the total number of matched features. Hence, the main advantage of the AKAZE algorithm is that it has the highest TPR with a 780% increase in the number of features as compared to the BRISK algorithm that is further established visually in the bar graphs shown in Fig. 6.

8 Comparison of Registration Based on the Various Feature Descriptors

Based on the matched features from the feature descriptors, the next step in the algorithm is registration of the images, that implies that the second image is transformed onto the plane of the first image. After this process, the homography matrix is computed and the second image is warped to finally register the images. The effectiveness, accuracy and efficiency of the feature descriptors can be clearly understood. Figure 7 clearly depicts that the currently available feature descriptors do not perform effectively for low- textural images. The major reason for this, is that the feature descriptor algorithms are designed on the assumption that numerous features are available in the images that can be further matched and registered.

9 Conclusions

This paper presents a detailed comparative analysis of five popular feature descriptors for low textural images. The primary motivation for this study is the application of these algorithms to real-time image and video stitching applications involving low-textural images captured from multiple cameras. Also, the performance and efficiency comparison helps to understand the improvements made in each generation of the feature descriptor algorithms.

Based on the results comparison of the registration using the various feature descriptor algorithms, it can be clearly understood that the currently available descriptors fail to effectively register low-textural images. This paves the way for future work, that is to design feature descriptors specifically for stitching of low-textural images.