1 Introduction

In the automotive industry, light alloy aluminum castings (e.g. ,wheels, knuckles, gear boxes, etc.) are an important element for determining roadworthiness. During the production process, heterogeneous parts can be formed inside the workpiece. This manifests itself, for example, as cracks, bubbles, slags, or inclusions. In the quality control of aluminum castings, every detail must be thoroughly checked using X-rays, and 100% of the parts must be reviewed. The goal of X-ray testing of castings is to identify discontinuities (defects) located inside the test object that are not visible to the naked eye (see, for example, Fig. 1).

Over the past 35 years, several computer vision methods for conducting automated inspections of castings have been described with many promising results. Automated X-ray systems have improved quality through multiple objective inspections and improved processes and increased productivity and consistency by reducing labor costs [41].

The trend today in computer vision -in general- is to use methods based on deep learning. Deep learning has been established as the state-of-the-art in many areas of computer vision. The key idea of deep learning is to replace handcrafted features with features that are learned efficiently using a hierarchical feature extraction approach [2, 22, 23, 41]. Usually, the learned features are so discriminative that no sophisticated classifiers are required. In recent years, we have witnessed tremendous improvements in many fields of computer vision that involve using complex deep neural network architectures trained with thousands or millions of images (e.g. ,facial recognition [7], object recognition and detection [31, 63], diagnosis of prostate cancer [44], classification of skin cancer [9], among others). Methods based on deep learning have become fundamental in these fields. Nevertheless, the use of deep learning is still limited in aluminum casting inspection.

Fig. 1
figure 1

X-ray images of a casting with real defects (images of series C0001 of \({\mathbb {GDX}}\text {ray}\) dataset [42])

In comparison with other computer vision applications, the introduction of techniques based on deep learning in computer vision for X-ray testing in industrial applications has been rather slow. We believe that there are three reasons for this. (i)The first has to do with the availability of public databases that can be used for these purposes. While in some areas of computer vision (e.g. ,facial recognition), hundreds of databases have been created since the 1990s, there is only one public database for X-ray testing for castings inspection, \({\mathbb {GDX}}\text {ray}\) [42]. This database was created five years ago and contains around 2700 X-ray images. The rest of the datasets used in the experiments reported by the industry and academia are private. (ii)The second reason is related to the number of experts working in this field. While almost anyone can be an expert in other areas of computer vision (such as object recognition), in non-destructive testing of castings, the relative number of people working on these subjects is rather low and their work is usually expensive. In this kind of computer vision application, experts must be hired to label the data (make annotations, define bounding boxes, etc.). It is very simple to find people who can detect cats and dogs in photographs, but not so easy to find human operators who can identify the discontinuities in a casting by inspecting an X-ray image. (iii)The last reason is that in other applications of computer vision, color photographs can be acquired using inexpensive equipment (often a cell phone), whereas in X-ray testing we need expensive equipment. It is likely that fewer people are working on X-ray images than color images for these reasons.

This research simplifies the use of deep learning in the inspection of castings. It is our hope that these techniques can be used easily and effectively in the quality control of die castings in the near future. Our contributions are fourfold:

  1. 1.

    We developed a simple, effective, and fast deep learning strategy that can be used in the inspection of aluminum castings. The training stage requires a relatively small number of X-ray images and no manual annotations because a simulation model is used to superimpose defects onto the X-ray images. In our experiments, the training for the model was completed in just 2.5 h; in the testing dataset (with real defects), performance was very high (average precision was around 0.90 and the \(F_1\) factor was 0.91), and the computational time is very low (only 11 ms per image i.e. ,it can be used in real-time inspection to aid human operators).

  2. 2.

    We propose a training/testing strategy in which defects that are used in the training are not used in testing. Due to the low number of real defects, in many cases, a defect can be captured in different X-ray images from different points of view. Thus, it is very common to use some captures of the defect in the training dataset and other captures of the same defect in the testing dataset. This practice, which could lead to overfitting, is avoided in our work.

  3. 3.

    We use well-established deep object detection methods with a good level of maturity in computer vision. That means that there are many examples available in public repositories that have already been successfully tested and evaluated. The use of these methods in aluminum inspection does not pose practical problems at the training or testing stages. Moreover, training and testing can be executed in Python in any browser with no intricate configuration and free access to GPUs using Google ColabFootnote 1.

  4. 4.

    The code and datasets used in this paper are available in a public repositoryFootnote 2. We believe that this practice, which is very common in other fields of computer vision, should be more common in X-ray testing. Thus, anyone can (i)reproduce the results reported in this paper and (ii)re-use the code in other inspection tasks.

In the field of computer vision, modern object-detection methods (like YOLO [50] and RetinaNet [28]) have been developed over the past five years with very promising results, but it has been very difficult to use them to identify defects in aluminum castings because there is not enough data for training. In order to overcome this problem, synthetic data were used in our work. Deep learning models trained with synthetic data is nothing new (see, for example, [53, 57, 58]). However, to the best of our knowledge, this is the first time that modern object-detection methods have been applied to the inspection of aluminum castings using simulated defects as training data. We tested ten object-detection models: eight modern object-detection models because they are the best performing and most representative deep learning-based object detection models in computer vision, as stated in [63], and two baseline classic models based on handcrafted features [34] and convolutional neural networks [37] in conjunction with a sliding windows strategy for comparison proposes. As we show in our experiments, it is clear that the modern strategies significantly outperform the classic ones.

2 Related work

In this Section, we cover the most important progress that has been made in the automated detection of defects in aluminum castings. The review is divided into two sections: the first (Sect. 2.1) is dedicated to the specific methods used in the inspection of aluminum castings. The second (Sect. 2.2) is related to computer vision methods that deal with object detection.

Table 1 Published methods of defect recognition on castings

2.1 Defect detection in castings

Over the past 35 years, the literature has described various computer vision methods applied to the automated detection of casting defects. The first contributions were likely made in the 1980s [4, 12]. Today, we can identify four different families of methods that are being used by industry or academia: (i)classic methods, (ii)methods based on multiple views, (iii)methods based on computed tomography, and (iv)methods based on deep learning. They are summarized in Table 1.

\(\bullet \) Classic Methods: These methods correspond to approaches based on classic image processing and pattern recognition techniques. In these approaches, handcrafted features are used for the automated inspection of castings. This family of methods consists of two main groups [35]:

  • Reference methods: In reference methods, still images must be taken from select inspection positions. A test image is then compared to the reference image. If a significant difference is identified, the test piece is classified as defective.

  • Methods without a priori knowledge of the structure: These approaches use pattern recognition, expert systems, artificial neural networks, or general filters to make them independent of the position and structure of the test piece. An example of these methods is given in our experiments.

The fundamental disadvantages of the first group of methods include the complexity of their configuration and inflexibility to changes in the design of the workpiece. However, they are much more effective than the second group because the automated adaptive processes used to accommodate design modifications are far from perfect.

\(\bullet \) Methods based on multiple views: Over the past two decades, approaches based on multiple views have been proposed because they can be very effective when examining complex objects where the uncertainty of only one view can lead to misinterpretation. These approaches typically have two main steps:

  • First, potential defects are identified in each view. Handcrafted or learned features can be used in this effort.

  • Second, the potential defects are matched and tracked across multiple views.

The key idea of this approach is to consider potential defects which cannot be tracked to be false alarms.

\(\bullet \) Methods based on computed tomography: In contrast to the rest of the methods, computed tomography produces a volumetric reconstruction of the test object: a 3D volume, i.e. ,a set of 2D images of slices of the object under test is estimated using reconstruction approaches. Computed tomography can be a very time intensive process requiring a minimum measurement time for adequate signal to noise ratios as well as a minimum number of projections for the desired local resolution.

\(\bullet \) Methods based on deep learning: The key idea of deep learning is to replace handcrafted features with features that are learned efficiently using a hierarchical approach. This family of methods has made contributions to defect detection in terms of object classification and object detection, as seen in Table 1.

Multiple views and computed tomography are rarely used to inspect castings. It is clear that in the coming years, deep learning-based methods will be used more frequently by the industry due to their high level of effectiveness. This paper uses state-of-the-art object detection methods with deep learning to offer an academic contribution to such efforts.

2.2 Computer vision for object detection

In computer vision, we distinguish between image classification and image detection. The purpose of image classification is to assign an X-ray image to one class. Image classification is typically used when there is only one object per image to be recognized. In the inspection of castings, we use this approach when a small sub-image of an X-ray image, i.e. ,a patch of 32 \(\times \)32 pixels as reported in [38], is to be classified as ‘defective’ or ‘not defective.’ In image classification using deep learning, the input image is fed into a convolutional neural network (CNN) that classifies the input image. In image detection, more than one object can be recognized in an X-ray image and the location of each recognized object is identified using a bounding box that encloses the detected object (see an example in Fig. 2).

A simple strategy that uses the sliding-window methodology has been utilized for image detection based on image classification. In this approach, a detection window is placed over an input image in both horizontal and vertical directions. For each localization of the detection window, a classifier (e.g. ,a CNN) decides which class the corresponding portion of the image belongs to based on its representation. An example of this approach for defect detection in aluminum castings is given in [37] and in our experiments. It is worthwhile to mention that this approach requires the classification of huge number of patches. In addition, if the size of the objects to be detected varies, the sliding-windows approach must be performed for different patch-sizes, which may make the computational time prohibitive. New approaches that avoid this problem have been developed over the past few years. We address them in this Section. They can be subdivided into two groups [19]: (i)detection in two stages, and (ii)detection in one stage. In many object detection experiments, the methods of the second group outperform those of the first in terms of both accuracy and speed [50, 63]. As such, we will only provide an overview of the first methods in this paper.

Fig. 2
figure 2

YOLO strategy for object detection

2.2.1 Detection in two stages

The detection methods that use two stages are called region-based methods. The first stage is the region proposal and the second is the final classification (of the proposed regions).

In the region proposal, a method is used to identify regions of the input image that may contain an object. In the sliding-windows method explained above, this step involves an exhaustive search. However, there are other methods, e.g. ,R-CNN [14] that propose regions instead of analyzing all possible matches for the input image.

In the final classification, a CNN is used to classify the regions that have been proposed through the first step.

The most relevant methods that use two-stage detection are R-CNN [14], Fast R-CNN [13], and Faster R-CNN [52]. For a more detailed explanation of the use of these methods in X-ray testing, see [41].

2.3 Detection in one stage

In these approaches, a single CNN is trained on both location and classification, i.e. ,prediction of bounding boxes and estimation of the class probabilities of the detected bounding boxes. This group of approaches is the state-of-the-art in detection methods because they are very effective and very fast. They are the best performing and most representative deep learning-based object detection models, as stated in [63]. In this Section, we address the most representative methods, namely, YOLO [3, 48,49,50]Footnote 3), EfficientDet [54], and RetinaNet [28]. We offer a brief description of these detection models and their principal differences. We pay special attention to YOLO because it performed best in our work.

\(\bullet \) YOLO: In regard to region-based approaches, as explained in Sect. 2.2.1, object detection is performed in two stages: region proposal and final classification. This means that the classification is not performed by looking at the complete image, but by viewing selected regions of the image. In order to overcome this disadvantage, a new method called YOLO, You-Only-Look-Once was proposed [48]. YOLO is a single (and powerful) convolutional neural network that looks the image once, i.e. ,the input image is fed into a single CNN and the output is the simultaneous prediction of both the bounding boxes (localization) and the category probabilities (classification) of the detected objects. It is very fast because the input image is processed in a single pass by the CNN. Over the past few years, many versions of YOLO have been developed: YOLOv1 [48], YOLOv2 [49], YOLOv3 [50], YOLOv4 [3] and YOLOv5\(^{3}\). They use different approaches in terms of subdivision, scales, anchors, and architectures to improve performance. The main strategy of YOLO is described below.

The main idea behind YOLO is very simple: The input image is divided into a grid of \(S \times S\) cells, and YOLO can detect B objects for each cell. For each detected bounding box, YOLO computes:

  • (xywh): variables that define the detected bounding box, i.e. ,location (xy) and dimension (width, height),

  • p: confidence score that gives the probability that the bounding box contains an object (P(Object)),

  • \(p_i\) : for \(i = 1 \cdots K\): probability distribution over all K possible classes, i.e. ,\(p_i\) is a conditional class probability (P(Class\(_i\vert \)Object)).

Fig. 3
figure 3

Block diagram of the proposed method. See examples of images a, b, c, and d in Fig. 4

Fig. 4
figure 4

Simulation of defects: a Original X-ray image with no defect. b Simulated projections of ellipsoids. c X-ray image with simulated defects (superimposition of simulated ellipsoids onto the original X-ray image). d Bounding boxes of the simulated defects. Using this method, it is very simple to generate training data, i.e. ,X-ray images with (simulated) defects and location of the defects, where the bounding boxes are obtained with no manual annotation

This means that for each bounding box, YOLO provides an array of \(R = 4+1+K\) elements: (x, y, w, h, p, \(p_1\) \(\cdots \) \(p_K\)), as illustrated in Fig. 2Footnote 4. At the testing stage, an object of class i is detected if P(Object) \(\times \)P(Class\(_i \vert \)Object) is greater than a confidence threshold. Since B bounding boxes can be detected in a grid cell, an array of \(Q = B \times R\) elements is computed for each cell. The simplicity of YOLO (see Fig. 2) is based on the fact that (i)the architecture has only standard convolution layers with 3 \(\times \)3 kernes and max-pooling layers with 2 \(\times \)2 kernels, and (ii)the output of the CNN is a tensor of \(S \times S \times Q\). This means that we have \(5+K\) elements per bounding box for each grid cell that give us information about the localization of the bounding box and the category probability.

\(\bullet \) RetinaNet: RetinaNet architecture [28] is another new object detection model. It combines the pyramidal feature extraction structure [29] with a residual architecture (ResNet) [16] that has yielded promising results for image classification. The pyramidal approach consists of decreasing the size of the image several times and making predictions for each of those sizes. Another novelty of this architecture is the shift from cross-entropy to a ‘focal loss’-based objective that reduces the penalty for well classified classes while punishing mis-classifications more aggressively.

\(\bullet \) EfficientDet: In EfficientDet [54], improvements to the architecture design are performed and analyzed: (i)BIFPN uses a weighted bidirectional feature pyramid network to fuse fast multi-scale features. (ii)A scaling method is used to scale depth, width, resolution and prediction networks simultaneously. This method achieves better performance than RetinaNet and YOLOv3 in terms of accuracy and speed in the COCO datasetFootnote 5.

2.4 Overview

An overview of our proposed method is presented in Fig. 3. The recognition approach has two stages: training and testing. In our method, training is performed using real X-ray images of aluminum castings with simulated defects only (see Fig. 4). Thus, no real defect is used for training purposes because the number of real defects is very low. Testing is achieved using real X-ray images of aluminum castings with real defects. Thus, the reported performance on the testing dataset corresponds to a real scenario.

2.5 Simulation of ellipsoidal defects

We use simulated ellipsoidal defects to train the detection modelFootnote 6. In this Section, we summarize the simulation approach. For further details and more examples, see [33, 37, 40].

Fig. 5
figure 5

Simulation of an ellipsoidal defect: for every pixel of the X-ray image, \((x,y) \in \Pi \), the corresponding X-ray beam is estimated. The two intersection points of the X-ray beam with the modeled ellipsoid surface are computed, and the length of the X-ray beam in the bubble d is calculated as the distance between the intersection points

Fig. 6
figure 6

Similarity between real and simulated defects: grayscale (top) and 3D (bottom) representations

X-ray imaging can typically be modeled using the absorption law, which characterizes the intensity distribution of X-rays through matter [32] and a linear model [33]:

$$\begin{aligned} I = I(z) = A \varphi _0 e^{-\mu z} +B, \end{aligned}$$
(1)

where A and B are the constant parameters of the linear model; and \(\mu \) the absorption coefficient, z the thickness of the irradiated matter, \(\varphi _0\) the incident energy flux density. In this case, \(\varphi _0 \exp (-\mu z)\) corresponds to the energy flux density after passage through matter with the thickness of z. If the material has a bubble of thickness d (with null absorption coefficient), the output energy flux density can be written as \(\varphi _0 \exp (-\mu (z-d))\) and the new X-ray image can be modeled from (1) by:

$$\begin{aligned} I' = I(z-d) = A \varphi _0 e^{-\mu (z-d)} + B \end{aligned}$$
(2)

where \(I(z-d)\) is the new gray-value of the X-ray image with the simulated defect, which can be rewritten as:

$$\begin{aligned} I' = I(z-d) = (I(z)-B)e^{\mu d} + B. \end{aligned}$$
(3)

Thus, it is possible to model the X-ray image of a casting with a simulated defect ‘\(I(z-d)\)’ from the X-ray image of casting with no defect ‘I(z)’ and a 3D model of the defect. In our approach, a 3D defect is modeled as an ellipsoidal cavity, which is projected and superimposed onto real X-ray images of a homogeneous casting with absorption coefficient \(\mu \) as shown in Fig. 5.

The synthetic image simulation process is described below:

  • Step 1: A real X-ray \(\mathbf{I}\) image of an aluminum casting is acquired. The size of the image is \(N \times M\) pixels. Image \(\mathbf{I}\) typically has no defect.

  • Step 2: An ellipsoid is defined in 3D space. The size (dimensions of each axis), location in 3D space (3D coordinates of the center of the ellipsoid) and orientation (3 rotations of each axis) are determined.

  • Step 3: For each pixel (xy) of image \(\mathbf{I}\), the two intersections of the X-ray beam with the ellipsoid surface are calculated, and the distance between them (d) is computed as illustrated in Fig. 5. If there is no intersection, d is set to zero. As output, we obtain a matrix \(\mathbf{d}\) of \(N \times M\) elements, in which element d(xy) is the corresponding length d for pixel (xy).

  • Step 4: A new image \(\mathbf{I}'\) of \(N \times M\) elements is defined. For each pixel (xy) of \(\mathbf{I}'\), the corresponding gray-value is computed according to (3) as:

    $$\begin{aligned} I'(x,y) = (I(x,y)-B)e^{\mu d(x,y)} + B. \end{aligned}$$
    (4)

The approach simulates only the defect and not the whole X-ray image of the casting, because for \(d=0\), \(I'(x,y) = I(x,y)\).

The new gray-value of a pixel, where the 3D defect is projected, depends on just four parameters:

  1. 1.

    The original gray-value ‘I(xy)’,

  2. 2.

    The linear absorption coefficient of the examined material ‘\(\mu \)’,

  3. 3.

    The calibration parameter ‘B’, and

  4. 4.

    The length ‘d’ of the intersection of the 3D flaw with the modeled X-ray beam, which is projected into the pixel (see Fig. 5) .

A simulation of an ellipsoidal defect of any size and orientation can be performed in any position of the casting. In the simulation, we must consider the fact that the size of the ellipsoid should not be larger than the thickness of the casting where the simulated defect is projected. Some examples are illustrated in Fig. 4.

In our work, we use ellipsoidal defects because it is a very simple model with known geometry. Furthermore, the similarity between real and simulated defects is good enough, as we can see in Fig. 6.

2.6 Training

The detection model is trained using real X-ray images with simulated ellipsoidal defects only as follows (see Figs. 3, 4):

  1. 1.

    Representative X-ray images of the casting object with no defects are selected. The idea is to have X-ray images of every part of the object being tested.

  2. 2.

    In each representative X-ray, random ellipsoidal defects are simulated. The idea is to superimpose many ellipsoidal defects onto the defect-free X-ray images. Here, the length of each axis of the ellipsoid, the orientation, and the 3D location are set randomly. It is worthwhile to mention that the simulated defects must be located in the object, i.e. ,no simulated defect may be located in the holes of the regular structure of the casting. As we can see in Fig. 4c, all simulated defects are located on the object, and none are located outside of it (in the white areas).

  3. 3.

    For each simulated defect, a bounding box is defined as a rectangle that encloses the projected ellipsoid. Using these three steps, it is very simple to generate training data. Now, we have X-ray images with many (simulated) defects with their locations and the bounding boxes are obtained with no manual annotation.

  4. 4.

    We split the X-ray images (with simulated defects) into a set for training purposes and a set for validation purposes.

  5. 5.

    The detection model is trained using training and validation sets.

Details of numbers of images and simulated defects per image are provided in Sect. 3.

2.7 Testing

The trained model is tested on X-ray images with no defects and with real defects. The idea is to estimate the performance in a real scenario. Thus, no simulated defects are used in the testing dataset.

To build the testing dataset, we need X-ray images of the same casting type with real defects that are manually annotated by human operators.

Performance and computational time must also be measured in the testing stage.

Fig. 7
figure 7

An example of X-ray images from series C0001 and series C0021 of \({\mathbb {GDX}}\text {ray}\). Each series contains X-ray images of a specific wheel type

Fig. 8
figure 8

Results for seven testing images of C0001 (one per column). The first row is the original testing image. The following eight rows are the results obtained using YOLOv3-Tiny, YOLOv3-SPP, YOLOv5s, YOLOv5l, YOLOv5m, YOLOv5x, RetinaNet and EfficientDet, respectively (ground truth (GT), in red, and detection (DT) in green). Baseline methods are not shown due to low performance and due to the space limitations of the article format

3 Experimental results

In this Section, we present the experiments and results obtained using the proposed method. We used eight modern algorithms for object detection: YOLOv3 [50] (versions SPP and Tiny), YOLOv5\(^{3}\) (versions ‘s’, ‘l’, ‘m’ and ‘x’), RetinaNet [28] and EfficientDet [54] (see details of implementation in Section 3.3). YOLOv5 and EfficientDet were released in 2020 and the others were released over the past three years. As a baseline, we included two additional methods that were developed earlier: (i)Xnet based on a convolutional neural network [37] and (ii)CLP-SVM [34] based on handcrafted features and SVM classifier. Both methods use the sliding windows strategy. According to Table 1, Xnet is a deep learning method and CLP-SVM is a classic method.

The Section is subdivided into four main parts. The first, Sect. 3.1, focuses on the construction of the datasets using real and simulated defects. The second, Sect. 3.2, shows the results obtained. The third, Sect. 3.3, describes the implementation in detail. Finally, Sect. 3.4 offers a discussion of our results.

3.1 Datasets

In our experiments, we used \({\mathbb {GDX}}\text {ray}\) dataset [42]Footnote 7. From \({\mathbb {GDX}}\text {ray}\), we used series C0001 for the main experiments with 72 X-ray images, and series C0021 for an additional experiment with 37 X-ray images. Each series contains X-ray images of a specific casting type as illustrated in Fig. 7. Both series have an annotated ground truth that includes real defects. In each series, there is a unique casting piece that is radiographed from different points of view. The following procedure must be performed for each casting type. The explanation and main experiments are provided below for series C0001. In the discussion, we include experiments on series C0021 to validate the proposed method.

Series C0001 belongs to an aluminum wheel commonly used for testing purposes (see, for example, [5, 37,38,39, 47]). These castings present two types of defects. The first is a group of blow hole defects (with \(\emptyset = \) 2.0 – 7.5 mm) which were already present in the castings. They were initially detected during (human) visual inspection (see, for example, Fig. 1). The remaining defects in these castings were produced by drilling small holes (with \(\emptyset = \) 2.0–4.0 mm) in parts of the casting that were known to be difficult to detect (for example, on the edges of regular structures).

3.1.1 Training and validation subsets

We used the following procedure to build the training and validation subsets:

\(\bullet \) Pre-processing: The size of the images of series C0001 is 572 \(\times \)768 pixels. We resized them by a factor of two to 1144 \(\times \)1,536 pixels.

\(\bullet \) Selection: Series C0001 has 72 X-ray images. For each resized X-ray image, we randomly select 100 windows of 640 \(\times \)640 pixels in locations where there is no real defect. An example is given in Fig. 4-a. Of the 100 images, 90 are selected for the training subset and 10 for the validation subset.

\(\bullet \) Simulation: In each window selected for the previous step, we simulate defects using the ellipsoidal model explained in Sect. 2.5. We can use a Python function given in our repository\(^{2}\). The input variables of this function are the size of the three axes of the ellipsoid, the location, and orientation in 3D space of the ellipsoid, the linear absorption coefficient \(\mu \) of the casting, and a parameter called \(x_{\max }\) defined as the maximum thickness where gray values are minimal (it is used to compute parameter B in (3)). We used the same configuration reported in [40]. In our experiments, we set the number of simulated defects per image randomly (from 2 to 20) along with the size and orientation of the three axes of the ellipsoid (from 1 to 9mm and from 0 to \(2\pi \) respectively). Examples are shown in Fig. 4. The defects have an elliptical shape of different sizes, and they are located and orientated in different ways. The size of the axes varies randomly from 1 to 9 mm. In these examples, we show the entire image and the simulated defects. We store the coordinates of the bounding box that enclosed each simulated defect.

In summary, for series C0001, we have 7,200 X-ray images of 640 \(\times \)640 pixels with around 80,000 simulated defects.

3.1.2 Testing subset

We use the following steps to define the testing subset.

\(\bullet \) Pre-processing: The size of the images of series C0001 is 572 \(\times \)768 pixels. We resized them by a factor of two to 1144 \(\times \)1,536 pixels.

\(\bullet \) Selection: Series C0001 has 72 X-ray images. For each resized X-ray image, we randomly selected 10 windows of 640 \(\times \)640 pixels which may contain real defects. An example is given in Fig. 8-first row.

In short, for series C0001, we have 720 X-ray images of 640 \(\times \)640 pixels with around 650 real defects. It is worth mentioning that there is no simulated defect in the testing dataset.

Fig. 9
figure 9

Different good detections (ground truth (GT) in red, and detection (DT) in green). The size of the cropped images is 140 \(\times \)140 pixels. The almost perfect detection of the first row (the difference is a couple of pixels) yields an IoU score of 0.5 \(\sim \) 0.6. IoU criterion in small defects depends on very accurate ground truth definition. Clearly, in these examples, the detection is more accurate than the ground truth. As such, we set the IoU-threshold in our work at \(\alpha =0.25\)

Fig. 10
figure 10

Precision and recall curves for IoU-threshold \(\alpha = 0.25\)

Table 2 Evaluation metrics for \(\alpha = 0.25\)

3.2 Results

After training, the models were tested on the testing X-ray images. Some of them are illustrated in Fig. 8. Some models performed well visually, especially the YOLO-based detectors. As we will see in the next experiments, the baseline methods (Xnet and CLP-SVM) did not perform well.

An evaluation metric based on the Intersection over Union (IoU) score is used [30] to evaluate the performance of the detectors. In this definition, an existing defect is considered to be detected if the overlap between defect and detection is high enough. Two bounding boxes are used to measure the overlap: one for the defect, called the ground truth GT (see red rectangles in Fig. 9), and one for the detection DT (see green rectangles in Fig. 9). Using these bounding boxes, the normalized overlap is defined as:

$$\begin{aligned} \mathsf{IoU} = \frac{\text{ area }(\mathsf{GT} \cap \mathsf{DT})}{\text{ area }(\mathsf{GT} \cup \mathsf{DT})}. \end{aligned}$$
(5)

The criterion establishes that a defect is detected if \(\mathsf{IoU} > \alpha \), where \(\alpha \) is called the IoU-threshold. In general, the \(\alpha \) is set at 0.5. Nevertheless, as we can see in Fig. 9, for very small defects, this IoU-threshold would be so tight that half of the detections would likely be wrong. A more reasonable IoU-threshold for our experiments should be lower, with \(\alpha =\frac{1}{4}\) for example all defects of Fig. 9 would be correctly detected. In our experiments, we evaluate the performance for \(\alpha = \frac{1}{10}, \frac{1}{5}, \frac{1}{4}, \frac{1}{3}, \frac{1}{2}\).

Fig. 11
figure 11

Average precision (AP) depending on IoU-threshold (\(\alpha \))

Table 3 Computational time

Using this IoU-criterion, the statistics of true positives (TP), false positives (FP) and false negatives (FN) can be computed, and with them, the precision, recall, and \(F_1\) values are calculated as:

$$\begin{aligned} Pr = \frac{TP}{TP+FP}\ , \ Re = \frac{TP}{TP + FN}\ , \ F_1 = 2\frac{PrRe}{Pr+Re}. \nonumber \\ \end{aligned}$$
(6)

By varying the confidence threshold of the detectors, we will obtain different (PrRe) values that can be plotted on the precision-recall curve. In Fig. 10, we show the precision-recall curve for \(\alpha = 0.25\). In Table 2, we report the \((Pr^*, Re^*)\) values at maximum \(F_1\) value (\(F_1^* = \max (F_1)\) ). Finally, in our experiments, we use the average precision (AP), computed as the area under the precision-recall curve as an evaluation metric of the detector’s performance. Table 2 gives the AP value for \(\alpha =0.25\). In Fig. 11, we show the AP values for different IoU-thresholds (\(\alpha \)).

3.3 Implementation

All modern methods were implemented in Python in Google Colab notebooks\(^{1}\). The following well-known implementations were adapted to our task:

The use of these methods in aluminum inspection does not present any practical problems at the training or testing stages, such as configuration, versions, etc.

All object detection methods and the X-ray images used for training, validation and testing in this work for aluminum defect detection are available on our public repository\(^{2}\). Details of computational time are given in Table 3.

3.4 Discussion

In our experiments, we implemented and tested eight modern object detectors based on YOLO, RetinaNet, and EfficientDet, and two baseline methods (Xnet and CLP-SVM) based on CNNs and handcrafted features, respectively. The main results are given in terms of performance (Figs. 10, 11 , and Table 2) and computational time (Table 3). In terms of performance, we have identified three groups according to average precision (AP): (i)YOLO-based methods with AP \(> 0.88\), (ii)RetinaNet, EfficientNet, and Xnet with AP = 0.5–0.65, and (iii)CLP-SVM with AP\(<0.15\). In addition, we identified two groups according the testing stage computational time: (i)modern object detection methods with only a few tens of milliseconds per image (ii)baseline methods with more than 1 sec/image.

In defect detection in aluminum castings, since the defects are so small (some of the defect diameters are only 16 pixels), and the manual definition of the bounding boxes of the real defects are very inaccurate, we relaxed the IoU-threshold to \(\alpha =0.25\), so we can consider those small detections as correct (see Fig. 9).

In the reported results (see, for example, Table 2), it is evident that baseline methods Xnet and CLP-SVM did not perform very well. On the other hand, the YOLO-based detectors did perform very well (and at very similar levels), and YOLOv5-methods performed best (better than YOLOv3-methods at IoU-threshold \(\alpha > 0.25\)). Moreover, RetinaNet and EfficientDet did not perform well, probably because they were designed for larger objects. To overcome this problem, we could increase the resolution of the training images, but that would increase the training time considerably.

In order to explain what the results mean, we will provide more details about YOLOv5s because it achieved one of the highest evaluation metrics. As shown in Table 2, for \(\alpha =0.25\), 97.54% of all existing defects were correctly detected (recall), and 85.12% of all detections were true positives. These metrics correspond to \(F_1 = 0.9091\). We observe in Fig. 11 that the performance decreases with the IoU score (\(\alpha \) value): the larger \(\alpha \), the lower the average precision (AP). For instance, in YOLOv5s, for \(\alpha = 0.25\), AP is 0.8962, however, for \(\alpha = 0.33\) and 0.5, AP is 0.8075 and 0.3174 respectively. Finally, according to Table 3, YOLOv5s was trained in just 2.5 hours, and the computational time in testing stage was only 11 ms per testing image, i.e. ,90 images per second.

For these reasons, we believe that the proposed methodology based on YOLO object detectors could satisfy the requirements of the industry according to the following attributes:

\(\bullet \) Simplicity: The construction of the training dataset is very simple because we need a low number of defect-free X-ray imagesFootnote 11 and a simulation process for including simulated defects in the dataset. This means that no manual annotation is required. Furthermore, we used well-established deep object detection methods that have been easily adapted to our task. The codes are implemented in Python and executed in Google Colab with no intricate configuration and free access to GPUs.

\(\bullet \) Effectiveness: The performance of the object detectors based on YOLOv5 was very high. The average precision was 0.89–0.90 and the \(F_1\) factor was 0.91 as shown in Table 2.

\(\bullet \) Speed: The models were trained in a matter of hours. In addition, the computational time of one testing image is only a few tens of milliseconds, as we can see in Table 3, i.e. ,it can be used in real-time inspection to aid human operators. It is worth mentioning that the computational time of the baseline methods is extremely high because they use the sliding-windows strategy.

Although the YOLO methods performed very well, one disadvantage of the proposed method is that each casting type needs an ad-hoc trained model. That means that if we train the model on images of a specific wheel type (e.g. ,series C0001 of \({\mathbb {GDX}}\text {ray}\), as we did in our experiments), and we use this trained model on X-ray images of another wheel type (e.g. ,series C0021 of \({\mathbb {GDX}}\text {ray}\)), it may not perform well (in this example \(AP=0\) because no real defect could be detected). The reader can see the differences between these two wheel types in Fig. 7. This result was expected because the model has learned about the details of the regular structures of one casting type that are not present in the another (and vice versa, the model has not learned the details of the other casting type). However, if we train a new model for this specific wheel type –using the methodology reported in this work–, the performance is increased to a very high level. In the aforementioned example, for wheel C0021 with 37 X-ray images, we obtain performance similar to the performance obtained for wheel C0001 using YOLOv5s, as shown in Table 4.

To overcome the disadvantage of having one trained model per casting type, we could train a single model with X-ray images from both casting types. In our example, if we train and test the model using images of both casting types, the object detector can recognize defects in both (the performance is, however, slightly lower, as shown in Table 4). However, with this solution, we are not avoiding a new training when we have a new casting type. Moreover, the computational time of this new training is higher (because we have more training images), and with the new training, the individual performances must be evaluated again to ensure the effectiveness of the model on all included casting types. For these reasons, we recommend that a specific model be learned for each casting type.

It could be interesting to analyze how many casting types can use a single model. We believe, however, that it is ambitious to have one model for all wheel types. It would be best to have one model for each type. This should not be a problem for the industry, as the training process can be completed in just 2.5 hours.

Table 4 Evaluation of YOLOv5s on two wheel-types

4 Conclusions

In this article, we proposed a training strategy that uses defect-free X-ray images of a casting with the superimposition of simulated defects. No manual annotations are required because the locations of all simulated defects are known. In addition, tests are performed using real X-ray images of aluminum castings with real defects. Thus, the reported testing dataset corresponds to a real scenario.

We used well-established object detection methods (YO-LO, RetinaNet, and EfficientDet) to detect defects in aluminum castings. All of them were developed in the past three years, and many examples are available in public repositories that could be adapted to our task. The strategies implemented are simple, effective, and fast. The training stage requires a relatively small number of X-ray images. In our experiments, YOLO-based detectors perform best. One of the models, YOLOv5s, was trained in just 2.5 hours. In addition, the testing dataset (with real defects) performed very well (average precision was 0.90 and the \(F_1\) factor was 0.91), and the computational time is very low (the method is able to process 90 X-ray images per second, i.e. ,this solution can be used in real-time inspection to aid human operators).

The code and the datasets used in this paper have been uploaded to a public repository so that anyone can reproduce (and improve upon) the reported results or re-use the code in other inspection tasks.

In the coming years, deep learning-based methods will be used more frequently by the aluminum castings industry due to their high effectiveness. This paper offers an academic contribution to such efforts.

In the future, we will model random shapes that can be used to simulate other kind of defects such as cracks. This feature can be very useful in the automated inspection of welds.