Introduction

Non-metallic inclusions are an inevitable product of chemical reactions occurring during steel processing and they play an important role in defining the properties of steel.[1] If not controlled, they can often reduce ductility, fatigue resistance, and toughness of steels.[2,3,4,5,6,7] There has been a significant amount of research on inclusion control during liquid steel processing[8] with intentional efforts to control inclusion populations referred to as “inclusion engineering” with the resulting product called “clean steel.”

An enabling technology for inclusion engineering efforts on both the laboratory scale and the production scale is automated analysis in a scanning electron microscope equipped with energy dispersive X-ray spectroscopy (SEM/EDS).[9] While one of several analysis and quantification methods,[10] automated SEM/EDS can measure hundreds or thousands of individual inclusions per sample and obtain chemistry, size, shape, and spatial distribution information in times on the order of hours or less. Several recent studies have advanced in back-scattered electron (BSE) imaging and EDS measurement parameters,[11,12] to improve the speed and accuracy of the method. Automated analysis is now a common tool among companies and researchers in the steel industry and it has enabled many developments in the scientific understanding of inclusions and in industrial process control.[10,13,14,15,16,17,18]

Inclusion chemistry and chemistry changes have been of primary interest. Considering many different grades and process routes, inclusions could be combinations of Al, Ca, Si, Mg, Mn, Ti, Zr, Ce, La, O, S, and N although the total number of possible elements is usually constrained by grade or processing conditions. Despite the constraints, often more than three variables are needed to fully represent inclusion populations and this creates a visualization challenge. Display of inclusion chemistry results is most typically done by Gibbs triangle ternary plots with inclusion compositions represented by cation or anion mole or mass fraction. An example plot is shown in Figure 1(a), where each inclusion chemistry in mole fraction is represented by a single data point. A proportional symbol plot (Figure 1(b)) shows the same data but with the symbol size proportional to the number of inclusions of that chemistry. The assumption is frequently made that there is little chemistry change during solidification, so the phases that comprise an inclusion in the liquid steel can be inferred from its composition by overlaying the ternary phase boundaries for the system at steelmaking temperatures, as shown in Figure 1(c).

Fig. 1
figure 1

Two representations of the same inclusion chemistry distribution (a) each measured chemistry (cation fraction) plotted as one data point, (b) the same data with symbol size proportional to the number of inclusions with that chemistry. (c) The ternary phase diagram of the MgO-CaO-Al2O3 system at 1873 K (1600 °C)

Representing inclusion chemistry by ternary diagram provides an easily interpretable view of inclusion chemistry distribution. However, they are limited in that only three variables can be plotted. For example, the Ca-containing inclusions in Figure 1 are CaS and this can only be verified by plotting the distribution on an Al-Ca-S diagram. Other data representation schemes have been proposed,[19] but still a diagram (or diagrams) must be produced for each sample. These diagrams provide engineers easily interpreted visual representations of inclusion chemistry distributions that are typically used to diagnose specific process issues.[18] Comparisons of many samples (e.g., looking at trends in behavior over many heats of steel or sequential samples) are difficult because one or more diagrams must be compared. Overall averages do not capture the multi-phase nature of inclusions. User-defined classification rules have been developed based on expected chemistry ranges.[18] However, this requires assumptions about the expected inclusion chemistries and they must be consistently applied.

This study applied methods of classifying, learning from, and representing large, multivariate data sets to inclusion chemistry data measured by automated SEM/EDS analysis. The methods investigated were (1) principal component analysis (PCA), and (2) cluster analysis. Method (1) is a technique for dimensionality reduction, i.e., representing multivariate chemistries in a two-dimensional plot, and method (2) automatically groups inclusions by chemistry. These methods were applied to a laboratory- and an industrial-scale data sets, to illustrate how these methods might be used to condense inclusion chemistry presentation, i.e., not require one or more ternary diagrams per sample. The industrial data set was gathered from analysis of two semi-finished components that exhibited differences in machinability. The objective of the analysis was to compare the inclusion populations to examine if they might be the cause of the machinability differences, as inclusions are well known to influence machinability.[20]

Analysis Methods

Principal Component Analysis (PCA)

PCA was investigated as a dimensionality reduction technique. It is a common exploratory data analysis tool that is used to provide a visual relationship between observations for multivariate data sets. The aim of PCA is to extract the important information from multivariate data sets and express the output as a set of new orthogonal variables called principal components (PC).[21] The new PC variables are linear combinations of the original variables. The principal components are generated such that the greatest variance (i.e., the greatest spread in the data set) between the observations lies on the first principal component, PC1, and the second greatest variance on PC2, and so on; in addition, the succeeding PC is computed on the constraint that it is orthogonal to the previous PC. The number of PCs computed is equal to the number of variables in the initial data set, but in most cases more than 80 pct of the variance between the observations is contained in the first two PCs, thus enabling the plotting of multivariate data sets on 2D scatter plots with minimal information loss.[21] Figure 2 shows an example of PCA, where high dimensional data, 3D, is transformed to 2D using the first 2 PCs, as a result the observations are viewed in simple 2D plots while losing minimal information, keeping in mind that PCs are linear combinations of the original variables.

Fig. 2
figure 2

Transformation of 3D data to 2D using PCA, reprinted from Ref. [22]

Cluster Analysis

Not to be confused with the physical clustering of inclusions, clustering is an automated method of arranging data into groups (referred to as clusters). The analysis breaks down a data set into a set of clusters, such that observations in one cluster are related to each other, in one way or another, more than observations in other clusters. Accordingly, the analysis referred to in this study pertains to the cluster of data, where a cluster is composed of inclusions with similar chemistries. Clustering itself is not one specific algorithm that can be applied to any set of data, rather it is a general term used that serves a specific task. There are numerous clustering models, each with its own algorithm on how to define and identify a cluster; as a result, the different models vary significantly depending on the input parameters and desired output.[23]

Because there are numerous models, an understanding of the various cluster models and the data sample is essential for selecting the appropriate clustering algorithm. The most basic method is k-means clustering,[23] a centroid-based clustering technique where the number of clusters is predetermined and the initial cluster centroids are randomly assigned. Then observations are assigned to the nearest cluster by minimizing the distance between the observations and the centroids of the various clusters. Distance here is a generic term. The physical distance between any two points in 2D space is \( \sqrt {\left( {x_{2} - x_{1} } \right)^{2} + \left( {y_{2} - y_{1} } \right)^{2} } \) but this can be generalized to any variables (not just positions).

While fast and conceptually straightforward, two drawbacks of k-means clustering are that the number of clusters for a data set must be preselected and that equal size clusters are often produced. Other algorithms do not have these restrictions. One such algorithm is the expectation-maximization (EM) algorithm and this method was employed in this study. The EM algorithm is a soft clustering method. In contrast to hard clustering, where each observation is assigned to one specific cluster, soft clustering assigns probabilistic mix proportions for an observation to belong to each cluster. For example, if sample X has 2 clusters, the ith observation has certain probabilities of belonging to clusters 1 and 2, and the algorithm assigns the observation to the cluster with the highest probability. If the ith observation has a 95 pct chance of belonging to cluster 1, and therefore a 5 pct chance of being in cluster 2, it will be assigned to cluster 1. This form of clustering is beneficial for assessing the quality of cluster classification, and when observations are midway between clusters. The main assumption behind the EM algorithm is that each cluster has its own Gaussian distribution, and the whole data set is a Gaussian Mixture Model (GMM).

The EM algorithm is an iterative clustering technique, involving two main steps, the expectation (E) and maximization (M) steps. First the initial parameters, the means and covariances of each cluster, are estimated, either randomly or using other statistical techniques such as utilizing the k-means algorithm to initialize the EM. Then, using these parameters, the E step calculates the probabilistic mix proportions for observations to belong to each cluster. The M step then re-estimates the parameters using the values computed in the E step. The E and M steps are reiterated until convergence.

The EM algorithm requires the number of clusters to be an input. Determination of the number of clusters was made in the following way. First, the Bayesian Information Criterion (BIC)[25,26] is calculated assuming the data consists of one to nine clusters, and the evaluation is made to verify if a sample is composed of one cluster or more than one cluster. The BIC is the value for the maximized log likelihood, considering the observations, parameters, data dimensions, and number of clusters (Eq. [1]). It is computed for a range of numbers, and the optimal number of clusters should be given by the largest BIC value.

$$ {\text{BIC}} = 2{ \log }{\mathcal{L}} - df - { \log }\left( n \right), $$
(1)

where \( {\mathcal{L}} \) is the likelihood function, \( {\mathcal{L}}\left( {\theta |x} \right) = P\left( {x |\theta } \right) \), a function of the parameters given the outcomes, which is equal to the probability of the outcomes given the parameters, \( \theta \) is the parameters of the model, the mean and covariance of each cluster, \( x \) is the outcomes of the model, \( df \) is the degrees of freedom, and \( n \) is the number of observations.

Usually, the BIC tends to over fit the number of clusters when several clusters exist in inclusion data sets. Therefore, if the optimal number of clusters evaluated using the BIC is one, then the data are assumed to be single clustered, and if it was found to consist of more than one cluster, the Silhouette index[24] was employed to determine the optimal number of clusters. This index measures how similar an observation is to its own cluster compared to observations in other clusters. The silhouette index is calculated using Eq. [2]:

$$ {\text{Silhouette}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} S(i)}}{n}, \quad {\text{Silhouette}} \in \left[ { - 1,1} \right], $$
(2)

where n is number of observations, \( S\left( i \right) = \frac{b\left( i \right) - a(i)}{{\hbox{max} \left\{ {a\left( i \right);b(i)} \right\}}} \) the Silhouette coefficient for observation i, a(i) is the average distance between the ith observation and observations within the same cluster, and b(i) is the average distance between the ith observation and observations in its nearest neighboring cluster.

The index is computed for a range of number of clusters, and the maximum value defines the optimal number of clusters. This procedure was employed because the Silhouette index cannot be defined for one cluster.

Materials and Methods

Materials

The laboratory-scale samples were taken from a series of experiments studying calcium treatment and reoxidation. Details of the experimental methods can be found in Reference 27. The samples, referred to in this paper as sample 1 and sample 2, were prepared in an induction furnace. Electrolytic iron was melted in an MgO crucible enclosed in a graphite crucible with no slag addition. The iron was deoxidized with Al, Ca treated, and reoxidized with Fe2O3 powder. The first sample, 1, was taken just before reoxidation, and the second, 2, right after reoxidation. The purpose of these lab samples was to create samples with controlled inclusion distributions to demonstrate the capabilities of PCA and clustering.

The industrial-scale data set was composed of four (4) samples, referred to in this paper as A, B, C, and D. All samples were taken from the same location on a near-finished component. The components were machined from continuously cast 4140 bar stock that was melted via an electric arc furnace, Al deoxidized, and Ca treated. Table I presents an overview of the sample chemistries, analyzed by spark optical emission spectroscopy (OES). Samples A and B were noted by the supplier to have better machinability compared to that of samples C and D. It should be noted that this ranking of machinability was relative, but sufficient to suggest there were differences between the samples that could be identified.

Table I Relevant Steel Chemistry (Mn, S, Al, and Ca in Weight Percent) for Industrial Samples

Methods

Automated inclusions analysis was carried out on mechanically polished samples using an SEM, utilizing a back-scattered electron detector equipped with an EDS analyzer. For the laboratory samples, a Phillips XL-30 SEM was utilized, with the accelerating voltage set to 10 kV, and a working distance of 10 mm. For the industrial samples, an FEI/Aspex Explorer SEM was used, the accelerating voltage was held at 20 kV, and the working distance ranged from 13.7 to 14.3 mm. The output raw data were prescreened to remove obvious outliers. A filter excluded any readings with Fe content larger than 75 pct, and more than 90 pct of the original data were retained for all samples. For the industrial samples, at least 1140 inclusions were measured per sample and the minimum detected inclusion size was 0.3 μm2. A summary of the number of inclusions measured and scanned area per sample is shown for the industrial samples in Table II.

Table II Industrial Sample Inclusion Count and Scanned Area

The variables considered for PCA and clustering were the inclusion chemistries only, which were limited to the Mg, Ca, and Al contents for the laboratory samples, and the Mg, Ca, S, Al, Mn, Si, and Ti contents for the industrial samples. Other variables investigated in this study included the areas, maximum diameters, and aspect ratios, which represent the size and morphology of inclusions; however, these variables were examined post-clustering.

All the statistical analysis preformed in this study was carried out in RStudio,[28] an interactive interface for R, a free and open-source programming language for statistical computing and graphics.[29] The “MCLUST” and “NBCLUST”[30,31] packages were utilized for the EM algorithm and the Silhouette index, respectively. In addition, the “GGTERN” package[32] was used to plot the ternary diagrams.

Results

The results of the inclusion analyses are presented below, first for the laboratory samples and then for the industrial samples. Initially, inclusion chemistry distributions on ternary diagrams are given. All ternary diagrams presented in this study pertain to the mole fraction of the respective elements. Thereafter, PCA was best used to help compare data from several samples on a 2D scatter plot, while clustering was utilized to automatically define groups in each sample. For PCA, principal components were computed for all samples combined in a single data set, to enable the plotting of observations from all samples on a single 2D plot.

It should be noted that the methods of PCA and clustering are not examined here simply to differentiate between samples. Differences in inclusion populations would be expected based on the processing history of these samples (as was known for the laboratory-scale samples and can be inferred from the chemistry of the industrial samples). This work presents the methods of PCA and clustering in the context of automated inclusion analysis and connects them to relatively small data sets that can be interpreted via conventional means. Possible further applications of these techniques are presented in Section V.

Laboratory Samples

The laboratory samples investigated in this study had controlled chemistries, and the variables considered for the analyses were confined to the Mg, Ca, and Al contents only. An overview of the inclusion chemistry distributions is presented as ternary diagrams in Figure 3. Sample 1 inclusions were composed of MgO-Al2O3-CaO inclusions, while sample 2 displayed two distinct inclusion groups—spinel (MgO-Al2O3) and solid calcium aluminates.

Fig. 3
figure 3

Inclusion chemistry distribution of laboratory samples. Sample 1 on the left, and sample 2 on the right

To get a visual comparison between samples on a single plot using PCA, both samples were combined into one data set. The transformation of the original data points to principal components was performed using the following equations:

$$ {\text{PC}}1 = - 0.585{\text{Mg}} - 0.201{\text{Al}} + 0.786{\text{Ca}} $$
(3)
$$ {\text{PC}}2 = 0.570{\text{Mg}} - 0.791{\text{Al}} + 0.222{\text{Ca}}, $$
(4)

where the coefficients represent the relative loading of each of the original variables to each respective PC, which are unique to this data set. Figure 4 displays a 2D scatter plot of all the observations, where the horizontal axis is represented by the first principal component, PC1, and the vertical axis by the second, PC2, and observations are colored according to sample. As shown on the axes, PC1 represents 69.8 pct of the variance between the observations, and PC2 30.2 pct. Since only three variables were considered for PCA, the first two PCs were enough to represent the entire variance in this data set (i.e., the variance represented in PC3 was zero); however, this will not be the case when larger number of input variables are used. Observations from sample 1 are all relatively condensed in the same area, while sample 2 can be broadly broken down into 2 groups; in addition, overlap between samples is very minimal. This is expected from the lab samples, since the samples are composed of different inclusion types. Figure 5 provides an illustration of how the original variables contribute to the principal components. The arrows shown on the plot represent the contribution of the original variables to PC1 (by the magnitude in the horizontal direction) and PC2 (by the magnitude in the vertical direction).

Fig. 4
figure 4

Scatter plot of inclusions in laboratory samples, transformed to 2D using PCA

Fig. 5
figure 5

Variable contribution to each PC, for lab samples

To understand the distribution of data points, both Figures 4 and 5 need to be considered simultaneously. The red data points of sample 2 can be broadly broken down into 2 groups, one group with positive PC1 values (on the right), and the other with negative PC1 values (to the left of the plot). From Figure 5, it is clear that the former group contains higher Ca contents and the latter group has higher Mg contents, corresponding to solid calcium aluminate and spinel inclusions, respectively. Due to the nature of this data set, composed of only three input variables with some inclusions (e.g., the spinels) comprising only two of the input variables, some of the data points form a straight line, as shown in Figure 4. These specific data points, or inclusions, having similar chemistries with slight deviations in cation mole fractions, correspond to the inclusions of lab sample 2 lying on the Mg/Al and Ca/Al axes of the ternary diagram (Figure 3); thus, they are not influenced by the third variable, and as a result they form a straight line on the scatter plot.

A summary of the cluster analysis, performed using the EM algorithm, is presented in Table III, for the lab samples. The table displays the average chemistry, the number and percentage of inclusions, and the average inclusion area, for each cluster. The table is ordered in decreasing cluster size, quantified by the number of inclusions per cluster.

Table III Summary of Cluster Analysis, for Lab Samples

As stated earlier, initially the BIC was evaluated for each sample; if the BIC identified the optimal number of clusters as one, then the data were assumed to be single clustered; otherwise, the optimal number of clusters was computed using the Silhouette index. For the laboratory data set, the clustering algorithm identified one cluster was optimal for sample 1 and two clusters were optimal for sample 2.

The assignment of clusters can help simplify ternary diagrams, as shown in Figure 6. Here the average chemistry of each cluster is plotted as a semi-transparent circle, where the size of each circle is proportional to the cluster’s total inclusion area. This was useful when comparing the relative amounts of inclusions in each cluster. Considering Figure 6(b) along with Table III, the two clusters identified in sample 2 pertain to calcium aluminate and spinel inclusions, with calcium aluminate inclusions dominating the inclusion population.

Fig. 6
figure 6

Mg-Ca-Al cluster average ternary plots for laboratory samples. The center of the circles represents the cluster centroids, and the size of each circle is proportional to the total area fraction of inclusions in that particular cluster. (a) Sample 1, (b) Sample 2

Industrial Samples

The inclusion chemistry distributions for the industrial samples are presented in Figure 7. Inclusions comprised Mn, S, Al, and Ca, with traces of Mg, Ti, and Si. Thus, analysis should strictly have accounted for all seven chemistry variables, but there could be a reasonable reduction to the four major components—Mn, S, Al, and Ca. This still required two ternary diagrams per sample, and in this case S-Ca-Al and S-Mn-Si ternaries were presented. Samples A and B were similar to each other and were predominately (Ca,Mn)S; inclusions in samples C and D were similar to each other and had more calcium aluminate oxide inclusions.

Fig. 7
figure 7

Inclusion chemistry distributions of industrial samples, in the form of S-Ca-Al and S-Mn-Si Ternary Diagrams. Diagrams for Sample A shown on the top-left, sample B on the top-right, sample C on the bottom-left, and sample D on the bottom-right

As with the laboratory-scale samples, all four samples were consolidated into one large data set, for PCA. PC1 represents 88.1 pct of the variance, and PC2 represents 6.8 pct, and thus both PCs combined retain 94.9 pct of the original data’s variance. The PCA transformation of the industrial data set was performed using the following equations:

$$ {\text{PC1}} = - 0.027{\text{Mg}} - 0.716{\text{Al}} + 0.002{\text{Si}} + 0.468{\text{S}} - 0.198{\text{Ca}} - 0.007{\text{Ti}} + 0.478{\text{Mn}} $$
(5)
$$ {\text{PC2}} = 0.007{\text{Mg}} + 0.331{\text{Al}} - 0.008{\text{Si}} - 0.391{\text{S}} - 0.569{\text{Ca}} - 0.014{\text{Ti}} + 0.643{\text{Mn}}. $$
(6)

The coefficients represent the loading of each of the original variables to the PCs. PCA results are presented in Figure 8.

Fig. 8
figure 8

PCA scatter plots of industrial samples. Sample A shown on the top-left panel, sample B on the top-right panel, sample C on the bottom-left panel, and sample D on the bottom-right panel

Figure 8 shows that samples A and B have similar chemistry distributions, and samples C and D display a similar trend, with sample C skewing more to the negative values of PC1. Samples C and D appear to differ from A and B because of the large number of data points at PC1 < 0. Figure 9 provides an illustration of how the original variables contribute to the principal components. As expected, inclusion chemistries were dominated by Al, Mn, S, and Ca, while the Mg, Ti, and Si contents are negligible.

Fig. 9
figure 9

Variable contribution to each principal component, for industrial samples

Thus, from the results presented in PCA, it can be inferred that samples A and B were similar to each other, and C and D were similar to each other. Inclusions in samples A and B contained primarily Mn and S, and Ca and Al contents were higher in samples C and D.

The cluster analysis results for the industrial data set are presented in Table IV and Figure 10. Two clusters were identified for both samples A and B: one cluster with predominately MnS inclusions and the other with complex inclusions, (Ca,Mn)S + calcium aluminate oxides. The largest cluster in both samples corresponded to the MnS inclusions. In samples C and D, four and three clusters were identified, respectively. Again MnS and complex (Ca,Mn)S + calcium aluminate oxide inclusions were identified. However, in this case the (Ca,Mn)S + oxide inclusion clusters were larger. There was also some overlap of clusters in these samples, suggesting there was opportunity for additional optimization of the analysis method.

Table IV Summary of Cluster Analysis, for Industrial Samples
Fig. 10
figure 10

S-Ca-Al cluster average ternary plots for industrial samples. The center of the circles represents the cluster centroids, and the size of each circle is proportional to the total area fraction of inclusions in that particular cluster. (a) Sample A, (b) Sample B, (c) Sample C, (d) Sample D

Discussion

In this work, PCA and cluster analysis were employed to analyze a controlled laboratory-scale data set and a relatively large industrial inclusion data set. The lab-scale data set was limited to two samples, where both samples had well-defined inclusion compositions. The analyses were carried out on the controlled samples in order to examine their applicability. PCA enabled the visualization of the entire inclusion population from both samples on a single plot, and it is clear from this plot that inclusions in sample 1 have relatively similar chemistry, while inclusions in sample 2 can be roughly divided into two inclusion types. This was also asserted by the cluster analysis results, identifying one cluster in sample 1 and two clusters in sample 2. Based on average cluster composition, sample 1 formed predominantly liquid MgO-Al2O3-CaO inclusions, and the two clusters in sample 2 were identified as spinel and solid calcium aluminate inclusion clusters. The samples were specifically selected to demonstrate whether the cluster analysis would identify one and two clusters for samples 1 and 2, respectively, using the BIC and Silhouette index, and the results presented verified this presumption.

With respect to the industrial samples, the data were measured from four samples, A and B, which exhibited good machinability, and C and D, which exhibited poor machinability. A complete representation of the inclusion population required numerous ternary diagrams, while PCA and clustering could condense this information. The results showed that the two sets of samples were indeed different. Samples A and B contained predominately MnS inclusions and C and D had larger amounts of complex (Ca,Mn)S + calcium aluminate oxide inclusions. Below some observations on the data analysis methods and the results are discussed.

Using PCA, an initial comparison between samples and inclusion populations was visualized in one 2D plot. The ability to visualize high dimensional data is a useful application of PCA. PCA has also been used as a method to explore data sets and reveal the existence of potential clusters of data points. For the purpose of this study, observations from all samples were consolidated into a single data set, since in both cases, the laboratory and industrial data sets, samples were processed to produce the same steel chemistry. Otherwise, using PCA to compare disparate samples on the same plot would not necessarily be beneficial, since a scatter plot of the first 2 PCs will only display a dissimilarity between observations from each sample, which was already expected. For example, performing PCA on an Al-killed sample and a Si-Mn killed sample would be trivial since these samples are known to be different and the PCA results will only highlight the expected difference. PCA would be most useful in reducing the number of diagrams needed for visualization, particularly if sulfide or nitride inclusions would be considered along with oxide inclusions. If sufficient data are available, PCA could be a process monitoring tool,[33,34] where new inclusion populations could be benchmarked against an existing baseline.

The cluster analysis performed was successful in condensing the large data sets into a few points that were representative of the inclusion chemistries. In some cases, clusters were not uniquely defined in a sample and sometimes two different clusters had similar chemistries, such as the red and green clusters of sample C and the black and red clusters of sample D. This was a consequence of the automated nature of the analysis. The clustering algorithm was generally successful at recognizing and classifying the inclusion chemistry groups. In addition, the algorithm performed well in assessing the possibility of having single-clustered data, as shown in the results presented for lab sample 1.

Once the industrial data were grouped by chemistry, additional analyses were performed to examine the relationship between chemistry, average inclusion area, and average inclusion aspect ratio. To examine these correlations, each cluster from Table IV was classified as either a MnS cluster or complex ((Ca,Mn)S + oxide) inclusion cluster based on its chemical composition. A summary of the average inclusion area and aspect ratio per cluster is provided in Figure 11, along with representative BSE images of inclusions in each cluster.

Fig. 11
figure 11

Relationship between average inclusion areas and aspect ratios in each cluster. The BSE images represent what an average inclusion would look like in the respective cluster

The MnS inclusions in all samples had similar average areas. The average inclusion area of the complex oxides was greater for samples C and D. The average aspect ratios (ratio of the longest 2D dimension to its perpendicular width) of the MnS inclusions were higher in samples C and D, indicating that they were slightly elongated. Although elongated, these inclusions were still small and so would not likely have the detrimental effect of stringer MnS on mechanical properties.

In summary, all four samples contained essentially two different inclusion populations—MnS and complex (Ca,Mn)S + calcium aluminate oxide inclusions. In samples C and D, the complex inclusions were more numerous and they were larger than the complex inclusions in samples A and B. The MnS inclusions were globular in samples A and B, while they were more elongated in samples C and D. These differences in inclusion population likely affected the part machinability, as it is well known that inclusions play an important role in machinability.[35]

Numerous studies have been conducted on the relationship between inclusions, steel processing, and machinability (for a recent review see Ref. [20]). It was not possible to develop many connections between the inclusions, processing, and machinability in this work. The steels in this work were 4140 alloy steel with approximately 300 ppm S and all samples were Al deoxidized and Ca treated. Generally, this composition and processing should lead to good castability, machinability, and properties.[36,37] The samples for this study were provided in the partially finished state and details of the refining and Ca treatment processes were not available; furthermore, significant changes to inclusion populations can occur upon the solidification of the samples.[38] The main purpose of this study was to introduce the use of PCA and clustering methods to analysis of inclusion distributions.

The objective of this study was to demonstrate alternative methods for classifying and analyzing inclusion populations. The method of PCA was shown to produce a 2D representation of multivariate data, from which differences between samples could be noted. Cluster analysis provides an automated method for classifying inclusion compositions without user-defined rules. The clusters can then be analyzed individually. They are amenable to much larger data sets than the four samples investigated here and future efforts will be devoted to application of these and other techniques to improve the insights that can be gained from automated non-metallic inclusion analysis data.

Conclusions

In this study, the differences in non-metallic inclusion population between two sets of samples were analyzed with PCA and cluster analysis. The first data set, a pair of laboratory-prepared samples with well-defined and controlled inclusion chemistries, was utilized to validate the statistical techniques. And the second data set, four industrial-scale samples, grade 4140, Al deoxidized, and Ca treated: two samples (A and B) were noted to have better machinability, while the other two samples (C and D) were noted to have poorer machinability. The following observations were noted:

  • PCA reduced several chemistry variables to a 2D plot where differences were noted between samples.

  • Cluster analysis, using the BIC and Silhouette index to define the number of clusters and the expectation–maximization algorithm to perform the clustering, automatically classified inclusions into groups. Essentially two groups: MnS and (Ca,Mn)S + calcium aluminate oxide inclusions, for the industrial samples. This was done without any user-specified composition ranges.

  • For the laboratory-scale samples, both PCA and clustering generated promising results. PCA enabled the visualization of the multivariate inclusion data set on simple 2D plots, providing an initial comparison between inclusion distributions of various samples. And cluster analysis was successful in identifying the inclusion groups present in each sample.

  • Considering the industrial data set, once the inclusions were clustered it was found that samples C and D contained a higher number of larger (Ca,Mn)S + oxide inclusions compared to samples A and B, which contained higher numbers of smaller MnS inclusions. The differences in machinability were attributed to the differences in the inclusion populations.