Keywords

1 Introduction

Since the publication of the report titled “Big Data: The Next Frontier of Innovation, Competition, and Productivity” by McKinsey in 2011, big data technology has received extensive attention from various industries, pointing out that big data is a data set that exceeds the capabilities of collecting, storing, managing, and analyzing typical database software tools [1].

In recent years, the civil aviation industry has continued to grow rapidly. Not only has it rapidly expanded its transportation scale and route network, its transport capacity has also been significantly enhanced. Multiplying air traffic flow also brings complex data on the operation of air traffic control. Data related to the safety of ATC can be excavated from these data, accurately identifying the degree of importance of hazard sources and different hazard sources, which is related to whether the civil aviation can continue to develop safely, efficiently and continuously.

Air traffic management is a multi-level and dynamic process. Airborne traffic control involves many types of massive data. Therefore, the storage, classification, analysis, and application of air traffic control data has also become a new research direction. The existing ATC data mining focuses on digital system communication and voice data mining to avoid congestion of voice channels and try to eliminate situations in which pilots and controllers may be misunderstood in the process of talking, in order to deepen the content of data in the future and improve air traffic forecast accuracy. The System Wide Information Management (SWIM) system can integrate flight, performance, geography, weather, and other types of data into three-layer deployments to achieve open, flexible, and secure information management [2], Tasha [3, 4] et al. used cluster analysis to apply clustering to weather and aeronautical data to obtain a possible distribution of aircraft abilities. Ning [5] studied the delay data between three busy airports in the United States by establishing a Bayesian network model, and then obtained the law of delay propagation of airports. In addition, in terms of transportation, Anderson [6] used the K-means method to study traffic data in London, thus identifying traffic accidents. After improving the Apriori algorithm, Wang [7] reduced the traffic accident data from the mobile phone, and then used association rules mining to investigate the causes of road traffic accidents from the aspects of accident relations and accident attributes.

At present, data mining technology has also been applied in the operational risk and safety management of ATC. The emphasis is on the post-event analysis of ATC operations and has achieved certain results [8, 9]. In the process of finding the reasons for controlling unsafe incidents, it is found that when a variety of factors coexist, it will lead to the occurrence of ATC operations security incidents. Since the Apriori algorithm was put forward, its application in various fields has been widely verified [10,11,12,13,14]. Therefore, this paper intends to study the application of ATC based on data mining of Apriori algorithm.

2 Selection of Research Variables

With the growing of air traffic flow in China, the importance of the air traffic control operational safety has become increasingly prominent, and turned into a research hotspot in this field [15]. The air traffic controllers need to handle all kinds of dynamic information under limited resources and make proper air traffic control decisions. In recent years, there have been many incidents of air traffic insecurity. Therefore, it is increasingly important to look for factors that affect the safety of air traffic management and analyze the impact of various factors on unsafe incidents [16].

The factors influencing the safety of ATC operations are mainly man-made factors, environment and equipment (see Fig. 1).

Fig. 1.
figure 1

Influence factors of air traffic management safety

However, the influence of different factors on the operational safety of ATC is different, and these factors have mutual effects. For example, the airspace environment or the deterioration of control equipment will lead to an increase in the regulatory load. It is well known that the increase in regulatory load is an important factor affecting the safety of regulatory operations. Therefore, it is necessary to discuss the relationship between the above items, which will help to find the key points in preventing air traffic management insecurity in the future, so as to fundamentally prevent it.

3 ATC Operational Data Mining Analysis

3.1 Analysis of Association Rules

The purpose of association rules is to find out all the strong association rules in the database, which can effectively mine frequent itemsets. Its main representation is \( A \Rightarrow B \), \( A \) represents if part, \( B \) represents then part.

Assume that \( I = \left\{ {I_{1} ,I_{2} ,I_{3} , \cdots ,I_{k} } \right\} \) is a set of all items. If \( X \subset I,Y \subset I \) exists, \( X,Y \) is called the itemset. If the count of the item is \( k \), it is called \( k - \) item set.

Assume that the set of all items in the database is \( I = \left\{ {I_{1} ,I_{2} ,I_{3} , \cdots ,I_{k} } \right\} \), \( D = \left\{ {T_{1} ,T_{2} ,T_{3} , \cdots ,T_{n} } \right\} \) is a database, and \( T_{i} = \left\{ {I_{i1} ,I_{i2} ,I_{i3} , \cdots ,I_{ik} } \right\} \), and any element \( I_{ij} (j \in \left[ {i,k} \right]) \subseteq I \) in \( T_{i} \), \( T_{i} \) is called a transaction in the database.

\( A \Rightarrow B \) is an association rule, where \( A \), \( B \) must satisfy \( \left\{ {A,B\,\text{|}\,A \subset I,B \subset I,A \cap B = \varPhi } \right\} \) at the same time, \( A \) is the premise of the rule, and \( B \) is the result.

According to the sample data, many association rules will be obtained, but not all association rules are valid, while some association rules have low level of association and are not effective. Therefore, it is necessary to determine whether the association rules are valid according to various measures of association rules. The most commonly used are confidence and support.

(1) Rule Confidence. Confidence is the measurement of the accuracy of simple association rules. It describes the probability of item Y in item X, and reflects the probability of Y appearing under the condition of X. The mathematical expression is as follows:

$$ C_{x \to Y} = \frac{{\left| {T\left( {X \cap Y} \right)} \right|}}{{\left| {T\left( X \right)} \right|}} $$
(1)

\( \left| {T\left( X \right)} \right| \) represents the number of transactions including items \( X \), \( \left| {T(X \cap Y)} \right| \) represents the number of transactions containing both items \( X \) and \( Y \). If the confidence degree \( C_{x \to Y} \) is larger, the project \( X \) appears then the project \( Y \) is also more likely to appear, reflecting the probability of \( Y \) under the condition of \( X \).

(2) Rule Support. It is a measurement of the generality of simple association rules, indicating the probability of simultaneous occurrence of item \( X \) and item \( Y \), representing the probability of another event occurring in the event of an event has occurred, and it can also be used to measure the confidence level or reliability of association rules. The mathematical expressions are as follows:

$$ S_{x \to Y} = \frac{{\left| {T\left( {X \cap Y} \right)} \right|}}{\left| T \right|} $$
(2)

\( \left| T \right| \) represents the total number of transactions, and if the support degree \( S_{x \to Y} \) is low, the rules are not universal.

The ideal association rules need high confidence and support. If the support degree is high and the confidence level is low, the credibility of the rules is low. If the confidence of the rules is high but the support is low, the application scope of the rules is small.

Assume that \( D \) is a database and \( X,Y \) is an item set. If the support degree \( s \) and confidence \( c \) of \( X \Rightarrow Y \) are not less than the minimum support degree \( \hbox{min} \_s \) and the minimum confidence degree \( \hbox{min} \_c \), \( X \Rightarrow Y \) is a strong association rule.

Therefore, in order to select a rule with a certain degree of confidence and support among numerous simple association rules, we need to set the threshold of minimum confidence and minimum support, and only the threshold that is greater than the minimum confidence and minimum support is effective. At the same time, the threshold setting should be reasonable: if the threshold is too small, the generated rules may not be representative. If the threshold is too large, the rules that meet the threshold requirements may not be found.

In general, if the mined simple association rule meets the preset threshold, then the rule is considered to be effective. But in fact, this rule may not be applicable. Therefore, confidence and support can only measure the validity of an association rule, but it cannot measure whether it is practical or meaningful. Therefore, we need to consider the rule lifting degree. Its mathematical expression is as follows:

$$ L_{x \to Y} { = }\frac{{C_{x \to Y} }}{{S_{Y} }} = \frac{{\left| {T\left( {X \cap Y} \right)} \right|}}{{\left| {T\left( X \right)} \right|}}/\frac{{\left| {T\left( Y \right)} \right|}}{\left| T \right|} $$
(3)

The rule lifting degree reflects the impact of the probability of item \( X \) appearance on the appearance of item \( Y \). It is meaningful when \( L_{x \to Y} > 1 \), which shows that \( X \) has a promoting effect on \( Y \).

3.2 Association Data Mining Method

The common association rule algorithms mainly include Apriori algorithm based on frequent itemset mining, Decision Tree algorithm based on mutual information computation and Rough set algorithm based on equivalence class partition. Because Apriori algorithm has outstanding advantages in mining the intrinsic meaning of data and the relationship between unknown data, it has became the core algorithm of simple association rules in data mining by the constant perfecting and improvement of scholars.

The basic idea of Apriori algorithm is to iterate repeatedly. From the 1- item sets, according to the given support threshold, we will prune frequent 1- item sets and find frequent 1- item set \( L_{1} \). According to the priori principle, if a set is frequent, all its subsets are frequent. Therefore, in generating a candidate 2- item set (\( C_{2} \)), the frequent 1- term set \( L_{1} \) can be directly selected. After the candidate 2- item sets are generated, the candidate 2- item set \( C_{2} \) is pruned according to the set support threshold to generate the frequent 2- item set \( L_{2} \). And so on, until the most frequent itemset \( L_{k} \) is generated. Therefore, the data mining process of Apriori algorithm can be divided into two steps:

(1) Generating Frequent Item Sets

(a) Set \( L_{k - 1} \) that is composed of frequent items (k-1)-sets generate all candidate k- set \( C_{k} \). P and q are two of these different item sets, if the first k-2 items of the p are the same, and the last item of p is greater than the last item of q, then add the last item of q to the last item of p to make it a candidate set of k-. Then find all the k- item sets in turn, and make up the \( C_{k} \).

(b) Prune the \( C_{k} \). For each of the item sets, check whether the subsets of each (k-1) are frequent item sets. In a large number of subsets, if there is a subset does’t belong to the frequent itemset, w will be removed from the \( C_{k} \).

(c) Calculate the support of each subset w in \( C_{k} \):

$$ Support = \frac{{N_{i} }}{N} $$
(4)

Where \( N_{i} \) is the number of transactions that contain an item set, and \( N \) is the number of all transactions.

(d) Add a set of items which meet the condition of Support > minsup to the frequent k- item sets which called Lk.

(e) Just find the frequent k- item sets and have k < kmax, repeat the steps above to look for (k + 1)- item sets.

(2) Generate Association Rules Based on Frequent Item Sets

Select association rules which meet the condition of confidence is greater than the preset minimum minsup from all simple association rules generated from frequent item sets, and make a valid association rule. Steps are as follows:

For every frequent item set l in L, all non-null subsets of l are produced.

For each nonempty subset A of l, if the set evaluation criterion is met, which meet the follow conditions:

$$ \frac{Support(l)}{Support(A)} \ge \hbox{min} \_conf $$
((5))

Support(l) and Support(A) are respectively the support of item set l and non-null subsets A, finally output the rule:\( A \to \bar{A} \), where \( \bar{A} = l - A \).

The process flow chart above is as follow (see Fig. 2):

Fig. 2.
figure 2

Apriori algorithm flow chart

3.3 Algorithm Performance

If any \( k - 1 \)-dimensional subset of the \( k \)-dimensional data itemset \( x \) is not a frequent itemset, thus \( x \) is not a frequent itemset, then some elements \( c \) in \( C_{k} \) may be eliminated, that is, to determine whether \( K \) \( k - 1 \)-dimensional subsets of \( c \) are all in \( L_{k - 1} \). In this method, the eliminated \( c \) only needs to scan \( L_{k - 1} \) once in the optimal state, and in the worst state until the \( K \) th \( k - 1 \)-dimensional subset is not in \( L_{k - 1} \). It can be seen that the average number of inspections for no element is \( \left| {L_{k - 1} } \right| \times k/2 \) times, the average calculation amount for the whole process is \( \left| {C_{k} } \right| \times \left| {L_{k - 1} } \right| \times k/2 \), and the average calculation amount for generating frequent itemsets procedures is \( \left| {C_{k} } \right| \times \left| {L_{k - 1} } \right| \times k/2 + \left| D \right| \).

3.4 An Example of Data Mining for Safe Operation of Air Management Based on Apriori Algorithm

In this paper, intended to adopt Apriori model in SPSS Modeler software to min unsafe event association rules for air management operation in order to explore which variable factors exist at the same time will lead to a higher probability of occurrence of unsafe events. Due to rule support and rule confidence are determined by the nature of the actual problem, in this paper, we select the support of 20% as a minsup to analyze all the implied association rules between all the data of an air management unsafe event.

According to the relevant content of the research variable selection, we set six preceding factors as A: Control equipment, B: Workload of the Controller, C: The psychological quality of the controller, D: The physical quality of the controller, E: Airspace Environment, F: Control of indoor environment. Taking the unsafe event level of air management unsafe events as the bottom factor, this paper introduces the general air management unsafe events as an example.

When the rule support is 20%, the rule confidence level is 80%, and the rule lifting degree > 1, the mining association rules are arranged in the first ten items according to the support degree in Table 1, as shown below:

Table 1. Non-safe time association rules for air management (Support 20%, Confidence 85%, Rule Elevation > 1)

It can be seen from the table that the data mining association analysis based on Apriori algorithm is suitable for the research of the operation Safety of air management. Taking the general air traffic control incident as an example, the main insecurity factors are the heavy workload of the controllers, the complexity of airspace environment, and the relatively old air traffic control equipment. Under the condition that the rule support degree is 20%, each of the preceding factors may have influence on the operation safety of the air management, but different factors of the previous factors and different combinations of the preceding factors have different effects on the air traffic control incidents.

In the future, based on this research method, we can conduct correlation analysis on other levels of ATC operational incidents to identify potential safety hazards and make rectifications in time to ensure that ATC operations can be carried out efficiently and safely in a long term.

4 Conclusions

On the basis of predecessors’ research, this paper makes a detailed interpretation of the Apriori algorithm and further broadens the field, combines this algorithm with the existing data of air traffic management, and applies the data mining technology to the analysis of the safety management of air traffic control, and discovered the main factors affecting the safety of air traffic control operations and their impact. Thus we can draw the following conclusions:

  1. (1)

    Data mining technology is feasible in the field of air traffic control operational safety. The application of Apriori correlation analysis algorithm can effectively analyze the influencing factors of ATC operation safety, and can analyze the importance of each influencing factor and the correlation between each influencing factor.

  2. (2)

    Under the condition that the rule support degree is 20%, each of the preceding factors may have influence on the operation safety of the air management, but the influence degree of different factors in the preceding paragraph is different.

  3. (3)

    In general, the main reason for the unsafe incidents is that the controller has a large workload, the airspace environment is more complicated, and the control equipment is old. Secondly, the physical quality and psychological quality of the controllers also have a certain influence on the operation of air traffic control. This may be because the controllers with better physical quality have stronger anti-fatigue ability, and the controllers with good psychological quality can adapt to stronger work. The pressure allows them to handle complex tasks more calmly in the regulatory work.

  4. (4)

    In the future research, this algorithm or its improved algorithm can be applied to analyze other level of air traffic control incidents, identify important influencing factors in time, and propose the improvement measures to ensure that the air traffic management can run efficiently and safely in a long time.