Keywords

1 Introduction

Sensors are located in environments that change dynamically and are required not only to detect the values of all parameters measured, but also to assess the situation and alert accordingly, based on predefined rules managed by a “Classifier-engine”. Since the environments are dynamically changed and there may be new situations that were not known in advance, which are reflected in new combinations of parameter values, there is a need for a dynamic updating of the sensor’s classifier.

One exemplary application for such dynamic classification unit (DCU) is a screening gate including biometric sensors that screen travelers entering a high security area such as an airport. The sensors may be configured to test multiple parameters of a traveler, such as heart rate, heart pressure, perspiration, etc. The classification system may be set to measure two classes of travelers, the bulk of travelers who have “normal” parameters and should pass the biometric screening without interference, and those who should be checked by security personnel. Upon receiving a new case, the sensing system determines whether it may be classified into one of the existing classes, or it should evoke a change in the classification scheme. Thus, over the course of a day, environmental conditions may change; ranges of values that haven’t been observed before may appear causing dynamic changes-updates in sensor’s classifier.

Changes in the sensor’s classifier (i.e. classification scheme) are triggered based on a threshold of segmentation error parameter. The sensor’s classifier is based on small data buffers and collects remembers a limited set of “representative” cases for each class (case-buffers). As a result of the trigger’s appeal, one or more case-buffers are dynamically regrouped into a new composition of buffers, according to segmentation quality criteria.

The novelty of this real-time mechanism lies in the fact that the entire process is based on the use of limited memory buffers. In addition, each DCU, which is a remote autonomous sensing system, can communicate with multiple additional remote autonomous sensing systems. In such situations, the case buffers, as well as the case history, can be synchronized and managed via a central controller. Furthermore, in a distributed environment, regardless the existence of a central controller, the contents of the case buffers and the classifier scheme of each DCU can be synchronized between the multiple remote autonomous agents (sensing-systems). Synchronization may be performed after each regrouping process. That is to say that each incremental updating at any local DCU may initiate synchronization among all connected autonomous agents.

2 Related Work

In the reality of the dynamic data environment, when a huge amount of raw data and information flows ceaselessly, the main purpose of individuals and organizations is discovering the optimal way to find a hidden potential in it, through the constant cooperation of human intelligence and machine capabilities. The techniques and models that successfully functioned in stable data environment are outdated and need to be corrected to deal with dynamic data environment. “Databases are growing in size to a stage where traditional techniques for analysis and visualization of the data are breaking down” [1, 2]. Because of the constant increase in data volume, interpreting of similarities of different sub-populations becomes the new dimension of data mining goal. The data usually flows from different sources and has to be handled and processed simultaneously [3].The development of new and advanced techniques in data mining in dynamic data environment covers more and more fields, for instance, computer sciences, medicine [4], security systems [5] and social networks [6, 7]. And it is not just an application of existing algorithmic tools in these fields, but the inclusion of elements and logic and even tools, that were created purposefully for them.

As a result of the constant need to get real-time solutions, the research is naturally directed into a new field – incremental data processing. The motivation is to maximize the quality of solutions through minimizing the process cost [8,9,10]. The algorithmic tools have to be adjusted to dynamic data environment and be capable to absorb significant amounts of data, possibly to handle with the Big Data environment. The main idea of incremental techniques is to use small segments of data and not the whole historical data [11,12,13].

One of the commonly used directions in data mining is classification process, in which the objects are classified into homogeneous groups, with a maximal diversity between groups and minimal within groups. The proximity of an object to group centroid is usually measured by similarity measures, such as RMSE (root mean square error), used in the current paper [14]. The classification tasks are usually divided into two main types: if the target attribute is previously known, the process is called “classification”, and if the target attribute is not known, it is called “clustering” [2, 10, 15, 16]. In the case of clustering problems, the interpretation of achieved clusters is one of the main challenges. For example, if a higher education consultant has to recommend the future student what is the best faculty for him, he will probably pick the faculty name from the existing list in the university. On the other hand, if the security system bank controller needs to identify the type of a new financial fraud trend, he needs to be very open-minded and be able to classify the action undertaken by a fraudster to an absolutely different type and give it an appropriate description. In some cases, there is a need to create a set of groups/classes based on items/customers/actions that are needed to be classified without any information about the target attribute. Different kinds of classification/clustering tasks in dynamic data environment in combination of existing and new techniques became the basis for extensive research [17,18,19].

The current research presents a dynamic classifier based on incremental dynamic clustering process. It permits the use of small data buffers that represent existing groups. This approach is significantly different from other approaches and methods, considered in the literature.

The dynamic classifier, proposed in the current paper, functions as a sensor. It works as a screening gate that distinguishes between “regular” items that are close enough to at least on of existing groups and alerts when the relatively “different” item occurs. The model permits not just an alert in such situation, but the action required to classify the item. Lots of studies combine different sensors in decision making processes [20,21,22,23].

3 Model Architecture

Figure 1 presents a schematic architecture of the proposed system, showing one DCU connected to a central controller as well as to other remote autonomous sensors-agents.

Fig. 1.
figure 1

Schematic architecture of the system.

The real-time data flows through the “Sensor” (Fig. 1, first component) to the “Classifier-engine”. The “Classier-engine” performs a decision process based on a classification scheme, as described in the flow diagram in Fig. 2. The “Gate” component represents the output, or, in other words, the decision regarding the classification of object. Based on a threshold of segmentation error parameter and segmentation quality criteria, the DCU incrementally updates the population of the relevant “Case-buffers”. The mechanism that manages the population (i.e. cases) stored in each buffer can use diverse policies, such as FIFO policy (First-In First-Out), or a selection policy that may store extreme-farthest cases of each group (“outliers” that are still classified to that group).

Fig. 2.
figure 2

A flow diagram of a process for real-time data classification based on dynamic updating of sensor’s classifiers.

The term “Sensor” represents the “funnel” through which the data stream flows. Thus, a sensor can be a physical object, as well as a logical handshake through which the data flows into the system. The flow chart, illustrated in Fig. 2, presents the real-time decision-making process for each new sensing data element (i.e. each new case).

The version control is managed by sequential numbering approach.

Table 1 presents the notations used in the flowchart:

Table 1. Notations

The mechanism presented in Fig. 2 works as follows: The new item Xi passes through the sensor, the distance measure between the new item and the centroids of existing groups are calculated and the minimal value e is registered. The threshold level δ, the maximal buffer size Zmax and the rest of parameters have to be determined at this point of time. If the minimal distance e is less than a threshold, no rearrangement needed and a new case Xi joins the closest group (completion). If e exceeds the threshold level, the number of items in the buffer is checked. If there are not enough cases (the number of cases is less than Zmin), the new case creates an absolutely new group. If there are enough cases in the buffer, the new case removes the oldest case and a new distribution is created (by splitting or merging of the existing groups).

4 Model Validation

Since the model deals with a stream of real-time data, which is a continuous flow of new cases, the validation was based on datasets of classification problems. The model is implemented by the code developed in Python and combines the k-means algorithm package [24].

The following datasets were used: (1) “ERA” dataset, donated by Prof. Ben-David [25]. This data set was originally gathered during the academic decision-making experiment. Input attributes are candidates’ characteristics (such as past experience, verbal skills etc.), output attribute is a subjective judgement of a decision-maker to which degree he/she tends to accept the applicant to the job or to reject him. All the input and output attributes have ordinal values. The data set contains 1000 instances, four input attributes and one output attribute. (2) “Car Evaluation” dataset that was retrieved from the UCI Machine Learning Repository [26,27,28]. Input attributes are cars properties and an output attribute is a class value (unacceptable, acceptable, good and very good). The data set contains 1728 instances.

4.1 Optimal Situation as a Baseline

The theoretical optima in such case is the situation in which the algorithm runs across the entire dataset. Thus, based on the results obtained by the clustering k-means algorithm, while analyzing all the records in the dataset, we can find the best set of rules, and the required total number of rules, that achieve the best classification accuracy.

4.2 The Initial Stage

According to widely used methodology in machine learning, each dataset was divided into training set (with about two thirds of data) and test set (with about one third of data). The training set provides the initial groups and the test set simulates a new data stream. Worth mention that the initial stage is mainly used to shorten the “reset-cycle” of the decision-making process. In cases where there is no urgency, the system can stat with no decision rule at all, and with totally empty “Case-buffers”.

4.3 The “Dynamic-Flow” Stage

The test set was used in an unsupervised mode (while hiding the target-labeled field). The records flowed through the “Sensor” to the “Classifier-engine” without any information regarding the right classification-filtering.

  • The “Classifier-engine” and the “Incremental-updater” used the flow diagram mechanism described in Fig. 2.

  • The delta symbol (δ), in Fig. 2, represents Root-mean-square error (RMSE) that was used as a threshold.

  • The parameters in each experiment were set as follows:

    ERA data-set: three threshold levels: 2, 2.25, 2.5; initial number of groups: 10; buffer size: 25; training set = 600, test set = 400.

Car evaluation data-set: three threshold levels: 0.8, 0.9, 1; initial number of groups: 15; buffer size: 25; training set = 1400, test set = 328.

In accordance with the schematic architecture of the system (illustrated in Fig. 1), a case is either directly classified, or initiates an incremental reevaluation (supported by the “Incremental-updater” component) till the threshold is satisfied, then the “Case-buffers” and the “Classifier-engine” are dynamically updated.

5 Results and Discussion

As shown in Figs. 3, 4 and in Table 2, we can see that although the learning mechanism uses only small data increments, it succeeds to perform good and consistent results. Figures 3 and 4 represent the dynamics of group set updating for different threshold levels. The process converges in both data sets for all sensitivity levels.

Fig. 3.
figure 3

Rules convergence using k-means with “ERA” dataset.

Fig. 4.
figure 4

Rules convergence using k-means with “Car evaluation” dataset.

Table 2. The dynamic incremental updating of group set, according to threshold level.

We can see that in all three threshold values a reach a convergence of the classification process. In order to trace the dependence of aggregate rate of total number of groups on the sensitivity level, we chose three threshold levels for each data set. We can see that the convergence is faster as the threshold refers to lower accuracy value, but even at a high accuracy threshold, a relatively rapid convergence was achieved. The application of this result is very practical: on one hand, the dynamic data environment dictates us to act in real time, that is why we use small increments of data to be able to classify objects immediately; on the other hand, we need to provide good classification results and identify new trends or significant changes in data distribution. The convergence of classification process shows the ability of the proposed model to catch the critical moments when an update is needed, without too much computational effort. The updated groups set becomes more and more representative, that is why the periods of time between every two updates lengthens.

Table 2 presents the numerical results of all experiments in two data sets. The distance measure RMSE was calculated for each classified item (in most cases the distance between the item and at least one of the existing groups is less than a threshold level, so the item is joins the existing group; sometimes the threshold is achieved and the update is needed). The average and standard deviation of all minimal RMSE values are calculated for each experiment. The total number of groups in the end of each experiment is presented in addition. As sensitivity of a threshold level decreases (higher values of δ), the average distance measure grows. This result is expected: if a threshold level is relatively high, less items are defined as “far” or “non-similar” and more items succeed to join existing groups. Their minimal RMSE value is weighted into the calculation of average RMSE and we get bigger result. The same effect usually happens in standard deviation.

In the conclusion of the above facts, we can see that the proposed incremental dynamic mechanism succeeds to achieve good results, that can be adopted in industry or in academical research as well.

6 Conclusions

Dynamic incremental classifier presented in this paper is designed to improve the classification process in state of dynamic data environment. The constant changes in data characteristics and preferences require from the mechanism immediate solutions. In addition to this obligatory condition, the process has to be economic. There is no dispute that the most qualitative solution will be obtained through the update of whole relevant data, but it is not possible in dynamic data environment. We assume that it is not possible to revise all previous data, so we choose to demonstrate the incremental mechanism that functions using small data buffers.

Experiments with different data sets showed that the loss of quality in classification results is not significant and the mechanism succeeds to identify the important changes in data stream and converges during the process.

The further research is planned in different possible directions: dealing with a big data sets that simulate big data environment; new trend and outlier detection; text data processing etc.