Keywords

1 Introduction

The rapid development of the Internet of Things (IoT) at present, the affordability of sensor equipment and the maturing connectivity technologies, allow us to collect a lot of useful data. The sensing data hence become ingredients to applications that are designed to provide better quality for everyday life. However, this kind of data has unique characteristics. First, the data collected by the sensor is usually numerical data. Second, when the sensor collects the data, the collection frequency is relatively fast, and data can be collected sporadically, within a few seconds at each time depending on the sampling rate. During each period of time, a large amount of data is gathered. But they may be about the same across successive periods because the changes in the environment or the outdoor activities are slow (compared to body activity recognition). Furthermore, the data collected by the sensors over a certain period of time is repetitive but uninteresting. For example, crowd-sensing and security-oriented sensor applications collect a huge amount of normal data, in the hope of detecting something that deviates from normal. Therefore, the data collected in a certain period of time may often contain irrelevant data that come from uninteresting and repetitive activities. In short, the data collected by these sensors are characterized by time series, large quantity over a period of time, easy data repetition, and certain noise data. This leads to the consequence that the classifiers that are trained by such data cannot effectively classify tasks. When we conducted data analysis, we found that the reason for the deterioration of the classification effect is often due to that the data was divided too much at high resolution. In daily life, we should pay more attention to the results and phenomena of some abstracted time periods. One extreme example of abstract time period is morning, afternoon and evening. In our new model, a coarse-grained level is adopted for data partitioning to reveal prominent data features while maintaining the data in effective structures. For solving the problem of “diluted data” due to IoT operational nature, a new model is proposed. In this paper, the proposed classification model is empowered by a new probability evaluation classification method treating the input data as data sequence. The advantages of proposed mechanisms can improve the robustness of the model, reduce the sensitivity to noisy training data in the data stream that come from the sensors of the Internet of Things. Therefore, the machine learning model will become better in alignment with real-life prediction objectives. The classification model that is induced using the proposed learning method will be more useful than the direct use of the classification algorithms alone.

2 Related Work

With the advancement of hardware technology, sensor devices are increasing. The sensor-centric Internet of Things has also experienced rapid development. According to statistics [1] until 2017, the total value of the IoT reached 29 billion U.S. dollars. Such a huge market has attracted the attention of industry personnel and academic staff. Over the years, people have been investigating and building smart systems such as smart homes, smart transportation, and smart security [2,3,4]. The massive increase in IoT devices helps people obtain large amounts of sensory data. How to tap valuable information from this vast amount of data and form knowledge to serve life more effectively is an important issue. Some researchers have tried to use the data mining technology in the development of the IoT to make it more intelligent [5,6,7,8]. Clustering is commonly used in data mining of the IoT and the most common clustering method is K-means [9] which is very mature in traditional data mining. The distribution of Internet of Things data in some cases is a clustering problem [10, 11], but the classification results presented by clustering are only similar data and cannot be judged. If people are unfamiliar or unclear with the collected data, they cannot rely on the clustering results to dig out effective knowledge. In supervised learning, people often use decision tree algorithms for data mining in the IoT [12, 13]. In addition, probabilistic models are also widely used, such as the Naïve Bayesian model [14, 15]. In machine learning, there is also a simple and efficient classification method that is SVM [16], combining with the kernel function can linearly separate data in high-dimensional space. However, the traditional classification method has unstable performance on the actual sensor data because of the special nature of IoT data. Through analysis and observation, all collected sensor data have highly repetitive characteristics and often contain noise data whose source may be due to sensor detection errors. This leads to too many samples of negative instances in the classification process, and the accuracy of the trained model decreases. In real life, people pay more attention to the results of a period of time, which is inconsistent with the frequency with which sensors collect data. Therefore, in order to solve this problem, this paper proposes a new type of data mining model, with a pre-processing which consists of constructing the subspace from the initial data set, and finally using the traditional classification method for classification. This model can greatly improve the accuracy of classification. Moreover, this model is robust, and it can be combined with various classification methods.

3 Our Proposed Model

In this section, the paper is going to introduce the model overview and example, formulating the high-level mathematics model for this new pre-processing process (PP). Compared with other classical classifier algorithms and pre-processing process, this pre-processing method combining with other classical classifier algorithms are applied for predicting the major label among a small group of continuous sequential data which are ordered by time. This new method will change the unit of information from a singular time point to a period of time. Each dataset tested in this paper is the data that come from a typical wearable sensor. The data have several labels such as "walking", "running" and so on in the prediction class. In our daily life, be it walking or running, there is a well-known observation that the sensed data would have similar adjacent instances along the data stream. The data instances carry the same label except when the instances are generated located at the boundary of two different actions and the noisy data. When people expect to classify which actions the subject under monitoring is doing through the data, it would be more effective for the machine learning model to be trained with a group of continuous data instances that are grouped with a common target label, than singular data instances with precise but similar data values individually. Our proposed pre-processing method is designed to generate new datasets for training and testing from the original train dataset and test dataset. The meaning of the instance in the new dataset is not merely the information converted from a singular instance which is in the original dataset, but the information from a continuous period of instances. How to convert the relevant information effectively is the most important part of the design, which is reported in this paper. To begin with, the feasibility of this algorithm in overall is defined, and this paper is going to illustrate how this algorithm works in detail in the next section, followed by the experiment result. First of all, when the train dataset is coming, we would like to get some information from them especially from the data which has the same label. Then we collect some sample data instances from each group of instances which have the same label (such as we collect sample instances from a set that each instance inside are labelled by “walking” and then do the same process in the “running” set). And we assume that these sample instances could represent and basically conclude most of traits from class labels. Then these sample subsets are called “standard sets” for each class label which means every class label has one sample set. If there are n labels in the dataset, then there will be n sample dataset. This is just the beginning of the transformation. In the next step, in order to stimulate a normal period of instances in the dataset, we use the sampling method to collect some data from each class to form n label dataset. Theoretically, the size of these label sets is smaller than standard sets because we want to find some similarity index between standard sets and label sets using isolation forest (iForest) detection algorithm [17] which is an algorithm to detect the non-isolation rate between two datasets.

The workflow is divided into three main steps: a) the original sequence training dataset (T0) will be transformed into a new training dataset (T1) while the attributes and the length of T1 are changed and optimized by PP, b) the original testing dataset (D0) will be transformed into a new test dataset (D1) by PP. It needs to be noticed that the core of PP method is based on Isolation Forest algorithm and c) the user could apply these new training dataset and testing dataset for prediction coupled with some known classification algorithms. Abstract flow charts are shown in Figs. 1 and 2, indicating how overall this concept works for reconstructing the training and testing datasets respectively using subspace division.

A set of formula are developed which are used to explain each step in aforementioned figures by explaining the operation pertaining to how the data is processed and converted between successive steps. Suppose S = {\({\mathrm{x}}_{1}\), …, \({ \mathrm{x}}_{\mathrm{t}}\), \({\mathrm{x}}_{\mathrm{t}+1}\), …} to be the original dataset, \({\mathrm{x}}_{\mathrm{t}}\) ∈ S where t = 1, 2, …, and the length of dataset S is fixed. Here SP is denoted as a collection of all the subsets of S. For every data \({\mathrm{x}}_{\mathrm{t}}\in \) S, there are m attributes characterizing them. And then the attributes space A would be defined as A = {(\({\mathrm{a}}_{1}\), \({\mathrm{ a}}_{2}\), …, \({\mathrm{a}}_{\mathrm{i}}\)…, \( {\mathrm{a}}_{\mathrm{m}}\))| ai is the value of the ith attribute, for i = 1, …, m} and the attributes’ types are mixed by numeric and nominal data. All the data were labeled, here it is called class and the collection of classes is C = {\({\mathrm{C}}_{1}\),\(\dots ,{\mathrm{C}}_{\mathrm{n}}\)}. Then the data \({\mathrm{x}}_{\mathrm{t}}\)=(x’t, c) where x’t = (x1, t’,…, xi, t’, …, xm, t’), x‘t ∈ A and c ∈ C. In this experiment, cross-validation was applied for splitting the original dataset S into training dataset and testing dataset. We denote one of the training datasets as T0 and testing dataset as D0 whose instances have no labels. Before pre-processing the training data, we need to define some functions and notations to make it more accessible.

  • Formulae #1

    Define a function Class to print out the class \({\mathrm{c}}_{\mathrm{h}}\) of instance h where \(\mathrm{h}\in \mathrm{S\,and\, }{\mathrm{c}}_{\mathrm{h}}\in \mathrm{ C}\).

    $$\mathrm{Class}:\mathrm{S}\to \mathrm{C },\mathrm{Class}\left(\mathrm{h}\right)={\mathrm{c}}_{\mathrm{h}}$$
    (1)
  • Formulae #2

    The function Maj is defined on SP which means to print out the major class \({\mathrm{c}}_{\mathrm{maj}}\) of dataset \({\mathrm{w}}_{\mathrm{x}}\) and the operation |·| means to calculate the length of the set.

    $$\mathrm{Maj}:\mathrm{ SP}\to \mathrm{C },\mathrm{ Maj}\left({\mathrm{w}}_{\mathrm{x}}\right)\,{=\mathrm{arg}}_{\mathrm{c}\in \mathrm{C}}\frac{|\left\{\mathrm{h}\in {\mathrm{w}}_{\mathrm{x}}\right|\mathrm{ Class}\left(\mathrm{h}\right)=\mathrm{c }\}|}{\left|{\mathrm{w}}_{\mathrm{x}}\right|}={\mathrm{c}}_{\mathrm{maj}}$$
    (2)
    Fig. 1.
    figure 1

    Block diagram that shows how a new training dataset is reconstructed by our proposed preprocessing method.

    Fig. 2.
    figure 2

    Block diagram that shows how a new testing dataset is reconstructed by our proposed preprocessing method.

  • Formulae #3

    Define a function Div(·) to collect the instances whose class is c from \({\mathrm{T}}_{0}\) then create a subset \({\mathrm{T}}_{0}^{\mathrm{c}}\) of \({\mathrm{T}}_{0}\).

    $$\mathrm{Div}:\mathrm{ SP}\times \mathrm{C}\to \mathrm{SP },\mathrm{ Div}\left({\mathrm{T}}_{0},\mathrm{ c}\right)=\left\{\mathrm{x}\in {\mathrm{T}}_{0}\right|\mathrm{ Class}\left(\mathrm{x}\right)=\mathrm{c }\}= {\mathrm{T}}_{0}^{\mathrm{c}}$$
    (3)
  • Formulae #4

    \({\mathrm{Sam}}_{\mathrm{md}}\)(·) is a function that take z samples from \({\mathrm{T}}_{0}^{\mathrm{c}}\) based on curtain sampling method md, here md could be one of the sample random sampling methods and stratified random sampling. Besides the sample set is named as \({\mathrm{ST}}_{0}^{\mathrm{c}}\).

    $$ {\text{Sam}}_{{{\text{md}}}} :\,\,{\text{SP}} \times \,\,{\text{IR}} \to {\text{SP,}}\,\,\,{\text{Sam}}_{{{\text{md}}}} ({\text{T}}_{{0}} ,\,{\text{z}}) = {\text{ST}}_{0}^{{\text{c}}} $$
    (4)
  • Formulae #5

    Function \(\mathrm{ITR }(\cdot )\) used \({\mathrm{ST}}_{0}^{{\mathrm{c}}_{\mathrm{j}}}\mathrm{ as a standard case}\) to train an Isolation Forest which is an algorithms created by Prof. Zhihua Zhou [17] for detecting the isolation point and then putting the \({\mathrm{L}}_{{\mathrm{c}}_{\mathrm{i}}}\) sample set into the Isolation Forest model to classify whether there are isolation points or not in the \({\mathrm{L}}_{{\mathrm{c}}_{\mathrm{i}}}\). Finally computing the rate of data in \({\mathrm{L}}^{{\mathrm{C}}_{\mathrm{i}}}\) that normally obeys the distribution in \({\mathrm{T}}_{0}^{{\mathrm{sc}}_{\mathrm{j}}}\). In other words, this function is to compute the non-isolation rate \({\mathrm{P}}_{\mathrm{i},\mathrm{j}}\).

    $$\mathrm{ITR}:\mathrm{ SP }\times \mathrm{SP}\to \left[\mathrm{0,1}\right],\mathrm{ ITR}\left({\mathrm{L}}^{{\mathrm{C}}_{\mathrm{i}}},{\mathrm{ST}}_{0}^{{\mathrm{c}}_{\mathrm{j}}}\right)= {\mathrm{P}}_{\mathrm{i},\mathrm{j}}$$
    (5)

Step 1: Reconstruct Training Data Table

With the definitions and notations above, the process of this algorithm will be presented below and also be described in the Figs. 1 and 2. When an original dataset S comes, through cross-validation, it could get one of training datasets T0, then divide the T0 into a collection of sub-dataset \({\{{\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\}}_{\mathrm{j}=1}^{\mathrm{n}}\) where

$${\mathrm{T}}_{0}^{{\mathrm{c}}_{\mathrm{j}}}=\mathrm{Div}\left({\mathrm{T}}_{0}, {\mathrm{c}}_{\mathrm{j}}\right)$$
(6)

Then the algorithm will do the first time sampling (the reason for why it is first time will be explained at the end of this step) with specific sampling method md to gain the trait from \({\{{\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\}}_{\mathrm{j}=1}^{\mathrm{n}}\) then it will get a series of sampling dataset \({\{{\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\}}_{\mathrm{j}=1}^{\mathrm{n}}\) where

$${\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}={\mathrm{Sam}}_{\mathrm{md}}\left({\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{j}}} , {\mathrm{z}}_{2}\right) $$
(7)

In order to simulate the arbitrary test sliding window w (specific description of w is in the step 2) where \(\mathrm{Maj}(\mathrm{w})={\mathrm{ C}}_{\mathrm{i}}\), this pre-processing method will do the first time sampling with specific sampling method md to gain the trait from \({\{{\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\}}_{\mathrm{j}=1}^{\mathrm{n}}\) then we get a series of Label dataset \({\{{\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}\}}_{\mathrm{i}=1}^{\mathrm{n}}\) where

$${\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}=\left\{{\mathrm{Sam}}_{\mathrm{md}}\left({\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{i}}} , {\mathrm{z}}_{1}\right)\right\}\bigcup {\mathrm{N}}_{\mathrm{i}}$$
(8)

and \({\mathrm{N}}_{\mathrm{i}}\) is a special noise set to simulate the noise in the arbitrary test sliding windows w where\(\mathrm{Maj}(\mathrm{w})={\mathrm{ C}}_{\mathrm{i}}\). It is clear to find that the length of \({\{{\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\}}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{ and}{\{{\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}\}}_{\mathrm{i}=1}^{\mathrm{n}}\) are same such as n. Therefore, that is easy to get n × n combinations such as {\({\{\left({{\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}},\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\right) \}}_{\mathrm{i}=1}^{\mathrm{n}}\)}nj = 1. With these n × n combinations, the preprocess sets \({\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\) as the second input element (\(\mathrm{standard case}\)) of ITR and \({\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}\) as the first input element (detection case) of ITR. At the end of the whole process, a training table TDT could be generated.

  • Formulae #6

    Construct a Matrix \(\mathrm{TDT}\in {[ [\mathrm{0,1}]}^{\mathrm{n}\times \mathrm{n}}\left|{\mathrm{C}}^{\mathrm{n}\times 1}\right]\) the element in the ith row and jth column of TDT is computed by function \(\mathrm{ITR}\left({{\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}},\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\right)\) and the element in the kth row and n + 1 column is \({\mathrm{C}}_{\mathrm{i}}\):

    $${\left[\mathrm{TDT}\right]}_{\mathrm{i},\mathrm{j}}=\mathrm{ITR}\left({{\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}},\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\right)\mathrm{ for }\,1\le \mathrm{i}\le \mathrm{n\,and }1\le \mathrm{j}\le \mathrm{n}$$
    $${[\mathrm{TDT}]}_{\mathrm{k},\mathrm{n}+1}={\mathrm{C}}_{\mathrm{i}}\mathrm{\, for }\,1\le \mathrm{k}\le \mathrm{n}$$
    (9)

    Finally, the definition of function PP would be constructed from TDT.

    Definition: Function PP is defined to generate a new training dataset T1 from T0 with the setting of z1 (the length of \({\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}\)) and z2 (the length of\({\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\)).

    $$ {\text{PP1:}}\,{\kern 1pt} {\text{SP}} \times {\text{IR}} \to {\text{SP}}^{\prime } {,}\,\,{\text{PP1(T}}_{{0}} {\text{, z}}_{{1}} {\text{, z}}_{{2}} {\text{) = T}}_{{1}} $$

    where \({\mathrm{T}}_{1}=\){\({\mathrm{xx}}_{1},\dots ,{\mathrm{xx}}_{\mathrm{n}}\)} and

    $${\mathrm{xx}}_{\mathrm{i}}={{\{[\mathrm{TDT}]}_{\mathrm{i},\mathrm{j}}\} }_{\mathrm{j}=1}^{\mathrm{n}+1}=\left({{\{[\mathrm{ITR}\left(\left\{{\mathrm{Sam}}_{\mathrm{md}}\left(\mathrm{Div}\left({\mathrm{T}}_{0}, {\mathrm{C}}_{\mathrm{i}}\right) , {\mathrm{z}}_{1}\right)\right\}\bigcup {\mathrm{N}}_{\mathrm{i}}, {\mathrm{Sam}}_{\mathrm{md}}\left(\mathrm{Div}\left({\mathrm{T}}_{0},{\mathrm{C}}_{\mathrm{j}}\right) , {\mathrm{z}}_{2}\right)\right)]}_{\mathrm{i},\mathrm{j}}\} }_{\mathrm{j}=1}^{\mathrm{n}}, {\mathrm{C}}_{\mathrm{i}}\right)$$
    (10)

Now, we are going to explain why it is written as “the first time”. If this algorithm just does sampling for the one time, then the TDT just have n instances which are too few. So, in order to increase the size of the dataset, this algorithm will do the step 1 for many times and the repeating frequency depends on user’s choice. Finally, we combine all the training Tables into a new training data set. So, does it work for the test data table. In the following section, the first-time loop is described because the principle is the same.

Step 2: Reconstruct Testing Data Table

This step is to transform the testing dataset D0 to new testing dataset D1 where \({\mathrm{D}}_{0}=\left\{{\mathrm{y}}_{1},\dots ,\dots \right\}\) and \({\mathrm{D}}_{1}={\{\mathrm{yy}}_{1},\dots ,{\mathrm{yy}}_{\mathrm{t}},\dots {\mathrm{yy}}_{\mathrm{r}}\}\). Sliding window w will be applied as a instrument to do so and the length of sliding window p can be set by user, e.g. P = z2. Let\({\mathrm{w}}_{1}=\{{\mathrm{x}}_{1},\dots ,{\mathrm{x}}_{\mathrm{p}}\}\),\({\mathrm{w}}_{\mathrm{z}}=\{{\mathrm{x}}_{1+\mathrm{z}},\dots ,{\mathrm{x}}_{\mathrm{p}+\mathrm{z}}\}\), \(\mathrm{W}=\{{\mathrm{w}}_{1},\dots ,{\mathrm{w}}_{\mathrm{z}},\dots {\mathrm{w}}_{\mathrm{r}}\}\) where r is determined by the length of slide window and the length of test dataset. Because of the similar technique, the testing dataset is also calculated by ITR with the input. However, the input is no longer the n × n combinations as TDT but a r × n combinations such as {\({\{\left({{\mathrm{w}}_{\mathrm{t}},\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\right) \}}_{\mathrm{j}=1}^{\mathrm{n}}\)}rt = 1.

  • Formulae #7

    Construct a Matrix \({\mathrm{TDT}}_{1}\in {[\mathrm{0,1}]}^{\mathrm{r}\times \mathrm{n}}\):

    $${[{\mathrm{TDT}}_{1}]}_{\mathrm{t},\mathrm{j}}=\mathrm{ITR}( {\mathrm{w}}_{\mathrm{t}} ,{\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}})$$
    (11)

    where \({\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}={\mathrm{Sam}}_{\mathrm{md}}\left({\mathrm{T}}_{0}^{{\mathrm{C}}_{\mathrm{j}}} , {\mathrm{z}}_{2}\right)\) for j = 1…n and t = 1…r.\({\mathrm{w}}_{\mathrm{t}}\) is the tth sliding window. While the element in the tth row and jth column of \({\mathrm{TDT}}_{1}\) is computed by function \(\mathrm{ITR}\left({\mathrm{w}}_{\mathrm{t}} ,{\mathrm{ST}}_{0}^{{\mathrm{C}}_{\mathrm{j}}}\right).\)

    With the help of TDT1 the final high-level function of step two can be defined.

  • Formulae #8

    Function PP2 is defined for transforming testing dataset D0 into new testing dataset D1 which has same categories of attributes as T1. But part of computing input changes because we no longer utilize T0 to gain sample set \({\mathrm{L}}_{0}^{{\mathrm{C}}_{\mathrm{i}}}\) whose size is z1 but using sliding windows of D0 while the class of new testing data is replaced by major class of a curtain sliding window.

    $$ {\text{PP2:}}\,{\kern 1pt} {\text{SP}} \times {\text{IR}} \to {\text{SP}}^{\prime \prime } {,}\,\,{\text{PP(T}}_{{0}} {\text{, D}}_{0} {\text{, z}}_{{2}} {\text{) = D}}_{{1}} $$

    where \({\mathrm{D}}_{1}=\){\({\mathrm{yy}}_{1},\dots ,{\mathrm{yy}}_{\mathrm{t}},\dots {\mathrm{yy}}_{\mathrm{r}}\)}, and the

    $$ {\text{yy}}_{{\text{t}}} = (\{ [{\text{TDT}}_{1} ]_{{({\text{i,}}\,{\text{j}})}} \}_{{({\text{j}} = 1)}}^{n} ,{\text{Maj}}({\text{w}}_{{\text{t}}} )) = (\{ {\text{ITR}}({\text{w}}_{{\text{t}}} ,\,\,{\text{Sam}}_{{{\text{md}}}} ({\text{Div}}({\text{T}}_{0} ,{\text{c}}_{{\text{i}}} ),\,\,{\text{z}}_{2} ))\}_{{({\text{j}} = 1)}}^{{\text{n}}} ,{\text{Maj}}({\text{w}}_{t} ) $$
    (12)

Step 3: Model Learning

After using the pre-processing method to generate the new training dataset T1 and new testing dataset D1, user could apply different algorithms to make the prediction with the help of T1 and D1. Then a group of high-level equations that represent these processes {algoi \({\}}_{\mathrm{i}=1}^{5}\) is defined as follow:

  • Formulae #9

    \({\{{\mathrm{algo}}_{\mathrm{i}}\}}_{\mathrm{i}=1}^{5}\) is defined for a group of high-level equations that use training data set (such as \({\mathrm{T}}_{1}\)), testing dataset (such as \({\mathrm{D}}_{1}\)) as well as a group of parameters for training the model and testing the model, then finally it gets the performance evaluation index pf.

    $$ {\text{algo}}_{{\text{i}}} {1:}\,{\kern 1pt} {\text{SP}}^{\prime } \times {\text{IR}} \to {\text{SP}}^{\prime \prime } {,}\,\,{\text{Pre}} \to {\text{IR, algo(T}}_{{1}} {\text{, D}}_{{1}} {\text{, P) = pf}} $$
    (13)

where Pre is a collection of all possible specific parameters with respect to the demand of user and \(\mathrm{P}\in \mathrm{Pre}\), pf is a performance evaluation index which is a combination of several statistical parameters of model such as accuracy, recall and F1-score. In this paper, there are five Algorithms involved which are SVM, logistic regression, C4.5, Bayes classifier and KNN. After comparing the performance between before and after per-processing, we found that this new method could improve the accuracy of the prediction. The following section will show the experiment results in detail.

4 Experiment

In this section we evaluate the performance of the proposed model through extensive experiments. We choose a sensor dataset which is a typical data stream in mobile IoT applications and comparing the performance of the algorithms after model optimization to the original algorithms. The Heterogeneity Human Activity Recognition (HHAR) data set, from Smartphones and Smart watches, devised to benchmark human activity recognition algorithms (classification, automatic data segmentation, sensor fusion, feature extraction, etc.) in real-world contexts; specifically, the dataset is gathered with a variety of different device models and use-scenarios, in order to reflect sensing heterogeneities to be expected in real deployments [18, 19]. The data can be obtained free from the public archive at UCI repository.

We choose 5 traditional methods and use three evaluation indicators (Recall, Precision, F1-score). They are: K-Neighbors Classifier, Logistic Regression, Gaussian Naïve Bayes, Decision Tree, and Support Vector Machines. The parameters of the machine learning models are set by their default values. When constructing a new training set, the length of the selected window is 100, and 20 serialized instances are constructed from serialized samples in each category. Then Formula #4 is applied to select the number of T0c. This number should be greater than the number of spaces selected by the window. Therefore, 30 instances are selected as T0c in different categories, and then the iForest algorithm is used to calculate the category probability. The category probability calculations here are using a similar number of percentages. Building a new test set also has the same steps, the length of the selected window is 20, and 20 serialization instances are constructed from random serialized samples in each category of the test set. In the different categories, 30 instances are still selected as T0c, and similar probability is calculated using Eq. 5, so that a new test set is formed. Using a preliminary experiment using the sensor dataset two, it can be seen that the model performed well with large improvement over the classification model that is built without the proposed pre-processing. The comparison results are tabulated in Table 1. The last two columns namely ‘Score’ and ‘Our Model’ are the performance indicator values from the classification model which has not been pre-processed and pre-processed, respectively. The running time is only a few minutes for every classifier. In fact, we are more concerned about what has happened over time from the perspective of doing machine learning from time-series. Therefore, the partition of granularity and the length of the window are particularly important. Furthermore, granularity experiments and iterative experiments are designed. If the selected window length is appropriate, the more iterations, the more training instances will be formed. The same parameters in building the training dataset and the testing set are maintained. In the additional experiment W is selected as 20, 30, 40, 50 and ST as 30, 40, 50, 60. The number of training set iterations is set at 100. As it can be seen in Fig. 3, the size of the sliding window which decides how much per pass the instances would enter into the pre-processing and training the classifier, matters. When the window size reaches over 60 approaching 100, almost full score can be obtained in the cross-validation mode of testing of the classifier. This implies sufficient amount of data per pass would help in framing up the subspaces. However, too large the sliding window may lead to a problem of incurring high latency. Having large sliding window is like reverting the incremental learning which is fast as it learns online, to traditional batch learning where the full set of data is used for model induction. The appropriate size of window for balancing between latency and the highest possible accuracy worth in-depth investigation in the future work. Although we can tune the sliding window size to be moderately suitable for accuracy and latency, what if only a limited (small) amount of training samples are available? To test out such extreme situation, another experiment is simulated where only relatively little training and testing data are assumed available and used in the pre-processing. The objective is to test the correlation between accuracy performance and the volume of the training dataset.

Table 1. Comparison results between classifiers built without and with pre-processing
Fig. 3.
figure 3

Averaged performance of classifiers with various W sizes.

From Table 2, when the number of epochs is equal, the more data, the better the effect it shows from our pre-processing approach. Please note that the performance indicator values are averaged over the five classifiers used in our experiments. When the training size and the testing size gradually increase, although precision, recall, and f1-score may fluctuate, the overall trend is rising. This proves that, to a certain extent, the greater the detection from the probabilistic sample size given sufficient amount of data, the higher the accuracy of detection. From the results, it is found that Pearson coefficients are, 0.620456148, 0.828351883 and 0.728648368 respectively which indicates quite high the correlations between the amount of training data size and the precision, the recall and the balanced F-score.

Table 2. Performance results from various dataset sizes.

5 Conclusions

This paper describes a subspace probabilistic detection pre-processing model based on the subspace-attribute probability calculation. The proposed model is to be used as a pre-processing method that transforms the time diluted dataset to one that can be better characterized by the temporal information from the data, hence better classification model training and prediction results. Five popular classification algorithms are used to test with the pre-processing method by performing classification over sensor data that characterize certain human activities. Such sensor data represent a kind of big data streams that possesses new data mining challenges due to their sheer volumes and sequential nature. This model can effectively solve the problem of repeatability and noise that exist in the sensor data. Through experiments, we can see that this model can effectively improve the performance of traditional machine learning classification algorithms in data mining in the sensor data by large magnitude.