Keywords

1 Introduction

With the development of the global supply chain, the safe and stable operation of the industrial chain plays a vital role in the economic development of a country. At present, the environment of the industrial chain trading market is complex, and there are various trading risks [1]. Insider trading behavior is one of the risks. Insider trading refers to traders directly or indirectly using inside information to buy and sell commodities and obtain improper economic benefits. Inside information refers to the non-public information obtained by the staff of some financial institutions or regulatory departments of the industry chain due to the convenience of their position or position. Since insider information can affect the price in the trading market, users who have access to insider information can spread it to others [2], thereby indirectly obtaining excess profits. This behavior affects the fairness of the industrial chain trading market and threatens the healthy development of the industrial chain trading market. In the supervision process of insider trading, it shows the characteristics of strong concealment of insider trading [3]. Therefore, this paper designs an insider trading behavior detection algorithm of industrial chain based on the characteristics of logistics time interval to mine the hidden insider trading behavior.

At present, the detection of insider trading behavior of industrial chain usually uses the regularized abnormal indicators of dealer trading patterns and the data-driven model methods [4] to identify the abnormal behavior of dealers. For the methods of regularized abnormal indicators, due to the characteristics of physical delivery of the industrial chain, commodities can be traded not only online but also offline, so the whole process of circulation of commodities cannot be supervised, which leads to the inapplicable income measurement indicators based on informed trading. For the data-driven model methods, the direct use of logistics features has a high dimension and the model learning is difficult, which makes the detection effect of insider trading behavior not good. Therefore, in order to help regulators efficiently supervise the insider trading behavior in the industrial chain trading market, this paper studied the problem of insider trading behavior detection based on logistics characteristics in the industrial chain trading market.

The main contributions of this paper are summarized as follows:

  1. 1)

    Firstly, aiming at the problem of low efficiency of industrial chain data characteristics, the algorithm proposed a construction method of logistics indicators to describe the whole process of insider trading behavior according to the three dimensions of own trading mode, commodity trading mode and dealer mode;

  2. 2)

    Secondly, aiming at the problem of long time span of transaction, a method of using dynamic sliding window is proposed to extract the logistics characteristics within the time interval, and then judge whether the time interval is abnormal;

  3. 3)

    Finally, the isolation forest algorithm is improved to identify the abnormal data in the abnormal window. The experimental results show that compared to using the isolation forest methods, the F1 value of the algorithm in this paper is increased by 20.68% in identifying the insider trading behavior of traders in the industrial chain trading market.

2 Related Work

In terms of methods, insider trading detection in the industrial chain market can be divided into model-based and data-driven insider trading detection methods [4].

For the model-based insider trading detection methods, Fama et al. proposed the event study method to measure the normal returns and abnormal returns before and after a certain event [5], which can be used to judge insider trading. Easley et al. mathematically described the trading process of informed and uninformed people, and proposed a model to estimate the probability of informed trading, which is used to estimate the probability of whether a transaction is an informed trader [6]. Mienna proposed a statistical model based on long time series data sets [7] in view of the complex trading strategies of insider traders and the difficulty of calculating additional returns in traditional econometrics models. Cline et al. considered the partial observability of insider trading and proposed a bivariate probability model to detect the behavior of illegal insider trading [8].

For data-driven insider trading detection methods, Deng et al. proposed a Gradient boosted decision tree (GBDT) based approaches for insider trading detection with differential evolution (DE) for parameter initialization [9, 10]. Esen et al. proposed a clustering-based insider trading detection method, which takes the outlier value of trading behavior as the suspicion degree of insider trading behavior through K-means and hierarchical clustering method and verifies it through the event study method [11]. Islam proposed the method of using Long Short-Term Memory network (LSTM) in deep learning to learn the structured and unstructured features of illegal insider trading events [12]. Seth et al. proposed a multi-stage insider trading detection method including deep neural network, consensus model and statistical methods to identify illegal insider trading behaviors through event analysis and detection of unstructured and structured data [13]. Lauar et al. proposed to build a training data set based on news events before insider trading events and proposed an augmentation method to expand the size of the data set, and used XGBoost to predict insider trading events [14].

However, the above methods do not consider the problems of inefficiency of industrial chain data characteristics and long transaction time span, so that the previous index modeling cannot be applied, resulting in poor algorithm effect. Therefore, this paper designs an insider trading behavior detection algorithm of industrial chain based on the characteristics of logistics time interval to mine the hidden insider trading behavior.

3 Problem Formulation and Analysis

In this section, the problem of insider trading detection in the industrial chain trading market is formally defined. The trading of industrial chains is different from other electronic transactions, which involves the transportation of goods logistics and has a large time span, so it is necessary to combine the characteristics of logistics to detect insider trading behavior. Next, the insider trading detection problem based on the characteristics of logistics time interval in the industrial chain is defined in detail.

Suppose there are a number of dealers that have traded during the time interval \(\left[ {t_{a} ,t_{b} } \right){ }\), and we use sets \(A = \left\{ {a_{1} , \ldots ,a_{i} , \ldots ,a_{N} } \right\}{ }\) to represent these trading accounts. A trader \(a_{i} \in A\), trading in M types of commodities, we define it as \(C^{i} = \left\{ {c_{1}^{i} , \ldots ,c_{j}^{i} , \ldots ,c_{M}^{i} } \right\}\). Among them, where K transactions on the \(j\) -th commodity are represented as \(R_{j}^{i} = \left\{ {r_{j,1}^{i} , \ldots ,r_{j,k}^{i} , \ldots ,r_{j,K}^{i} } \right\}\), then the \(k\) -th transaction can be represented as \(r_{j,k}^{i} = \left\langle {s_{j,k}^{i} ,e_{j,k}^{i} ,p_{j,k}^{i} ,v_{j,k}^{i} ,f_{j,k}^{i} } \right\rangle\), where \(f_{j,k}^{i} = 1\) represents the selling transaction behavior and \(f_{j,k}^{i} = - 1\) represents the buying transaction behavior.

The problem of insider trading behavior detection [15] is defined as: judging whether the trading behavior \(r_{j,k}^{i}\) in the trading behavior data \( R_{j}^{i}\) of dealers is insider trading behavior. The variables involved are listed in Table 1.

Table 1. Variable description

4 Algorithm Design

4.1 Interval Logistics Index Construction Algorithm

The main function of interval logistics characteristic index construction is to use indicators [16] under different dimensions to describe insider trading behavior and normal trading behavior in the industrial chain trading market. The specific steps of interval logistics characteristic index construction algorithm are as follows:

Using Sliding Window to Divide the Logistics Behaviour. The logistics behavior of the trading account \({\varvec{a}}_{{\varvec{i}}}\) that detects the time interval of \(\left[ {{\varvec{t}}_{{\varvec{a}}} ,{\varvec{t}}_{{\varvec{b}}} } \right)\) on the \({\varvec{j}}\)-th commodity is divided into N segments using the unit time interval length T, where T is expressed as the Eq. 1:

$$ T = \frac{{t_{b} - t_{a} }}{N} $$
(1)

Therefore, the time interval sequence obtained by partitioning is expressed in Eq. 2:

$$ S = {\text{\{ }}[t_{a} ,t_{a} + T{)},\left[ {t_{a} + T,t_{a} + 2T} \right), \ldots ,\left[ {t_{a} + \left( {N - 1} \right)T,t_{b} } \right)\} $$
(2)

According to the characteristics that the logistics interval will span multiple units of time T, a sliding window of length \(L\) is used to extract the logistics characteristics of this logistics behavior, and the time interval of the sliding window is expressed as Eq. 3:

$$ SS = {\text{\{ }}[t_{a} ,t_{a} + L{)},\left[ {t_{a} + T,t_{a} + T + L} \right), \ldots ,\left[ {t_{a} + NT - L,t_{b} } \right)\} $$
(3)

According to different types of logistics behaviors of traders, logistics behaviors related to logistics characteristics in the sliding window within \([t_{x} ,{ }t_{x + L} )\) can be divided into buying logistics behaviors and selling logistics behaviors, in which logistics behaviors entering the warehouse are represented by \(B_{{t_{x} }}\) in formula 4, and logistics behaviors exiting the warehouse are represented by \(S_{{t_{x} }}\) in Eqs. 4 and 5.

$$ B_{{j,t_{x} }}^{i} = {\text{\{ }}(t_{j,k}^{i} ,e_{j,k}^{i} ,p_{j,k}^{i} ,v_{j,k}^{i} ,f_{j,k}^{i} {)|}t_{x} \le e_{j,k}^{i} < t_{x} + L ,f_{j,k}^{i} = 1{\text{\} }} $$
(4)
$$ S_{{j,t_{x} }}^{i} = {\text{\{ }}(t_{j,k}^{i} ,e_{j,k}^{i} ,p_{j,k}^{i} ,v_{j,k}^{i} ,f_{j,k}^{i} {)}|t_{x} \le s_{j,k}^{i} < t_{x} + L ,f_{j,k}^{i} = - 1\} $$
(5)

Calculating Logistics Indicators of Incoming Warehouse and Outgoing Warehouse. The total value of goods corresponding to logistics behaviors of incoming warehouse and outgoing warehouse are expressed as Eqs. 6 and 7:

$$ U_{{j,t_{x} }}^{i} = \mathop \sum \limits_{{R_{j,k}^{i} \in B_{{t_{x} }} }} p_{j,k}^{i} *v_{j,k}^{i} $$
(5)
$$ V_{{j,t_{x} }}^{i} = \mathop \sum \limits_{{R_{j,k}^{i} \in S_{{t_{x} }} }} p_{j,k}^{i} *v_{j,k}^{i} $$
(7)

For different types of logistics behavior, the contributions to the logistics index are different. Therefore, the commodity value of logistics behaviors of incoming warehouse and outgoing warehouse accounting for their total logistics behaviors are defined as Eqs. 8 and 9, respectively.

$$ \alpha_{{j,t_{x} }}^{i} = \frac{{U_{{j,t_{x} }}^{i} }}{{U_{{j,t_{x} }}^{i} + V_{{j,t_{x} }}^{i} }} $$
(8)
$$ \beta_{{j,t_{x} }}^{i} = \frac{{V_{{j,t_{x} }}^{i} }}{{U_{{j,t_{x} }}^{i} + V_{{j,t_{x} }}^{i} }} $$
(9)

Calculating the Characteristic Indicators of Three Dimension. Next, in order to better describe the indicators proposed, we have made the following definitions:

Definition 1. Total value of goods in interval: it represents the total value of goods in and out of warehouse on the logistics characteristics within the sliding window, and the outlier of the timing relationship of the trader’s logistics behavior on this kind of goods. The logistics behavior in this time zone, the total value of the goods that are shipped out and shipped into the warehouse are expressed as Eq. 10:

$$ O_{{j,t_{x} }}^{i} = U_{{j,t_{x} }}^{i} + V_{{j,t_{x} }}^{i} $$
(10)

In the formula, the first item represents the total value of all goods entering the warehouse in the time range of \([t_{x} ,{ }t_{x + L} )\), which is expressed as the commodity price \(p_{j,k}^{i}\) multiplied by the quantity \(v_{j,k}^{i}\) of all purchasing logistics behaviors, and the second item represents the total value of all goods exiting the warehouse.

Definition 2. Ratio of interval commodity trading to all traded commodities: it represents the ratio of the value of commodities in the sliding window to the sum of the value of commodities traded by all users in the sliding window and is expressed as Eq. 11:

$$ P_{{j,t_{x} }}^{i} = \alpha_{{j,t_{x} }}^{i} \frac{{U_{{j,t_{x} }}^{i} }}{{\mathop \sum \nolimits_{j} U_{{j,t_{x} }}^{i} }} + \beta_{{j,t_{x} }}^{i} \frac{{V_{{j,t_{x} }}^{i} }}{{\mathop \sum \nolimits_{j} V_{{j,t_{x} }}^{i} }} $$
(11)

In the formula, \(\alpha_{{j,t_{x} }}^{i}\) and \(\beta_{{j,t_{x} }}^{i}\) represent the proportion of logistics activities of incoming warehouse and outgoing warehouse respectively; In the first term, the numerator represents the total value of the goods \(j\) of the incoming warehouse in the time range \([t_{x} ,{ }t_{x + L} )\), and the denominator represents the sum of the total value of all the goods of the incoming warehouse in the time range. In the second term, the numerator represents the total value of the goods \(j\) of the outgoing warehouse in the time range \([t_{x} ,{ }t_{x + L} )\), and the denominator represents the sum of the total value of all the goods of the outgoing warehouse in the time range.

Definition 3. The ratio of interval commodity trading to all dealers: it represents the ratio of the sum of commodity values of the logistics behavior of the dealer and all dealers on this commodity, which is expressed as Eq. 12.

$$ Q_{{j,t_{x} }}^{i} = \alpha_{{j,t_{x} }}^{i} \frac{{U_{{j,t_{x} }}^{i} }}{{\mathop \sum \nolimits_{i} U_{{j,t_{x} }}^{i} }} + \beta_{{j,t_{x} }}^{i} \frac{{V_{{j,t_{x} }}^{i} }}{{\mathop \sum \nolimits_{i} V_{{j,t_{x} }}^{i} }} $$
(12)

In the formula, the numerator represents the total value of goods stored in the warehouse within the time range. The denominator represents the sum of the total value of the inventory of all the traders who traded in the commodity during the time period.

4.2 Interval Anomaly Detection Algorithm for Logistics Characteristics

The main function of abnormal interval detection based on interval logistics characteristic indexes is to detect abnormal interval according to the logistics characteristic indexes and the surrounding normal indexes. The specific steps of the algorithm are given in Algorithm 1, and we describe its detailed process as follows:

1) Construct index sequence composed of sliding window and surrounding time interval. For the r-th index \(F_{{j,r,t_{x} }}^{i}\) of the j-th commodity logistics of the i-th trader in the time range \([t_{x} ,{ }t_{x + L} ]\), the logistics index within the detection time interval \(H\) is expressed as Eq. 13:

$$ FS_{{j,r,t_{x} }}^{i} = \left\{ {F_{{j,r,t_{x} - \frac{H}{2}L}}^{i} ,F_{{j,r,t_{x} - \frac{H}{2}\left( {L - 1} \right)}}^{i} , \ldots ,F_{{j,r,t_{x} }}^{i} , \ldots ,F_{{j,r,t_{x} + \frac{H}{2}\left( {L - 1} \right)}}^{i} ,F_{{j,r,t_{x} + \frac{H}{2}}}^{i} } \right\} $$
(13)

When detecting whether each sliding window is abnormal with the surrounding sliding window, the length of L time interval is used to select in turn, so as to ensure that the sliding window to be detected does not overlap. During sequence exploration, it is necessary to ensure that it does not exceed the boundaries of the original indicator sequence [17].

2) Calculate the anomaly index corresponding to the sliding window. For the r-th index outlier of the j-th commodity logistics of the i-th trader in the time range \([t_{x} ,{ }t_{x + L} ]\) is expressed as Eq. 14:

$$ S_{{j,r,t_{x} }}^{i} = \frac{{F_{{j,r,t_{x} }}^{i} - median\left( {FS_{{j,r,t_{x} }}^{i} } \right)}}{{MAD_{{FS_{{j,r,t_{x} }}^{i} }} }} $$
(14)

The function of \(median\) represents the Median of samples in the time series, and represents the Median Absolute Deviation (MAD) of samples [18].

3) Determine whether the sliding window is abnormal. For each feature, its outliers are calculated, and its total outliers are expressed as Eq. 15. \(\rho_{r}\) represents the parameters of this indicator:

$$ S_{{j,t_{x} }}^{i} = \mathop \sum \limits_{r} \rho_{r} S_{{j,r,t_{x} }}^{i} $$
(15)

Whether the interval is abnormal is judged according to the relationship with threshold \(\delta_{s}\). The robustness of the anomaly detection algorithm is enhanced by weighted fusion of abnormal degrees of multiple dimensional indicators [19].

4.3 Abnormal Logistics Feature Detection Based on Isolation Forest

In this paper, the method based on isolation forest [20] is used to perform anomaly detection on the data in the detected anomaly interval, which can improve the effectiveness of the whole anomaly detection algorithm and enhance the interpretability of the algorithm [21]. The specific steps of the abnormal logistics feature detection algorithm based on isolation forest are given in Algorithm 2, and we present its specific description as follows:

  1. 1)

    The newly constructed feature vector for each transaction is expressed as \(\left\langle {p_{j,k}^{i} ,v_{j,k}^{i} ,f_{j,k}^{i} ,An_{j,k}^{i} } \right\rangle\);

  2. 2)

    Mark the transaction \(R_{j,k}^{i} = \left\langle {s_{j,k}^{i} ,e_{j,k}^{i} ,p_{j,k}^{i} ,v_{j,k}^{i} ,f_{j,k}^{i} } \right\rangle\) to determine whether it is within the m-th anomaly interval \([t_{s}^{m} ,t_{e}^{m} ]\), and if so, it is marked as:

    $$ An_{j,k}^{i} = \left\{ {\begin{array}{*{20}c} {1, t_{s} \le s_{j,k}^{i} < t_{e} f_{j,k}^{i} = 1 or t_{s} \le e_{j,k}^{i} < t_{e} f_{j,k}^{i} = - 1 } \\ {0, otherwise} \\ \end{array} } \right. $$
    (16)

The feature vector is input into the isolation forest algorithm to determine whether each logistic behavior of the user is abnormal.

5 Experiments

5.1 Experimental Environment

The experiments were run on a PC with windows10 operating system, the Processor was AMD Ryzen 7 3800X 8-Core Processor 3.89 GHz. The experimental code was implemented using Python 3.8.13.

The experimental data comes from the real industry chain data. Due to the risk of leakage of logistics data for traders’ privacy, therefore, all the data provided were desensitized according to the industry chain electronic transaction sensitive information desensitization and encryption regulations. All traded commodity names and dealer names in this paper are represented using a post-desensitized string.

5.2 Performance Indices and Benchmark Algorithms

Experimental Indicators. For classification problems, the following indicators are usually considered [21]:

  1. 1)

    Accuracy rate: all predicted dealer associations are consistent with the actual user associations.

  2. 2)

    Precision rate: the proportion of actual transactions that are abnormal among all transactions that are predicted to be abnormal.

  3. 3)

    Recall rate: the proportion of actual transactions predicted to be abnormal among all transactions that are abnormal.

  4. 4)

    F1 value: Considering the above two indicators.

  5. 5)

    Running time.

Comparison Algorithm. We have chosen the following three comparative algorithms:

  1. 1)

    Using the Isolation Forest algorithm (IForest) for detection [20], randomly selecting features from the original features to construct an isolated tree, and calculating the outliers of the sample through the path length of the sample. The shorter the path length, the greater the outlier value;

  2. 2)

    The Histogram Based Outlier Score (HBOS) method [22] is used to divide each dimension of the data into intervals, and the outlier value of the sample is calculated through the density of the interval where the sample is located. The lower the density, the greater the outlier value;

  3. 3)

    The anomaly detection method of K Nearest Neighbor (KNN) [23] is used to calculate the outliers by calculating the distance between the sample and the surrounding points. The larger the distance, the larger the outlier.

Parameter Settings. The effectiveness of the algorithm was verified by randomly sampling the original data set from 50% to 100% to obtain different sizes of data sets. 50 abnormal transactions are generated per day, where the number of items in each abnormal transaction is set to 100. Sequence detection length H = 6.

5.3 Experimental Result

Table 2. Experimental index results under different transaction sample numbers

As shown in Table 2 and Fig. 1(a)(b)(c), the insider trading detection method based on time interval characteristics proposed in this paper is superior to other comparison algorithms in terms of precision, recall rate and F1 value on transaction data sets of different sizes. Compared with the isolation forest algorithm, the proposed algorithm improves the accuracy index by 9.94%, 15.74%, 4.89%, 11.84%, 11.37%, 21.37% respectively. In terms of the recall rate, the proposed algorithm improves by 7.39%, 15.38%, 3.69%, 15.34%, 11.12%, 19.98% respectively; In terms of F1 value, the algorithm proposed in this paper improves by 8.65%, 15.55%, 4.29%, 13.59%, 11.24%, 20.68% respectively.

Fig. 1.
figure 1

Variation of experimental indicators with the proportion of datasets size

As shown in Table2 and Fig. 1(d), the proposed insider trading detection algorithm based on logistics time interval characteristics is not significantly different from the isolation forest method in terms of time, and the algorithm execution time shows an increasing trend with the increase of transaction datasets size. The histogram based anomaly detection algorithm takes the least time, and with the increase of the size of the transaction data set, the execution time of the algorithm has little effect. The anomaly detection method based on k-nearest neighbor method takes the longest time and the execution time of the algorithm increases significantly with the increase of the size of the transaction data set.

Results Analysis. In terms of precision, recall rate and F1 index, the insider trading detection algorithm based on logistics time interval characteristics proposed in this paper is superior to other comparison algorithms, but there is no obvious change in the trend, which is due to the randomness of sample sampling, and different sizes of transaction data sets are different in the distribution of normal trading behavior. As for the running time of the algorithm, the insider trading detection algorithm based on the characteristics of logistics time interval proposed in this paper has little difference. This is because although the algorithm proposed in this paper needs more time to detect the additional time interval, the isolation forest method has a higher time dimension and requires more time. Therefore, the running time of the two algorithms is not greatly different. Histogram-based anomaly detection takes the least time due to its simple detection method and fast computation, so the increase of its dataset size is not enough to have a noticeable effect on it. The k-nearest neighbors algorithm takes the longest time, because it takes a lot of time to calculate the distance between its K neighboring points. With the increase of the proportion of the datasets size, the algorithm needs to spend more running time on transaction detection.

6 Conclusions

This paper studies the detection of insider trading behavior based on logistics characteristics in the industrial chain trading market, and proposes an insider trading detection algorithm based on logistics time interval characteristics. Firstly, according to the characteristics of long logistics time span, the dynamic sliding window is used to extract the logistics transaction behavior in the time interval. Secondly, three characteristics under the dimension of own trading mode, commodity trading mode and dealer mode are used to describe. Then, after the abnormal logistics time intervals are identified, the overlapping abnormal time intervals are optimized into non-overlapping abnormal intervals by using the method of reducing the sliding window to improve the accuracy of the algorithm. Finally, the method based on isolation forest was used to judge the abnormal transaction behavior of the extracted effective logistics features, so as to improve the effect of the algorithm. Through the experiments on the logistics data sets of real trading behavior of dealers, the results show that the proposed algorithm improves the F1 value by 20.68% compared with the direct using of isolation forest.