Keywords

1 Introduction

Insurance companies are launching new policies with additional benefits to increase their customer base. Profit margins of insurance companies are under pressure, especially in the segment of vehicle insurance due to heavy competition in the industry. Over the years, vehicle insurance companies are charging premiums based on accumulated data such as age, gender, type of vehicle, etc. Customers are expecting insurance policies which match to their specific requirements and at the same time, companies also want to charge more premium for the customers with risky driving behaviour and less for safe driving. Insurance companies also want to provide premium discounts as a reward for safe driving behaviour of their customers. Insurance companies are monitoring real-time vehicle movement of their customers through IoT. By leveraging this information with the help of advanced analytics, companies can derive better insights on their customer’s driving patterns.

The objective of the study is to provide an analytical approach to classify driving behaviour which will lead to risk-based insurance premiums. Study of the driving pattern will also help insurance companies to quickly determine the fault when accidents took place which is a key factor in claims settlement. In our study, driving patterns are categorized into three groups such as Risky, Potential Risky, and Safe. We give below an analytical approach to categorize the driving pattern.

Vehicle movement information is captured through sensors connected to the cars. K-means clustering technique (Zaki and Meira 2014) is applied to form homogeneous groups of similar driving patterns. Number of Clusters in K-means is decided using Elbow rule technique and Hierarchical clustering (Zaki and Meira 2014). Studied the properties of each cluster through Decision Tree (Zaki and Meira 2014) and labelled each cluster with the help of domain experts. Labelled data has been divided into ‘Train and Test’ using Stratified Random Sampling (Cochran 1977). Feature selection technique Fuzzy Forest (Conn et al. 2015) is used to select the important features for model building. A classifier is built using Support Vector Machine (Hastie et al. 2009) (SVM) to classify driving behaviour and monitored the performance of classification model post-deployment.

2 Our Approach

2.1 Data

Data comprises of vehicle movement information captured through multiple sensors connected to the cars. A total of 23 features related to vehicle movement, engine condition, driving speed, weather condition, and other engine performance measures are collected along with timestamp. Captured second-wise data is aggregated to minute level using mean for continuous variables and mode for categorical variables so that all 23 features information is available minute wise. Applied pre-processing steps outlier removal and data normalization to clean the data.

2.2 Clustering

The captured data is not a labelled data, so we need to derive categories from the data. We can study the inherent patterns of the data by applying unsupervised machine learning technique like clustering. To form homogeneous groups of similar driving pattern, clustering technique K-means (Zaki and Meira 2014) is applied. Number of clusters (K) is decided through the technique of Elbow rule and by looking at the dendrogram of Hierarchical clustering. The Elbow method looks at the total Within-Cluster Sum of Squares (WSS) as a function of the number of clusters: Plot the graph with number of clusters (K) on X-axis and cluster WSS on Y-axis. The location of a bend (elbow) in the plot is considered as the optimal number of clusters. The number of clusters formed are three (K = 3). Figure 1 represents the two-dimensional view of clusters formed through K-means. Each colour and shape represents a cluster.

Fig. 1
figure 1

Two-dimensional view of the clusters formed through K-means clustering

Here, Dim1 and Dim2 are the two-dimensional coordinates of multidimensional data derived using distance matrix and classical Multidimensional Scaling (Gower 1966) (MDS). It can be observed from Fig. 1, three clusters are well formed and observations in each cluster are non-overlapping with other cluster members.

2.3 Cluster Labelling

In order to derive target variable, we need to study the cluster properties and label each cluster. We applied rule-based technique Decision Tree (Zaki and Meira 2014) with cluster number as response variable, and all other features as independent variables to study the cluster properties. The rules formed will provide concrete information related to the behaviour of each cluster. Table 1 represents some of the rules extracted from each cluster.

Table 1 Rules extracted through Decision Tree to study properties of each cluster

After going through the rules formed for each cluster, with the help of domain experts labelled clusters as Risky, Potential Risky, and Safe Zones. The observations falling under ‘Risky’ cluster are of risky driving behaviour and the records falling under ‘Safe’ cluster are following safe driving pattern. Each record is labelled with cluster name, which is the target variable with three levels Risky, Potential Risky, and Safe.

2.4 Sampling

Data is divided into Train and Test in the ratio of 70:30 using stratified random sampling (Cochran 1977). Train data set is used for model building and Test data set is used for model evaluation. Stratified sampling is used to give a fair representation of each category in the training data.

2.5 Feature Selection

All features may not be relevant for model building and presence of irrelevant features might reduce the model’s predictive power, which ultimately leads to low accuracy. Feature selection (Guyon and Elisseeff 2003) technique ‘Fuzzy Forest’ (Conn et al. 2015) was used on Train data to select a subset of features which are important. Fuzzy Forest is one of the best feature selection techniques which provide unbiased variable importance rankings when the features are correlated. Fuzzy Forest using ‘R’ software identifies five variables V3, V9, V14, V16, and V21 as important variables to build a classifier. These key variables are related to Speed, Torque, Pedal Position, Brake, and Weather.

2.6 Building Classifier

Using the features selected through Fuzzy Forest, we built Support Vector Machine (Hastie et al. 2009) (SVM) and Random Forest-based (Breiman 2001) classification model using Train data. Model performance has been evaluated on Test data and the accuracies are reasonably good. Support Vector Machine with Radial kernel and optimum parameters (gamma, cost, and degree) has an accuracy of 98.2%, which is slightly higher than Random Forest 96%. SVM Classifier is used for categorizing the driving patterns of new data as Safe, Potential Risky, and Risky.

2.7 Analytical Results

Classifier based on Support Vector Machine is used to predict the driving patterns of new data. Figure 2 represents the period-wise predicted results of a single vehicle’s driving pattern.

Fig. 2
figure 2

Period wise predicted driving pattern of a vehicle in Risky, Potential Risky and Safe Zone

Here, period represents a week, and each bar shows the percentage of time vehicle was driven in Safe, Potential Risky, and Risky categories. For Period3, the percentage of vehicle driven in Risky zone is 27.2% and in Period4, the vehicle was driven more in Potential Risky and less in Safe zone.

Figure 3 represents the cumulative percentage of time a vehicle was driven in each category of Safe, Potential Risky, and Risky.

Fig. 3
figure 3

Cumulative percentage of time a vehicle was driven in Potential Risky, Risky and Safe Zone

Over the periods of time, the vehicle was driven more in Potential Risky zone (35.3%) and less in Risky zone (29.9%) categories.

2.8 Performance Monitoring

Once the model has been deployed, we need to monitor the performance of the model continuously to make sure that model is performing as per expectations and there is no degradation in model prediction power. For this, collecting predicted results of the model at regular intervals and examining whether the rules extracted earlier are still valid or not for each category. If there is any variation in the predicted results, we need to refit or rebuild the model by considering latest data.

3 Conclusion

In this study, we presented an approach to classify drivers driving pattern as Risky, Potential Risky, and Safe using advanced analytical techniques. Along with Support Vector Machine, we tried another classification technique Random Forest. And when compared, SVM gave better results.

By quantifying the percentage of time vehicle driven in Risky, Potential Risky, and Safe zone, insurance companies can fix the premium based on the risk assessed with each category. At the time of accident claim approval, this analysis will assist the insurance companies by providing a clear picture on the driver’s driving condition.

This analysis will aid insurance companies in providing certain services which are specific to their customers. It will ultimately lead to customer loyalty and healthy revenues.