Abstract
Insurance companies are witnessing a significant drop in their profit margins particularly in the segment of vehicle insurance due to heavy competition in the industry. Insurance companies are trying to improve their customer base by retaining existing customers and launching new policies with additional benefits. Customers are expecting insurance policies which match to their requirements and at the same time, companies also want to charge more premium for the customers with risky driving behaviour and less for safe driving. Insurance companies are reducing costs with the help of historical risk data and advanced analytics to improve their profits. Insurance companies are capturing real-time vehicle movement data through IoT to monitor the driving behaviour of their customers. By applying advanced analytics on this data, insurance companies can study customers driving pattern to assess the risk involved in it. In this study, we are presenting an analytical approach to categorize driving patterns using advanced machine learning techniques which will lead to risk-based insurance premium. It will help insurance companies to provide personalized services to their customers and in assisting insurance companies in the process of claims approval when an accident took place.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
Insurance companies are launching new policies with additional benefits to increase their customer base. Profit margins of insurance companies are under pressure, especially in the segment of vehicle insurance due to heavy competition in the industry. Over the years, vehicle insurance companies are charging premiums based on accumulated data such as age, gender, type of vehicle, etc. Customers are expecting insurance policies which match to their specific requirements and at the same time, companies also want to charge more premium for the customers with risky driving behaviour and less for safe driving. Insurance companies also want to provide premium discounts as a reward for safe driving behaviour of their customers. Insurance companies are monitoring real-time vehicle movement of their customers through IoT. By leveraging this information with the help of advanced analytics, companies can derive better insights on their customer’s driving patterns.
The objective of the study is to provide an analytical approach to classify driving behaviour which will lead to risk-based insurance premiums. Study of the driving pattern will also help insurance companies to quickly determine the fault when accidents took place which is a key factor in claims settlement. In our study, driving patterns are categorized into three groups such as Risky, Potential Risky, and Safe. We give below an analytical approach to categorize the driving pattern.
Vehicle movement information is captured through sensors connected to the cars. K-means clustering technique (Zaki and Meira 2014) is applied to form homogeneous groups of similar driving patterns. Number of Clusters in K-means is decided using Elbow rule technique and Hierarchical clustering (Zaki and Meira 2014). Studied the properties of each cluster through Decision Tree (Zaki and Meira 2014) and labelled each cluster with the help of domain experts. Labelled data has been divided into ‘Train and Test’ using Stratified Random Sampling (Cochran 1977). Feature selection technique Fuzzy Forest (Conn et al. 2015) is used to select the important features for model building. A classifier is built using Support Vector Machine (Hastie et al. 2009) (SVM) to classify driving behaviour and monitored the performance of classification model post-deployment.
2 Our Approach
2.1 Data
Data comprises of vehicle movement information captured through multiple sensors connected to the cars. A total of 23 features related to vehicle movement, engine condition, driving speed, weather condition, and other engine performance measures are collected along with timestamp. Captured second-wise data is aggregated to minute level using mean for continuous variables and mode for categorical variables so that all 23 features information is available minute wise. Applied pre-processing steps outlier removal and data normalization to clean the data.
2.2 Clustering
The captured data is not a labelled data, so we need to derive categories from the data. We can study the inherent patterns of the data by applying unsupervised machine learning technique like clustering. To form homogeneous groups of similar driving pattern, clustering technique K-means (Zaki and Meira 2014) is applied. Number of clusters (K) is decided through the technique of Elbow rule and by looking at the dendrogram of Hierarchical clustering. The Elbow method looks at the total Within-Cluster Sum of Squares (WSS) as a function of the number of clusters: Plot the graph with number of clusters (K) on X-axis and cluster WSS on Y-axis. The location of a bend (elbow) in the plot is considered as the optimal number of clusters. The number of clusters formed are three (K = 3). Figure 1 represents the two-dimensional view of clusters formed through K-means. Each colour and shape represents a cluster.
Here, Dim1 and Dim2 are the two-dimensional coordinates of multidimensional data derived using distance matrix and classical Multidimensional Scaling (Gower 1966) (MDS). It can be observed from Fig. 1, three clusters are well formed and observations in each cluster are non-overlapping with other cluster members.
2.3 Cluster Labelling
In order to derive target variable, we need to study the cluster properties and label each cluster. We applied rule-based technique Decision Tree (Zaki and Meira 2014) with cluster number as response variable, and all other features as independent variables to study the cluster properties. The rules formed will provide concrete information related to the behaviour of each cluster. Table 1 represents some of the rules extracted from each cluster.
After going through the rules formed for each cluster, with the help of domain experts labelled clusters as Risky, Potential Risky, and Safe Zones. The observations falling under ‘Risky’ cluster are of risky driving behaviour and the records falling under ‘Safe’ cluster are following safe driving pattern. Each record is labelled with cluster name, which is the target variable with three levels Risky, Potential Risky, and Safe.
2.4 Sampling
Data is divided into Train and Test in the ratio of 70:30 using stratified random sampling (Cochran 1977). Train data set is used for model building and Test data set is used for model evaluation. Stratified sampling is used to give a fair representation of each category in the training data.
2.5 Feature Selection
All features may not be relevant for model building and presence of irrelevant features might reduce the model’s predictive power, which ultimately leads to low accuracy. Feature selection (Guyon and Elisseeff 2003) technique ‘Fuzzy Forest’ (Conn et al. 2015) was used on Train data to select a subset of features which are important. Fuzzy Forest is one of the best feature selection techniques which provide unbiased variable importance rankings when the features are correlated. Fuzzy Forest using ‘R’ software identifies five variables V3, V9, V14, V16, and V21 as important variables to build a classifier. These key variables are related to Speed, Torque, Pedal Position, Brake, and Weather.
2.6 Building Classifier
Using the features selected through Fuzzy Forest, we built Support Vector Machine (Hastie et al. 2009) (SVM) and Random Forest-based (Breiman 2001) classification model using Train data. Model performance has been evaluated on Test data and the accuracies are reasonably good. Support Vector Machine with Radial kernel and optimum parameters (gamma, cost, and degree) has an accuracy of 98.2%, which is slightly higher than Random Forest 96%. SVM Classifier is used for categorizing the driving patterns of new data as Safe, Potential Risky, and Risky.
2.7 Analytical Results
Classifier based on Support Vector Machine is used to predict the driving patterns of new data. Figure 2 represents the period-wise predicted results of a single vehicle’s driving pattern.
Here, period represents a week, and each bar shows the percentage of time vehicle was driven in Safe, Potential Risky, and Risky categories. For Period3, the percentage of vehicle driven in Risky zone is 27.2% and in Period4, the vehicle was driven more in Potential Risky and less in Safe zone.
Figure 3 represents the cumulative percentage of time a vehicle was driven in each category of Safe, Potential Risky, and Risky.
Over the periods of time, the vehicle was driven more in Potential Risky zone (35.3%) and less in Risky zone (29.9%) categories.
2.8 Performance Monitoring
Once the model has been deployed, we need to monitor the performance of the model continuously to make sure that model is performing as per expectations and there is no degradation in model prediction power. For this, collecting predicted results of the model at regular intervals and examining whether the rules extracted earlier are still valid or not for each category. If there is any variation in the predicted results, we need to refit or rebuild the model by considering latest data.
3 Conclusion
In this study, we presented an approach to classify drivers driving pattern as Risky, Potential Risky, and Safe using advanced analytical techniques. Along with Support Vector Machine, we tried another classification technique Random Forest. And when compared, SVM gave better results.
By quantifying the percentage of time vehicle driven in Risky, Potential Risky, and Safe zone, insurance companies can fix the premium based on the risk assessed with each category. At the time of accident claim approval, this analysis will assist the insurance companies by providing a clear picture on the driver’s driving condition.
This analysis will aid insurance companies in providing certain services which are specific to their customers. It will ultimately lead to customer loyalty and healthy revenues.
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley.
Conn, D., Ngun, T., Li, G., & et al. (2015). Fuzzy forests: extending random forests for correlated, high-dimensional data (Research Rep.). Department of Biostatistics UCLA.
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338.
Guyon, I., Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). Springer.
Tan, P. N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining (2nd ed.). Pearson Education.
Zaki, M. J., & Meira, W. (2014). Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press.
Acknowledgements
I sincerely thank Mr. Kumar G. N. for his continuous support and timely inputs. I would also like to thank Mr. Kamesh J. V. and my colleagues who encouraged me during this journey.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Valluru, S. (2019). Connected Cars and Driving Pattern: An Analytical Approach to Risk-Based Insurance. In: Laha, A. (eds) Advances in Analytics and Applications. Springer Proceedings in Business and Economics. Springer, Singapore. https://doi.org/10.1007/978-981-13-1208-3_12
Download citation
DOI: https://doi.org/10.1007/978-981-13-1208-3_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1207-6
Online ISBN: 978-981-13-1208-3
eBook Packages: Business and ManagementBusiness and Management (R0)