Keywords

1 Introduction

According to the World Health Organization WHO [1], approximately 1.4 million people died in traffic accidents in 2016, and it is estimated that more than 50 million people suffered severe injuries Traffic accidents in 2016 were the eighth cause of death, and the main causes of death for people between the ages of 15 and 29 years old. In addition, such accidents affect pedestrians, cyclists, and motorcyclists. Half of the road deaths occur among motorcyclists (28%), pedestrians (23%), and cyclists (3%). Mortality rates in low-income countries are 3 times higher than in high-income countries. Although only 1% of motor vehicles are in emerging countries, 13% of deaths occur in these nations [1]. Colombia in 2016 obtained a rate of 18.5 fatalities per 100,000 inhabitants. This figure is close to the global average (18.2) and the average middle-income countries (18.8). Between 2012–2018 in relation to the road users involved, the motorcyclists correspond to 50% of the victims. Pedestrians with 24%. The users of vehicles with 17%. The cyclists with 5% of accidents. The objective of this research was to analyze the road crashes in Cartagena (Colombia), and the factors associated with the collision and severity. Cartagena has 1.2 million inhabitants and more than 120,000 motor vehicles. Cartagena in the last two years (2017–2018) has remained in the top positions for fatal accidents in capital cities. In the last 8 years, Cartagena has been considered the fifth most dangerous city in road safety after Medellin, Cali, Bogota, and Barranquilla.

2 Method

The method is based on official information from the control entities, which allowed for the application of data mining and machine learning techniques. The method is based on the application of Decision Tree, Rule Induction, Support Vector Machines, Naïve Bayes, and Multilayer Perceptron with WEKA software. The decision tree constructs classification models in the form of trees. Rule Induction is an iterative process that follows a divide-and-conquer approach. Naïve Bayes is a classification algorithm based on Bayesian theorem. Multilayer Perceptron creates a feed-forward artificial neural network. Support vector machines are learning algorithms for classification and regression analysis. In the prediction of road accidents, data mining techniques have been implemented, such as: regressive models [2], neural networks [3], artificial intelligence [4], decision trees [5], Bayesian networks [6], SVM [7], and combined methods. The aim is to establish a set of decision rules for defining countermeasures to improve road safety. The effectiveness of each algorithm was implemented using cross-validation with 10-fold. The methodological process of this investigation is presented in four steps: (a) Pre-processing of accident dataset; (b) application of data mining techniques through Weka Software; (c) analysis of results and metrics; (d) definition of decision rules and analysis of associated factors.

2.1 Data Sources (Include Sample Size)

The registration of traffic accidents corresponds to the Cartagena database from January 2016 to December 2017. The dataset corresponds to the reports by the Administrative Department of Traffic and Transport (DATT). In total, 10,053 traffic accidents were reported by agents and police. The data records information about temporality, road users, gender and age. For the pre-processing of the data set, 27 categorical variables were defined (see Table 1). The variables were classified into four categories: (1) road actors involved in the crash, (2) individuals involved, (3) weather conditions and timing, (4) accident characteristics. The levels of injury severity were determined as low-level-of-injury (material damages, minor, non-incapacitating injury) or high-level-of-injury (injured victims and fatality).

Table 1. Definitions and values of traffic accident variables from four categories in the dataset.

3 Results

The software WEKA contributed to the purification of the information by means of the Remove Misclassified method and duplicated instances. The dataset was reduced to 7,894 instances. Table 2 summarizes the variables used and their relationship to the severity. The analyzed data was divided into 16% of low, and 84% high-level-of-injury. In the descriptive analysis of the data, the greatest number of accidents occurs between cars (45%), followed by cars-heavy vehicles (28%), and finally between cars-motorcycles (14%). Accidents between private and public vehicles are prevalent (44%). Accidents involving motorcyclists (76%) and bicycles (88%) are more severe. The most frequent type of crash is the collision (99%), and the most severe are being run over (100%) and falling off the vehicle (93%).

Table 2. Statistics of road crashes for Cartagena in 2016–2017.

After descriptive statistical analysis, an inferential and correlational analysis of the variables proposed in the prediction of severity was proposed. The proposed analyzes were the Spearman and Friedman ANOVA correlation. Table 3 summarizes the results.

Table 3. Statistical analysis of the results for the prediction of the severity.

The variables NOT, YB, YHV, YC, GHV, DW, TD, and TCD do not evidence a significant statistical association with the direct prediction of the severity of the accident (p-value > 0.05). The variables with a statistical association on the prediction will be represented in the definition of the rules (p-value < 0.05).

After the data pre-processing, the selected data mining techniques are applied and parameterized (See Table 4) with the 10-fold cross-validation technique. The results are compared with the metrics: precision, accuracy, recall, and area under the ROC curve (See Table 5). The results show a high consistency and similarity in the prediction metrics in the applied techniques.

Table 4. Parameters settings for all classifiers.
Table 5. Comparison of all prediction results and performance metrics based on 10-fold cross-validation

From the best results of each of the techniques, 12 priority decision rules for road safety were defined (See Table 6).

Table 6. List of rules identified with all methods applied.

4 Discussion

The results show cyclists and motorcyclists as the most vulnerable road users. Motorcyclists men and women between 20 and 39 years are predictive of high severity accidents. When there are no motorcycles or cyclists involved in the accident, the probable severity is low. Also, the collision between two motorcycles is considered of high severity. If the crash is a runover the severity is high, and it is inferred that the victim is a pedestrian. Finally, if the crash is between vehicles with rigid protection systems such as cars, buses or carriages, the accident decreases its severity.

This investigation allowed analyzing the accident records of Cartagena (Colombia) with data mining techniques. The rules contribute to the definition of strategies for the reduction of the severity of accidents. The presence of vulnerable road users (motorcyclists, cyclists, and pedestrians) were predictive variables on the severity of the accidents, as well as the results of [4, 8].

Rules 1, 2, 3, 4, 6, 8 and 12 show that more than 50% of the rules are related to the users of motorcycles. Motorcyclists are a population with significant growth due to the conditions for mobility, transportation, sports and other economic activities [9,10,11]. Subsequently, some countermeasures are exemplified by the defined rules. These rules are based on findings in the United States and the European Union on vulnerable users that can be replicated to reduce and eliminate road accidents. Among the policies, strategies and countermeasures to improve the road safety of the motorcyclist and the recommendations of WHO [12], are: Promote culture and education in road safety [13]. Analysis and monitoring of accident reports [14]. Road safety campaigns on the most vulnerable users in the ages of 15 to 30 years [15]. Promote the use of protective elements [16]. Restrict and punish driving under the influence of drugs and alcohol [17]. Control the speed according to the road type [18]. Improve the quality of roads, or design exclusive lanes [19]. Improve mechanical conditions and maintenance [20]. Improve the visibility of the motorcyclist [21]. Improve road safety conditions such as lighting, and infrastructure [22]. Forbid the transport and exposition of children on motorcycles [23]. Penalize violations and risky behaviors [24]. Restrict the manipulation of electronic devices while driving [25].

Finally, some additional countermeasures based on the rules are: Define road safety control plans according to the season. Rules 3 and 4 contrast that there are more demanding months in road control. Define speed limits conditioned by the intensity of rainfall. Rules 6 and 12 show that moderate and intense rains increase the possibility of accidents. Motorcyclist accidents can be avoided if interaction with other users is reduced. This is achieved from single circulation lanes. Rules 1, 3, 4, 6, 8 and 12 relate cars when the motorcycle interacts with another vehicle (e.g. Trucks and Cars). Finally, rules 7 and 8 confirm that the age of the motorcyclist influences the severity of the accident. These rules allow you to define license plans according to age. For example, in Europe, there are restrictions on circulation, speed, displacement, and violations according to age. Rule 11 is closely related to the high severity of the abuses. To define effective countermeasures from these rules it would be good to include new variables. These should be focused on additional aspects such as accident location, interception, location of the road, type of road, signaling, lighting, time, among others.

5 Conclusion

In this study, motorcyclists at young adult ages are related to predictive factors of severity at a level of injury or fatality. Global records in 2016 placed Colombia in the tenth position worldwide, the third in the region and second in South America in motorcyclist accident. In 2018 Colombia has 8.3 million motorcycles registered. In the last 7 years (2012–2018) the proportion of dead and injured motorcyclists is close to 50% of road users. Analyzing the accident rate of motorcyclists in Colombia and their causality are future investigations essential to improve road safety. The limitations of the study are the sub-registration of data by traffic control entities. If more causal variables are included, the creation of significantly more strategic rules for road safety can be achieved (for example, state and road conditions).

This investigation allowed analyzing the accident records of Cartagena (Colombia) with data mining techniques. The rules contribute to the definition of countermeasures, focused on vulnerable users for the reduction of the severity of accidents. The definition of rules from data mining is more effective than analyzing information with a simple descriptive statistical analysis. Because the analysis of the information is done in a correlational way, this contributes to obtain results that are easier to understand and apply. Techniques such as multivariate analysis or black-box techniques require additional steps for the analysis of information.