Keywords

1 Introduction

Recently, distracted driving has raised significant safety concerns in the automotive industry, with increases of up to 22% in alarming road accidents, including both fatal crashes and near-crashes [1]. Recent statistics reveal a concerning surge of 8% in road accidents due to distracted driving over the course of five years, encompassing both fatal collisions and alarming near-crash scenarios [2]. This trend translates to an approximate average of 2,924 distraction-affected fatal traffic crashes per year [2]. This issue does not only affect road safety but also has economic implications, resulting in substantial losses in developing countries [3]. Moreover, distracted driving leads to higher rates of road fatalities, injuries, and emotional distress for individuals and their loved ones [4].

Despite the deployment of the Advanced Driver Assistance Systems (ADAS), distracted driving remains an ongoing challenge [5]. A recent study has shown that drivers engage in distracting activities, even with ADAS features, leading to near-crashes or crashes [6]. The study revealed that 68% of drivers using lane-keeping assistance were still prone to distractions like texting or adjusting the radio, resulting in near-crashes or crashes [6]. Similarly, another study identified that 45% of drivers involved in crashes with vehicles equipped with lane departure warning and forward collision warning systems were distracted by activities such as using a mobile phone or reaching for an object, leading to failures in responding to warnings [7].

Additionally, ADAS systems are often limited to premium vehicles, leaving a significant number of drivers without access to these safety features [8]. Even in vehicles equipped with high levels of driving autonomy, such as the autonomous Uber and Tesla Model S vehicles, human drivers are susceptible to distraction, resulting in accidents [9]. Therefore, there is a critical need to detect and mitigate driver distraction in real-time to ensure road safety. In order to address this overarching problem, a hybrid approach is adopted by integrating Convolutional Neural Networks (CNNs) and eye movement tracking. By leveraging these technologies, this study aims to detect and mitigate manual, visual, and cognitive distractions in real-time through a proposed prototype, while attempting to surpass limitations of previous approaches. As such, the outcome of this study is expected to advance driver safety, reduce the impact of distracted driving, and bring benefits to individuals, and societies.

This paper is structured as follows: In the next section, background on distracted driving is provided and prior research is critically reviewed. Subsequently, the implementation process of the proposed prototype is thoroughly explained in Sect. 3 prior to discussing the evaluation method used to rigorously assess the prototype’s effectiveness. The paper then concludes with a concise summary of findings and discussions.

2 Background and Related Works

Distracted driving is a prevalent issue, with research indicating that a significant number of drivers engage in secondary tasks while driving [10]. There are different types of distraction, and these are:

  • Manual distraction

    Drivers are manually distracted when they take their hands off the steering wheel to perform secondary tasks, such as using vehicle controls, drinking, eating, smoking, reaching for in-vehicle objects, or manipulating mobile phones while driving.

  • Visual distraction

    Visual distraction occurs when a driver takes his eyes off the road, often due to the presence of attention-grabbing stimuli. For example, a study sponsored by the AAA Foundation which used in-vehicle video recordings of teen drivers found that glancing away from the forward roadway for more than 2 s increased the risk of a crash or near-crash to over two times that of normal driving [11].

  • Cognitive distraction

    Cognitive distraction occurs when the focus required for safe driving is redirected toward other secondary tasks, resulting in divided attention [12]. An example of cognitive distraction during driving is when a driver’s mind is preoccupied with thoughts, daydreaming, or engaging in complex mental tasks unrelated to driving. This can include being deep in thought about personal issues, work-related stress, planning future events, or even being engrossed in intense conversations or arguments with passengers. Cognitive distractions divert the driver’s attention away from the road and can impair their ability to react to sudden changes or hazards, increasing the risk of accidents.

As related works, existing studies have utilized various physiological sensors to detect cognitive distractions while driving. For example, Electroencephalography (EEG) signals have been employed to measure brain activity [13], Electrocardiogram (ECG) signals have recorded heart activity [14], and physical eye-tracking sensors have monitored eye pupil diameter [15], providing valuable insights such readings fluctuations during the detection process. However, these sensors can be intrusive and potentially cause additional distractions for the driver [16]. In contrast, computer vision techniques, particularly deep learning, have gained popularity for non-invasive, accurate and faster real-time detection of distracted driving [17].

Recent research has focused on utilizing deep learning algorithms to detect driver distraction using the State Farm Distracted Driver Detection Dataset, a comprehensive and widely used dataset in the field [16, 18]. The dataset encompasses many driving scenarios, providing valuable insights into real-world distracted driving situations [19]. Among the tested models in existing literature, the MobileNetV2-tiny has demonstrated exceptional accuracy (99.88%) in detecting distracted driving [20].

This highlights the effectiveness of deep learning models in addressing the challenges of distracted driving detection. Although previous studies have primarily focused on detecting distracted driving, there exist gaps in simultaneously addressing mitigation strategies and exploring cognitive distraction detection. Additionally, the utilization of deep learning models to detect driver cognitive distractions using eye-tracking features requires further investigation [21].

3 Implementation of the Distracker

To address the gaps discussed in the previous section, this study aims to develop an in-vehicle smart agent prototype capable of accurately detecting and mitigating manual, visual, and cognitive distractions in real-time using deep learning and eye-tracking algorithms. By incorporating multimodal audio and visual alerts, the proposed system aims to effectively reduce distracted driving instances, thus enhancing road safety, reducing accidents, and improving driver focus and attention. By addressing these research gaps, the study intends to contribute to the advancement of techniques for detecting and mitigating distracted driving, ultimately making driving a safer and more secure experience for everyone on the road.

In order to fulfil the key objective of this paper, an in-vehicle assistant called ‘Distracker’ was designed and implemented. Distracker uses an eye-tracking algorithm and a deep learning model to detect distracted driving and issue real-time multi-modal warnings to the driver in order to encourage safe driving. The architecture of the system is depicted in Fig. 1. The Distracker utilizes a Raspberry Pi 3 B+ for data processing, with two webcams capturing driver images from the front and side views of the driver. The Driver Distraction Detection Algorithm analyses frames and triggers multi-modal warnings (audio and LED) for distractions. Components include Logitech cameras, speakers, a Max7219 LED matrix, and a Xiaomi power supply. The Distracker prototype was developed using Python programming language and various open-source libraries such as Dlib, OpenCV2, TensorFlow lite, Pygame, and Luma matrix library. Google Collab Pro and Google Drive were also used for training the model and storing the dataset respectively. The presentation of the two tables below provides a comprehensive insight into the development process of the Distracker prototype (Tables 1 and 2).

Table 1. Libraries used in the development of the Distracker prototype and their corresponding functions.
Table 2. Components used in the development of the Distracker prototype and their corresponding functions.
Fig. 1.
figure 1

Architecture of Distracker system.

To assemble this setup, the Raspberry Pi 3b+ was utilized, accompanied by its luma LED matrix, power bank, speakers, and dual cameras. These elements were securely installed on the car’s dashboard, as showcased in Fig. 2. The incorporation of flex tape ensured stable attachment, preventing any risk of components detaching. Notably, the two cameras were strategically positioned to effectively capture complete frames of the individual.

Fig. 2.
figure 2

Illustrates the positioning of various components within the setup.

In terms of the underlying mechanism to detect distractions, a hybrid approach was adopted by integrating Convolutional Neural Networks (CNNs) and eye movement tracking. As part of the integrated approach, Distracker uses the State Farm Distracted Driver Detection (SFDDD) Dataset, which consists of 10 classes representing distracted driving scenarios [19]. The dataset includes 22,424 labelled training photos and 79,726 unlabelled test images. After the dataset was downloaded from Kaggle, pre-processing was done. Images were horizontally flipped to the right-hand side for testing in Mauritius, where vehicles are driven on the right-hand side. The labelled train dataset was used and split into 70% for training, 30% for validation, and 30% for testing. The resulting train, validation, and test folders split consisted of 15,696, 3,363, and 3,363 images respectively.

In addition, Distracker implements the MobileNetV2 architecture which is a deep-learning model used to process images of specific dimensions. A representation of the architecture is illustrated in Fig. 3 and it was applied to the State Farm Distracted Driver Detection (SFDDD) dataset by resizing the images to meet the required dimensions. The architecture includes bottleneck layers and an output dense activation layer. Transfer learning was employed by training the MobileNetV2 model with the dataset and modifying it for improved performance. This involved removing the original model’s last layer and adding layers such as GlobalAveragePooling2D, Dense Activation Relu, Dropout layer, and Softmax layer. The modified architecture had ten classification nodes, which was found to be suitable to transfer learning in this study.

Fig. 3.
figure 3

The MobileNetV2 model architecture applied to Distracker.

After preprocessing, the model was trained using high-performance serverless GPU units on Google Collab Pro. Transfer learning and similar metrics from recent existing research were employed to train additional layers integrated into the MobileNetV2 pre-trained model [20]. Categorical cross-entropy was chosen as the loss function, and model checkpointing was utilized to save the best model based on validation loss. During the training process, there was a notable improvement in accuracy. By the 30th epoch, the model’s accuracy showed a substantial increase from 78.70% to 99.66%. The validation accuracy began at a high level of 92.47% and reached 98.12% in the final epoch. Both training and validation losses consistently decreased, with the validation loss reducing from 0.2344 to 0.0521. Overall, the model demonstrated exceptional performance in effectively generalizing distracted driving scenarios (Fig. 4).

Fig. 4.
figure 4

a) Shows the training and validation loss curves for the trained MobileNetV2 model b) Shows the training and validation accuracy curves for the trained MobileNetV2 model.

To further enhance the model’s performance, the last 8 layers of the pre-trained model were unfrozen and fine-tuned using the validation dataset for 20 epochs. The fine-tuned model achieved high accuracy values on both the training set 99.77% and the validation set 99.45%, accompanied by low losses (0.0092 and 0.0244, respectively) (Fig. 5).

Fig. 5.
figure 5

a) Shows the training and validation loss curves for the fine-tuned MobileNetV2 model b) Shows the training and validation accuracy curves for the fine-tuned MobileNetV2 model.

The fine-tuned model was converted to a TensorFlow lite format to enable deployment on the Raspberry Pi 3B+. During the conversion process, a technique called quantization was applied. This involved removing unnecessary layers and optimizing the model to achieve a favorable balance between accuracy and speed. The quantization process ensured that the model maintained an average appropriate frame rate of 3.78 fps and a memory size 13.819 megabytes (MB), making it well-suited for deployment on the Raspberry Pi 3B+.

In the integrated approach utilized by Distracker, an eye-tracking algorithm is also involved, and the schema is depicted in Fig. 6. In terms of the process involved, the algorithm analyses the input image frames to identify faces and eyes using a face detector and facial landmarks shape predictor. The algorithm then converts the frames to grayscale, focuses on a single face, and extracts the eye region by identifying specific points within the facial landmarks. A calibration process determines the optimal threshold for converting the extracted eye frame into a binary image, ensuring accurate pupil extraction under varying lighting conditions. Then, binarization is applied to enhance pupil recognition by converting the eye frames into monochrome binary images, making the pupil stand out as a distinguishable blob. Following this process, contours are identified in the iris, and the centroid of the pupil is calculated using contour recognition and image moments, providing information about the pupil’s position. Red crosshair marks are also added to the original image frame, accurately indicating the positions of the left and right pupils. In the final step, the algorithm calculates horizontal and vertical eye ratios to determine the user’s gaze direction, comparing them with predefined thresholds. By analyzing the ratios, the algorithm can establish whether the user is looking in different directions or at the centre, ensuring reliable and precise gaze direction determination.

Fig. 6.
figure 6

The eye-tracking algorithm schema

The eye-tracking algorithm monitors the driver’s pupils to detect cognitive distractions by analysing eye fixations. A continuous fixation duration exceeding 300 ms indicates cognitive distraction [21]. The algorithm tracks the position of the driver’s pupils using red crosshair marks, enabling the analysis of pupil metrics for the detection of cognitive distraction as shown in the Fig. 7a) and b) below.

Fig. 7.
figure 7

a) Shows user looking at centre b) Shows user cognitively distracted in eye fixations.

The Distracker underwent hardware integration to seamlessly fit into the car and was tested for its ability to generate real-time multimodal warnings, addressing functional requirements. For non-functional requirements, the prototype was optimized for portability, real-time response, lightweightness, and accuracy. Accuracy and responsiveness of distraction detection were quantitatively assessed and compared with ground truth data to validate distracted driving detection. Moreover, the model’s accuracy and generalization capabilities were rigorously tested against an independent validation dataset.

4 Evaluation Method

As part of evaluation, the accuracy of the Distracker prototype in detecting and mitigating manual, visual, and cognitive distractions in real-time was evaluated. For this, an experiment was conducted whereby involving participants who had to perform distracted driving tasks on a straight rural road and then data on true/false positives/negatives were collected based on the warnings produced by Distracker in order to derive the confusion matrix. For the experiment, ethical clearance was obtained from the Mauritius IT REC of Middlesex University Mauritius campus to prioritize safety of the participants involved. To conduct this assessment, an experiment was setup involving participants who were recruited via email. These participants were then asked to complete a consent form, signifying their interest in participating. Subsequently, they were provided with a health screening form, and a driving license form. This comprehensive approach ensured that participants met the necessary criteria for involvement.

A total of 20 participants, including 10 members of the public and 10 students from Middlesex University Mauritius Campus, took part in the prototype evaluation to meet the number requirements as past study [22]. The evaluation experiment was conducted in Flic-en-Flac, Mauritius, specifically at Morcellement Ramiah, The Waterway Residence, and Jardin d’Anna on straight road. Health screenings were conducted to ensure participants’ suitability for the study and their driving licenses were checked to ensure their eligibility to participate in the experiment.

As key procedures of the experiment, participants received detailed information about the experiment entailing driving a passenger car equipped with the Distracker prototype on an actual road. Each participant then performed a series of distracted driving tasks while following clear instructions. These tasks are listed as follows:

  • C0: Safe Driving

  • C1: Texting Right

  • C2: Talking on the phone right

  • C3: Texting left

  • C4: Talking on the phone left

  • C5: Operating radio

  • C6: Drinking

  • C7: Reaching behind

  • C8: Hair and Makeup

  • C9: Talking to a passenger

  • Cognitive distraction

The set of tasks underwent a repetition process twice, resulting in a total of 20 tasks performed (10 tasks for evaluating advisory multimodal warnings and 10 tasks for assessing cautionary multimodal warnings). When assessing advisory multimodal warnings, successful task completion necessitated participants to engage the brakes after the initial multimodal alert. If braking took place before the alert, it was considered a failure, requiring the task to be performed again.

Regarding the evaluation of cautionary multimodal warnings, accomplishing tasks successfully required participants to apply the brakes after the intensified multimodal alert. Conversely, if braking occurred before this heightened alert, it constituted a failure, leading to the task being repeated. During the evaluation, the prototype’s performance in deep learning, and the warnings it generated across different modes were captured. The collected questionnaire data underwent careful examination to ensure reliability and validity. The collected data were analysed using ANOVA in SPSS to compare the means of the measured dependent variables. The Bonferroni test was used to identify groups with significantly different means, guided by the ANOVA results. The p-value was employed to test the null hypothesis and determine if there were any significant differences between the means. Eventually, accuracy was determined using the following formula:

$$Accuracy= \frac{(True\,Positives+True\,Negatives)}{(True\,Positives+False\,Positives+True\,Negatives+False\,Negatives)}$$

5 Results and Discussions

5.1 Performance and Accuracy Assessment in Real-Life Context

The accuracy of the built-in distraction detection algorithm within Distracker was assessed by analyzing the warnings it produced during real-life experiments, using the evaluation method discussed in the previous section. A total of 20 distracted tasks were performed, with 10 tasks used to generate advisory multi-modal warnings and another 10 tasks for generating cautionary multi-modal warnings. Additionally, the eye-tracking algorithm’s performance was evaluated using a cognitive distraction detection task called the N-back task, as described by a previous study by Biondi et al. (2017) [23]. In this evaluation, participants were required to memorize a sequence of digits presented audibly through a speaker and spell out the second-to-last digit while driving. The evaluation results were presented in a confusion matrix, which indicated the true positives and true negatives for various distracted activities. These findings offer valuable insights into the effectiveness of the integrated deep learning and eye-tracking algorithm for accurately classifying distracted driving tasks in real-life situations (Fig. 8).

Fig. 8.
figure 8

The classification confusion matrix for all distracted driving tasks evaluated in the real-life experiment.

Table 3. F1 score, Recall and Precision Results following experiment.

The results in real-life scenario achieved an impressive accuracy of 93.63%. Various categories, including texting right, drinking, reaching behind, hair and makeup, talking to passenger, and cognitive distraction, performed exceptionally well, with precision and recall scores surpassing 0.9.

5.2 Perceptual Challenges and User Feedback

However, some anomalies in precision and recall were observed, as shown in Table 3 above. Differences were noted among groups regarding “warning perceptivity” and “warning timeliness and accuracy” during the ANOVA test. Around 3.4% of the multi-modal warnings were rated as “barely noticeable” and “not so effective”. 15% of participants primarily perceived the audio warnings but found the visual display inadequate, particularly in daylight conditions. Due to the well-lit conditions in laboratory settings [24], the visual warning perceptions experienced in real-life scenarios differed, leading to challenges in accurately perceiving the visual warnings. Another 2.7% of the multi-modal warnings were considered “unnoticeable’ and “not at all effective”, primarily due to device issues and misclassifications caused by changes in lighting conditions.

5.3 Comparative Performance and Significance

In comparison to Biondi et al. (2017), this study achieved significantly higher mean ratings (20.35 vs. 14.58) for warnings being “highly noticeable” during N-back tasks. Similarly, when compared to Maltz and Shinar (2007) [25], this study achieved higher warnings perceptivity (95% vs. 66%) for cognitive distraction. The mean rating for multi-modal warnings perceptivity for cognitively distracted tasks were also higher by 10% as compared to the previous study by Roberts, Ghazizadeh and Lee (2012) [26]. Although this study’s overall classification accuracy in real-life scenarios (93.63%) was lower than that of existing non-real-life experiments conducted by previous studies [16, 18, 20], it demonstrated high accuracy even in real-life situations.

5.4 Quantized MobileNetV2 Model Outperforms

Furthermore, in comparison to Li et al. (2022) [16], whose modified YOLOv5s model achieved a mean average precision of 95.60%, the quantized MobileNetV2 model in this study surpassed their performance with a mean average precision of 99.46%. Despite Li et al.’s higher frame rate of 70fps due to the use of a more powerful NVIDIA GEFORCE RTX 3080 GPU, this study aimed to balance accuracy, detection speed, and lightweight design. As a result, the quantized MobileNetV2 model was 13% lighter and more accurate in detection. Additionally, the quantized MobileNetV2 outperformed the unmodified MobileNetV2 used by Hossain et al. (2022) [18] on the SFDDD dataset, achieving an overall accuracy of 99.46% compared to their 98.12% for detecting distracted driving activities, surpassing it by 1.3%. Furthermore, the quantized MobileNetV2 model in this study outperformed the modified EfficientDet-D3 model developed by Sajid et al. (2021) [27] by 0.3% in mean average precision (mAP). The EfficientDet-D3 model achieved a high mAP of 99.16% for detecting distracted driving activities on the SFDDD dataset. Nevertheless, it is worth noting that the modified MobileNetV2 model by Wang and Wu (2022) [20] achieved significantly higher accuracy of 99.88%, outperforming this present study by 0.46%. However, this difference can be attributed to limited implementation details and insufficient training of the MobileNetV2 model in this study. The model was trained for only 50 epochs due to resource constraints and limited GPU availability for intensive training. Nonetheless, the findings highlight the outstanding performance of the quantized MobileNetV2 model in accurately classifying distracted driving behaviors.

5.5 Challenges and Limitations

Nevertheless, a few challenges and limitations were also identified during the evaluation process. The current hardware utilized as part of the Distracker had relatively low performance and thus there were unexpected shutdowns during the experiment, particularly due to high temperatures. Moreover, 20% of participants found the warnings difficult to understand due to unfamiliarity with the audio alerts, where half of the same group claimed to have disliked the audio sound and also found it annoying. As such, further studying usability of the solution can help to derive further insights on the application of Distracker in practice.

6 Conclusion

To conclude, this study has developed an in-vehicle smart agent prototype that combines deep learning and eye movement tracking to detect and address manual, visual, and cognitive distractions in real-time. The analysis of distracted driving tasks showcased an overall classification accuracy of 93.63% in real-life scenarios, including the detection of cognitive distractions, which is a significant contribution of this research. Moreover, the Distracker prototype, equipped with non-intrusive multi-modal mitigation warnings aimed at enhancing its potential for practical implementation in vehicles.

Moving forward, future endeavours should prioritize addressing the identified limitations. This entails utilizing embedded devices with enhanced processing power, implementing advanced cooling systems, and exploring the possibility of powering the embedded device through the car’s power supply to reduce reliance on external power sources. Involving end-users in the design process and improving the visibility of warnings through visual cues and adjustable audio volume are vital aspects to consider. Augmented datasets, deep learning techniques, and strategic device placement should be explored to optimize the performance of the eye-tracking algorithm. By pursuing these future improvements, the field of detecting and mitigating distracted driving can progress, ultimately ensuring a safer and more secure driving experience for all individuals on the road.