1 Introduction

Laboratory rats and mice are used in several fields of biomedical research, including the study of animal behaviour in the neurosciences area [47]. Specifically, its usefulness takes importance in translational medicine since drugs can assess their efficacy in animal models for future treatments in different brain diseases in humans [3, 19]. The cognitive states can be evaluated in paradigms such as the open field test, the Morris water maze, and the elevated plus-maze, to name only a few [66]. However, the most widely used equipment of behavioural tests to evaluate possible drugs for pathologies that involve the motor process, natural exploration, or anxiety, is the open field test [30, 58].

In the beginning, the Open Field Maze (OFM) was used in a rudimentary way, where the rat was placed in a box divided into sections with grids. Subsequently, the researcher quantified the classic parameters such as the number of quadrants visited and grooming [15, 48, 68] by observation. In such a way that when the researchers wrote down their annotations (which were repetitive due to multiple observations to the test video), they got tired and led to high variability in the results. While seeking to reduce labour may be attractive but expensive if more personnel are involved in the task.

Currently, commercial solutions have used computerised video-analysis systems or infrared beam grids for rat tracking or measuring the time it spends in specific areas of test arenas [17, 60, 64]. On the other hand, the general specifications necessary for behavioural analysis in most commercial systems and those research proposals found in the literature require controlled conditions such as camera position, constant light or high contrast between the rodent and the background. These demanding characteristics make it difficult to define and qualify the behaviour of animals with a particular interest for the experiment; likewise, they do not allow the researcher to adapt to highly changing experimental needs.

Nevertheless, in addition to environmental conditions, previous work has shown that rat tracking and some behaviour classification (although limited) are possible by using traditional computer vision techniques based on geometrical features [6]. Motivated by the latter, we propose a novel and robust system based on Deep Learning (DL), a recently successful artificial intelligence technique, to monitor the locomotive behaviour of rats in real-time in the open field maze test. This system aims to reduce the effort of researchers in the manual annotations of this test by automatically creating ethograms, a top view plot of the rat’s position in the test maze and a heatmap where the most visited locations are highlighted. These data are obtained by analysing images recorded with an inexpensive camera placed in the top view over the maze (see Fig. 1).

Fig. 1
figure 1

Our approach can detect the rat in the maze while simultaneously classifying its behaviour using an inexpensive camera placed in top view

Therefore, in this paper, we have divided and organised the work into the following sections: in Section 2, we present related work that analyses the foundation of experimental design regarding the topics of the area; Section 3, we describe our method on automatic monitoring using DL testing on the open field maze, we also describe the animals used in our experiments, characteristics of behaviour and, the design and configuration of the hardware; in Section 4 we present our experimental framework; in Section 5, we discuss what has been achieved in this work based on the system that we are presenting. Finally, our conclusions are outlined in Section 6, including future perspectives.

2 Related work

Designing a system for rat detection and behaviours classification involves computer vision challenges such as object recognition.Object recognition is an essential task that requires knowledge of the scene that impacts many applications, such as autonomous navigation, pedestrian detection, facial expression recognition, human activity recognition, to name a few. To perform recognition is necessary first, some pre-processing of the image to handle it better. Then, extract features from the image and process them to perform classification.Depending on the nature of the image, it can apply a variety of pre-processing to the image, such as colour normalisation, deblurring, brightness and contrast correction. For challenging images, such as underwater images, the work [36] proposes using Contrast-Limited Adaptive Histogram Equalisation (CLAHE) and Percentile methodologies to enhance them.After pre-processing, several works in literature propose the use of visual descriptors as feature extractors combined with classifiers (such as k-Nearest Neighbours, random forest) to recognise objects in the image. The work in [33] uses a combination of Shi-Tomasi, Scale-Invariant Feature Transform (SIFT) and Speeded-up Robust Features (SURF) extractors followed by a random forest classifier to obtain an accuracy of 86.4% in the 10-class Wang dataset.Other works propose schemes that use a combination of SIFT and Oriented fast and rotated brief (ORB) extractors to feed classifiers such as BayesNet, or k-NN [34] for content-based image retrieval obtaining a precision rate of 88.9% on the Wang dataset. The work [35] increased precision up to 99.53% for the Corel dataset using decision trees, random forest, and multi-layer perceptron.The efforts to outperform the state-of-the-art (SOTA) have increased with Deep Learning (DL) growth in the last years. The work [32] presents a complete review of SOTA methods for 2d object recognition, comparing the performance of feature-based methods against DL-based approaches, showing that DL can score better performance for some datasets. In this field, DL methods have shown to be robust to image changes such as uncontrolled light conditions, dynamic objects in the environment, for mention some. Also, DL has shown a high score in challenging datasets for multi-class classification such as ImageNet; this is the case of CoAtNet [9] (current top-1 accuracy in SOTA for image classification). CoAtNet scores a high accuracy of 90.88% for 1000-classes ImageNet dataset, using a combination of convolutional layers and an attention model (Transformer).In addition to image classification, object detection in images is another challenging problem. Several works have been proposed to achieve the best results for object detection. Some of the most relevant works are the You Only Lool Once (YOLO) network [53] and the Single Shot Detector (SSD) network [41], which have a reliable performance in AP (average precision) detection. Despite recent works that highly outperforms the AP on the Common Objects in Context (COCO) dataset, such as SwinV2-G [42] that proposes an architecture using Swing Transformers, the SSD offers promising results in prediction with fewer classes and has the advantage of fast prediction due to its compact architecture.Additionally, works in literature report approach to solve various problems such as hand gesture recognition [28] using temporal information. Networks handle temporal information (in the form of data sequence) using LSTM (long short-term memory) modules or gated recurrent unit (GRU) that take advantage of temporal data to provide feedback to the input.Other work uses temporal information for facial expression recognition (FER) [51] in the same way as a sequence of data (image stacked). Still, instead of 2D convolutions layers, they use 3D convolutions to handle the image sequence.Not all works use recurrent modules or 3D convolutions to handle temporal information. In the context of autonomous drone racing (ADR), [55] shows that using temporal information as a mosaic image provides the necessary information to learn flight commands successfully. Furthermore, the work [5] shows that it is possible to handle temporal information as a sequence of grey-scale images with only 2D convolutions to estimate camera pose w.r.t. an object with a low error in the ADR context.Within the domain of neurosciences, where detecting a rodent into an apparatus such as the OFM is an essential task, we can find many efforts in the literature that use both classical computer vision algorithms as well as deep learning-based algorithms to rat detection and behaviour classification; evermore, some works combine the use of special devices to aid detection. Several works show approaches for detecting a rat in a test using traditional computer vision [4, 11, 13, 16, 37, 63, 69, 71]. However, these works do not classify any behaviour; furthermore, controlled conditions of light remark high contrast between the animal and the scenario [7, 40].

Strategies such as painting or lighting bells to detect the rodent efficiently are used in [14, 49, 56]. Another invasive marker is surgical implants used to detect rodents [21, 22, 44, 59]. To enhance rodent detection and classification of its behaviour, the approaches in [70, 71] use additional hardware (sensors) to classify specific behaviours of the rodent.

When using depth/infrared cameras, the works in [7, 8, 20, 45, 50, 67] can identify the rodent position and get its orientation; besides, they can analyse more than one rodent; although they cannot identify behaviours, they only classify rearing. With the growth of Deep Learning, several works propose using different convolutional neural network (CNN) to detect and classify rodent behaviours in different scenarios.

Detection systems are the most conventional systems to be used for many scenario types. For this purpose, the works in [1, 10, 11, 23, 43] use CNN architectures reported in the literature to detect the rodent; the most common network used in these works are the YOLO network [53].

The authors in [11] adapt the YOLO network to detect rats (1 to 3) in a test box; also, they combine the Extended Kalman Filter to correct missed detection and score a high accuracy of 95% in detection. Despite performing good detection, the system proposed cannot classify any behaviour.

An approach presented in [10] reported precision of 90% in detection. This work performed detection in three different scenarios, but controlled illumination is needed.

A complete work presented in [24] can make both rat detection and behaviour classification with a constant illumination condition; they can classify five behaviours; however, the system cannot classify grooming.

Furthermore, several works focus their attention on rodent behaviour identification for different types of tests.

A fine-tuning of AlexNet [31] is made in work [54] to identify five behaviours associated with the Object Location Memory test. Higaki et al. [18] uses a CNN to classify the Morris Maze test behaviours.

For specific behaviours, the work [12] uses an extensive dataset (over 2 million images) to classify rodent grooming; also, [39] proposes to input the network with a stack of optical flow images. Scratching is another behaviour that the work in [29] classifies by proposing a DL-based approach, where a sequence of 21 images is used to feed the network. There is also a work that centres its efforts on segmenting rats using thermal images with a CNN [46].

One of the most relevant proposals is the work in [65], in which the authors proposed a system that can classify nine behaviours (including grooming) with an average precision of 65%. Nevertheless, to achieve this result, some controlled conditions are needed. If they do not control light conditions, their results are below 60%.

In addition to the proposal described above, we can find commercial and free systems that work in different apparatuses and offer various tools.

Ethovision XT is a complete system in the market; it can detect different animals in many tests and classify behaviours in the home cage. The principal disadvantage of the Ethovision system is its high price, making it not accessible for all the researchers.

Another great system that can detect animals for several apparatus is the ANY-Maze software. ANY-Maze works on six different scenes. Despite being a perfect solution for animal tracking, ANY-Maze does not offer behavioural classification. One thing to keep in mind is that the price of ANY-Maze could be high for most researchers.

In sum, in most cases, the works that can perform detection and tracking do not classify behaviours; on the other hand, there is no detection, or any additional information obtained from the test when classifying behaviours. There is no visual information to help researchers understand the data obtained with these different systems.

3 Methodology

Simultaneously rodent detection and behaviour classification could be challenging; for this reason, our system separates our methodology into two main tasks: rodent detection and behaviour classification (See Fig. 2). This section will give a general overview of the Open Field Test and its setting, a description of the network architecture in each step of the methodology, dataset generation, and the necessary configuration to make the network training.

Fig. 2
figure 2

General overview of our proposed system performing two main tasks whose output is connected by three modules. For the first task, a model trained with the Single Shot Detector network detects the rodent in the frame. The Data Processing Module DPM uses the SSD’s output to generate a sequence that feeds the Rat Behaviours Network, used to perform the second task on behaviours classification. The latter is fed back to the DPM to generate all the output data shown to the right

3.1 Animals

Five male Wistar rats weighing 250–300g were obtained from Bioterio Claude Bernard of Benemérita Universidad Autónoma de Puebla (BUAP). Animals were housed in temperature and humidity-controlled in the vivarium of Laboratorio de Neurofarmacología-BUAP with a light-dark cycle of 12–12 hrs and free access to food and water. All procedures have followed the Guide for the care and use of laboratory animals of the Mexican Council for Animal Care NOM-062-ZOO-1999. We also obtained the approval of the Use of Laboratory Animals and the Ethics Committee of BUAP.

3.2 Open field test

The open field maze was used to determine the spontaneous motor activity of the rats. This model consists of a wooden box with 1.2 m x 1.2 m x 1.2 m. The arena was divided into nine quadrants of 40cm x 40 cm each. The test consists of placing the rat in the central quadrant of the arena and letting it explore for 15 min. In addition, a camera was placed on a tripod above the open field maze to video record and visualise the spontaneous scanning movements of the rats in a wide field. After some time, the rat is removed and placed in the laboratory vivarium.

3.3 Rodent detection

The first task in our methodology is rodent detection. To detect the rat in the test, we extracted each frame from the video and used it as input for detection. We selected the Single Shot Detector network [41]. The SSD can identify multiple objects from an image, delimiting the containing area where the object is. Since we only need to identify one object (a rodent), we selected a reduced version of the SSD named SSD7, which has only seven convolutional layers as a base network. This reduced architecture allows detection at a faster frame rate, which helps to make our approach more efficient in computational terms.

3.4 Behaviours classification

For the second task, we proposed to use an additional CNN to predict rodent behaviours. We decided to use a second CNN based on a preliminary test with the SSD network, where we trained the CNN to detect the rodent and predict its behaviour simultaneously. Unfortunately, we did not obtain satisfactory results. Thus, we designed a compact network based on inception modules [62] to predict rodent behaviours only.

A stack of 6 consecutive grey-scale images inputs the network, followed by the sequence, a combination of convolutional layers, and one inception module that extracts the features necessary to predict behaviours by Multi-Layer Perceptron form by four neurons. Figure 3 shows the architecture of the proposed network described.

Fig. 3
figure 3

BehavioursNet architecture based on an inception module. This is a small architecture used to train a model to classify the rat’s behaviours in real-time

The stacked input consists of a cropped image containing the rat; we proposed using a sequence to provide more information to the CNN about the behaviour’s motion. The sequence is significant to classify grooming and differentiate walking from resting, two different behaviours in the performed movement.

To reduce the amount of data information, we proposed to use grey-scale images instead of RGB images because we focused on extracting motion features rather than complex texture features.

We did not change the input size proposed in [62]; the size for each image is 224x244.

3.5 Data processing module

For each step of our methodology, it is necessary to prepare the input frames and process each network’s output. The Data Processing Module (DPM) is a pre-processing and post-processing of the data in our system.

At first, the DPM takes the output of the SSD network and uses the detected bounding boxes to crop the frame, converting it to grey-scale. The cropped frame is used to generate a frame sequence of six consecutive frames (a stack of images) as input for the Rat Behaviours Classification network.

The output from the second network enables the DPM to generate all the graphics associated with the test, i.e., video with detection, ethogram, detection plot, and heatmap plot for most visited areas.

The SSD output is used to generate the rat’s detection plot. Using the detection, the DPM plots each detection in the x and y-axis, maintaining the origin as in the image, which means the top left corner is the origin with coordinates (0,0). In addition to the detection plot, the DPM stores all the centres. At the end of the video processing, depending on the rat’s behaviour, some regions will store more data indicating the rat’s preference; this information is plotted as a heatmap.

To analyse the behaviours performed in the test, DPM creates an ethogram showing the four behaviours with a different colour for each one. Behaviours are stored in a vector for every prediction and plotted at the end as a timeline plot.

3.6 Dataset generation

As described in previous sections, our methodology consists of two main tasks: rodent detection and behaviour classification. Thus, it was necessary to prepare a dataset for each task. Inspired by the work in [6], we implemented a system to detect the rat’s position using a filtering algorithm such as the Kalman filter. The image points from rat detection on the image were used as labels with the complete image as training examples for the SSD network. We have five recorded videos with an average length of 15 minutes. We used Video 2 for dataset creation. There were about 27 thousand frames with respective bounding box labels. For detection, we used RGB frames, as shown in Fig 4.a).

Fig. 4
figure 4

Dataset examples for the training of the SSD, used for the rat’s detection task in the image: a) Top view images captured with an inexpensive camera; b) Cropped images from the images in a) obtained with our automatic detection labelling system based on stochastic filtering

Also, the bounding boxes were used for cropping the rat from the image. For behaviour labels, it was necessary to perform the classification manually. Due to the video’s number of frames, the task is arduous to label each video frame. Instead, we set labels only at the start of each behaviour, reducing manual labels to about 150 per video. We used the detection system to generate the labels for each frame in a time range with these marks.

Figure 4.b) shows an example of the cropped images generated and used to train the network for behaviour classification. Because the rat does more actions in the first minutes of the test and then tends to be resting, the classes were unbalanced. We took the minimum number of labels for one behaviour as a threshold to prevent a wrong classification. We took only that number of labels (frames) of each behaviour for the training.

Thus, we had a semi-automatic labelling system for bounding boxes and behaviours using the hand labels and the automatic detection system for the rat’s position.

3.7 Training process

We used Python 2.7 with the Keras API 2.2.4 and TensorFlow 1.14 framework as the backend to implement our networks. We set 100 epochs for training with a batch size of 64 with Adam optimiser and learning rate= 0.001 for both SSD and Rat Behaviour networks.

We used the loss described in [41] for the SSD network, and we used a categorical cross-entropy loss for classification, respectively.

Table 1 summarises the setup of the parameters used for training both networks.

Table 1 Parameters used to train the system for the rat’s detection and classification

The dataset for detection consisted of 27 thousand labelled frames; however, we only took 8 thousand images for the training, split it into 80% for the training and the remaining 20% for validation. The training dataset size was approximately 12 thousand images for the Rat Behaviour Network, and we split it into 8952 images for training and 2984 images for validation. We wrote the DPM, the CNN and the entire system in Python.

4 Results

This section presents the results obtained for each module of our two-step system and the results produced by the DPM.

4.1 SSD network

As we described in Section 3.7, we used a small dataset for SSD training. This dataset was enough for the SSD to learn a model for the rat’s detection for all the video frames without losing it. Figure 5a) shows an example of the detection of the rat in the video. This sequence shows the bounding box containing the rat and its centre with the corresponding label and the confidence of the object detection. We generated this sequence only to show the SSD detection. We remind the reader that our system also performs a second task, behaviour classification.

Fig. 5
figure 5

Examples of our system’s output for the two main tasks: a) Detection of the rat on the image, indicated with a green bounding box, showing the confidence level obtained by the SSD network; b) Behaviour classification obtained after passing cropped images from the detection task to the BehavioursNet architecture; c) Failure cases where the SSD was trained to detect and classify behaviours simultaneously. These images show that the network cannot detect the rat correctly; hence the classification behaviour is also incorrect

4.2 Rat behaviour classification network

The network at this step predicts one of the four possible behaviours: walking, rearing, resting, and grooming. This prediction was performed in every frame in the video. We attached the behaviour classification to the cropped image to show the network’s output; an example of the result is shown in Fig. 5.b).

Additionally, we argue in Section 3.4 that we also tested the SSD network to classify behaviours, but the network could not perform these tasks correctly. Figure 5.c) shows that when we combine detection with classification in the same CNN, the network does not correctly detect the bounding box that contains the rat.

4.3 Plots and data generated by the DPM

The methodology section emphasises that the Data Processing Module is the essential module for pre- and post-processing data that generates the statistics for the system output. These outputs are video, ethogram, detection plot, heatmap plot, and a total of visited cells.

A video where the system detected the rat (indicated by a green bounding box on the image) and the classified rat’s behaviour had been created by combining the outputs of SSD and the Rat Behaviours Classification network Footnote 1.

Because the classification runs on a frame-to-frame basis, we have annotations of the behaviour of the rat for each instant of time. We compared the system’s classification against the ground truth data frame by frame. We had also calculated the classification precision for each behaviour; we present these results in Table 2.

Table 2 Metrics for our system evaluation: Accuracy, Precision, Recall and the F-score. Note that walking is the behaviour with the highest score

With the predictions generated by the network, the DPM created an ethogram from the video. The ethogram shows the behaviour at each time of the video. The blue colour indicates rearing; walking is presented in orange, purple for grooming, and green for resting. Figure 6 shows the ground truth ethogram corresponding to video 3. Note that in the first half of the ethogram, the rat explores the box, walking and taking some time to groom itself; then resting is the predominant behaviour for the complete video from the second middle of the test on wards.

Fig. 6
figure 6

Ethograms corresponding to video 3 used in our experiments: a) Ethogram generated with the ground truth showing that in the first minutes, the rat walks to explore the box, and it rears many times in tandem with a long period of grooming. After half the video, the rat decreases its activity and rests for a long time; b) Ethogram obtained with the behaviour classification obtained with our BehavioursNet architecture. Note that rearing and grooming are the behaviours with more misclassification. However, this ethogram produced automatically with our system may be enough to detect unusual behaviour

Although the precision for the behaviour classification is not high, we can still use the information in the ethogram to interpret the general behaviour during the entire test.

Moreover, the centre of the rat was estimated at each frame for detection throughout the video test, the DPM generates a detection plot using the centre estimation (see Fig. 7). The Figure shows the rat’s detection for video 3; the system generates the detection plot for all the videos.

Fig. 7
figure 7

A plot of the rat’s position in video 3, used in our experiments, shows in red the ground truth and the pixel positions detected with our system in green. Our system performs closely to the ground truth with a low error of 6 pixels on average. Note that the error is not significant compared with the size of the rat in pixels

We compared the detection points generated by the CNN vs the ground truth. Figure 7 shows the comparison plots; the red points represent the ground truth, and the green points are the output SSD. As observed, the estimated points are close to the ground truth ones; the global mean error (distance) between ground truth and estimated detection points is 6.34 pixels. For each axis, we calculated the RMSE, obtaining for x-axis 3.8 pixels and y-axis 6.1 pixels. These errors are low enough to detect the rat and generate a cropped image containing the rat correctly. In addition to the detection plot, the DPM produced a heatmap indicating the most visited zones in the box test. Lighter colour indicates fewer visited frequencies; for the most visited zones, a darker colour paints the area, as shown in Fig. 8. The heatmap is also from video 3; we can notice that the rat preferred the top right box and bottom right box. To estimate the time required to process each result by our system, we measured the processing time between each frame in all the experiments performed, with which we obtained an average time of 42 ms (\(\sim 23\)Hz).

Fig. 8
figure 8

Heatmap plot generated for video 3 used in our experiments. For this test, in the first minutes, the rat explores the whole box, walking and rearing in some places. Then, for half of the video the rat reduces its activity, resting most of the time since the environment has become familiar to the rat

Additionally, to the detailed description of video three results, Fig. 9 presents the plots generated by our system for each video (including the one used for training). We can observe that rats tend to explore and stay in corners. Therefore, their behaviours are different in all cases.

Fig. 9
figure 9

Example of the plots generated by our system for all videos in our experiments. Notice that from the detection and heatmap plots that rats prefer to explore more left corners. Besides this, their behaviours are different for each rat in the test, as can see in the ethogram plots

Section 2 presented a review of related work developed in the last decade. Considering the review, Table 3 compares the most related work in recent years. It follows from column Scores (column 2) that the Precision or Accuracy obtained by our system is comparable with the proposals reported in the SOTA. Nevertheless, our proposal does not need controlled conditions such as light or high contrast to performing these results. Also, our system offers a variety of visual results that can help have a broader vision of what happens in the open field test.

Table 3 Comparison with most related works in literature. The column Score shows the reported result by each work; some works report only precision (P) and some others only accuracy (Acc)

5 Discussion

This work aimed to develop a Deep Learning-based two-step methodology to track and detect a rat in the arena of the open field maze. Subsequently, it classifies the animal’s behaviour. Since the psychomotor process is highly researched in the neurosciences area [2, 52, 57], experimenters have had difficulties in the precision of the data obtained by some software and by themselves. For this reason, we designed our system to generate an ethogram for the rat behaviours in the video analysed; this allows the researcher to evaluate highly relevant behavioural parameters depending on the objective study. The present work results show that the detection performed with the Single Shot Detector network is efficient and enables the system to automatically perceive locomotor behaviour in free-moving rats in the open field maze model. Likewise, we have compared traditional computer vision algorithms against the approach proposed in this work, showing that our system can simultaneously detect and classify the animal’s behaviour, something not achieved by these traditional methods.

There has been a behaviour classification problem and remains a complex challenge to date. The detection of the subjects is possible, as we have described in our related work section. However, there is still room for improvement in the behaviour classification task [61]. In this regard, we have proposed an approach based on the SSD network and our novel CNN architecture called BehavioursNet. We use the former network for detection and the latter for behaviour classification. From our experience, attempting all tasks with a single CNN performs poorly. See, for instance, Fig. 5.c), where the SSD has been trained to detect the rat and classify its behaviour. However, the SSD did not even detect the rat when tested on the images.

According to the experiments made with the SSD network, its architecture tries to classify small regions of the image as the object to join later all regions (anchor boxes). The particularity of the rat’s behaviours presents similarities in some of them, such as resting and grooming in which the rat’s shape seems similar. Or, when the rat is walking, that long shape can be confused with rearing if the image is rotated. Providing only one image to the network to classify behaviours may not have enough information and cause the network to fail in detection. On the other hand, if we design a network that processes more than one image to provide more information, this can improve network classification, expecting some false positives in behaviours that show similarities. Thus, our approach detects and classifies the rat’s behaviour, as shown in Figs. 5.a and 5.b.

Since behaviour tests require the researcher’s constant observation in real-time, later looking at the video record, was tired and predisposed to errors. The automatic behaviour classification provided with our system can facilitate the locomotor study of experimental subjects such as the rats shown in Fig. 5. The detection of the rat on the image performed by our system, as shown in Fig. 8, is useful to analyse the activity of the rat, reflecting any condition derived from some drug or pathology. The behaviour classification on a frame-to-frame basis and the reported in the ethogram can speed up the behaviour analysis and evaluation for various pathologies, including PD and anxiety. All the data generated in real-time by the system permits the user to skip observation time, paying attention only to those time slots with relevant motion activity or behaviour classification.

Table 2 shows the evaluations of each behaviour’s classification: rearing, walking, grooming, and resting. The values indicate a high score for accuracy in both rearing and grooming; however, particularly for grooming, the score decreases in precision, recall and F-score (Table 2). This situation is caused by the similarity of grooming and resting behaviours when seen from a top view. From this perspective, essential body parts of the rat, such as the paws, are not visible, which may be crucial to classifying grooming.

Figure 6.a) shows an ethogram produced with ground truth data. Fig. X shows the ethogram obtained from our network’s classification; note that in the first half of this ethogram, the classification depicted in green resembles that of the ground truth. Despite the margin of error between grooming and resting, when interpreted with our network, such ethograms can be helpful when studying anxiety processes. This is advantageous for the researcher because he will not need to spend a more significant part of the time corroborating the data obtained by the ethogram compared to those taken in real-time.

Grooming is an innate behaviour in rats related to the hygiene of the animal and other physiological processes such as thermo-regulation, socialisation, and excitement [26]. However, in highly anxious animals, it is common to observe hyperactivity and increased grooming [25, 27]. In addition, a typical thigmotaxis behaviour has been observed in the open field maze, which is related to the amount of time the experimental subject remains adjacent to the maze wall [38]. In contrast, when evaluating anxiolytic drugs, this activity and behaviours diminished. Therefore, when assessing the four behaviours with our system, their measurements could be used to support the user’s interpretation of a specific pathology, even when having a small margin of error between grooming and resting. Additionally, our system could be useful when a large number of videos need to be analysed in various experimental groups of rodents of a given project.

As Table 3 shows, the accuracy and precision obtained by our system are comparable with those found in the literature. Nevertheless, our system takes advantage of those with greater accuracy or precision but only tracks the rat or detects one or more behaviours without grooming. This is a significant analysis in many experiments as we described earlier. Our proposal can track and detect behaviours, including grooming. Furthermore, compared with the work presented in [65], they score an AP = 65% with constant light conditions. When illumination presents changes, their AP drops to < 50%; in these cases, our work is not only comparable with the highest precision, it is also maintained under changing lighting conditions. In the additional material, it can be seen that when researchers observe the rat, they cause variations to the light due to their shadows; additionally, there are present some slight variations in light due to physical situations of the lights.

Finally, we emphasise again that our system processes data with two CNN architectures, selecting a reduced version of SSD (SSD 7) followed by our small BehavioursNet. Yet, our system performs at an average time of 42ms (\(\sim \)23Hz). This time makes possible the use of standard inexpensive cameras that can record videos in a range of 15 to 25 fps without requiring much more time than the duration of the video itself. The possibility of processing videos in real-time benefits the user because this can reduce the time needed to perform analyses during the experimentation of the effects of a drug. Even more, having the possibility to see together all the graphs generated with our system gives us a quick overview of the activity of each rat during the test, thus allowing us to observe the differences between activity and behaviour between trials. Therefore, our system’s automatic generation of these results offers researchers the opportunity to spend less time watching the recordings, focusing only on those videos where the ethograms and trajectory plots may exhibit distinctive data worth being analysed more carefully.

6 Conclusions

This work has described a system for automatically detecting a rat in an open field maze while simultaneously classifying its behaviours. We have shown that it is possible to use Deep Learning techniques such as Convolutional Neural Networks to perform these tasks efficiently at an average frequency of 23 Hz. Also, despite the difficulties of using top-view images, it is possible to classify behaviours with a precision and recall of 60%, comparable with the works reported in the literature, with the advantage of not requiring special setups or controlled environments. The results achieved with our proposal deem promising considering that we can do it with a low resolution and inexpensive video camera with a budget PC.

We will explore 3D data to enhance classification between similar behaviours for future work.