1 Introduction

Honey bees are traditional pollination insects, and they also produce honey. In particular, Manuka honey is an export product exclusive to New Zealand. Many scientists and bee keepers are concerned about the health of bees, and they observe and analyse bees’ behaviour. The use of an observation platform inside the beehive is popular, but there is little monitoring of bees outside the beehive, especially using a close up view. Campbell et al. [1] tracked bees outside the hive, but the shadows of the bees affected the outcome. Kimura et al. [2] tracked honey bees on a flat surface, but it did not mention whether their method could be used in the natural environment. This paper tracks the flight of bees outside the beehive. As the camera is close to the beehive entrance, the video can display the bees’ bodies clearly.

In multiple object tracking, the common method is filtered tracking. Dearden et al. [3] introduced the particle filter to track soccer players on the field. Weng and Shantaiya et al. [4, 5] used the Kalman filter to track human beings moving on a video. All of the above [3,4,5] used filtered tracking. The filters predict the position assuming the velocity is constant and then combine the prediction and detection to track the object.

In this paper, the frame rate of the camera is 50 Hz, but the bees are flying fast. Therefore the situation is different from the people tracking in [3,4,5].

The approach of bee detection is implemented by the combination of foreground detection [6, 7] and colour base segmentation [8]. Detection produces bee blobs from the video image and displays them as a binary image. These blobs can be analysed to obtain the bee positions, and then bounding boxes can be drawn around the bees. Lu et al. [9] and Thou-Ho et al. [10] used blob analysis to get the same information.

Position information is used for bee tracking using the Kalman filter. Normally, the Kalman filter is used to track human beings and traffic vehicles, because their velocity is almost constant. The filter relies on the steady value of the velocity. However, in this research, there is no such steady value available for tracking. The Kalman filter predicts the bee positions for the next frame, and then combines the predictions and measurements (detections) to calculate the Kalman result for the next prediction.

The research in this paper is also concerned with multiple bee tracking. The Hungarian assignment method [11, 12] is used to assign predictions and detections.

If bees appear individually in the video frame, they can easily be tracked. However, because the video is only 2D, when two bees cross over each other, the detection only produces one blob of these bees. Therefore, there are two predicted positions corresponding to one merged blob in the frame. In this case, the Hough transform is used to calculate the positions of these two bees. Ballard [13] introduced the Hough transform to detect arbitrary shapes. Prasad and Leung [14] detected ellipse from edge information. Maji and Malik [15] used the Hough transform to detect objects. Their voting space could find the possible location of objects. In this paper, the Hough transform is applied to the detection of merged bees using information about the edges and orientations of their blobs in the immediately previous frames. This is a new application of Hough transform. Then the Kalman filter uses those positions as measurements (detections) to track the bees. This technique solves the problem of bees merging. The experiment is performed using MATLAB.

2 Bee Detection

2.1 Bee Detection with Image Segmentation

The image segmentation model uses a combination of foreground detection and colour base segmentation to detect bees.

Foreground detection uses the Gaussian Mixture Model (GMM) [7] to detect moving bees and ignores the unchanging background. This produces the binary image that is called the foreground mask image (FMI) in this paper. However, this also detects bee shadows and vegetation movement, which affects the result. The colour global thresholding using the Hue-Saturation-Value (HSV) colour space [16], detects the orange and black colour of bees. This removes the shadow and vegetation movement. This colour segmentation method produces two binary images we call the orange mask image (OMI) and black mask image (BMI). These three binary mask images are combined to detect the bee blobs. The logic is displayed in Fig. 1. More details are provided in [17].

Figure 1
figure 1

The combination logic.

Each blob can be analysed to estimate the position of the blob centre and to draw rectangular bounding boxes (bbox) around each blob.

2.2 Hough Transform for Merged Bee Detection

A difficulty occurs when two bees pass across each other in the image. Because we use 2D video, when the crossing happens, bee detection produces one big blob rather than two separate bee blobs.

After the merging has been identified (a procedure discussed in section 3), the next step is to detect bees in the merged blob. This is a new application of the Hough transform method, because the shapes of the merged bees are arbitrary.

It is a challenge to detect and separate two bees when they are merged together, because the bees have the same colour and patterns. Figure 2 displays the situation. In Fig. 2a the segmentation detects two bee blobs. In Fig. 2b the result of segmentation is only one merged blob in the following frame. The solution is to use the two individual bee shapes in Fig. 2b, to detect each bee in the merged blob in Fig. 2d.

Figure 2
figure 2

The segmentation of bees blob. a The first frame of two bees (b) The blob segmentation of (a). c The next frame of the two bees. d The merged blob segmentation of (c).

Although the bee shapes appear different in the merged detection, the single bee shapes in the Fig. 2b frame are partly observable in the merged shape of the Fig. 2d frame. Therefore, the shape parameters from the Fig. 2b frame can be used to detect the individual bees in the merged frame of Fig. 2d. However, this method can only find an approximate area of pixels for each bee’s position, because of the changes in the shapes.

In Fig. 2b, the blob of each single bee can be cropped out. This blob image can be used to produce the edge image. Using the bee blob near the bottom right corner in Fig. 2b as an example, Fig. 3 displays the edge of this blob.

Figure 3
figure 3

The edge image of a single bee blob.

It is assumed that the edge image (I s ) has width I and height J . The image pixel at i, j has the value:

$$ {I}_s\left(i,j\right)=\left\{\begin{array}{l}1,\kern1.75em on\ the edge\\ {}0, not\ on\ the edge\end{array}\right., $$
(1)

where 1 ≤ i ≤ I and 1 ≤ j ≤ J.

The centre of the image is at (i c , j c ). The pixels with value “1” define the edge curve on the image I s . The set of pixels (S) on the edge of the single bee blob is:

$$ S=\left\{\left(i,j\right)|{I}_s\left(i,j\right)=\mathsf{1}\right\}. $$
(2)

It is assumed that the set of edge pixels has N members. Each member of the set S is denoted by s n  = (i n , j n ) representing the coordinates of edge pixel number (n ∈ [1, N]).

In the edge image, the origin is at the top left corner. To detect a single bee in the merged blob, it is convenient to shift the origin to the centre of the edge image. The transformation is:

$$ {c}_n=\left[{u}_n,{v}_n\right]=\left[{i}_n-{i}_c,{j}_n-{j}_c\right], $$
(3)

then the transformed edge pixel set (C) is:

$$ C=\left\{{c}_n|n\;\epsilon\;\left[\mathsf{1},N\right]\right\}. $$
(4)

Figure 4 shows the merged blob edge of the two bees in Fig. 2c. This blob can also be cropped. The edge image of the merged blob (I m ) has width X and height Y, and it pixels are:

$$ {I}_m\left(x,y\right)=\left\{\begin{array}{c}1,\kern2em on\ the edge\ \\ {}0, not\ on\ the edge\end{array}\right., $$
(5)

where 1 ≤ x ≤ X and 1 ≤ y ≤ Y.

Figure 4
figure 4

The edge image of merged bee blob.

The pixels with value “1” are on the edge of the blob on the image I m . The set of pixels (M) on the edge of the merged blob is:

$$ M=\left\{\left(x,y\right)|{I}_m\left(x,y\right)=\mathsf{1}\right\}. $$
(6)

This set of pixels has K members. Each member is denoted by m k  = (x k , y k ) representing the coordinates of pixel number (k ∈ [1, K]).

In the Hough transform technique, the edge pixels of the single bee blob are drawn on the merged blob edge image, with the centre of the single bee blob positioned on each pixel on the edge of the merged blob. Therefore, if the merged blob edge has K pixels, the single blob edge is drawn K times on the merged blob edge image. The coordinates of the resulting pixels drawn on the image are

$$ {t}_{nk}=\left[{u}_n+{x}_k,{v}_n+{y}_k\right]. $$
(7)

This transform set is known as T, where:

$$ T=\left\{{t}_{\mathrm{nk}}|n\;\epsilon\;\left[\mathsf{1},N\right],k\;\epsilon\;\left[\mathsf{1},K\right]\right\}. $$
(8)

These are the Hough Transform pixels. Some of the pixels t nk are in the same position as each other. Figure 5 is an example of the single bee blob edge drawn four times on the merged blob edge image, with each drawing shown as a different colour. The centre of each drawing is at a different position on the edge of the merged blob. All of the pixels on these single bee edge drawings belong to T.

Figure 5
figure 5

An example of drawing the single bee blob edge on the merged blob edge image.

Each pixel in the merged blob edge image (I m ) may match none, one or several of the Hough Transform pixels (T). This creates a voting map (V(x, y)), whose size is the same as the size of merged blob edge image. Each element of the voting map array corresponds to a pixel of merged edge image. These elements are bins that record the number of Hough Transform pixels located at position (x, y).

For (x, y) inside the merged blob, V(x, y) is the number of t nk values equal to (x, y). These points are x ϵ [1, X], y ϵ [1, Y]. V(x, y) is also called the voting value.

The voting map indicates the possible positions of the single bee on the merged image. If the single bee blob edge fits part of the merged blob edge, there is expected to be a corresponding element of voting map that has a peak value. This peak point is a candidate for the single bee’s position in the merged blob.

However, it was observed that the motion of the bee can change its orientation by up to ±10° between successive video frames. If the bee changes orientation, the voting map may produce a lower valued peak point. Conversely, the fitting orientation leads the voting map to get a sharp high peak point. The rotation of the bee’s blob is described as:

$$ \alpha =\left\{{\alpha}_l|\left(l\;\epsilon\;\mathsf{1},\mathsf{2},\dots, L\right)\right\}, $$
(9)

where the L = 21 and the

$$ \left[{\alpha}_{\mathsf{1}},\dots {\alpha}_{\mathsf{21}}\right]=\left[-{\mathsf{1}\mathsf{0}}^{{}^{\circ}},-{\mathsf{9}}^{{}^{\circ}},\dots, {\mathsf{0}}^{{}^{\circ}},\dots, {\mathsf{9}}^{{}^{\circ}},\dots, {\mathsf{1}\mathsf{0}}^{{}^{\circ}}\right] $$

are the angles which are used to rotate the single bee blob edge from its original orientation. The plus sign indicates an anticlockwise rotation and the minus sign indicates a clockwise rotation. If the single bee blob edge is rotated to α l the edge pixel coordinates relative to the centre of the single bee blob are:

$$ {\displaystyle \begin{array}{c}{c}_{n{\alpha}_l}=\Big[{u}_n\times \cos {\alpha}_l+{v}_n\times \mathit{\sin}{\alpha}_l,\\ {}\kern3em {v}_n\times \cos {\alpha}_l-{u}_n\times \mathit{\sin}{\alpha}_l\Big].\end{array}} $$
(10)

Each rotated single bee blob edge curve can be drawn on the merged edge image to generate the voting map for an α l rotation. In this case, there are 21 different voting maps corresponding to the 21 different rotations. The different voting maps display different valued peak points. If a rotation makes the single bee edge fit the merged edge, a peak point in the voting map is high and sharp, with other peak points being much lower. If the rotation does not fit the merged edge, there may be two or three peak points with a similar, but lower, value. Figure 6 shows an example. In this figure, the X and Y values are the coordinates of the peak point, and the Z value is the voting value of the point. The rotation of 0 degree means no change in the orientation of the single bee blob edge (Fig. 6a–c). In this case, there are three peak points having the same level. The rotation of 10 degree in Fig. 6d–f does not fit the merged edge. The three peak points have a similar level 35, 36 and 37. Figure 6g–i is the fitting rotation in which the peak point in the voting map is sharply 53, while the second high peak level is only 39. It can be seen that the fitting rotation produces the correct detection and usually has the highest peak value.

Figure 6
figure 6

The final detection of results with different rotation of the single bee edge. The −10° rotation fit the detection, and voting map have top peak point.

In summary, the 21 voting maps produce 21 candidate positions since each voting map produces exactly one candidate position, corresponding to its peak value. However, the fitting rotation may not be the highest peak value. This is because the single bee shape sometimes fits another part of the merged shape rather than the correct part, so that the wrong position also includes the high peak point. If the peak point is near the correct position, the region around this point includes many higher voting value points. This is described as the correct region. Conversely, the region around the wrong position has some low value points, and this is known as the incorrect region, so that this peak point is sharp and isolated.

Figure 7 indicates the situation. It can be seen that the top point of wrong rotation (Fig. 7a–b) has a similar level (94) to the top point of the fitting rotation (level 95 in Fig. 7c–d). However, the top point in Fig. 7a–b) is more isolated than the top point in Fig. 7c–d). The problem is solved by using the sum of the region values around the top points. In each of the 21 voting maps, all the values at voting points within ±5 pixels of the top point are added together. It is assumed the top point in the voting map for orientation index l is V l (x p , y p ), and the total value of the region around this top point is:

$$ {r}_l={\sum}_{x={x}_p-5}^{x_p+5}{\sum}_{y={y}_p-5}^{y_p+5}{V}_l\left(x,y\right), $$
(11)

where l ∈ [1, 21]. The top value of the 21 region total values produces the correct region which includes the actual position of the bee. In addition, l gives the orientation of the bee. Take Fig. 7 as an example. After the calculation, the region around top point of Fig. 7b has a value of 3950; but the value of the region around the top point of Fig. 7d is 5501. Therefore, the correct region is from Fig. 7c–d in which the fitting rotation is 10˚.

Figure 7
figure 7

The peak point problem with different rotation. The fitting rotation is 10˚.

The top point in the correct region might be expected to be the candidate position. However, this point is only near the actual position, because of the shape noise. The shape noise comes from the segmentation. It is difficult to get accurate shape detail when performing segmentation, but the detection can be improved by a weighted average calculation near the correct region. The correct region itself is too small to be used to modify the position calculation. The correct region is extended to an area of ±20 pixels from the peak point, and a weighted average of the voting map is calculated as below:

$$ {w}_x=\frac{\sum_{x={x}_p-20}^{x_p+20}{\sum}_{y={y}_p-20}^{y_p+20} xV\left(x,y\right)}{\sum_{x={x}_p-20}^{x_p+20}{\sum}_{y={y}_p-20}^{y_p+20}V\left(x,y\right)}, $$
(12)
$$ {w}_y=\frac{\sum_{x={x}_p-20}^{x_p+20}{\sum}_{y={y}_p-20}^{y_p+20} yV\left(x,y\right)}{\sum_{x={x}_p-20}^{x_p+20}{\sum}_{y={y}_p-20}^{y_p+20}V\left(x,y\right)}. $$
(13)

In formula (12) and (13), the point at (w x , w y ) replaces the peak point as the candidate position of the bee.

This is the final detection of the bee on the merged image. The other merged bee is detected in the same way.

In the following merged frame, the fitting rotation still uses the original orientation, and the single bee detection which is just before the first merged frame is still used to detect the merged bee. This method works until the two bees separate.

3 Tracked Bees with Kalman Filter

The video has a 50 Hz frame rate and its view is very close to the bees. Therefore, a difficulty is that the bees can quickly fly anywhere from one frame to the next. In this case, the normal Kalman filter for object tracking [5] may not work well, because the bees are changing positions, velocities and accelerations. There is no steady variable for the normal Kalman filter. However, taking another view, the Kalman filter can combine the prediction and measurement to produce a correction for the prediction of next frame. Therefore, if the position of a bee is the state of Kalman filter, the reasonable correction is close to the actual position of the bee. In this research, the measurement of bee detection is more reliable than the prediction when tracking a single bee, so the correction is close to the detection. Conversely, in the merged bee situation, the detection is not as reliable as single bee tracking, so the correction depends on both the prediction and detection. This method requires the estimation of the covariance matrices of the noises.

3.1 The Kalman Filter Principle for Tracking

In object tracking, the Kalman filter is discrete at frame number k. The Kalman filter estimation is:

$$ \mathrm{Prediction}:\kern1em {p}_k={As}_{k-1}\kern0.5em , $$
(14)
$$ {P}_k={AQ}_{k-1}{A}^T+{Q}_0\kern0.5em . $$
(15)
$$ \mathrm{Measurement}:{m}_k={HS}_k\kern0.5em . $$
(16)
$$ \mathrm{Correction}:\kern0.5em {K}_k={P}_k{\left({P}_k+{R}_k\right)}^{-1}\kern0.5em , $$
(17)
$$ {s}_k={p}_k+{K}_k\left({m}_k-{p}_k\right)\kern0.5em , $$
(18)
$$ {Q}_k=\left(I-{K}_k\right){P}_k\kern0.5em . $$
(19)

From Eq. (14) to (19), the matrix P k is the covariance matrix of the prediction noise. The matrix R k is the covariance matrix of the measurement noise. The matrix S k is the actual state from the detection measurement. The matrix Q k is the covariance matrix of the estimated state s k . The matrix H is the measurement model matrix, which is identity matrix in this case. The correction is controlled by P k and R k . If the prediction (p k ) is unreliable, but the measurement from the bee detection is accurate, the noise of P k is greater than R k , so that the Kalman gain K k is nearer to one. Then the estimate state s k is closer to the measurement than to the prediction. If the situation is opposite, the state s k is closer to the prediction than the measurement.

In this research, Q 0 is the variance due to the fact that the bee velocity is not constant. It adds to the noise of the prediction. Q 0 is not only added once at the beginning of tracking, it is added in each Kalman filter calculation.

To apply the Kalman filter formulas, the state of a bee is its position (x k , y k ), and the tracking model in frame k is:

$$ {x}_k=2{x}_{k-1}-{x}_{k-2}, $$
(20)
$$ {y}_k=2{y}_{k-1}-{y}_{k-2}. $$
(21)

If the prediction of the Kalman filter is:

$$ {p}_k={A}_{k-1}{s}_{k-1}, $$
(22)

then the predicted state vector is:

$$ {p}_k={\left[{x}_k,{y}_k\right]}^T. $$
(23)

The state vector in the frame k is:

$$ {s}_k={\left[{x}_k,{y}_k,{x}_{k-1},{y}_{k-1}\right]}^T. $$
(24)

The transition matrix is defined as:

$$ {A}_{k-1}=\left[\begin{array}{cccc}2& 0& -1& 0\\ {}0& 2& 0& -1\end{array}\right]=A. $$
(25)

The measurement model of the Kalman filter is:

$$ {m}_k={H}_{k-1}{S}_k. $$
(26)

The measurement state vector is:

$$ {S}_k={\left[{x}_k,{y}_k\right]}^T. $$
(27)

The measurement vector is:

$$ {m}_k={\left[{x}_k,{y}_k\right]}^T. $$
(28)

The measurement transition matrix is:

$$ {H}_{k-1}=\left[\begin{array}{cc}1& 0\\ {}0& 1\end{array}\right]=H. $$
(29)

The covariance matrix of corrected noise is:

$$ {Q}_k=\left[\begin{array}{cccc}{q}_{11k}& 0& 0& 0\\ {}0& {q}_{22k}& 0& 0\\ {}0& 0& {q}_{11k-1}& 0\\ {}0& 0& 0& {q}_{22k-1}\end{array}\right]. $$
(30)

3.2 Estimation of the Covariance Matrices

The covariance values are estimated from the differences between the system calculation of the bee positions in the model and the actual bee positions as calculated manually. The error is calculated by:

$$ {e}_m=\left[{x}_d-{x}_a,{y}_d-{y}_a\right], $$
(31)

where e m is the measurement error, [x d , y d ] is the detected position and [x a , y a ] is the actual position. In the similar way, the prediction error is calculated from the difference between the system prediction and the actual position.

$$ {e}_p=\left[{x}_p-{x}_a,{y}_p-{y}_a\right], $$
(32)

where e p is the prediction error and [x p , y p ] is the predicted position.

It is assumed the x and y values are independent to each other, so both the covariance matrices are diagonal. For single bee tracking, 172 single bee images were used to estimate errors and calculate the variances. The measurement covariance matrix was:

$$ {R}_k=\left[\begin{array}{cc}13& 0\\ {}0& 10\end{array}\right]. $$
(33)

The x and y variances could be different because x is parallel to the front of the beehive and y is perpendicular to the front of the beehive. Both x and y are parallel to the ground surface below the beehive.

In addition, the covariance matrix Q 0 was calculated from bee positions on images.

$$ {Q}_0=\left[\begin{array}{cc}124& 0\\ {}0& 78\end{array}\right] $$
(34)

These two covariance matrices are assumed to be constant during single bee tracking. However, because bees can merge with each other in the image, the prediction and measurement covariance matrices in the merged tracking are different from the single tracking. 91 merged bee images were utilised to get the following matrices for the merged situation:

$$ {R}_k=\left[\begin{array}{cc}419& 0\\ {}0& 717\end{array}\right], $$
(35)
$$ {Q}_0=\left[\begin{array}{cc}435& 0\\ {}0& 825\end{array}\right]. $$
(36)

These two covariance matrices above are also assumed to be constant during merged bee tracking.

3.3 The Tracking Model

In the frame where a bee first appears, there is no previous position, so the predicted position for the next frame is taken to be the same as the current measured position. (This applies to all bees in the first frame of a new video.) Therefore, the first prediction will not be accurate, but the next prediction should agree with the detection. In the following frame, bees are tracked using the Hungarian assignment method [12]. This method assigns predictions to detections in an optimal way, to minimize the sum of the distances between the assigned predictions and detections. As a constraint, a prediction will not be assigned to a detection when the distance is more than 80 pixels. This will usually result in some unassigned predictions and detections. When a detection is unassigned, it is probably a bee appearing for the first time. In addition, unassigned predictions can be caused by a bee disappearing or bees moving across each other in the video.

Figure 8 shows the whole tracking model process. When a bee just appears for the first time, a Kalman filter model is created for tracking it. After the creation of the tracking Kalman filter, predictions and detections are assigned to each other for the correction of the Kalman filter results. The correction produces the next prediction. If a prediction is unassigned, it has to be estimated whether it is in the merged situation or not.

Figure 8
figure 8

The tracking model process.

In the merged situation, two predicted bounding boxes (bboxes) overlap one detected bounding box (bbox). Following the Hungarian assignment method, there is an unassigned prediction. The bbox information from the blob analysis is used to identify the merge occurrence.

For example, Fig. 9 displays a merged situation, where the two predicted bboxes (red) overlap the merged detected bbox (blue). In this situation, one prediction is assigned with this merged detection. The other prediction is unassigned, even though the unassigned prediction bbox overlaps the merged detected bbox. This is the how the merge situation is recognised.

Figure 9
figure 9

The merged bees situation.

In Fig. 8, if a merged situation is recognised, the Hough transform is used to locate each bee (as outlined in section 2). If it is not a merged situation, the tracking model will be deleted, because it is assumed the bee has disappeared, so it does not need to be tracked any more.

4 Implementation and Results

4.1 Implementation Background

The camera is attached to the front wall of the beehive, facing down about 30 cm above the entrance. A single white colour board is placed on the ground in front of the entrance of beehive, to simplify the image background. This is necessary, because the natural background includes withered grass with a similar colour to the bees and this interferes with the bee detection. Figure 10a shows the situation. The video is 1920*1080 resolution at a rate of 50 frames per second. The videos are recorded and copied to a computer for MATLAB processing. The computer is a 64bit Intel Core i7 3.40GHHz with 32GB RAM. MATLAB with the computer vision toolbox is used to a run a program following Fig. 8. The program can track bees automatically in the videos.

Figure 10
figure 10

The camera position (a) and the camera view (b).

A Matlab function performs the blob analysis. It not only calculates the blob information, but also outputs bounding boxes around each bee blob. The tracking is indicated by adding indexes to each bounding box and curves marking each bee’s trajectory with different colours in the video. A tracking boundary is set up just in front of the hive entrance, which is shown by the red line in Fig. 10b, so that the system only tracks bees in front of the white background area. This is because the area below the boundary line has complex colour information which affects the bee detection. Figure 10b shows an example of tracking output over 40 frames. The blue and red bounding boxes mark the detected and predicted bee positions respectively.

The model implementing tracking in the merged bee situation is shown in Fig. 11. The green outline shapes show the detections produced by the Hough transform when the bees are merged. The blue bounding boxes show the detection of merged blobs. The images (a) to (d) are successive frames. These examples demonstrate successful tracking.

Figure 11
figure 11

The result of merged situation tracking. Each image is the neighbour frame among the frame flow.

4.2 Evaluation of the Merged Detection and Tracking Results

There are two main situations that need to be evaluated: single bee tracking and merged situation tracking. There are several factors that affect the tracking accuracy.

First, merged detection failures. In the frames, bees continually overlap each other when they fly actively. It is common to see more than two bees merge, sometimes up to five bees merging in a single blob. Sometimes, bees fly close to the camera, so that the image has a large bee overlapping many smaller bees. The merged detection method may not be able to track bees in this situation. In addition, if bees keep merging over a long period of time, they may change direction and shape. Therefore, the detection may fail, because it is only based on the shape information before the merging. This factor is the main problem of this tracking method.

Second, boundary errors. Most bees come from beyond the image boundary. Some of them are only partly visible on their first appearance in a frame. In the single bee situation, tracking starts with these partly visible bodies. For merged bees on the boundary, tracking will not start if a bee is merged with another bee when it first becomes visible. In addition, bees may stay merged as they disappear beyond the frame boundary, so they are not seen separately. Then it is hard to know whether the merged tracking is correct or not. Therefore, as mistakes are unpredictable on the boundary, they will not be included in the evaluation of merged tracking.

Third, bees flying close to the camera. As the video is recorded in the natural environment, the video may show some bees as being very big, because they may fly close to the camera. These big blobs of bees may lead to unpredictable mistakes, which are ignored in the evaluation.

Finally, fourth, bee shadows. Although the single colour (white) board simplifies the background, shadows can still appear on the video. The shadows do not have an orange colour, but they may be detected as moving objects. If the bee detection method (section 2.1) cannot successfully remove shadows, the system may recognise them as bees. Therefore, shadows may affect bee tracking accuracy. One type of video which includes many shadows has been chosen to see test whether shadows affect the tracking or not.

Considering the four factors, two videos have been chosen to demonstrate the results. All of them were taken in a sunny environment, when detection is better than on a cloudy day. The first video is an active video (Fig. 12a). There are about eight bees in the frame view. This video was taken in the middle of the day, when the sunlight was clear and bees were flying actively. This video captures not only single bees, but also many merged situations, so it is used to estimate results for the single and merged situation tracking. The second video is also an active one, but there are many shadows on the board (Fig. 12b). This is the shadow video. This video also captures bees flying in single and merged situations. In addition, it tests the method when there are many shadows in the background.

Figure 12
figure 12

Example frames from two videos. a The active video. b The shadow video.

The evaluation results for single bee tracking are listed in Table 1. The active and shadow videos were evaluated over 300 frames to capture single tracking situations. In the table, the single bee tracking columns record the total number of bees tracked in these frames. If a bee index was assigned to the same bee between two consecutive frames in the video, this was counted as one correct tracking; otherwise, it was incorrect. After processing, the videos were all examined manually to determine whether each bee had been tracked correctly. In the active video, single tracking accuracy is over 99%, and in the shadow video accuracy is also over 99%. It can be seen that the shadows have little effect on tracking.

Table 1 The evaluation of single bee tracking.

The merged tracking with the Hough transform is compared to tracking without it. Both of the tracking models use the Kalman filter described in chapter 3. In the tracking model without the Hough transform, the merged tracking uses the centre of the merged blob as the measurement for the Kalman filter.

Table 2 reports the accuracy of merged tracking. The Correct column records that a bee after merging was identified correctly as the same bee before merging. For example, if three bees merged and then separated, and one of them was correctly identified with itself before merging, but the other two were incorrectly identified, then one result is Correct and other two are Incorrect. The Incorrect results do not include boundary errors.

Table 2 The tracking evaluation and comparison.

The table shows that during 1000 frames of active video, there were a total of 190 bees in merged situations. When using only Kalman filter tracking, 128 are correctly tracked and 62 are incorrectly tracked. The accuracy is 67.4%. In the model of Kalman filter with the Hough transform, 152 bees were tracked correctly, and 38 incorrectly. The accuracy increased to 80.0%. The merged tracking on the shadow video was sometimes more accurate because other factors were causing the errors. The Kalman filter tracked bees with 136 correct and 42 incorrect. The tracking accuracy was 76.4%. The result of using Hough transform is 156 correct tracking and 22 incorrect tracking. The accuracy level is increased by 11.2% to 87.6%. Visual inspection of the frames also showed that shadows have had little effect on the tracking. The variation between the results for the active and shadow videos appear to be not related to the presence of shadows.

5 Conclusion

This paper used a combination method for image segmentation to detect individual bees. In addition, it introduced a new way to apply the Hough transform to detect merged bees. The Kalman filter in this paper is utilized in the situation of non-steady variable tracking. The tracking model with the Kalman filter correctly tracked over 99% of single flying bees, and correctly tracked bees in the merged situation 72% of the time. However, with the Hough transform, merged bee tracking success increased to 84%. This method helps to solve the problem of the merged situation on 2D video.

However, there is still some future work required. Firstly, the Hough transform is too slow to find the bees’ positions in the merged situation. At present, this takes 8 to 33 s per frame, depending on size of the blob and number of bees. It is necessary to improve the speed of this procedure to achieve real time tracking. Secondly, as bee image segmentation relies on the orange and black colour of bees, there is a problem in differentiation arising from the natural background containing black (soil) and orange (withered grass) colours.