1 Introduction

Ego-motion technology is widely used nowadays for autonomous navigation, video conferencing, remote surveillance, visual odometry, human computer interaction and short term control applications. The highest accomplishment and effectiveness of ego-motion technology can be known from the fact that it was used by NASA for the famous mars rover project [18, 60]. Ego-motion technology facilitated in robot navigation on another planet, i.e. Mars.

The term “ego-motion” can be traced back to the work done by [30] in the field of psychology, he explains his views on theories of visual space. Six years later Warren [106] gives his idea of ego-motion perception on the basis of visual signals. These two studies marked the entry of ego-motion in the field of psychology, many other researchers, for instance, Neisser [70], Brandt, et al. [9], Prazdny [75], Prazdny [76] followed and produced their own concepts for ego-motion.

In computer vision domain, the first computational model for ego-motion was developed by Prazdny [76]. Prazdny believed that optical flow fields are generated at an observer’s retina as he moves in a 3D world. The ego-motion parameters of the observer and the relative depth map of the stationary environment can be computed by using the Instantaneous Positional Velocity Fields (IPVF). Only local properties of optical flow are enough for computation of ego-motion. Prazdny developed a computer model for investigation, analysis and performance evaluation of the proposed method. The computer model provided a simulation of 3D world where an observer is moving through the environment. Prazdny’s final result confirmed that the method was reasonable and feasible for ego-motion estimation from optical flow.

Many computer vision researchers have focused on providing solutions to ego-motion technology. These solutions were based on diverse algorithms [12, 36, 49, 59, 76, 79, 97, 98, 116], ego-motion computation using stereo sequences [4, 20, 22, 23, 26, 42, 55, 62, 65, 7274, 82, 84, 90] and monocular [48, 66, 80, 90, 110, 111, 115] sequences. On the other hand, several researchers have also focused on methods that learn visual representations from videos, where knowledge of ego-motion can be used for feature learning [1, 44, 89]. To learn visual representations, the videos are normally captured using vehicle mounted cameras. Most notably, the method suggested in [44] predicts feature responses to a distinct collection of observer motions and discrete ego-motion in one image is estimated using linear transformations. Olson et al. [72] used probabilistic approach to develop an algorithm for improving the robustness of ego-motion estimation methods for long distance navigation. They claim that their approach can achieve an error below 1 % for long distance travels. Ess et al. proposed a method in [25] for tracking pedestrians in the presence of occlusions and outliers. They use a novel feedback connection from the object detector to visual odometry which utilizes the semantic knowledge of detection to stabilize localization.

1.1 Outline of the research paper

Our goal with this review paper is threefold: (1) For computer vision readers, we expect this paper to serve as an introduction to ego-motion and its related concepts, and believe that our work will be a substantial contribution to the computer vision domain. (2) For ego-motion researchers, we believe this review paper will provide a handy opportunity to study the ego-motion estimation methods/algorithms in diverse scenarios. (3) For visual odometry readers, we hope this review to contribute and extend the significance of ego-motion technology in the computer vision domain.

The paper is structured as follows: Section 2 begins by introducing the motion estimation in general, ego-motion technology and provides an understanding towards some basic motion estimation concepts such as independent moving objects, focus of expansion, motion field, and optical flow. Section 3 discusses three major applications of ego-motion technology, i.e. autonomous vehicles, visual odometry and visual SLAM (Simultaneous Localization and Mapping). Section 4 critically reviews the previous literature and specifies various algorithms that have been successfully used for ego-motion estimation. Next, Section 4 presents several types of camera setups along with its strengths and weaknesses in computer vision. Section 5 enlightens the major ego-motion estimation complexities of occlusion, noise and camera calibration. Section 6 lists some open problems found during this literature review. Finally, section 7 summarizes and concludes the research paper and provides some future directions. Figure 1 summarizes the contents of this research article.

Fig. 1
figure 1

Ego-motion framework

2 Motion estimation concepts

An image sequence is composed of a chain of moving images which are learned at discrete instances of time [99]. When sub-sampling image sequences, it should be made sure that the time interval is small enough, it guarantees that discrete sequence represent an actual continuous image sequence evolving over time [99]. There are two motions associated with any image sequence i.e. the camera motion and the object motion in the scene [40]. Further, there are three possibilities for relative motion within an image sequence. The observing camera can move in front of a static scene, or objects can move in front of a fixed camera, or the camera and the object can be moving simultaneously at any given time [99]. 3D motion estimation is challenging task since a total of seven parameters are required to compute motion at each and every pixel [40].

Understanding motion estimation concepts in computer vision is of great importance. In the sub-sections, we have discussed all basic concepts when estimating motion in a scene. Firstly, in section 2.1, we have discussed ego-motion which is the true motion of moving camera also known as observer motion. Ego-motion has three parameters associated with it translation, rotation and depth. In the perspective of camera motion if there are moving objects in the scene they are known as independent moving objects. Section 2.2 discusses independent moving objects and the complexities it poses in estimating ego-motion. Focus of Expansion is also an important concept when dealing with motion estimation. Using FOE, it can be found if the camera or observer is moving towards the objects or away from it, see section 2.3 for more details. Section 2.4 and 2.5 discusses motion field, Motion field is the true motion of objects in a scene, and motion vector is assigned to each and every point in the image. These motion vectors can be pure translational or rotational. Finally, section 2.6, we discuss the optical flow which is the motion of brightness or luminance patterns in an image.

2.1 Ego-motion

Raudies and Neumann [78] defined ego-motion as a sequence of images recorded using a moving camera, having depth of the pictured environment along with all the data about the 3D movement of the camera. Similarly, Baik, et al. [6] defined visual ego-motion as a continuous process where 2D image sequences captured by a camera are used to estimate 3D camera pose. Generally speaking, ego-motion is the camera’s motion within an environment, relative to a rigid scene, where the motion can be 3D.

Visual motion fields have been used as a primary model for ego-motion estimation over the past 30 years [78]. At low power consumption, rich information can be extracted from the cameras, but there are certain issues with handling all the information from a camera before the ego-motion can be inferred. Ego-motion detection is a challenging task; a reliable and accurate technique is yet to be discovered. Image motion, ego-motion and depth of scene are tightly coupled with each other [91]. The rotation, translation and depth parameters for ego-motion estimation is given below [69, 108].

$$ T={\left({T}_x,{T}_y,{T}_z\right)}^T $$
(1)
$$ R={\left(\ {R}_x,{R}_y,{R}_z\right)}^T $$
(2)
$$ \mathrm{Z}\left(\mathrm{x},\mathrm{y}\right) $$
(3)

Here T represents the observer translation, R represents the observer rotation and Z(x, y) is depth at any location. See Fig. 2 for translational parameter and Fig. 3 for rotational parameter respectively.

Fig. 2
figure 2

Translational parameter for ego-motion estimation

Fig. 3
figure 3

Rotational parameter for ego-motion estimation

Visual ego-motion estimation is a traditional problem in the computer vision literature. Raudies and Neumann [78] lay emphasis on the following three estimation problems: First, how to compute the optical flow. Second, how to estimate ego-motion using optical flow when combined with the visual image motion model. Third, how to notice the translation speed of the observer for relative depth estimation. According to Gluckman and Nayar [32], the main problem with ego-motion is recovery of translation direction and observer rotation as it moves through the environment. Numerous algorithms have been developed by the vision researchers to provide a solution to this ego-motion problem [32]. Baik, et al. [6] specified ego-motion estimation as a state estimation problem, that can be solved effectively using Bayesian filtering and particle filtering methods. Camera projection always has problems on nonlinearity, Bayesian filtering methods are capable to deal with this nonlinearity. Apart from these problems, visual ego-motion holds great significance for computer vision applications, robotics, augmented reality and vSLAM (Visual Simultaneous Localization and Mapping) [60, 71, 74].

2.2 Independent moving objects

As an observer moves through the environment the scene changes. If the scene is static and no objects are moving, essential information about the motion of the observer can be inferred from such image sequences [78]. Objects in the scene that are non-static are referred to as “moving independently” [78]. Moving object detection in videos is a key step for information extraction in several vision based applications such as semantic annotation, video surveillance, traffic monitoring and people tracking or more generally objects tracking. It is challenging to extract observer information from scenes with moving objects. It becomes difficult to interpret whether the objects within the scene are moving or the observer is moving.

When attempting to construct any vision system, it’s important to address the visual motion too [78]. The task of identifying Independent Moving Objects (IMO’s) by an observer that is moving is of remarkable significance and is effortlessly achieved by humans. Developing algorithms that can achieve such tasks is challenging for computer vision researchers. Even for navigational purposes, It’s necessary that the observer is capable of identifying moving objects in the scene, in order to avoid collisions [59].

The fastest algorithm for moving object detection in a static camera environment is frame differencing [47]. In frame differencing two consecutive image frame difference is computed and moving objects are found. However, if in a scene both camera motion (ego-motion) and object motion is present, then differencing is not applicable since two independent motions are involved [47].

Independent movement of one or more objects gives rise to several issues in ego-motion estimation. Famous work was done by MacLean, et al. [59] for dual motion estimation i.e. object and ego-motion. MacLean, et al. [59] used Heeger and Jepson subspace method and expectation maximization (EM) algorithm for ego-motion estimation. Several constraints were modified so that a sphere camera can be used instead of a pinhole camera. Bilinear constraint of Gaussian distribution function was suggested as a constraint [59].

2.3 Focus of expansion

In pure camera translation conditions, when the camera is looking forward, all image features seem to diverge from a specific image location known as the focus of expansion (FOE) [13, 14]. In backward camera motion image features converge from a specific image location called the focus of contraction. In other words FOE can be described as the intersection point of translation vector with the image plane [60, 64]. If the object present in the scene is stationary, then all the velocity vectors emerge from the FOE. In scenarios of independent moving objects velocity vectors will have different directions of image flow.

Finding the focus of expansion from the frames is significant for the computation of 3D camera translation [40, 60]. Focus of expansion cannot be determined when using traditional cameras, however it is covered when using wide-angle imaging systems [32]. Optical flow can also be used to find the FOE. Once FOE is determined, it can be used to estimate distances to points in the scene. These distances can be used for finding the time to impact or collision between objects and camera in the scene. The rate of expansion gives the estimated time for collision. This rate of expansion feature is used in applications for the control of moving vehicles and robotics, collision warning system and obstacle avoidance [64].

2.4 Motion field

Visual motion fields are used for projection of 3D scene points onto camera surface [78]. Visual motion fields are also used to observe movements in any scene, movement can be rotational or translational. Motion field can also be used for determining the relative depth in a scene, which, when combined with ego-motion produces information beneficial for visual navigation [78]. The motion field solves problems like recovering 3D velocity field, image segmentation for moving object identification or reconstruction [104].

Illustration of 3D motion projection onto a camera image can be achieved using motion field [99]. In the motion field each pixel in the image is assigned with a velocity vector. The relative motion between the 3D scene and the camera allows these velocities to be induced. Each point of a 3D scene is projected as a point in the image. However the location of projection of a fixed point may vary with time.

Variations in image brightness patterns, spatial and temporal, is known as ‘optical flow’ and can be used for motion field estimation [104]. The motion field can also be represented using a mapping function that maps image coordinates to a 2D vector. The image of a scene point, P, is the point p given using the equation below [99].

$$ p=f\left(\boldsymbol{P}/Z\right) $$
(4)

Where P = [X, Y, Z] T is a 3D point in the typical camera reference frame. The Z axis is the optical axis and the focal length is denoted by f. The projection of the scene point is given by p, such that p = [x, y, f]. The third coordinate z, is always equal to f, hence the projection p can be written as p = [x, y] T. The relative motion between the camera and the scene point P can be expressed using the Eq. (5) [99].

$$ V=-T-\omega \times P $$
(5)

Where T is the translational component such that T = (Tx, Ty, Tz)T and ω is the angular velocity such that ω= (ω x, ω y, ω z )T. In expanded form the equation (5) can be written as,

$$ {V}_x=-{T}_x-{\omega}_yZ\times {\omega}_zY $$
(6)
$$ {V}_y=-{T}_y-{\omega}_zZ\times {\omega}_xZ $$
(7)
$$ {V}_z=-{T}_z-{\omega}_xZ\times {\omega}_yX $$
(8)

The time derivative of both sides of the projection Eq. (4) is taken in order to obtain the relation between the velocity of P and the corresponding velocity of p.

$$ v=f\frac{ZV-{V}_zP}{Z^2} $$
(9)

In expanded form the Eq. (9) can be written as

$$ {v}_x=\frac{T_zx-{T}_x\ f}{Z}-{\omega}_yf+{\omega}_zy+\frac{\omega_xxy}{f}-\frac{\omega_y{x}^2}{f} $$
(10)
$$ {v}_y=\frac{T_zy-{T}_y\ f}{Z}+{\omega}_xf+{\omega}_zx+\frac{\omega_yxy}{f}-\frac{\omega_x{y}^2}{f} $$
(11)

Finally, it can be concluded that motion field is the sum of two main components i.e. translation and rotation. The part of a motion field that depends on angular velocity does not carry any depth information [99].

2.5 Pure translation and rotation motion field

The motion field is the sum of two components, one of which depends on the translation and the other depends on the rotation [99]. In pure translation, rotational component ω of the motion field is considered to be zero.

Since ω=0, both Eqs. (10) and (11) becomes,

$$ {v}_x=\frac{T_zx-{T}_x\ f}{Z} $$
(12)
$$ {v}_y=\frac{T_zy-{T}_y\ f}{Z} $$
(13)

There are two cases associated with the pure translational component. Let p be any point in the scene, if Tz < 0 then p is the focus of expansion, the motion field vectors point away from it. If Tz > 0 then motion field vectors point towards p and is called the focus of contraction. In the first two cases when Tz ≠ 0, motion field is said to be radial. If Tz = 0, the motion field vectors become parallel.

In pure rotation translational component T of the motion field is considered to be zero. In such a case, pure rotation can be computed by modifying the Eqs. (10) and (11),

$$ {v}_x=-{\omega}_yf+{\omega}_zy+\frac{\omega_xxy}{f}-\frac{\omega_y{x}^2}{f} $$
(14)
$$ {v}_y={\omega}_xf+{\omega}_zx+\frac{\omega_yxy}{f}-\frac{\omega_x{y}^2}{f} $$
(15)

2.6 Optical flow

The recovery of motion field, which is a perspective projection onto the image plane of true 3-D velocity field, is thought to be a crucial step. Only two forms of data are available, the spatial and the temporal. The temporal variations in the image are referred to as brightness pattern [104]. From these variations it is possible to derive an estimate of the motion field called the optical flow [104].

Optical flow is the distribution of apparent velocities of image brightness patterns in an image [37]. This optical flow can be generated due to the relative motion of the viewer or the objects. Optical flow can provide data about the spatial arrangement of objects and its rate of change. The velocity measurements must be accurate and dense for the inference of ego-motion [7].

Many methods have been proposed for the computation of optical flow [7]. Over thirty years have passed and most of the latest methods resembles the original technique of Horn and Schunck [96]. Therefore, Horn and Schunck method can be correctly termed as classical [96]. Horn and Schunck optical flow depends on both spatial smoothness and brightness, but it’s not robust to the outliers [96]. Black and Anandan [8] formulated a robust framework to deal with outliers in both the spatial and temporal terms. In order to generate piecewise smooth results the quadratic regulariser in Horn and Schunck model was replaced by a smoothness constraint [2, 11, 21, 67, 83, 107]. To preserve flow discontinuities L1 penalty was used by Shulman and Herve [88]. Some of the best methods that can be used to find optical flow include coarse-to-fine estimation, texture decomposition, incremental warping, warping with bi-cubic interpolation, graduated non-convexity, temporal averaging of image derivatives and median filtering [96].

3 Significance of ego-motion estimation

Ego-motion technology is beneficial for many applications such as autonomous navigation, video conferencing and remote surveillance [20, 29, 32, 71, 93, 100]. According to Tsao, et al. [100], ego-motion is extremely useful for human computer interaction and short term control applications.

For several years the primary goal of computer vision researchers has been directed towards effective use of video sensors for navigation and obstacle detection [71]. Using image sensor for motion estimation is an interesting idea, its compact, low-cost and a self-contained device [54]. Also, these sensors can be very important if used as a component in navigation systems. Various types of sensors are used for the purpose of navigation and motion estimation, the primary ones are GPS and inertial measurement units (IMUs) [54].

There are several drawbacks associated with these sensors and therefore ego-motion estimation is a better option [54]. For high-precision applications, these sensors can be quite costly and prone to error. If low-cost IMUs are used, they degrade quickly unless corrected. On the other hand, GPS sensors are incapable of working indoors or under tree shelters. Therefore the conclusion drawn is to use visual ego-motion estimation in combination with the traditional methods [54].

3.1 Autonomous vehicles

Vehicle’s ego-motion estimation provides the capability for advanced driving assistance systems and mobile robot localization [51]. Autonomous driving and computer vision based driving assistance requires accurate ego-motion estimation [93].

Vehicle analysis systems can be divided into two groups i.e. offline and online applications [29]. Offline applications are useful when processing video sequences to study the behavioral patterns of the driver. These behavioral patterns are used for development of tools and applications for driver assistance. On the other hand an online system holds great significance for intelligent driver support.

Omnidirectional cameras have always been preferred for ego-motion estimation due to its panoramic view support and capability to deal with ambiguities [29]. When an omnidirectional camera is mounted on an automobile it can provide a complete panoramic view which is very appropriate for both offline and online vehicle analysis applications [29].

Several researchers [16, 23, 29, 31, 41, 86, 87, 9294, 102, 114] opted to use ego-motion technology for autonomous vehicles. Investigations have discussed several challenges that are associated with autonomous vehicles, which include and are not limited to uneven terrains, translation motions, relative rotation motion, low number of feature points on a typical road, obstacle and lane detection [93]. Various sensors have been proposed to tackle these issues such as radar, laser or GPS-based systems. Radar and laser are specifically of great importance in extreme conditions when there is no presence of light. On the other hand GPS-based systems have reliability issues in environments where they have no direct line of sight with a satellite. Using computer vision based systems, ego-motion, rather than sensors reduces maintenance and cost associated with the sensors. It also eliminates the need to calibrate between the sensors and don’t have other drawbacks associated with sensors [33, 93].

3.2 Visual odometry

Visual odometry deals with motion estimation of a robot [65]. The motion is estimated from the visual input only [71]. Visual odometry has been proved to be an extremely effective tool for vehicle safety applications when driving near obstacles, on inclines, performing slip checks, making tough drive tactics and ensuring precise imaging [60]. Visual odometry provides position knowledge, this position information allows additional autonomous vehicle capability and better analysis during planetary maneuvers [60].

As mentioned earlier, one of the most famous and outstanding experiment in stereo visual odometry is the Mars Exploration Rovers (MER) by Cheng, et al. [18] and Maimone, et al. [60]. NASA sent its two MERs and each rover had accurate knowledge of its position, which allowed autonomous detection and compensation for an unforeseen slip during the drive [60]. Several computer vision researchers [18, 54, 57, 60, 71, 105] have worked on visual odometry.

Decades of research in the field of visual odometry have proved that accurate localization and motion estimation is extremely important for navigational purposes [18, 54, 57, 60, 71, 105]. Active beacons and GPS-based navigation provides absolute positioning, but with high costs. On the other hand visual odometry reduces the maintenance and cost associated with sensors by replacing it with an inexpensive camera. It also allows more reliable and safer navigation system, which can be used in close proximity to human beings [15]. This inexpensive camera used in visual odometry provides a broader field of view and simultaneous sensing capabilities of the range and appearance [65].

3.3 Visual SLAM

In localization a robot estimates its own location relative to its surroundings [113]. In pure localization environments the robot needs to be delivered with a precise map and is incapable of adapting to the environmental changes. Manual labor, specific expertise and cost are associated with the map building for the robot. To achieve an advanced degree of independence, the robot needs to construct a map using its sensors and then recognize where it is, this is called Simultaneous Localization and Mapping (SLAM) [113]. When a robot is capable enough to travel across an environment without user-involvement, form a map, and also localize itself in the map, it is believed to achieve full autonomy [50].

SLAM has typically been used with landmarks that are too sparse. Hence, these sparse map representations are insufficient to be effectively and efficiently used for tasks like path planning, obstacle avoidance and autonomous navigation. SLAM for ground vehicles has mostly been done with sensors rather than using visual cameras [71]. If a global positioning sensor (GPS) is used and something makes the external and data beacons unavailable, then the robot must be capable of finding its position based on reference points and build a map [50]. One of the main approaches to move away from costly sensor is, camera usage and some optical encoders to locate suitable reference points, this is referred to as Visual SLAM (vSLAM), [50, 113]. Recent improvements in sensors and hardware have made vision based processing more practical and mature [71].

Visual SLAM is cost-efficient for consumer robotics in particular [50]. It improves the overall performance of localization and mapping. Its algorithm is also very robust to dynamic changes in the environment which may be caused by moving objects and/or people and/or lighting changes [113]. The primary purpose of visual SLAM is to combine the image and odometry data such that it enables robust map-building and localization. This robustness allows the sensor data acquired from the mobile robot to remove the difficult-to-model noise [50].

4 Algorithms and hardware comparisons

Section 4.1we briefly explain some popular algorithms that are used for ego-motion estimation. In section 4.2, we overview different camera setups and specify its drawbacks in context of ego-motion estimation.

4.1 Critical review of major algorithms

This section primarily focuses on instantaneous-time algorithms for ego-motion estimation. The extensions and modifications to these algorithms by other renowned computer vision researchers are also studied.

Over the past 30 years extensive research has been devoted towards ego-motion estimation. The most common and notable method for its estimation is using the optical flow. With the passage of time several techniques evolved and point correspondence method were introduced. The main difference between optical flow and point correspondence method is of motion representation. The point correspondence method is capable of presenting large motions while optical flow presents small motions.

Overall the algorithms for ego-motion computation can be classified as discrete time algorithms or instantaneous-time algorithms [97]. In instantaneous-time algorithms, input is the image velocity while discrete time methods concentrate on image displacement [97]. In a rigid scene under perspective projection, image velocity is produced due to camera motion. As mentioned in [97], image velocity can be expressed using Eq. (16).

$$ \cup (X)=\left[\begin{array}{ccc}\hfill\ 1\hfill & \hfill 0\hfill & \hfill -{\boldsymbol{x}}_1\hfill \\ {}\hfill\ 0\hfill & \hfill 1\hfill & \hfill -{\boldsymbol{x}}_2\hfill \end{array}\right]\left(\frac{T}{Z(X)}+\varOmega \times X\right) $$
(16)

The image velocity is ⋃(X), image position is X = (× 1, × 2, 1), T is the translational velocity, Ω is the rotational velocity and Z is the depth. The focal length is taken as 1.

Prominent work on instantaneous-time algorithms was done by Bruss & Horn, Jepson & Heeger, Tomasi & Shi, Kanatani A, Kanatani B and Prazdny (see Fig. 4). All these proposed algorithms differ from each other on four factors [97]. First, algorithms that calculate rotation-first. Second, algorithms that calculate translation-first. Third, algorithms that require numerical optimization requirement or not. Finally, motion parallax based algorithms versus epipolar constraint.

Fig. 4
figure 4

Timeline for major ego-motion algorithms

4.1.1 Prazdny

In computer vision domain, the first computational model for ego-motion was developed by Prazdny [76]. Prazdny believed that optical flow fields are generated at an observer’s retina as he moves in a 3D world, and these ego-motion parameters of an observer can be computed. He proposed an algorithm that estimated rotation first, which was independent of translation and depth. Prazdny’s implementation consisted triple image points and rotation constraint. Later Tian, et al. [97], combined all constraints across the image. It was also found that each triple of points came from Delaunay triangulation and that different triangulations lead to inconsistent estimates. In order to avoid inconsistent estimates and fix triangulation, Tian, et al. [97] used a fixed uniform sampling grid. The algorithms discussed in section 0 , 4.1.3, 4.1.4 and 4.1.5; all begins by estimating the translational factor first (see Table 1).

Table 1 Ego-motion estimation algorithms facts

Prazdny method assumes that surfaces in scene are smooth. Prazdny model is very sensitive to noise and works best in situations where there is low or no noise (falling leaves, trees, blowing snow etc.) in the optical flow field. There is no mechanism used to overcome external and internal noise.

4.1.2 Bruss and horn

Bruss and Horn removed depth and acquired a bilinear constraint to use on each sole image pixel [12]. The same bilinear constraint was later derived by MacLean and Jepson, but by using a different algebraic manipulation [59]. Bruss and Horn algorithm was checked in many different simulations and performed quite well, the main drawback was of numerical optimization [97]. Bruss and Horn also worked on direct methods for motion estimation [38]. In direct methods neither point correspondence, nor optical flow estimation is required. In cases of pure translation and rotation, the first derivative of brightness in an image region is used for motion estimation. Bruss and Horn algorithm works best in a situation where we have a planar environment and there is no depth parameter involved.

4.1.3 Jepson and Heeger

Rieger and Lawton [79] suggested a technique for ego-motion estimation based on motion parallax. Different depths might be associated with two 3D points on a same image location. The difference of the flow vectors is oriented toward the FOE. Hence Rieger and Lawton [79] algorithm is used to locate the focus of expansion from the dissimilarities of flow vector. Later Hildreth [36] revised algorithm of Rieger and Lawton, to increase its performance. In both these algorithms, it was challenging to measure flow vectors near occlusion boundaries [97]. Jepson and Heeger studied these preceding efforts and suggested their own set of solutions. Various subspace approaches were proposed by Jepson and Heeger for ego-motion estimation [35, 45, 46]. The main benefit of using linear subspace method was that there was no requirement for iterative numerical optimization. Jepson and Heeger algorithm is best suited for scenes where there is rich depth structure involved and have less to no noise.

4.1.4 Tomasi and Shi

Tomasi and Shi [98] use the motion parallax information differently. The translation is estimated from image distortions, which is the variation in the angular distance amid a pair of image points. Image deformations are independent of camera rotations. Tomasi and Shi algorithm works best for the scenes which have straight ahead motion and some sideways motions.

4.1.5 Kanatani A and B

The epipolar constraint is the foundation for several linear discrete time algorithms. There is also an instantaneous time form of the epipolar constraint for ego-motion estimation which was proposed by Zhuang, et al. [116]. In future this instantaneous-time version of epipolar constraint was reformulated by Kanatani [49] in terms of parameters and twisted flow. Kanatani’s and Zhuang’s algorithm are similar and equivalent to one another [97]. This algorithm is referred to as Kanatani A.

The Least-squares estimates of translation vector are scientifically biased [97]. Kanatani introduced a renormalization method that automatically removed the bias and unknown noise. This bias was analyzed by Kanatani using a simple Gaussian noise model and hence its named Kanatani B [49]. Kanatani A and B are both used for scenes where there are non-rigid and small motions involved.

4.1.6 Other methods

The algorithms discussed in sections 4.1.1 to 4.1.5 are the most famous and novel methods used for ego-motion estimation. Several extensions and modifications has also been proposed to these methods. Kanatani’s method was further extended to study its effect on brightness of moving objects [61]. As an extension to Kanatani, orthogonal subspace decomposition was proposed by Wu, et al. [109].

Horn and Schunck model is one of the most classical models for optical flow estimation. Several researchers have adapted and modified this rich model [2, 8, 11, 21, 67, 83, 88, 107]. Black and Anandan [8] formulated a robust framework to deal with outliers in both the spatial and temporal terms as an extension to the Horn and Schunck model. In order to generate piecewise smooth results the quadratic regulariser in Horn and Schunck model was replaced by a smoothness constraint [2, 11, 21, 67, 83, 107]. To preserve flow discontinuities L1 penalty was used by Shulman and Herve [88].

Optical Flow methods have also been considered for ego-motion estimation by Raudies and Neumann [77] and Briod, et al. [27]. Combination of Optical Flow method and Random Sample Consensus (RANSAC) was used for ego-motion estimation and efficient separation of translation and rotation parameters [77]. Recently, Briod, et al. [10] proposed a novel ego-motion estimation algorithm that uses only optical flow directions and not its scale, making the method immune to inertial sensor drifts.

Inspired by recent developments in deep learning networks, Convolutional Neural Networks has been used to learn feature representations from images for frame-to-frame motion estimation [19]. The proposed method is extremely robust with respect to image anomalies and imperfections. There have been other methods proposed such as branch-and-bound methods [27, 28, 52]. These methods are robust in handling outliers, but only deals with pure translational camera motion environments.

4.2 Types of camera

The main equipment required for ego-motion estimation is the camera. Over the course of years, numerous cameras have been developed, some cameras provide small field views while others provide panoramic and wide views. In this section we first overview the traditional camera setup and specify its drawbacks in context of ego-motion estimation. Later we view omnidirectional, stereo and monocular camera setups which have been widely used in ego-motion estimation environments.

4.2.1 Traditional camera

Traditional cameras typically have a cone of 45 degrees and limited field of view [32, 68]. It comprises of a video camera attached to a lens. The projection model is perspective with a single center of projection. These conventional imaging systems are of finite sizes and therefore the incoming rays are occluded by the camera lens. Instead of a hemisphere the lens has a small field of view that corresponds to a cone [68]. Traditional cameras make computation of camera motion sensitive to noise, because the direction of translation may lie outside the field of view [32].

Using multiple traditional cameras to overcome the limited field of view is infeasible because the centers of projections will reside inside their respective lenses [68]. Another method to overcome this limited field of view is by using rotating imaging systems. The problem with rotating imaging systems is that it requires precise positioning, use of moving parts, more time for large field view computation and thus can only be used with static scenes [68].

4.2.2 Omnidirectional and panoramic camera

Omnidirectional cameras provide a 360° panoramic view of the scene [29, 63, 81]. When compared to traditional cameras it has an enhanced field of view [17, 63]. It is ideal for 3D vision tasks such as obstacle detection and motion estimation [95]. Wide field of view can be attained by using a combination of a camera and a mirror, a camera with wide-angle lens or several synchronized panoramic cameras [63]. Omnidirectional camera’s panoramic view helps in dealing with ambiguities that are associated with ego-motion estimation [29].

Omnidirectional cameras show a great potential in intelligent vehicle applications, autonomous navigation applications, remote surveillance, video conferencing and scene recovery [29, 68]. Another advantage of 360° horizontal panoramic view is that one can track feature point for longer distances with less constraint [17]. Omnidirectional cameras are very advantageous for robot vision system applications for e.g. when surrounding scene is stored as a single image frame [17, 29, 63]. 360° view ensures that no object will escape the camera’s view [63].

Various algorithms have been proposed by the vision researchers to solve the ego-motion problem [32]. Typically two methods are used for solving ego-motion. The first method includes the optical flow computation. The second method is based on motion field analysis and extraction of camera translation and rotation from the optical flow. The main problem with the computation of ego-motion is the sensitivity of optic flow to noise, omnidirectional camera systems seek to overcome this problem [32]. There are also several drawbacks associated with omnidirectional cameras. One drawback is lower image resolution for a larger field of view [17]. Many researchers have preferred and used omnidirectional camera for ego-motion estimation [29, 32, 34, 55, 56, 63, 81, 103].

4.2.3 Stereo and monocular camera

Many methods have been proposed for ego-motion computation using stereo [4, 20, 22, 23, 26, 42, 55, 62, 65, 7274, 82, 84, 90] and monocular [48, 66, 80, 90, 110, 111, 115] sequences. The main difference among these methods is feature tracking and the transformation applied for ego-motion estimation [65].

Monocular camera setup has low runtime cost and well-developed machine learning paradigms [90]. Due to translation scale ambiguity, monocular ego-motion estimation is only reliable for measuring translational velocity [20]. Monocular cameras do not provide a 3D location for detecting vehicles directly [90].

Whenever possible the stereo camera setup is preferred over a monocular camera setup [20]. The main advantage of using a stereo camera setup is that it provides explicit computation of location in real world coordinates and depth [90]. One of the most outstanding work on stereo visual odometry, Mars Exploration Rovers, was done by Cheng, et al. [18] and Maimone, et al. [60]. Mars Exploration Rovers (MER) project by NASA successfully demonstrated the usefulness of visual odometry on another planet. The MER system used its 45 degree mounted navigation cameras to obtain stereo pairs, which were compared using an on-board software [60]. The MER system was also able to determine 6 Degrees-of-Freedom rover pose (roll, pitch, yaw, x, y, z). Stereo camera setup’s main drawback is that it requires additional specialized hardware, precise calibration, and additional computational cost [90].

5 Ego-motion estimation challenges

Difficulties and ambiguities in ego-motion estimation arise due to unwanted camera motion, occlusion, noise, lack of image texture, illumination changes and the aperture problem [20, 85]. In this section we discuss these challenges and specify the avoidance and recovery methods for it.

5.1 Occlusion

Occlusions arise in video streams when a portion of scene is visible in one image but not another [3]. Depth discontinuities may also introduction occlusion in video streams [3]. Image formation is often limited due to occlusion [20].

Determining various occlusion regions is challenging [39]. Occlusion can be easily detected if the motion field is known [3]. The computation of motion field might also be disrupted when a scene becomes occluded and dis-occluded over a course of a video [39].

Self-occlusions are produced when the shape of the scene changes and results in significant deformations [101]. In forward motion scenarios shape of scene is affected, scale changes in image domain and produces self-occlusions [101]. Tsotsos, et al. [101] described an ego-motion estimation system specifically for humanoid robots with specific emphasis on the challenges of scale changes due to forward motion.

Determination of occluded region is important for one-to-one correspondence of image scene points [39]. These image scene points are required for video segmentation, tracking, segmentation and reconstruction algorithms. In order to manage occlusion regions some applications treat them as outliers [39]. However, outliers may also pose as problems, triggered by two reasons, namely, incorrect matches and the correct matches [5]. Incorrect matches are produced by temporal or erroneous spatial formation of point-to-point correspondence. On the other hand, correct matches are caused by misrepresentation model and assumption that a point is static [5].

Occlusion boundaries are very beneficial for motion direction, ordered depth and scene context [39]. Recently Yamaguchi, et al. [112] proposed a method for recovering occlusion boundaries from a static scene given two frames of a stereo pair captured from a moving vehicle.

5.2 Noise

Vision applications are inherently noisy, but are also rich sources of information [58]. One of the primary challenges associated with the ego-motion computation is noisy estimates with optic flow [32]. Whenever signals of translation lie outside the field of view, signals are seriously corrupted, making the camera motion computation highly sensitive to noise [32, 100]. It’s also challenging to compute the translation parameters in the presence of noise [100]. Noise likewise affects stereo and feature tracking parameters [5].

Visual odometry is typically computed incrementally from frame to frame and with each step small errors are introduced due to noise [82]. Many of the visual odometry applications such as robot navigation, autonomous vehicle driving and dealing with high-speed traffic scenes are extremely challenging [58]. In order for these real-time applications to work accurately, robustness to noise proportions needs to be attained in constant time [58].

Ego-motion estimation requires some previous analysis of the noise involved in the imaging process [5]. One of the most widely used models in simulation studies is the Gaussian noise model [78]. It’s useful to know how the noise propagates as the data is processed [5].

5.3 Camera calibration

Accurate camera calibration is necessary for any computer vision task, such as ego-motion estimation that requires extraction of metric information from 2D images [81]. The camera calibration problem comprises both the interior and exterior orientation complications [43]. In order to join the image plane coordinates to absolute coordinates, orientation, camera position and camera constant must be determined [43].

In ego-motion estimation each pixel is imaged using perspective projection [43]. The camera calibration problem is to relate the position of pixels in the image array to the points in the scene [43]. Lens distortions, aspect ratio and the location of the principal point needs to be found, to relate pixels in image array to a point in scene [43].

Intrinsic camera calibration is the calibration of an individual camera and involves parameters such as principal point, focal length and distortion model [24]. Whereas extrinsic camera calibration is the relative offset between two sensors [24].

Numerous methods have been developed for planar camera calibration as compared to omnidirectional cameras [81]. The overall methods for omnidirectional camera calibration can be divided into two groups. The first group exploits previous knowledge about the scene, for example presence of plumb lines or calibration patterns. The second group includes techniques that do not exploit this prior knowledge, such as using point correspondence or epipolar constraint for pure rotation, self-calibration procedures, and planar motion of camera [81].

6 Open problems

In the literature that we have reviewed, we have come across the following open problems for ego-motion estimation.

  1. (1)

    Ego-motion, Object and Background Motion: How is the user supposed to determine both the camera and object(s) motion captured in extreme constraints and conditions?

  2. (2)

    Scalability: How does each algorithm behave for extreme camera movements? How is the accuracy and algorithm time affected?

  3. (3)

    Evaluation: How to decide which algorithm and camera setup works best when used together for ego-motion estimation?

  4. (4)

    Application: How to know that a vision system is good enough to be used for a non-trivial robotic application?

  5. (5)

    Proportion: How many motion parameters need to be considered to detect a moving object reliably?

Independent movement of one or more objects in extreme conditions gives rise to several issues in ego-motion estimation. Even some of the state of the art algorithms fails to deliver quality results. To tackle this problem efficiently, we may need to use deep complex structured learning algorithms for feature analysis in-combination with optical flow. The scalability of algorithms with extreme camera movements can be tested using computer simulations or models. The problem of evaluation can be eased by creating a benchmark set that allow comparison between different algorithms and camera setups. Applications using ego-motion will need to perform extensive tests on real world images, most important consideration is installing and calibrating the entire experimental equipment that is capable of moving the camera in a controlled fashion. Finally the last open problem discusses the number of motion parameters. Estimation results are more reliable in cases where the number of motion parameters is reduced, but this raises another problem of detecting moving objects effectively from these reduced number of motion parameters.

7 Conclusions and future directions

We have reviewed the literature for ego-motion estimation in order to provide solutions to different computer vision issues. The ego-motion literature reviewed can be grouped based on the algorithms, stereo camera setup, monocular camera setup, omnidirectional camera, autonomous vehicles, visual odometry and visual SLAM as shown in Table 2.

Table 2 Motion estimation classified according to different domains

We have also pointed out the main challenges and issues that need to be taken into account when estimating camera motion. Noise, occlusion and calibration problems need to be addressed for proper and effective ego-motion estimation.

In all the literature that we have reviewed we have come to know that this research area is widely open to new research and development. Here we offer a few possible suggestions for future research.

  • In future research, we suggest that one should try to implement 6-DOF ego-motion parameters for autonomous vehicles. This approach can be fruitful for ego-motion estimation of large-scale rough terrain.

  • For vehicle collision detection and safety systems a fusion of classifiers and ego-motion based approaches should be used in fusion to promote the detection rate. To avoid collision, object learning can also be carried out with different viewing angles.

  • Optical flow holds great significance in motion estimation. Researchers should investigate the problem of dynamic backgrounds in the future. Dynamic backgrounds include those objects that tend to have fixed locations, but adds a great deal of noise, for example windblown trees.

  • Overall the ego-motion process can also be hastened by using hardware resources such as field-programmable gate array (FPGA). By using the hardware we mean that some part of the ego-motion algorithm should be transferred on to the hardware.

  • The use of a car speedometer can be investigated in future for estimating magnitude of forward motion in case of long stretches of new highway. In such a highway case there is enough vertical texture for computing rotation estimates, but no horizontal texture for computing the forward magnitude.