1 Introduction

Distilling useful information from prior experience is one of the primary research problems in computer science. Past information contained in the training data is extracted as a model and used to predict future outcomes in machine learning. In the past few years, the advent of deep learning techniques has greatly benefited the areas of computer vision, speech, and Natural Language Processing (NLP). However, supervised deep learning-based techniques require a large amount of human-annotated training data to learn an adequate model. Although data have been painstakingly collected and annotated for problems such as image classification120,186, image captioning115, instance segmentation134, visual question answering81, and other tasks, it is not viable to do so for every domain and task. Particularly, for problems in health care and autonomous navigation, collecting an exhaustive data set is either very expensive or all but impossible.

Even though supervised methods excel at learning from a large quantity of data, results show that they are particularly poor in generalizing the learned knowledge to new task or domain221. This is because a majority of learning techniques assume that both the train and test data are sampled from the same distribution. However, when the distributions of the train and test data are different, the performance of the model is known to degrade significantly201,221. For instance, take the example of autonomous driving. The roadside environment for a city in Europe is significantly different from a city in South Asia. Hence, a model trained with input video frames from the former suffers a significant degradation in performance when tested on the latter. This is in direct contrast to living organisms which perform a wide variety of tasks in different settings without receiving direct supervision168,237.

This survey is targeted towards summarizing the recent literature that addresses two bottlenecks of fully supervised deep learning methods—(1) lack of labeled data in a particular domain; (2) unavailability of direct supervision for a particular task in a given domain. Broadly, we can categorize the methods which aim to tackle these problems into three sets—(1) data-centric techniques which solve the problem by generating a large amount of data similar to the one present in the original data set; (2) algorithm-centric techniques which tweak the learning method to harness the limited data efficiently through various techniques like on-demand human intervention, exploiting the inherent structure of data, capitalizing on freely available data on the web or solving for an easier but related surrogate task; (3) hybrid techniques which combine ideas from both the data and algorithm-centric methods.

Figure 1:
figure 1

Learning paradigms arranged in decreasing order of supervision signal. Semantic segmentation of outdoor scene is taken as an example task (1) Fully supervised learning requires a lot of annotated data to learn a viable model35. (2) Synthetically generated instances can be used to compensate for the lack of real-world data179. (3) Knowledge from one real-world data set can be transferred to another data set which does not contain the sufficient amount of instances. For instance, a model trained on Cityscapes can be fine-tuned with the data from the Indian Driving Data set (IDD)231. (4) In case pixel-level labels are expensive to obtain, inexact supervision from polygon labels can be exploited to accomplish the task. (5) If only a few instances are available along with their labels, few-shot learning techniques can be employed to learn a generalizable model. (6) Finally, unsupervised learning exploits the inherent structure of the unlabelled data instances.

Data-centric techniques include data augmentation which involves tweaking the data samples with some pre-defined transformations to increase the overall size of the data set. For images, this involves affine transformations such as shifting, rotation, shearing, flipping, and distortion of the original image116. Some recent papers also advocate adding Gaussian noise to augment the images in the data set. Ratner et al.171 recommend learning these transforms instead of hard-coding them before training. Another method is to use techniques borrowed from computer graphics to generate synthetic data which is used along with the original data to train the model. In the case when data are in the form of time-series, window slicing and window warping can be used for augmentation purposes126.

Algorithm-centric techniques try to relax the need of perfectly labeled data by altering the model requirements to acquire supervision through inexact248, inaccurate148, and incomplete labels24. For most of the tasks, these labels are cheaper and relatively easy to obtain than full-fledged task-pertinent annotations. Techniques involving on-demand human supervision have also been used to label selective instances from the data set220. Another set of methods exploit the knowledge gained while learning from a related domain or task by efficiently transferring it to the test environment189.

Hybrid methods incorporate techniques which focus on improving the performance of the model at both the data and algorithm level. For instance, in urban scene understanding task, researchers often use a synthetically generated data set along with the real data for training. This proves to be greatly beneficial as real-world data set may not cover all the variations encountered during the test time i.e. different lighting conditions, seasons, camera angles etc. However, a model trained using synthetic images suffers a significant decrease in performance when tested on real images due to domain shift. This issue is algorithmically addressed by making the model “adapt” to the real-world scenario259. Most of the methods discussed in this survey fall under this category.

In this paper, we discuss some of these methods along with describing their qualitative results. We use tasks associated with autonomous navigation as a case study to explain each paradigm. As a preliminary step, we introduce some common notations used in the paper. We follow this by mentioning the radical improvement brought by supervised deep learning methods in computer vision tasks briefly in Sect. 1.2. Section 2 contains an overview of work which involves the use of synthetic data for training. Various techniques for transfer learning are compared in Sect. 3. Methods for weak and self-supervision are discussed in Sects. 4 and 6, respectively. Methods which address the task of learning an adequate model from a few instances are discussed in Sect. 5. Finally, we conclude the paper discussing the promises, challenges, and open research frontiers beyond supervised learning in Sect. 7. Figure 1 gives a brief overview of the survey in the context of semantic segmentation task for autonomous navigation.

1.1 Notations and Definitions

In this section, we introduce some notations which aid the explanation of the paradigms surveyed in the paper. Let \({\mathcal{X}}\) and \({\mathcal{Y}}\) be the input and label space, respectively. In any machine learning problem, we assume to have N objects from which we wish to learn the representation of the data set. We extract features from these objects \(X = (x_{1}, x_{2}, \ldots , x_{N})\) to train our model. Let P(X) be the marginal probability over X. In a fully supervised setting, we also assume to have labels \(Y = (y_{1}, y_{2}, \ldots , y_{N})\) corresponding to each of these feature sets. A learning algorithm seeks to find a function \(f: {\mathcal{X}} \longrightarrow {\mathcal{Y}}\) in the hypothesis space \({\mathcal{F}}\). To measure the suitability of the function f, a loss function \(l: {\mathcal{Y}} \times {\mathcal{Y}} \longrightarrow {\mathbb {R}}^{\ge 0}\) is defined over space \(\mathcal {L}\). A machine learning algorithm tries to minimize the risk R associated with wrong predictions:

$$\begin{aligned} R = \frac{1}{N} \sum _{n=0}^{N} l(y_{i}, f(x_{i})). \end{aligned}$$

Use of synthetic data has become mainstream in computer vision literature. Note that even though synthetic data may appear to contain the same entities, we cannot assume that it has been generated from the same distribution. Hence, we denote that it is input space as \({\mathcal{X}}_{\mathrm {synth}}\) instead of \({\mathcal{X}}\). However, the label space remains the same. To elaborate, we have a new domain \(\mathcal {D}_{\mathrm {synth}} = \{X_{\mathrm {synth}}, P(X_{\mathrm {synth}})\}\) which is different from the real domain \(\mathcal {D} = \{X, P(X)\}\) as both their input feature space and marginal distributions are different. Hence, we cannot use the objective predicting function \(f_{\mathrm {synth}}: {\mathcal{X}}_{\mathrm {synth}} \longrightarrow {\mathcal{Y}}\) for mapping \({\mathcal{X}}\) to \({\mathcal{Y}}\).

Transfer learning, a term interchangeably used with domain adaptation (DA), aims to solve this problem. However, the term is not only used to transfer knowledge between different domains but also between distinct tasks. We define a task as containing the label space \({\mathcal{Y}}\) and the conditional distribution P(Y|X), as \({\mathcal{T}} = \{{\mathcal{Y}}, P(Y|X)\}\). Building on the above notations, we define domain shift (\({\mathcal{D}}_{s} \ne {\mathcal{D}}_{t}\)) and label space shift (\({\mathcal{T}}_{s} \ne {\mathcal{T}}_{t}\)), where \({\mathcal{D}}_{s}\) and \({\mathcal{D}}_{t}\) are source and target domains, respectively. As an example, using synthetic data and then adapting the learned objective to real domain fall under domain shift as \({\mathcal {D}}\ne {\mathcal {D}}_{\mathrm {synth}}\). Within the domain adaptation literature, methods have been categorized into homogeneous and heterogeneous settings. Homogeneous domain adaptation methods assume that the input feature space for both the source and target input distribution is the same, i.e., \(X_{s}=X_{t}\). Heterogeneous domain adaptation techniques relax this assumption. As a result, heterogeneous DA is considered a more challenging problem than homogeneous DA.

Although supervised learning considers that all the feature sets \(x_{i}\) have a corresponding label \(y_{i}\) available at the time of training, the labels can be inaccurate, inexact, or incomplete in a real-world scenario. These scenarios collectively fall under the paradigm of weakly supervised learning. These conditions are particularly true if the training data has been obtained from web. Formally, we define the feature set for incomplete label scenario as \(X = (x_{1}, x_{2}, \ldots , x_{l}, x_{l+1}, \ldots x_{n})\) where \(X_{\mathrm {labeled}} =(x_{1}, x_{2}, \ldots , x_{l})\) have corresponding labels \(Y_{\mathrm {labeled}} = (y_{1}, y_{2}, \ldots y_{l})\) available while training, but the rest of the feature sets \( X_{\mathrm {unlabeled}} = (x_{l+1}, \ldots , x_{n})\) do not have any labels associated with them.

Other interesting weakly supervised models encompass cases where each instance has multiple labels or a bag of instances have a single label assigned to it. To formalize for multiple-instance single-label scenario, we assume that each feature set \(x_{i}\) is composed of many sub-feature sets \((x_{i,1}, x_{i,2},\ldots , x_{i,m})\). Here, \(x_{i}\) is called a “bag” of features and the paradigm is known as multiple-instance learning. A bag is labeled positive if at least one item \(x_{i,j}\) is positive otherwise negative. Although the above paradigms correspond to a varied amount of supervision, they always assume a huge number of instances X available at the time of training the model. This assumption breaks down when some classes do not have sufficient instances.

Few-shot learning entail the scenario when only a few (usually not more than 10) instances per class are available at the time of training. Zero-shot learning (ZSL) is an extreme scenario which arises when no instance is available for some classes during training. Given the training set with features \(X = (x_{1}, x_{2}, \ldots , x_{n})\) and labels \(Y_{\mathrm {train}} = (y_{1}, y_{2}, \ldots , y_{n})\), the test instances belong to previously unseen classes \(Y_{\mathrm {test}} = (y_{n+1}, y_{n+2}, \ldots , y_{m})\). Recently, some papers address a generalized ZSL scenario where the test classes have both seen or unseen labels.

When no supervision signal is available, the inherent structure of the instances is utilized to train the model. Let X and Y be the feature and label set, respectively; as we do not have P(Y|X), we cannot define the task \({\mathcal{T}}=\{{\mathcal{Y}}, P(Y|X)\}\). Instead, we define a proxy task \({\mathcal{T}}_{\mathrm {proxy}} = \{Z, P(Z|X)\}\) whose label set Z can be extracted within the data itself. For computer vision problems, proxy tasks have been defined based on spatial and temporal alignment, color, and motion cues.

1.2 Success of Supervised Learning

Over the past few years, supervised learning methods have enabled computer vision researchers to train more and more accurate models. For several tasks, these models have achieved state-of-the-art performance which is comparable to humans. In the visual domain, accuracy for both structure and unstructured prediction tasks such as image classification91,96,116,203,214, object detection75,76,136,174,178, semantic segmentation12,27,92,133,138,182,260, pose estimation23,222, action recognition46,58,74,104,223, video classification110, and optical flow estimation47 has consistently increased allowing for their large-scale deployment. Apart from computer vision, problems in other domains such as speech recognition82,83,190, speech synthesis229, machine translation13,84,213,244, and machine reading170 have also seen a significant improvement in their performance metrices.

Despite their success, supervised learning-based models have a fair share of issues. First of all, they are data hungry requiring a huge amount of instance-label pairs. To add, a majority of large data sets required to train these models are proprietary as they provide an advantage to the owner in training a supervised model for a particular task and domain. Second, when applying a machine learning model in the wild, it encounters a multitude of conditions which are not observed in the training data. In these situations, fully supervised methods, despite the super-human-level performance on a particular domain suffer drastic degradation in performance on a real-world test set as they are biased towards the training data set.

2 Effectiveness of Synthetic Data

A much better degree of photo-realism, easy-to-use graphics tools such as game engines, large libraries of 3D models, and appropriate hardware have made it is possible to simulate virtual visual environments which can be used to construct synthetic data sets which are exponentially larger than real-world data sets. One primary advantage of using synthetic data is that the precise ground truth is often available for free. On the other hand, collecting and annotating data for a large number of problems is not only a tedious process but also prone to human errors. To add, one can easily vary factors such as viewpoint, lighting, and material properties earning full control over configurations and visual challenges to be introduced in the data set. This presents a major advantage for computer vision researchers as real-world data sets tend to be non-exhaustive, redundant, heavily biased, and partly representative of the complexity of natural images221. Moreover, some situations are not possible to be arranged in a real-world setting because of safety issues, e.g., a head-on collision in an urban scene understanding data set. Last but not least, having a few high-profile real-world data sets bias the research community towards the tasks for which annotations have been provided with these data sets. Thus, graphically generated synthetic data sets have become a norm in the computer vision community, particularly for tasks such as medical imaging and autonomous navigation.

Figure 2:
figure 2

Data collected in real-world setting may not have sufficient diversity in terms of illumination, viewpoints, etc. Synthetic data produced through virtual visual models help to get around this bottleneck. Another way to create additional data for training is to paste real or virtual objects to real scenes. One advantage of this approach is that the domain gap between real and synthetically generated data is lesser leading to better performance on the real data set.

In the visual domain, synthetic data have been used mainly for two purposes: (1) evaluation of the generalizability of the model due to the large variability of synthetic test examples, and (2) aiding the training through data augmentation for tasks where it is difficult to obtain ground truth, e.g., optical flow or depth perception. A virtual test bed for design and evaluation of surveillance systems is proposed in Taylor et al.217. Kaneva et al.108 and Aubry and Russell10 use synthetic data to evaluate hand-crafted and deep features, respectively. Butler et al.22 propose MPI Sintel Flow data set, a synthetic benchmark for optical flow estimation. Handa et al.88 introduce ICL-NUIM, a data set for evaluation of visual odometry.

More significantly, synthetic data are utilized for gathering additional training instances, mainly beneficial due to the availability of precise ground truth. There are various data generation strategies, from real-world images combined with 3D models to full rendering of dynamic visual scenes. Figure 2 illustrates two common methods for synthetic data generation. Vaquez et al.232 learn the appearance models of pedestrians in a virtual world and use the learned model for detection in the real-world scenario. A similar technique is described for pose estimation11,162, indoor scene understanding89, action recognition41, and variety of other tasks. Instead of rendering the entire scene, Gupta et al.85 overlay text on natural images consistent with the local 3D scene geometry to generate data for text localization task. A similar method is used for object detection52 and semantic segmentation177 where real images of both the objects and backgrounds are composed to synthetically generate a new scene. One drawback of using synthetic data for training a model is that it gives rise to “sim2real” domain gap. Recently, a stream of works in domain randomization188,219,224 claims to generate synthetic data with sufficient variations, such that the model views real data as just another variation of the synthetic data set.

Modern game engines are a popular method to extract synthetic data along with the annotation due to their photo-realism and realistic physics simulation. Gaidon et al.64 present the Virtual KITTI data set and conduct experiments on multi-object tracking. SYNTHIA183 and GTA179 provide urban scene understanding data along with semantic segmentation benchmarks. UnrealCV167 provides a simple interface for researchers to build a virtual world without worrying about the game’s API.

Synthetic data for Autonomous Navigation

Autonomous Navigation has greatly benefited from the use of synthetic data sets as pixel-level ground truth can be obtained easily and cheaply using label propagation from frame to frame. As a result, several synthetic data sets have been curated particularly for visual tasks pertaining to autonomous navigation64,129,179,180,183,191. Alhaija et al.5 propose a method to augment virtual objects to real road scene for creating additional data to be used during training the model. Apart from training the models, racing simulators have also been used to evaluate the performance of different approaches to autonomous navigation26,48. Janai et al.102 offer a comprehensive survey of literature pertinent to autonomous driving

One of the major challenges in using synthetic data for training is the domain gap between real and synthetic data sets. Transfer learning discussed in Sect. 3 offers a solution to this problem. Eventually, through the use of synthetic data, we would like to replace the expensive data acquisition process and manual labeling of ground truth into a generic problem of training with unlimited computer-generated data and testing in the real-world scenario without any degradation in performance.

3 Domain Adaptation and Transfer Learning

Figure 3:
figure 3

Conventional techniques for domain adaptation. The original model is trained to classify and . However, it is able to classify and only after applying appropriate DA techniques.

As stated in Sect. 2, a model trained on source domain does not perform well on a target domain with different distribution. Domain adaptation (DA) is a technique which addresses this issue by reusing the knowledge gained through the source domain for the target domain. DA techniques have been categorized according to three criteria: (1) distance between domains; (2) presence of supervision in the source and target domain; (3) type of domain divergences. Most of the DA techniques assume that the source and target domain are “nearer” to each other, in the sense that the instances are directly related. In these cases, single-step adaptation is sufficient to align both the domains. However, if this assumption does not hold true, multi-step adaptation is used where a set of intermediate domains is used to align the source and target domains. Prevalent literature also classifies DA in supervised, semi-supervised, and unsupervised setting according to the presence of labels in source and target domain. Nevertheless, there are inconsistencies in the definition within the literature; while some papers refer to the absence of target labels as unsupervised DA, others define it as an absence of both the source and target labels. Hence, in this section, we categorize the DA techniques with respect to the type of domain divergences. Section 1.1 gives out the formal notation and formulations for DA setting.

Earlier works categorized the domain adaptation problem into homogeneous and heterogeneous settings. Homogeneous domain adaptation deals with the situation when both the source and target domains share a common feature space \({\mathcal{X}}\) but different data distributions P(X) or P(Y|X). Some traditional methods for homogeneous domain adaptation include instance re-weighting25, feature transformations39,97, or kernel-based techniques that learn an explicit transform from source to target domain50,78,154. Figure 3 pictorially presents the traditional domain adaptation methods. All the techniques addressing this problem aim to correct the differences between conditional and marginal distributions between the source and target domain. Heterogeneous domain adaptation pertains to the condition when the source and target domains are represented in different feature space. This is particularly important for problems in the visual domain such as image recognition80,117,263, object detection, semantic segmentation128, and face recognition as different environments, background, illumination, viewpoint, sensor, or post-processing can cause a shift between the train and test distributions. Moreover, a difference between the tasks also demands the model to be adapted to the target domain task. Manifold alignment238 and feature augmentation49,130 are some of the techniques used for aligning feature spaces in heterogeneous adaptation. A detailed survey of traditional adaptation techniques is out of the scope of this survey. We direct readers to Ben-David et al.16 and Pan et al.153 for a summary of homogeneous and Day and Khoshgoftaar40 and Weiss et al.242 for a detailed overview of heterogeneous adaptation techniques. Patel et al.158, Shao et al.199, and Csurka37 provide an overview of shallow domain adaptation methods on visual tasks. In this paper, we briefly state recent advances in deep domain adaptation techniques pertaining computer vision tasks.

Taking a cue from the success of deep neural networks for learning a feature representation, recent DA methods use them to learn representations invariant to the domain; thus inserting the DA framework within the deep learning pipeline. Earlier work using deep neural networks only used the features extracted from the deep network for feature augmentation149 or subspace alignment139,169 of two distinct visual domains. Although these methods perform better than state-of-the-art traditional DA techniques, they do not leverage neural networks to directly learn a semantically meaningful and domain-invariant representation.

Contemporary methods use discrepancy-based or adversarial approaches for domain adaptation. Discrepancy-based methods posit that fine-tuning a deep network with target domain data can alleviate the shift between domain distributions45,151,253. Labels or attribute information70,227, Maximum Mean Discrepancy (MMD)226,249, correlation alignment212, statistical associations87, and batch normalization131 are some of the criterion used while fine-tuning the model.

Adversarial methods encompass a framework which consists of a label classifier trained adversarially to the domain classifier. This formulation aids the network in learning features which are discriminative with respect to the learning task but indiscriminate with respect to the domain. Ganin et al.68 introduced DANN architecture which uses a gradient reversal layer to ensure that feature distributions over the two domains are aligned. Liu and Tuzel136 introduce a GAN-based framework in which the generator tries to convert the source domain instances to those from the target domain and the discriminator tries to distinguish between transformed source and target domain instances. Bousmalis et al.20, Hoffman et al.95, Shrivastava et al.202 and Yoo et al.252 also focus on generating synthetic target data using adversarial loss, albeit using it in pixel space instead of embedding space. Sankaranarayanan et al.193 use a GAN only to obtain the gradient information for learning a domain-invariant embedding, noting that successful domain alignment does not strictly depend on image generation. Tzeng et al.228 propose a unified framework for adversarial methods summarizing the type of adversary, loss function, and weight sharing constraint to be used during training.

Generative Adversarial Network (GAN)

GAN79 consists of two neural networks; a generator that creates samples using noise and a discriminator which receives samples from both the generator and real data set and classifies them. The two networks are trained simultaneously with the intention that the generated samples are indistinguishable from real data at equilibrium. Apart from producing images, text, sound, and other forms of structured data, GANs have been instrumental in driving research in machine learning; particularly in the cases where data availability is limited. Data augmentation7,62 using GANs has resulted in higher performing models than those which use affine transformations. Adversarial adaptation, a paradigm inspired by GAN framework, is used to transfer the data from the source to the target domain. Other notable applications of GANs include data manipulation140, adversarial training119, anomaly detection195, and adversarial cryptography1

Reconstruction-based techniques try to construct a shared representation between the source and target domains while maintaining the individual characteristics of both the domains intact. Ghifary et al.72 use an encoder which is trained simultaneously to accomplish source-label prediction along with target data reconstruction. Bousmalis et al.19 train separate encoders to account for domain-specific and domain-invariant features. In addition, it uses domain-invariant features for classification while using both kinds of features for reconstruction. Methods based on adversarial reconstruction are proposed in Kim et al.112, Russo et al.187, Yi et al.251, Zhu et al.262 which use a cyclic consistency loss as the reconstruction loss along with the adversarial loss to align two different domains.

Optimal transport is yet another technique used for deep DA38,173. Courty et al.36 assign pseudo-labels to the target data using the source classifier. Furthermore, they transport the source data points to the target distribution minimizing the distance traveled and changes in labels while moving the points.

Visual adaptation has been studied for problems such as cross-modal face recognition137,207, object detection31,94, semantic segmentation32,225,259, person re-identification42, and image captioning29. Although deep DA has achieved considerable improvement over the traditional techniques, much of the work in the visual domain has focused on addressing homogeneous DA problems. Recently, heterogeneous domain adaptation problems such as face-to-emoji215 and text-to-image synthesis176,254 have also been addressed using adversarial adaptation techniques. Another interesting direction of work pertains open set DA21,23,255 which loosens the assumption that output sets of both the source and target class must exactly be the same. Tan et al.216 address the problem of distant domain supervision transferring the knowledge from source to target via intermediate domains. An in-depth survey of deep domain adaptation techniques is presented in Wang and Deng239.

4 Weakly Supervised Learning

Weakly supervised learning is an umbrella term covering the predictive models which are trained under incomplete, inexact, or inaccurate labels. Incomplete supervision encompasses the situation when the annotation is only available for a subset of training data. As an example, take the problem of image classification with the ground truth being provided through human annotation. Although it is possible to get a huge number of images from the internet, only a subset of these images can be annotated due to the cost associated with labeling. Inexact supervision pertains to the use of related, often coarse-level annotations. For instance, a fully supervised object localization requires to delineate the bounding boxes; however, usually, we only have image-level labels. Finally, noisy or non-ground truth labels can be categorized as inaccurate supervision. Collaborative image tags on social media websites can be considered as noisy supervision. Apart from saving annotation cost and time, weakly supervised methods have proven to be robust to change in the domain during testing.

Figure 4:
figure 4

An example of the varying degree of supervision for semantic segmentation problem. Although pixel-level labels provide strong supervision, they are relatively expensive to obtain. Thus, the recent literature suggests techniques which exploit polygon labels, scribbles, image-level labels, or even collaborative image tags from social media platforms (note that hashtags are not only inexact but also an inaccurate form of supervision).

4.1 Incomplete Supervision

Weakly supervised techniques pertaining incomplete labels make use of either semi-supervised or active learning methods. The conventional semi-supervised approaches include self-training, co-training18,165, and graph-based methods51. A discussion on these is out of the scope of this survey. Interested readers are directed to Chapelle et al.24 for a detailed overview of semi-supervised learning.

Active learning methods are used in computer vision to reduce labeling efforts in problems such as image annotation109, recognition235, object detection250, segmentation234, and pose estimation135. In this paradigm, unlabeled observations are optimally selected from the data set to query at the training time. For instance, localizing a car occluded by a tree is more difficult than another non-occluded car. Thus, the human annotator could be asked to assign ground truth for the former case which may lead to improved performance for the latter case. A typical active learning pipeline alternates between picking the most relevant unlabeled examples as queries to the oracle and updating the prior on the data distribution with the response34. Some common query formulation strategies include maximizing the label change61, maximizing the diversity of selected samples53, reducing the expected error of classifier184, or uncertainty sampling194. A survey by Settles198 gives insight into various active learning techniques.

Although both semi-supervised and active learning techniques have been used to address different problems in the visual domain, there has been an increased interest towards the latter after the emergence of deep learning-based methods. Sener and Savarese197 and Gal et al.65 present an effective method to train a CNN using active learning heuristics. An approach to synthesize query examples using GAN is proposed by Bento and EDU261. Fang et al.55 reframe active learning as a reinforcement learning problem. In addition, deep active learning methods have been used to address vision tasks such as object detection in Roy et al.185.

4.2 Inexact Supervision

Apart from dealing with partially labeled data sets, weakly supervised techniques also help relax the degree of annotation needed to solve a structured prediction problem. Full annotation is tedious and time-consuming process—contemporary vision data sets reflect this fact. For example, in Imagenet186, while 14 million images are provided with image-level labels and 500,000 are annotated with bounding boxes; only 4460 images have pixel-level object category labels. Thus, the development of training regimes which learn complex concepts from light labels is instrumental in improving the performance of several tasks.

A popular approach to harness inexact labels is to formulate the problem in multiple-instance learning (MIL) framework. In MIL, the image is interpreted as a bag of patches. If one of the patches within the image contains the object of interest, the image is labeled as a positive instance, otherwise negative. Learning scheme alternates between estimating object appearance model and predicting the patches within positive images. As this setup results in a non-convex optimization objective, several works suggest initialization209, regularization208, and curriculum learning118 techniques to alleviate the issue. Recent works100,243 embed the MIL framework within a deep neural network to exploit the weak supervision signal.

Structured prediction problems such as weakly supervised object detection (WSOD) and semantic segmentation have garnered a lot of attention in the recent years. Bilen and Vedaldi17 propose an end-to-end WSOD framework for object detection using image-level labels. Several other techniques have been employed as supervision signal for WSOD such as object size200 and count69, click supervision156,157, and human verification155. Similar methods have also been proposed for weakly supervised semantic segmentation problems15,98,111,132,142,163. Figure 4 depicts some weak supervision signals used for semantic segmentation problem.

4.3 Inaccurate Supervision

As curating large-scale data sets is an expensive process, building a machine learning model which uses web data sets such as YouTube8m2, YFCC100M218, and Sports-1M110 is one of the pragmatic ways to leverage the almost infinite amount of visual data. However, labels in these data sets are noisy and pose a challenge for the learning algorithm. Several studies have investigated the effect of noisy instances or labels on the performance of the machine learning algorithm. Broadly, we categorize the techniques into two sets—the first approach resorts to treating the noisy instances as outliers and discard them during training54,211. Nevertheless, noisy instances may not be outliers and occupy a significant portion of the training data. Moreover, algorithms pursuing this approach find it difficult to distinguish between noisily labeled and hard training examples. Hence, methods in this set often use a small set of perfectly labeled data. Another stream of methods focus on building algorithms robust to noise107,146,175,230 by devising noise-tolerant loss functions73 or adding appropriate regularization terms9. For a comprehensive overview of learning algorithms robust to noise, we refer to Frénay and Verleysen60.

Consequently, a plethora of techniques have been proposed to harness the deep neural networks in a “webly” supervised scenario. As most of the data on the web is contributed by non-experts, it is bound to be inaccurately labeled. Hence, techniques used to address noisy annotations can be applied if the training data are collected from the web. Chen and Gupta30 propose a two-stage curriculum learning technique on easier examples before adapting it to web images. Xiao et al.247 predict the type of noise in each of the instances and attempt to remove it. Webly supervised methods have been proposed for many tasks in visual domain such as learning visual concepts43,67, image classification233, video recognition66, and object localization264.

5 k-Shot Learning

One of the distinguishing characteristics of human visual intelligence is the ability to acquire an understanding of novel concepts from very few examples. Conversely, a majority of current machine learning techniques show a precipitous decrease in performance if there are an insufficient number of labeled examples pertaining to a certain class. Few-shot learning techniques attempt to adapt the current machine learning methods to perform well under a scenario where only a few training instances are available per class. This is of immense practical importance—for instance, collecting a traffic data set might result in only a few instances of auto-rickshaws. However, during testing, we would like the model to recognize auto-rickshaws with various scales, angles and other variations which might not be present in the training set. Earlier methods such as Fei-Fei et al.57 use Bayesian learning-based generative framework with the assumption that the prior built from previously learned classes can be used to bootstrap learning for novel categories. Lake et al.121 built a Hierarchical Bayesian model which performs similarly to humans on few-shot alphabet recognition tasks. However, their method is shown to work only for simple data sets such as Omniglot122. Wang and Hebert241 learn to regress from parameters of the classifier trained on a few images to the parameters of the classifier trained on a large number of images. More recent efforts into a few-shot learning techniques can be broadly categorized into metric-learning and meta-learning-based methods.

Metric learning aims to design techniques for embedding the input instances to a feature space beneficial to few-shot settings. A common approach is to find a good similarity metric in the new feature space applicable to novel categories. Koch et al.114 use a deep learning model based on computing the pairwise distance between the samples based on Siamese networks following which the learned distance is used to solve a few-shot problems through k-nearest-neighbor classification. Vinyals et al.236 propose an end-to-end trainable one-shot learning technique based on cosine distance. Other loss functions used for deep metric learning include triplet loss Schroff et al.196 and adaptive density estimation Rippel et al.181. Mehrotra and Dukkipati143 approximate the pairwise distance by training a deep residual network in conjunction with a generative model.

Meta-learning entails a class of approaches which quickly adapt to a new task using only a few data instances and training iterations. To achieve this, the model is trained on a set of tasks, such that it transfers the “learning ability” to a novel task. In other words, meta-learners treat the tasks as training examples. Finn et al.59 propose a model agnostic meta-learning technique which uses gradient descent to train a classification model such that it is able to generalize well on any novel task given very few instances and training steps. Ravi and Larochelle172 also introduce a meta-learning framework employing LSTM updates for a given episode. Recently, a method proposed by Mishra et al.145 also exploits contextual information within the tasks using temporal convolutions.

Another set of methods for few-shot learning relies on efficient regularization techniques to avoid over-fitting on the small number of instances. Hariharan and Girshick90 suggest a gradient magnitude regularization technique for training a classifier along with a method to hallucinate additional examples for a few-shot classes. Yoo et al.252 also regularize the dimensionality of parameter search space through efficiently clustering them ensuring the intra-cluster similarity and inter-cluster diversity.

Literature pertaining to Zero-Shot Learning (ZSL) focuses on finding the representation of a novel category without any instance. Although it has a strong semblance to few-shot learning paradigm, methods used to address ZSL are distinct from few-shot learning. A major assumption taken in this setting is that the classes observed by model during training are semantically related to the unseen classes encountered during testing. This semantic relationship is often captured through class-attributes containing shape, color, pose, etc., of the object which are either labeled by experts or obtained through knowledge sources such as Wikipedia, Flickr, etc. Lampert et al.123 were first to propose a zero-shot recognition model which assumes independence between different attributes and estimates the test class by combining the attribute prediction probabilities. However, most of the subsequent work takes attributes as the semantic embedding of classes and tackles it as a visual semantic embedding problem4,56,124,245. More recently, word embeddings206,256 and image captions176 have also been used in place attributes as a semantic space. Figure 5 compares the two common approaches to ZSL with supervised learning.

Figure 5:
figure 5

Comparison of supervised learning with ZSL. Features are not available for \({\mathrm {C}}_{3}\) and \({\mathrm {C}}_{4}\) at the time of training. However, the availability of attributes or semantic embeddings for both the train and test classes aid the training of ZSL framework.

In ZSL, a joint embedding space is learned during training where both the visual features and semantic vectors are projected. During testing on unseen classes, nearest-neighbor search is performed in this embedding space to match the projection of visual feature vector against a novel object type. A pairwise ranking formula is used to learn parameters of a bi-linear model in Akata et al.4 and Frome et al.63. Recently, Zhang et al.256 argue to use the visual space as the embedding space to alleviate the hubness problem when performing nearest-neighbor search in semantic space. We refer the readers to Xian et al.246 for detailed evaluation and comparison of contemporary ZSL methods.

Some other tasks which have shown promising results in a zero-shot setting are video event detection86, object detection14, action recognition166, and machine translation106.

6 Self-supervised Learning

Figure 6:
figure 6

Strong supervision vs. weak supervision vs. self-supervision. and depict fully connected and convolutional layers, respectively.

In self-supervised learning, we obtain feature representation for semantic understanding tasks such as classification, detection, and segmentation without any external supervision. Explicit annotation pertaining to the main task is avoided by defining an auxiliary task that provides a supervisory signal in self-supervised learning. The assumption is that successful training of the model on the auxiliary task will inherently make it learn semantic concepts such as object classes and boundaries. This makes it possible to share knowledge between two tasks. Self-supervision has a semblance to transfer learning where knowledge is shared between two different but related domains. However, unlike transfer learning, it does not require a large amount of annotated data from another domain or task. Figure 6 illustrates the difference between both the paradigms in the context of vehicle detection.

Before the advent of deep learning-driven self-supervision models, a significant work was carried out in unsupervised learning of image representations using hand-crafted205 or mid-level features204. This was followed by deep learning-based methods like autoencoders93, Boltzmann machines192, and variational methods113 which learn by estimating latent parameters which help to reconstruct the data.

The existing literature pertaining self-supervision relies on using the spatial and temporal context of an entity for “free” supervision signal. A prime example of this is Word2Vec144 which predicts the semantic embedding of a particular word based on the surrounding words. In the visual domain, context is efficiently used by Doersch et al.44 to predict the relative location of two image patches as a pretext task. The same notion is extended in Noroozi and Favaro150 by predicting the order of shuffled image patches. Apart from spatial context based auxiliary tasks, predicting color channel from luminance values125,257 and regressing to a missing patch in an image using generative models159 have also been used to learn useful semantic information in images. Other modalities used for feature learning in images include text77, motion160,164, and cross-channel prediction258. Recently, Huh et al.99 take advantage of EXIF metadata embedded in the image as a supervisory signal to determine if it has been formed by splicing different images.

For videos, temporal coherence serves as an intrinsic underlying structure: two consecutive image frames are likely to contain semantically similar content. Each object within the frame is expected to undergo some transformations in the subsequent frames. Wang and Gupta240 use relationships between the triplet of image patches obtained from tracking. Misra et al.147 train a network to guess whether a given sequence of frames from a video are in chronological order. Lee et al.127 make the network predict the correct sequence of frames given a shuffled set. Apart from temporal context, estimating camera motion103, ego-motion3, and predicting the statistics of ambient sound8,152 have also been used as a proxy task for video representation learning.

Self-supervision for Urban Scene Understanding

As solving autonomous navigation takes centre stage in both vision and robotics community, urban scene understanding has become a problem of utmost interest. More often than not, annotating each frame for training is a tedious job. As self-supervision gives the flexibility to define an implicit proxy task which may or may not require annotation, it is one of the preferred methods for addressing problems such as urban scene understanding. Earlier work in this area includes Stavens and Thrun210 where authors estimate the terrain roughness based on the “shocks” which the vehicle receives while passing over it. Jiang et al.105 show that predicting relative depth is an effective proxy task for learning visual representations. Ma et al.141 propose a multi-modal self-supervised algorithm for depth completion using LiDAR data along with a monocular camera

7 Conclusion and Discussions

In the past decade, computer vision has benefited greatly from the fact that neural networks act as universal approximator of functions. Integrating these networks in the pre-existing machine learning paradigms and optimizing through backpropagation have consistently improved performance for different visual tasks. In this survey paper, we reviewed recent work pertaining to the paradigms which fall between fully supervised and unsupervised learning. Although most of our references lie in the visual domain, the same paradigms have been prevalent in related fields such as NLP, speech, and robotics.

The space between fully supervised and unsupervised learning can be qualitatively divided on the basis of the degree of supervision needed to learn the model. While synthetic data are cost effective and flexible alternative to real-world data sets, the models learned using it still need to be adapted to the real-world setting. Transfer learning techniques address this issue by explicitly aligning different domains through discrepancy-based or adversarial approaches. However, both of these techniques require “strict” annotation pertaining to the task which hinders the generalization capability of the model. Weakly supervised algorithms relax the need of exact supervision by making the learning model tolerant of incomplete, inexact, and inaccurate supervision. This helps the model to harness the huge amount of data available on the web. Even when a particular domain contains an insufficient number of instances, methods in k-shot learning try to build a reasonable model using parameter regularization or meta-learning techniques. Finally, self-supervised techniques completely eliminate the need of annotation as they define a proxy task for which annotation is implicit within the data instances.

These techniques have been successfully applied in both structured and unstructured computer vision applications such as image classification, object localization, semantic segmentation, action recognition, image super-resolution, image caption generation, and visual question answering. Despite their success, recent models weigh heavily on deep neural networks for their performance. Hence they carry both the pros and cons of using these models; cons being lack of interpretability and outcomes which largely depend on hyperparameters. Addressing these topics may attract increasingly more attention in the future.

Some very recent work combines ideas from two or more paradigms to obtain results in a very specialized setting. Peng et al.161 address the domain adaptation problem when no task-relevant data are present in the target domain. Inoue et al.101 leverage the full supervision in source and inaccurate supervision in the target domain to perform transfer learning for object localization task.

In the coming years, the other learning paradigms inspired by human reasoning and abstraction such as meta-learning6,59, lifelong learning33, and evolutionary methods may also provide interesting avenues in research. We hope that this survey helps researchers by easing the understanding of the field and encourage research in the field.