1 Introduction

Human eye sees the world in an interesting way. We suppose as if we see the entire scene at once, but this is an illusion created by the subconscious part of our brain [1]. According to the Scanpath theory [2, 3], when the human eye looks at an image, it can see only a small patch in high resolution. This small patch is called the fovea. It can see the rest of the image in low resolution which is called the periphery. To recognize the entire scene, the eye performs feature extraction based on the fovea. The eye is moved to different parts of the image until the information obtained from the fovea is sufficient for recognition [4]. These eye movements are called saccades. The eye makes successive fixations until the recognition task is complete. This sequential process happens so quickly that we feel as if it happens all at once.

Biologically, this is called visual attention system. Visual attention is defined as the ability to dynamically restrict processing to a subset of the visual field [5]. It seeks answers for two main questions: What and where to look? Visual attention has been extensively studied in psychology and neuroscience; for reviews see [6,7,8,9,10]. Besides, there is a large amount of literature on modeling eye movements [11,12,13,14]. These studies have been a source of inspiration for many artificial intelligence tasks. It has been discovered that the attention idea is useful from image recognition to machine translation. Therefore, different types of attention mechanisms inspired from the human visual system have been developed for years. Since the success of deep neural networks has been at the forefront for these artificial intelligence tasks, these mechanisms have been integrated into neural networks for a long time.

This survey is about the journey of attention mechanisms used with neural networks. Researchers have been investigating ways to strengthen neural network architectures with attention mechanisms for many years. The primary aim of these studies is to reduce computational burden and to improve the model performance as well. Previous work reviewed the attention mechanisms from different perspectives [15], or examined them in context of natural language processing (NLP) [16, 17]. However, in this study, we examine the development of attention mechanisms over the years, and recent trends. We begin with the first attempts to integrate the visual attention idea to neural networks, and continue until the most modern neural networks armed with attention mechanisms. One of them is the Transformer, which is used for many studies including the GPT-3 language model [18], goes beyond convolutions and recurrence by replacing them with only attention layers [19]. Finally, we discuss how much more can we move forward, and what’s next?

2 From the late 1980s to early 2010s: the attention awakens

The first attempts at adapting attention mechanisms to neural networks go back to the late 1980s. One of the early studies is the improved version of the Neocognitron [20] with selective attention [21]. This study is then modified to recognize and segment connected characters in cursive handwriting [22]. Another study describes VISIT, a novel model that concentrates on its relationship to a number of visual areas of the brain [5]. Also, a novel architecture named Signal Channelling Attentional Network (SCAN) is presented for attentional scanning [23].

Early work on improving the attention idea for neural networks includes a variety of tasks such as target detection [24]. In another study, a visual attention system extracts regions of interest by combining the bottom-up and top-down information from the image [25]. A recognition model based on selective attention which analyses only a small part of the image at each step, and combines results in time is described [4]. Besides, a model based on the concept of selective tuning is proposed [26]. As the years go by, several studies that use the attention idea in different ways have been presented for visual perception and recognition [27,28,29,30].

By the 2000s, the studies on making attention mechanisms more useful for neural networks continued. In the early years, a model that integrates an attentional orienting where pathway and an object recognition what pathway is presented [31]. A computational model of human eye movements is proposed for an object class detection task [32]. A serial model is presented for visual pattern recognition gathering Markov models and neural networks with selective attention on the handwritten digit recognition and face recognition problems [33]. In that study, a neural network analyses image parts and generates posterior probabilities as observations to the Markov model. Also, attention idea is used for object recognition [34], and the analysis of a scene [35]. An interesting study proposes to learn sequential attention in real-world visual object recognition using a Q-learner [36]. Besides, a computational model of visual selective attention is described to automatically detect the most relevant parts of a color picture displayed on a television screen [37]. The attention idea is also used for identifying and tracking objects in multi-resolution digital video of partially cluttered environments [38].

In 2010, the first implemented system inspired by the fovea of human retina was presented for image classification [39]. This system jointly trains a restricted Boltzmann machine (RBM) and an attentional component called the fixation controller. Similarly, a novel attentional model is implemented for simultaneous object tracking and recognition that is driven by gaze data [40]. By taking advantage of reinforcement learning, a novel recurrent neural network (RNN) is described for image classification [41]. Deep Attention Selective Network (DasNet), a deep neural network with feedback connections that are learned through reinforcement learning to direct selective attention to certain features extracted from images, is presented [42]. Additionally, a deep learning-based framework using attention has been proposed for generative modeling [43].

3 2015: the rise of attention

It can be said that 2015 is the golden year of attention mechanisms. Because the number of attention studies has grown like an avalanche after three main studies presented in that year. The first one proposed a novel approach for neural machine translation (NMT) [44]. As it is known, most of the NMT models belong to a family of encoder-decoders [45, 46], with an encoder and a decoder for each language. However, compressing all the necessary information of a source sentence into a fixed-length vector is an important disadvantage of this encoder-decoder approach. This usually makes it difficult for the neural network to capture all the semantic details of a very long sentence [1].

Fig. 1
figure 1

The extension to the conventional NMT models that is proposed by [44]. It generates the t-th target word \(y_t\) given a source sentence \((x_1, x_2, ..., x_T)\)

The idea that [44] introduced is an extension to the conventional NMT models. This extension is composed of an encoder and decoder as shown in Fig 1. The first part, encoder, is a bidirectional RNN (BiRNN) [47] that takes word vectors as input. The forward and backward states of BiRNN are computed. Then, an annotation \(a_j\) for each word \(x_j\) is obtained by concatenating these forward and backward hidden states. Thus, the encoder maps the input sentence to a sequence of annotations \((a_1,...,a_{T_x})\). By using a BiRNN rather than conventional RNN, the annotation of each word can summarize both the preceding words and the following words. Besides, the annotation \(a_j\) can focus on the words around \(x_j\) because of the inherent nature of RNNs that representing recent inputs better.

In decoder, a weight \(\alpha _{ij}\) of each annotation \(a_j\) is obtained by using its associated energy \(e_{ij}\) that is computed by a feedforward neural network f as in Eq. (1). This neural network f is defined as an alignment model that can be jointly trained with the proposed architecture. In order to reduce computational burden, a multilayer perceptron (MLP) with a single hidden layer is proposed as f. This alignment model tells us about the relation between the inputs around position j and the output at position i. By this way, the decoder applies an attention mechanism. As it is seen in Eq. (2), the \(\alpha _{ij}\) is the output of softmax function:

$$\begin{aligned}&e_{ij} = f(h_{i-1},a_j) \end{aligned}$$
(1)
$$\begin{aligned}&\alpha _{ij} = \frac{\exp (e_{ij})}{\sum _{k=1}^{T_x}\exp (e_{ik})} \end{aligned}$$
(2)

Here, the probability \(\alpha _{ij}\) determines the importance of annotation \(a_j\) with respect to the previous hidden state \(h_{i-1}\). Finally, the context vector \(c_i\) is computed as a weighted sum of these annotations as follows [44]:

$$\begin{aligned} c_i = \sum _{j=1}^{T_x} \alpha _{ij} a_j \end{aligned}$$
(3)

Based on the decoder state, the context and the last generated word, the target word \(y_t\) is predicted. In order to generate a word in a translation, the model searches for the most relevant information in the source sentence to concentrate. When it finds the appropriate source positions, it makes the prediction. By this way, the input sentence is encoded into a sequence of vectors and a subset of these vectors is selected adaptively by the decoder that is relevant to predicting the target [44]. Thus, it is no longer necessary to compress all the information of a source sentence into a fixed-length vector.

The second study is the first visual attention model in image captioning [48]. Different from the previous study [44], it uses a deep convolutional neural network (CNN) as an encoder. This architecture is an extension of the neural network [49] that encodes an image into a compact representation, followed by an RNN that generates a corresponding sentence. Here, the annotation vectors \(a_i \in R^D\) are extracted from a lower convolutional layer, each of which is a D-dimensional representation corresponding to a part of the image. Thus, the decoder selectively focuses on certain parts of an image by weighting a subset of all the feature vectors [48]. This extended architecture uses attention for salient features to dynamically come to the forefront instead of compressing the entire image into a static representation.

The context vector \(c_t\) represents the relevant part of the input image at time t. The weight \(\alpha _i\) of each annotation vector is computed similar to Eq. (2), whereas its associated energy is computed similar to Eq. (1) by using an MLP conditioned on the previous hidden state \(h_{t-1}\). The remarkable point of this study is a new mechanism \(\phi\) that computes \(c_t\) from the annotation vectors \(a_i\) corresponding to the features extracted at different image locations:

$$\begin{aligned} c_t = \phi (\big \{a_i\big \},\big \{\alpha _i\big \}) \end{aligned}$$
(4)

The definition of the \(\phi\) function causes two variants of attention mechanisms: The hard (stochastic) attention mechanism is trainable by maximizing an approximate variational lower bound, i.e., by REINFORCE [50]. On the other side, the soft (deterministic) attention mechanism is trainable by standard backpropagation methods. The hard attention defines a location variable \(s_t\), and uses it to decide where to focus attention when generating the t-th word. When the hard attention is applied, the attention locations are considered as intermediate latent variables. It assigns a multinoulli distribution parametrized by \({\alpha _i}\), and \(c_t\) becomes a random variable. Here, \(s_{t,i}\) is defined as a one-hot variable which is set to 1 if the i-th location is used to extract visual features [48]:

$$\begin{aligned}&p(s_{t,i} = 1 | s_{j<t}, a) = \alpha _{t,i} \end{aligned}$$
(5)
$$\begin{aligned}&\quad c_t = \sum _i s_{t,i} a_i \end{aligned}$$
(6)

Whereas learning hard attention requires sampling the attention location \(s_t\) each time, the soft attention mechanism computes a weighted annotation vector similar to [44] and takes the expectation of the context vector \(c_t\) directly:

$$\begin{aligned} E_{p(s_t|\alpha )}[c_t] = \sum _{i=1}^L \alpha _{t,i} a_i \end{aligned}$$
(7)

Furthermore, in training the deterministic version of the model, an alternative method namely doubly stochastic attention, is proposed with an additional constraint added to the training objective to encourage the model to pay equal attention to all parts of the image.

The third study should be emphasized presents two classes of attention mechanisms for NMT: the global attention that always attends to all source words, and the local attention that only looks at a subset of source words at a time [51]. These mechanisms derive the context vector \(c_t\) in different ways: Whereas the global attention considers all the hidden states of the encoder, the local one selectively focuses on a small window of context. In global attention, a variable-length alignment vector is derived similar to Eq. (2). Here, the current target hidden state \(h_t\) is compared with each source hidden state \({\bar{h}}_s\) by using a score function instead of the associated energy \(e_{ij}\). Thus, the alignment vector whose size equals the number of time steps on the source side is derived. Given the alignment vector as weights, the context vector \(c_t\) is computed as the weighted average over all the source hidden states. Here, score is referred as a content-based function, and three different alternatives are considered [51].

On the other side, the local attention is differentiable. Firstly, an aligned position \(p_t\) is generated for each target word at a time t. Then, a window centered around the source position \(p_t\) is used to compute the context vector as a weighted average of the source hidden states within the window. The local attention selectively focuses on a small window of context, and obtains the alignment vector from the current target state \(h_t\) and the source states \({\bar{h}}_s\) in the window [51].

The introduction of these novel mechanisms in 2015 triggered the rise of attention for neural networks. Based on the proposed attention mechanisms, significant research has been conducted in a variety of tasks. In order to imagine the attention idea in neural networks better, two visual examples are shown in Fig. 2. A neural image caption generation task is seen in the top row that implements an attention mechanism [48]. Then, the second example shows how the attention mechanisms can be used for visual question answering [52]. Both examples demonstrate how attention mechanisms focus on parts of input images.

Fig. 2
figure 2

Examples of the attention mechanism in visual. (Top) Attending to the correct object in neural image caption generation [48]. (Bottom) Visualization of original image and question pairs, and co-attention maps namely word-level, phrase-level and question-level, respectively [52]

4 2015-2016: attack of the attention

During two years from 2015, the attention mechanisms were used for different tasks, and novel neural network architectures were presented applying these mechanisms. After the memory networks [53] that require a supervision signal instructing them how to use their memory cells, the introduction of the neural Turing machine [54] allows end-to-end training without this supervision signal, via the use of a content-based soft attention mechanism [1]. Then, end-to-end memory network [55] that is a form of memory network based on a recurrent attention mechanism is proposed.

In these years, an attention mechanism called self-attention, sometimes called intra-attention, was successfully implemented within a neural network architecture namely Long Short-Term Memory-Networks (LSTMN) [56]. It modifies the standard LSTM structure by replacing the memory cell with a memory network [53]. This is because memory networks have a set of key vectors and a set of value vectors, whereas LSTMs maintain a hidden vector and a memory vector [56]. In contrast to attention idea in [44], memory and attention are added within a sequence encoder in LSTMN. In order to compute a representation of a sequence, self-attention is described as relating different positions of it [19]. One of the first approaches of self-attention is applied for natural language inference [57].

Many attention-based models have been proposed for neural image captioning [58], abstractive sentence summarization [59], speech recognition [60, 61], automatic video captioning [62], neural machine translation [63], and recognizing textual entailment [64]. Different attention-based models perform visual question answering [65,66,67]. An attention-based CNN is presented for modeling sentence pairs [68]. A recurrent soft attention based model learns to focus selectively on parts of the video frames and classifies videos [69].

On the other side, several neural network architectures have been presented in a variety of tasks. For instance, Stacked Attention Network (SAN) is described for image question answering [70]. Deep Attention Recurrent Q-Network (DARQN) integrates soft and hard attention mechanisms into the structure of Deep Q-Network (DQN) [71]. Wake-Sleep Recurrent Attention Model (WS-RAM) speeds up the training time for image classification and caption generation tasks [72]. alignDRAW model, an extension of the Deep Recurrent Attention Writer (DRAW) [73], is a generative model of images from captions using a soft attention mechanism [74]. Generative Adversarial What-Where Network (GAWWN) synthesizes images given instructions describing what content to draw in which location [75].

5 The transformer: return of the attention

After the proposed attention mechanisms in 2015, researchers published studies that mostly modifying or implementing them to different tasks. However, in 2017, a novel neural network architecture, namely the Transformer, based entirely on self-attention was presented [19]. The Transformer achieved great results on two machine translation tasks in addition to English constituency parsing. The most impressive point about this architecture is that it contains neither recurrence nor convolution. The Transformer performs well by replacing the conventional recurrent layers in encoder-decoder architecture used for NMT with self-attention.

The Transformer is composed of encoder-decoder stacks each of which has six identical layers within itself. In Fig. 3, one encoder-decoder stack is shown to illustrate the model [19]. Each stack includes only attention mechanisms and feedforward neural networks. As this architecture does not include any recurrent or convolutional layer, information about the relative or absolute positions in the input sequence is given at the beginning of both encoder and decoder using positional encodings.

Fig. 3
figure 3

The Transformer architecture and the attention mechanisms it uses in detail [19]. (Left) The Transformer with one encoder-decoder stack. (Center) Multi-head attention. (Right) Scaled dot-product attention

The calculations of self-attention are slightly different from the mechanisms described so far in this paper. It uses three vectors namely query, key and value for each word. These vectors are computed by multiplying the input with weight matrices \(W_q\), \(W_k\) and \(W_v\) which are learned during training. In general, each value is weighted by a function of the query with the corresponding key. The output is computed as a weighted sum of the values. Based on this idea, two attention mechanisms are proposed: In the first one, called scaled dot-product attention, the dot products of the query with all keys are computed as given in the right side of Fig. 3. Each result is divided to the square root of the dimension of the keys to have more stable gradients. They pass into the softmax function, thus the weights for the values are obtained. Finally each softmax score is multiplied with the value as given in Eq. (8). The authors propose computing the attention on a set of queries simultaneously by taking queries and keys of dimension \(d_k\), and values of dimension \(d_v\) as inputs. The keys, queries and values are packed together into matrices K, Q and V. Finally, the output matrix is obtained as follows [19]:

$$\begin{aligned} Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V \end{aligned}$$
(8)

This calculation is performed by every word against the other words. This leads to having values of each word relative to each other. For instance, if the word \(x_2\) is not relevant for the word \(x_1\), then the softmax score gives low probability scores. As a result, the corresponding value is decreased. This leads to an increase in the value of relevant words, and those of others decrease. In the end, every word obtains a new value for itself.

As seen from Fig. 3, the Transformer model does not directly use scaled dot-product attention. But the attention mechanism it uses is based on these calculations. The second mechanism proposed, called the multi-head attention, linearly projects the queries, keys and values h times with different, learned linear projections to \(d_q\), \(d_k\) and \(d_v\) dimensions, respectively [19]. The attention function is performed in parallel on each of these projected versions of queries, keys and values, i.e., heads. By this way, \(d_v\)-dimensional output values are obtained. In order to get the final values, they are concatenated and projected one last time as shown in the center of Fig. 3. By this way, the self-attention is calculated multiple times using different sets of query, key and value vectors. Thus, the model can jointly attend to information at different positions [19]:

$$\begin{aligned} MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O \\ \nonumber where \; head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) \end{aligned}$$
(9)

In the decoder part of the Transformer, masked multi-head attention is applied first to ensure that only previous word embeddings are used when trying to predict the next word in the sentence. Therefore, the embeddings that should not be seen by the decoder are masked by multiplying with zero.

An interesting study examines the contribution made by individual attention heads in the encoder [76]. Also, there is an evaluation of the effects of self-attention on gradient propagation in recurrent networks [77]. For a deeper analysis of multi-head self-attention mechanism from a theoretical perspective see [78].

Self-attention has been used successfully in a variety of tasks including sentence embedding [79] and abstractive summarization [80]. It is shown that self-attention can lead to improvements to discriminative constituency parser [81], and speech recognition as well [82, 83]. Also, the listen-attend-spell model [84] has been improved with the self-attention for acoustic modeling [85].

As soon as these self-attention mechanisms were proposed, they have been incorporated with deep neural networks for a wide range of tasks. For instance, a deep learning model learned a number of large-scale tasks from multiple domains with the aid of self-attention mechanism [86]. Novel self-attention neural models are proposed for cross-target stance classification [87] and NMT [88]. Another study points out that a fully self-attentional model can reach competitive predictive performance on ImageNet classification and COCO object detection tasks [89]. Besides, developing novel attention mechanisms has been carried out such as area attention, a novel mechanism that can be used along multi-head attention [90]. It attends to areas in the memory by defining the key of an area as the mean vector of the key of each item, and defining the value as the sum of all value vectors in the area.

When a novel mechanism is proposed, it is inevitable to incorporate it into the GAN framework [91]. Self-Attention Generative Adversarial Networks (SAGANs) [92] introduce a self-attention mechanism into convolutional GANs. Different from the traditional convolutional GANs, SAGAN generates high-resolution details using cues from all feature locations. Similarly, Attentional Generative Adversarial Network (AttnGAN) is presented for text to image generation [93]. On the other side, a machine reading and question answering architecture called QANet [94] is proposed without any recurrent networks. It uses self-attention to learn the global interaction between each pair of words whereas convolution captures the local structure of the text. In another study, Gated Attention Network (GaAN) controls the importance of each attention head’s output by introducing gates [95]. Another interesting study introduces attentive group convolutions with a generalization of visual self-attention [96]. A deep transformer model is implemented for language modeling over long sequences [97].

Table 1 Summary of Notation

5.1 Self-attention variants

In recent years, self-attention has become an important research direction within the deep learning community. Self-attention idea has been examined in different aspects. For example, self-attention is handled in a multi-instance learning framework [98]. The idea of Sparse Adaptive Connection (SAC) is presented for accelerating and structuring self-attention [99]. The research on improving self-attention continues as well [100,101,102]. Besides, based on the self-attention mechanisms proposed in the Transformer, important studies that modify the self-attention have been presented. Some of the most recent and prominent studies are summarized below.

5.1.1 Relation-aware self-attention

It extends the self-attention mechanism by regarding representations of the relative positions, or distances between sequence elements [103]. Thus, it can consider the pairwise relationships between input elements. This type of attention mechanism defines vectors to represent the edge between two inputs. It provides learning two distinct edge representations that can be shared across attention heads without requiring additional linear transformations.

5.1.2 Directional self-attention (DiSA)

A novel neural network architecture for learning sentence embedding named Directional Self-Attention Network (DiSAN) [104] uses directional self-attention followed by a multi-dimensional attention mechanism. Instead of computing a single importance score for each word based on the word embedding, multi-dimensional attention computes a feature-wise score vector for each token. To extend this mechanism to the self-attention, two variants are presented: The first one, called multi-dimensional ‘token2token’ self-attention generates context-aware coding for each element. The second one, called multi-dimensional ‘source2token’ self-attention compresses the sequence into a vector [104]. On the other side, directional self-attention produces context-aware representations with temporal information encoded by using positional masks. By this way, directional information is encoded. First, the input sequence is transformed to a sequence of hidden states by a fully connected layer. Then, multi-dimensional token2token self-attention is applied to these hidden states. Hence, context-aware vector representations are generated for all elements from the input sequence.

5.1.3 Reinforced self-attention (ReSA)

A sentence-encoding model named Reinforced Self-Attention Network (ReSAN) uses reinforced self-attention (ReSA) that integrates soft and hard attention mechanisms into a single model. ReSA selects a subset of head tokens, and relates each head token to a small subset of dependent tokens to generate their context-aware representations [105]. For this purpose, a novel hard attention mechanism called reinforced sequence sampling (RSS), which selects tokens from an input sequence in parallel and trained via policy gradient, is proposed. Given an input sequence, RSS generates an equal-length sequence of binary random variables that indicates both the selected and discarded ones. On the other side, the soft attention provides reward signals back for training the hard attention. The proposed RSS provides a sparse mask to self-attention. ReSA uses two RSS modules to extract the sparse dependencies between each pair of selected tokens.

5.1.4 Outer product attention (OPA)

Self-Attentive Associative Memory (SAM) is a novel operator based upon outer product attention (OPA) [106]. This attention mechanism is an extension of dot-product attention [19]. OPA differs using element-wise multiplication, outer product, and tanh function instead of softmax.

5.1.5 Bidirectional block self-attention (Bi-BloSA)

Another mechanism, bidirectional block self-attention (Bi-BloSA) which is simply a masked block self-attention (mBloSA) with forward and backward masks to encode the temporal order information is presented [107]. Here, mBloSA is composed of three parts from its bottom to top namely intra-block self-attention, inter-block self-attention and the context fusion. It splits a sequence into several length-equal blocks, and applies an intra-block self-attention to each block independently. Then, inter-block self-attention processes the outputs for all blocks. This stacked self-attention model results a reduction in the amount of memory compared to a single one applied to the whole sequence. Finally, a feature fusion gate combines the outputs of intra-block and inter-block self-attention with the original input, to produce the final context-aware representations of all tokens.

5.1.6 Fixed multi-head attention

The fixed multi-head attention proposes fixing the head size of the Transformer in the aim of improving the representation power [108]. This study emphasizes its importance by setting the head size of attention units to input sequence length.

5.1.7 Sparse sinkhorn attention

It is based on the idea of differentiable sorting of internal representations within the self-attention module [109]. Instead of allowing tokens to only attend to tokens within the same block, it operates on block sorted sequences. Each token attends to tokens in the sorted block. Thus, tokens that may be far apart in the unsorted sequence can be considered. Additionally, a variant of this mechanism named SortCut sinkhorn attention applies a post-sorting truncation of the input sequence.

5.1.8 Adaptive attention span

Adaptive attention span is proposed as an alternative to self-attention [110]. It learns the attention span of each head independently. To this end, a masking function inspired by [111] is used to control the attention span for each head. The purpose of this novel mechanism is to reduce the computational burden of the Transformer. Additionally, dynamic attention span approach is presented to dynamically change the attention span based on the current input as an extension [51, 112].

5.2 Transformer variants

Different from developing novel self-attention mechanisms, several studies have been published in the aim of improving the performance of the Transformer. These studies mostly modify the model architecture. For instance, an additional recurrence encoder is preferred to model recurrence for Transformer directly [113]. In another study, a new weight initialization scheme is applied to improve Transformer optimization [114]. A novel positional encoding scheme is used to extend the Transformer to tree-structured data [115]. Investigating model size by handling Transformer width and depth for efficient training is also an active research area [116]. Transformer is used in reinforcement learning settings [117,118,119] and for time-series forecasting in adversarial training setting [120].

Besides, many Transformer variants have been presented in the recent past. COMmonsEnse Transformer (COMET) is introduced for automatic construction of commonsense knowledge bases [121]. Evolved Transformer applies neural architecture search for a better Transformer model [122]. Transformer Autoencoder is a sequential autoencoder for conditional music generation [123]. CrossTransformer takes a small number of labeled images and an unlabeled query, and computes distances between spatially-corresponding features to infer class membership [124]. DEtection TRansformer (DETR) is a new design for object detection systems [125], and Deformable DETR is an improved version that achieves better performance in less time [126]. FLOw-bAsed TransformER (FLOATER) emphasizes the importance of position encoding in the Transformer, and models the position information via a continuous dynamical model [127]. Disentangled Context (DisCo) Transformer simultaneously generates all tokens given different contexts by predicting every word in a sentence conditioned on an arbitrary subset of the rest of the words [128]. Generative Adversarial Transformer (GANsformer) is presented for visual generative modeling [129].

Recent work has demonstrated significant performance on NLP tasks. In OpenAI GPT, there is a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer [130]. GPT-2 [131] and GPT-3 [18] models have improved the progress. In addition to these variants, some prominent Transformer-based models are summarized below.

5.2.1 Universal transformer

A generalization of the Transformer model named the Universal Transformer [132] iteratively computes representations \(H^t\) at step t for all positions in the sequence in parallel. To this end, it uses the scaled dot-product attention in Eq. (8) where d is the number of columns of Q, K, and V. In the Universal Transformer, the multi-head self-attention with k heads is used. The representations \(H^t\) are mapped to queries, keys and values with affine projections using learned parameter matrices \(W^Q \in \Re ^{d\times d/k}\), \(W^K \in \Re ^{d\times d/k}\), \(W^V \in \Re ^{d\times d/k}\) and \(W^O \in \Re ^{d\times d}\) [132]:

$$\begin{aligned} MultiHead(H^t) = Concat(head_1,...,head_k)W^O \\ \nonumber where \; head_i = Attention(H^tW_i^Q, H^tW_i^K, H^tW_i^V) \end{aligned}$$
(10)

5.2.2 Image transformer

Image Transformer [133] demonstrates that self-attention-based models can also be well-suited for images instead of text. This Transformer type restricts the self-attention mechanism to attend to local neighborhoods. Thus, the size of images that the model can process is increased. Its larger receptive fields allow the Image Transformer to significantly improve the model performance on image generation as well as image super-resolution.

5.2.3 Transformer-XL

This study aims to improve the fixed-length context of the Transformer [19] for language modeling. Transformer-XL [134] makes modeling very long-term dependency possible by reusing the hidden states obtained in previous segments. Hence, information can be propagated through the recurrent connections. In order to reuse the hidden states without causing temporal confusion, Transformer-XL uses relative positional encodings. Based on this architecture, a modified version named the Gated Transformer-XL (GTrXL) is presented in the reinforcement learning setting [135].

5.2.4 Tensorized Transformer

Tensorized Transformer [136] compresses the multi-head attention in Transformer. To this end, it uses a novel self-attention model multi-linear attention with Block-Term Tensor Decomposition (BTD) [137]. It builds a single-block attention based on the Tucker decomposition [138]. Then, it uses a multi-linear attention constructed by a BTD to compress the multi-head attention mechanism. In Tensorized Transformer, the factor matrices are shared across multiple blocks.

5.2.5 BERT

The Bidirectional Encoder Representations from Transformers (BERT) aims to pre-train deep bidirectional representations from unlabeled text [139]. BERT uses a multilayer bidirectional Transformer as the encoder. Besides, inspired by the Cloze task [140], it has a masked language model pre-training objective. BERT randomly masks some of the tokens from the input, and predicts the original vocabulary id of the masked word based only on its context. This model can pre-train a deep bidirectional Transformer. In all layers, the pre-training is carried out by jointly conditioning on both left and right context. BERT differs from the left-to-right language model pre-training from this aspect.

Recently, BERT model has been examined in detail. For instance, the behavior of attention heads are analysed [141]. Various methods have been investigated for compressing [142, 143], pruning [144], and quantization [145]. Also, BERT model has been considered for different tasks such as coreference resolution [146]. A novel method is proposed in order to accelerate BERT training [147].

Furthermore, various BERT variants have been presented. ALBERT aims to increase the training speed of BERT, and presents two parameter reduction techniques [148]. Similarly, PoWER-BERT [149] is developed to improve the inference time of BERT. This scheme is also used to accelerate ALBERT. Also, TinyBERT is proposed to accelerate inference and reduce model size while maintaining accuracy [150]. In order to obtain better representations, SpanBERT is proposed as a pre-training method [151]. As a robustly optimized BERT approach, RoBERTa shows that BERT was significantly undertrained [152]. Also, DeBERTa improves RoBERTa using the disentangled attention mechanism [153]. On the other side, DistilBERT shows that it is possible to reach similar performances using much smaller language models pre-trained with knowledge distillation [154]. StructBERT proposes two novel linearization strategies [155]. Q-BERT is introduced for quantizing BERT models [156], BioBERT is for biomedical text mining [157], and RareBERT is for rare disease diagnosis [158].

Since 2017 when the Transformer was presented, research directions have generally focused on novel self-attention mechanisms, adapting the Transformer for various tasks, or making them more understandable. In one of the most recent studies, NLP becomes possible in the mobile setting with Lite Transformer. It applies long-short range attention where some heads specialize in the local context modeling while the others specialize in the long-distance relationship modeling [159]. A deep and lightweight Transformer DeLighT [160] and a hypernetwork-based model namely HyperGrid Transformers [161] perform with fewer parameters. Graph Transformer Network is introduced for learning node representations on heterogeneous graphs [162] and different applications are performed for molecular data [163] or textual graph representation [164]. Also, Transformer-XH applies eXtra Hop attention for structured text data [165]. AttentionXML is a tree-based model for extreme multi-label text classification [166]. Besides, attention mechanism is handled in a Bayesian framework [167]. For a better understanding of Transformers, an identifiability analysis of self-attention weights is conducted in addition to presenting effective attention to improve explanatory interpretations [168]. Lastly, Vision Transformer (ViT) processes an image using a standard Transformer encoder as used in NLP by interpreting it as a sequence of patches, and performs well on image classification tasks [169].

5.3 What about complexity?

All these aforementioned studies undoubtedly demonstrate significant success. But success not make one great. The Transformer also brings a very high computational complexity and memory cost. The necessity of storing attention matrix to compute the gradients with respect to queries, keys and values causes a non-negligible quadratic computation and memory requirements. Training the Transformer is a slow process for very long sequences because of its quadratic complexity. There is also time complexity which is quadratic with respect to the sequence length. In order to improve the Transformer in this respect, recent studies have been conducted to improve this issue. One of them is Linear Transformer which expresses the self-attention as a linear dot-product of kernel feature maps [170]. Linear Transformer reduces both memory and time complexity by changing the self-attention from the softmax function in Eq. (8) to a feature map-based dot-product attention. Its performance is competitive with the vanilla Transformer architecture on image generation and automatic speech recognition tasks while being faster during inference. On the other side, FMMformers which use the idea of the fast multipole method (FMM) [171] outperform the linear Transformer by decomposing the attention matrix into near-field and far-field attention with linear time and memory complexity [172].

Another suggestion made in response to the Transformer’s quadratic nature is The Reformer that replaces dot-product attention by one that uses locality-sensitive hashing [173]. It reduces the complexity but one limitation of the Reformer is its requirement for the queries and keys to be identical. Set Transformer aims to reduce computation time of self-attention from quadratic to linear by using an attention mechanism based on sparse Gaussian process literature [174]. Routing Transformer aims to reduce the overall complexity of attention by learning dynamic sparse attention patterns by using routing attention with clustering [175]. It applies k-means clustering to model sparse attention matrices. At first, queries and keys are assigned to clusters. The attention scheme is determined by considering only queries and keys from the same cluster. Thus, queries are routed to keys belonging to the same cluster [175].

Sparse Transformer introduces sparse factorizations of the attention matrix by using factorized self-attention, and avoids the quadratic growth of computational burden [176]. It also shows the possibility of modeling sequences of length one million or more by using self-attention in theory. In the Transformer, all the attention heads with the softmax attention assign a nonzero weight to all context words. Adaptively Sparse Transformer replaces softmax with \(\alpha\)-entmax which is a differentiable generalization of softmax allowing low-scoring words to receive precisely zero weight [177]. By means of context-dependent sparsity patterns, the attention heads become flexible in the Adaptively Sparse Transformer. Random feature attention approximates softmax attention with random feature methods [178]. Skyformer replaces softmax with a Gaussian kernel and adapts Nyström method [179]. A sparse attention mechanism named BIGBIRD aims to reduce the quadratic dependency of Transformer-based models to linear [180]. Different from the similar studies, BIGBIRD performs well for genomics data alongside NLP tasks such as question answering.

Music Transformer [181] shows that self-attention can also be useful for modeling music. This study emphasizes the infeasibility of the relative position representations introduced by [103] for long sequences because of the quadratic intermediate relative information in the sequence length. Therefore, this study presents an extended version of relative attention named relative local attention that improves the relative attention for longer musical compositions by reducing its intermediate memory requirement to linear in the sequence length. A softmax-free Transformer (SOFT) is presented to improve the computational efficiency of ViT. It uses Gaussian kernel function instead of the dot-product similarity [182].

Additionally, various approaches have been presented in Hierarchical Visual Transformer [183], Long-Short Transformer (Transformer-LS) [184], Perceiver [185], and Performer [186]. Image Transformer based on the cross-covariance matrix between keys and queries is applied [187], and a new vision Transformer is proposed [188]. Furthermore, a Bernoulli sampling attention mechanism decreases the quadratic complexity to linear [189]. A novel linearized attention mechanism performs well on object detection, instance segmentation, and stereo depth estimation [190]. A study shows that kernelized attention with relative positional encoding can be calculated using Fast Fourier Transform and it leads to get rid of the quadratic complexity for long sequences [191]. A linear unified nested attention mechanism namely Luna uses two nested attention functions to approximate the softmax attention in Transformer to achieve linear time and space complexity [192].

6 Concluding remarks: a new hope

Inspired by the human visual system, the attention mechanisms in neural networks have been developing for a long time. In this study, we examine this duration beginning with its roots up to the present time. Some mechanisms have been modified, or novel mechanisms have emerged in this period. Today, this journey has reached a very important stage. The idea of incorporating attention mechanisms into deep neural networks has led to state-of-the-art results for a large variety of tasks. Self-attention mechanisms and GPT-n family models have become a new hope for more advanced models. These promising progress bring the questions whether the attention could help further development, replace the popular neural network layers, or could be a better idea than the existing attention mechanisms? It is still an active research area and much to learn we still have, but it is obvious that more powerful systems are awaiting when neural networks and attention mechanisms join forces.