Keywords

1 Introduction

Internationalisation is considered as a big trend in e-commerce. There is an increasing interest by e-commerce businesses to expand to other countries. Language is here an important barrier. E-retailers struggle to efficiently translate their product descriptions and websites in a variety of languages. Currently, this is still done manually. However, consumers prefer to read product descriptions in their native language to get an optimal understanding of the product specifications and to be able to compare products.

Neural machine translation (NMT) is an approach to machine translation which uses an artificial neural network to predict a sequence of words in the target language given a sequence of words in the source language. In multimodal neural machine translation (MNMT), the source sequence is paired with an image and the target sequence is generated aided by the information in the image. The fashion e-commerce domain, where product descriptions reference to fine-grained product attributes somewhere in the image (e.g., V-neck, floral print), is a challenging but interesting domain for MNMT which requires to efficiently integrate the visual and textual information. State-of-the-art NMT systems are sequence-to-sequence networks with an attention-based encoder-decoder architecture. The encoder encodes each source word with a vector representation which captures the word’s semantics. At each timestep, the decoder outputs the most likely target word by looking at the source word representations and the target words generated in previous timesteps. In this work, we propose a MNMT model which jointly learns to align semantically related source words, target words and image regions and to translate. Hence, it infers a multimodal, multilingual space where a source word, target word and image region that refer to the same fashion attribute have vector representations which are close together. This way the source word representations become visually contextualised or visually grounded, which informs the decoder about the visual context in an efficient way.

The main contributions of our paper are:

  • We infer a multimodal, multilingual space in which we embed an image region, source word and target word that refer to the same fashion attribute close together. In this space, they are aligned through an attention-based alignment model which uses cosine similarity to measure semantic relatedness. Next, the decoder attends to the inferred visually grounded representations of our source words.

  • We propose a new, natural setting for multimodal translation, that is fashion e-commerce, which is challenging because of its references to fine-grained fashion attributes and the limited amount of training data.

  • We show state-of-the art multimodal translation results on a real-world fashion e-commerce dataset.

The remainder of this paper is structured as follows. In Sect. 2 we review other work related to the subject of this paper. Next, we elaborate our model architecture in Sect. 3. In Sect. 4 we describe our experimental setup. The results of the conducted experiments can be found in Sect. 5. Finally, we present our conclusions and provide directions for future work in Sect. 6.

2 Related Work

Unimodal machine translation models are trained with pairs of sentences, where the target language sentence is the translation of the source language sentence. Currently, neural machine translation is the most popular and successful technique. The neural networks are in the form of sequence-to-sequence networks with an attention-based encoder-decoder architecture. [2] were the first to introduce an attention mechanism in the decoder. The intuition behind it is to compute the expected alignment of every source word with the next target word and to jointly translate. The pure text-based model of [2] will serve as our unimodal neural machine translation (UNMT) baseline.

There is a current interest in MNMT and more specifically in using additional visual information to aid the translation [3,4,5, 7, 9, 19]. Although these works achieve promising results, they indicate that further exploration to what is the best way to benefit from the visual context is needed. One approach in MNMT is to use a double attention mechanism, one over the source words and another over different regions of the image [3, 5]. However, this approach neglects to exploit the semantic relatedness between the image regions, source words and target words which is an important indicator for the relevance of the visual information. Our approach makes use of an additional alignment model to align the image regions, source words and target words to infer visually grounded source word representations. This is different from [9] who project the visual features to the space of source word embeddings and append these visual words to the head/tail of the source sentence. The encoder then encodes both these visual words and the source words. In contrast to our work, they do not use an alignment model to infer their multimodal space and do not attempt to include the target language in this space. Most closely related to our work is the work of [19] where the visual context is grounded into the encoder through the joint learning of a multimodal space and of a translation model. More precisely, they embed images close to their attended source sentence representations in a multimodal shared space. Additionally, they initialise the decoder hidden state in such a way that the source words closest related to the visual context have more influence during decoding. In contrast, we do not embed full images and sentences in our shared space, but instead work at a finer level to find the latent alignment of image regions and words, which proves to be valuable especially for fashion data. Moreover, we also include the target language to obtain a space which is both multimodal and multilingual [19] report the state-of-the-art results for MNMT and therefore we use their model as our MNMT baseline.

In order to find the semantic correspondences between the image regions, source words and target words we make use of an alignment model. Alignment models have already proven to be useful for other tasks that require to jointly reason over vision and language, such as image captioning [10], visual question answering [1, 17], multimodal search [11] and image-text matching [12, 18].

Neural networks and deep learning models have become an essential item in the toolbox of fashion-related businesses (e.g., in apparel recognition, fashion search, product recommendation and outfit combination). Closer to this work is the work of [13] who generate persuasive textual descriptions of fashion items given a number of key terms that describe the item in order to encourage an online buyer towards a successful purchase. However, their neural architecture ignores the image when generating the persuasive descriptions. The neural architecture proposed in this paper could expand the work of [13] in multimodal and multilingual settings.

3 Methodology

First, we describe the baseline models for UNMT and MNMT in respectively Sects. 3.1 and 3.2. Next, we elaborate our proposed MNMT architecture which aligns the image regions, source words and target words with stacked cross-attention in Sect. 3.2. In all formulas, matrices are written with capital letters and vectors are bolded. We use letters \( W \) and \( \varvec{b} \) to refer to respectively the weights and bias in linear and non-linear transformations.

During the training phase, all models learn from a training set of examples of paired descriptions in source and target language. The MNMT models also have access to a corresponding image. During the testing phase, the models only have access to the source sentence and image.

3.1 UNMT Baseline

In UNMT the goal is to translate a source sentence \( X = \left( {x_{1} ,x_{2} , \ldots , x_{M} } \right) \) consisting of \( M \) words into the correct target sentence \( Y = \left( {y_{1} ,y_{2} , \ldots , y_{N} } \right) \) consisting of \( N \) words. Our UNMT baseline is the attention-based encoder-decoder architecture of [2]. For more details, the reader is referred to [2].

3.2 MNMT Baseline

In MNMT, a source sentence \( X = \left( {x_{1} ,x_{2} , \ldots , x_{M} } \right) \) is translated into a target sentence \( Y = \left( {y_{1} ,y_{2} , \ldots , y_{N} } \right) \) aided by the visual information in image \( I \) paired with source sentence \( X \). Our MNMT baseline is the model of [19]. The model obtains visually grounded source word representations by sharing the encoder between the translation task and a multimodal space inference task.

Encoder.

The encoder is a bidirectional recurrent neural network (BRNN) [15] with gated recurrent units (GRUs) [6]. It produces a source word representation \( \varvec{s}_{j} \in {\mathbf{\mathbb{R}}}^{{2d_{x} }} \) for each word \( \varvec{x}_{j} \varvec{ } \) of source sentence \( X \) by concatenating the forward and backward hidden states.

Shared Space Inference Task.

The objective is to infer a shared space for images and source sentences which captures the semantic meaning across the two modalities. Each image is represented with vector \( \varvec{v} \in {\mathbf{\mathbb{R}}}^{2048} \) obtained from the pool5 layer of the convolutional neural network ResNet50 [8] pre-trained on ImageNet [14]. The representation of the source sentence \( \varvec{s}_{att} \) is obtained by applying attention to each source word representation \( \varvec{s}_{j} \) with image representation \( \varvec{v} \). This produces attention scores \( z_{j} \) which measure how well the source word at position \( j \) corresponds with the image. Next, the attention scores \( z_{j} \) are normalized with the softmax function and used to weight the source words \( \varvec{s}_{j} \). This way, the words which are more related to the image content get a higher weight in the generated source sentence representation:

$$ z_{j} = \tanh \left( {W_{s} \varvec{s}_{j} } \right)\varvec{.}\tanh \left( {W_{v} \varvec{v}} \right) $$
(1)
$$ \varvec{s}_{att} = \mathop \sum \limits_{j = 1}^{M} \beta_{j} \varvec{s}_{j} , {\text{with}}\, \beta_{j} = {\text{softmax}}\left( {\left[ {z_{1} , z_{2} , \ldots ,z_{M} } \right]} \right)_{j} $$
(2)

Next, image \( \varvec{v} \) and source sentence \( \varvec{s}_{att} \) are projected to their representations \( \widehat{\varvec{v}} \) and \( \widehat{\varvec{s}} \) in the multimodal space:

$$ \widehat{\varvec{v}} = \text{tanh}\left( {W_{v\,emb} \varvec{v} + \varvec{b}_{v\,emb} } \right) $$
(3)
$$ \widehat{\varvec{s}} = \tanh \left( {W_{s\,emb} \varvec{s}_{att} + \varvec{b}_{s\,emb} } \right) $$
(4)

with \( \widehat{\varvec{v}},\varvec{ }\widehat{\varvec{s}}\varvec{ } \in \varvec{ }{\mathbf{\mathbb{R}}}^{d} \). The projection to the multimodal space is learned by minimizing a triplet loss which enforces that a corresponding image-sentence pair should be closer than a non-corresponding pair:

$$ \begin{aligned} {\mathcal{L}}_{triplet1} & = \mathop \sum \limits_{e}^{E} \mathop \sum \limits_{{e^{{\prime }} \ne e}}^{E} \hbox{max} \left( {0, m - f\left( {\widehat{\varvec{v}}_{e} ,\widehat{\varvec{s}}_{e} } \right) + f\left( {\widehat{\varvec{v}}_{e} ,\widehat{\varvec{s}}_{{e^{{\prime }} }} } \right)} \right) \\ & \quad + \mathop \sum \limits_{e}^{E} \mathop \sum \limits_{{e^{{\prime }} \ne e}}^{E} \hbox{max} \left( {0, m - f\left( {\widehat{\varvec{v}}_{e} ,\widehat{\varvec{s}}_{e} } \right) + f\left( {\widehat{\varvec{v}}_{{e^{{\prime }} }} ,\widehat{\varvec{s}}_{e} } \right)} \right) \\ \end{aligned} $$
(5)

where index \( e \) ranges over the number of training examples and \( m \) is the margin. In the multimodal space, cosine similarity \( f\left( {\varvec{x},\varvec{y}} \right) = \frac{{\varvec{x}^{T} \varvec{.y}}}{{\left| {\left| \varvec{x} \right|} \right|\varvec{.}\left| {\left| \varvec{y} \right|} \right|}} \) measures semantic relatedness.

Translation Task.

The visually grounded source word representations \( \varvec{s}_{j} \) are used by the decoder, which is a conditional GRU [16] consisting of two stacked GRUs. At each timestep \( t, \) the decoder produces the next target word \( y_{t} \) starting from the previously emitted word \( y_{t - 1} \), the previous decoder hidden state \( \varvec{h}_{t - 1} \) and the source context vector \( \varvec{c}_{t}^{att} \):

$$ \varvec{o}_{t} = \tanh \left( {E_{y} y_{t - 1} + W_{h} \varvec{h}_{t} + W_{c} \varvec{c}_{t}^{att} } \right) $$
(6)
$$ P\left( {y_{t} |y_{t - 1} ,\varvec{h}_{t} ,\varvec{c}_{t}^{att} } \right) = {\text{softmax}}\left( {W_{out} \varvec{o}_{t} } \right) $$
(7)

where \( E_{y} y_{t - 1} \in {\mathbf{\mathbb{R}}}^{{d_{y} }} \) is the vector representation of the previously emitted word and context vector \( \varvec{c}_{t}^{att} \) is acquired by applying Bahdanau’s attention [2] on the source word representations \( \varvec{s}_{j} \) based on the decoder hidden state proposal \( \varvec{h}_{t}^{'} \) from the first GRU. At timestep \( t = 0 \) the decoder hidden state \( \varvec{h}_{0} \) is initialized such that the source words most closely related to the image have a bigger influence during translation decoding. More precisely, \( \varvec{h}_{0} \) is computed as a weighed sum of the attended source sentence representation \( \varvec{s}_{att} \) and the mean of the source word representations \( \varvec{s}_{j} \):

$$ \varvec{h}_{0} = { \tanh }\left( {W_{init} \left( {\lambda \varvec{s}_{att} + \left( {1 - \lambda } \right)\frac{1}{M} \mathop \sum \limits_{j = 1}^{M} \varvec{s}_{j} } \right)} \right) $$
(8)

with weight \( \lambda \) a hyperparameter. During training, we quantify the quality of the translation with the cross entropy loss:

$$ {\mathcal{L}}_{cross - entropy } = - \mathop \sum \limits_{e}^{E} \mathop \sum \limits_{t}^{T} \varvec{y}_{et} \varvec{.}\log \left( {\overline{\varvec{y}}_{et} } \right) $$
(9)

where indices \( e \) and \( t \) range over respectively the number of training examples and number of timesteps, \( \varvec{y}_{et} \) is the one-hot encoded ground truth vector for training example \( e \) at timestep \( t \), and \( \overline{\varvec{y}}_{et} \) is the vector of predicted probabilities as outputted by the softmax layer for training example \( e \) at timestep \( t \). Therefore, the complete loss function for the MNMT baseline is:

$$ {\mathcal{L}} = \text{ }\alpha {\mathcal{L}}_{cross - entropy} + \left( {1 - \alpha } \right){\mathcal{L}}_{triplet1} $$
(10)

where \( \alpha \) determines the contribution of the translation loss versus the visual grounding loss.

3.3 MNMT with Alignment Model Based on Stacked Cross-Attention

Similar to the MNMT baseline, our model learns a shared space jointly with the translation task to obtain visually grounded source word representations. In this shared space, we align source words, target words and image regions which refer to the same fashion attribute. Hence in contrast with the MNMT baseline, our space is both multimodal and multilingual and our alignment is finer, resulting in a space which captures fine-grained semantics across the visual and textual modalities. Note that the alignment at the region and word level is latent: we know which sentence corresponds with which image, but which words and image regions correspond is unknown. Therefore, we use an alignment model to learn these correspondences from frequent combinations of words and visual patterns in our training set. The alignment model is based on stacked cross-attention [12]. We will further refer to our model as the MNMT SCA model.

Encoder.

The encoder is identical to the one of the MNMT baseline in Sect. 3.2.

Shared Space Inference Task.

We obtain image regions by representing the image with the res4f-features \( \varvec{v}_{k} \in {\mathbf{\mathbb{R}}}^{1024} \left( {k = 1..196} \right) \) of ResNet50 [8] pre-trained on ImageNet [14]. The image regions \( \varvec{v}_{k} \), source words \( \varvec{s}_{j} \) and target words \( E_{y} y_{t} \) are projected to \( \widehat{\varvec{v}}_{k} \), \( \widehat{\varvec{s}}_{j} \) and \( \widehat{\varvec{y}}_{t} \) in the multimodal, multilingual space:

$$ \widehat{\varvec{v}}_{k} = W_{{vk_{emb} }} \varvec{v}_{k} + \varvec{b}_{{vk_{emb} }} $$
(11)
$$ \widehat{\varvec{s}}_{j} = W_{{sj_{emb} }} \varvec{s}_{j} + \varvec{b}_{{sj_{emb} }} $$
(12)
$$ \widehat{\varvec{y}}_{t} = W_{{yt_{emb} }} E_{y} y_{t} + \varvec{b}_{{yt_{emb} }} $$
(13)

with \( \widehat{\varvec{v}}_{k} , \widehat{\varvec{s}}_{j} , \widehat{\varvec{y}}_{t} \in {\mathbf{\mathbb{R}}}^{d} \). The projections to the multimodal, multilingual space are learned by minimizing a triplet loss which enforces that corresponding image regions, source words and target words should be closer than non-corresponding ones:

$$ {\mathcal{L}}_{triplet2} = \frac{{\ell \left( {\widehat{V},\widehat{S}} \right) + \text{ }\ell \left( {\widehat{V},\widehat{T}} \right) + \text{ }\ell \left( {\widehat{S},\widehat{T}} \right)}}{3} $$
(14)
$$ \begin{aligned} {\text{with}}\, \widehat{V} & = \left\{ {\widehat{\varvec{v}}_{1} , \ldots ,\widehat{\varvec{v}}_{196} } \right\}, \widehat{S} = \left\{ {\widehat{\varvec{s}}_{1} , \ldots ,\widehat{\varvec{s}}_{M} } \right\}, \widehat{T} = \left\{ {\widehat{\varvec{y}}_{1} , \ldots ,\widehat{\varvec{y}}_{T} } \right\} \\ \ell \left( {Q, K} \right) & = \hbox{max} \left( {0, m - SCA\left( {Q, K} \right) + SCA\left( {Q, K_{hard} } \right)} \right) \\ \end{aligned} $$
(15)
$$ + \,\hbox{max} \left( {0, m - SCA\left( {Q, K} \right) + SCA\left( {Q_{hard} , K} \right)} \right) $$
(16)

where \( m \) is the margin and \( SCA\left( {Q, K} \right) \) is the similarity score of two sets of features \( Q \) and \( K \). Note that we use hard negative sampling here, i.e., \( Q_{hard} \) and \( K_{hard} \) are the hardest negatives for the corresponding feature sets \( \left( {Q, K} \right) \) and are given by \( Q_{hard} = argmax_{Q' \ne Q} SCA\left( {Q', K} \right) \) and \( K_{hard} = argmax_{K' \ne K} SCA\left( {Q, K'} \right) \). Similarity score \( SCA\left( {Q, K} \right) \) of feature set \( Q = \left\{ {\varvec{q}_{1} ,\varvec{q}_{2} , \ldots ,\varvec{q}_{{Q_{tot} }} } \right\}, \varvec{q}_{i } \in {\mathbb{R}}^{d} \) and feature set \( K = \left\{ {\varvec{k}_{1} ,\varvec{k}_{2} , \ldots ,\varvec{k}_{{K_{tot} }} } \right\}, \varvec{k}_{i } \in {\mathbb{R}}^{d} \) is computed with stacked cross-attention. Stacked cross-attention works in two stages of attention. In the first stage, we compute the cosine similarities \( f(\varvec{q}_{i} , \varvec{k}_{j} ) \) of all pairs of \( \varvec{q}_{i} \) and \( \varvec{k}_{j} \). These cosine similarities are thresholded at zero and normalized to get attention scores \( \varvec{c}_{ij} \) for each \( \varvec{q}_{i} \) and \( \varvec{k}_{j} \):

$$ c_{ij} = \frac{{{ \hbox{max} }\left( {0, f\left( {\varvec{q}_{i} , \varvec{k}_{j} } \right)} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{{Q_{tot} }} \hbox{max} \left( {0, f\left( {\varvec{q}_{i} , \varvec{k}_{j} } \right)} \right)^{2} } }} $$
(17)

Next a context vector \( \varvec{c}_{i}^{att} \) is computed for each \( \varvec{q}_{i} \) as a weighted combination of the \( \varvec{k}_{j} \):

$$ \varvec{c}_{i}^{att} = \mathop \sum \limits_{j = 1}^{{K_{tot} }} \gamma_{ij} \varvec{k}_{j} , {\text{with}}\, \gamma_{ij} = {\text{softmax}}\left( {\left[ {\eta c_{i1} , \eta c_{i2} , \ldots ,\eta c_{{iK_{tot} }} } \right]} \right)_{j} $$
(18)

with \( \eta \) a hyperparameter. If \( \varvec{q}_{i} \) corresponds with some \( \varvec{k}_{j} \), then \( \varvec{c}_{i}^{att} \) will be highly correlated with this \( \varvec{k}_{j} \). Otherwise, \( \varvec{c}_{i}^{att} \) will not be correlated with any of the \( \varvec{k}_{j} \). In the second stage, the similarity score of the two feature sets is calculated as the average cosine similarity \( f \) between feature \( \varvec{q}_{i} \) and its context vector \( \varvec{c}_{i}^{att} \):

$$ SCA\left( {Q, K} \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{{Q_{tot} }} f\left( {\varvec{q}_{i} ,\varvec{c}_{i}^{att} } \right)}}{{Q_{tot} }} $$
(19)

Translation Task.

Aligning the image regions, source words and target words in the multimodal, multilingual space makes that the source word representations \( \widehat{\varvec{s}}_{j} \) become visually grounded. Therefore, we feed these \( \widehat{\varvec{s}}_{j } \) to the decoder (instead of the \( \varvec{s}_{j} \)) to let the decoder benefit from the visual context. The decoder hidden state is initialized with Eq. 8 but with \( \varvec{s}_{att} \) computed as:

$$ z_{j} = { \hbox{max} }(0,\mathop {\hbox{max} }\limits_{\text{k}} \left( {f\left( {\widehat{\varvec{s}}_{j} ,\widehat{\varvec{v}}_{k} } \right)} \right) $$
(20)
$$ \varvec{s}_{att} = \mathop \sum \limits_{j = 1}^{M} \beta_{j} \widehat{\varvec{s}}_{j} , {\text{with}} \,\beta_{j} = {\text{softmax}}\left( {\left[ {z_{1} , z_{2} , \ldots ,z_{M} } \right]} \right)_{j} $$
(21)

with \( f \) the cosine similarity. The complete loss function for our MNMT SCA model is the same as in Eq. 10, but with the triplet loss \( {\mathcal{L}}_{triplet2} \) of Eq. 14 instead.

4 Experimental Setup

4.1 Dataset

For this task we acquired a new, real-world e-commerce dataset from the company e5 mode, with product descriptions in English, French and Dutch and images of fashion products. The product descriptions describe the main features of a product, but do not provide an exhaustive description. Moreover, not all described product features are visible in the image, e.g., they might apply to the back of the product. The English and Dutch descriptions are sentence-aligned, i.e., they are exact parallel translations. The English and French descriptions have comparable content, i.e., they have similar content but are not translations of each other. The product descriptions are associated with one image that displays the fashion product on a clear, white background. A fashion product can either be a clothing item such as a dress, blouse, pants or underwear, or a clothing accessory like a necklace, belt, scarf or tie. The dataset consists of 3082 product images with associated descriptions in the three languages. The amount of products in this dataset is a realistic size for most e-retailers. Of the total amount of products, 2460 (~80%) are used for training, 314 (~10%) for testing and 308 (~10%) for validation. The validation set is used for hyperparameter tuning during training.

4.2 Experiments and Evaluation

We train the UNMT baseline, MNMT baseline and our MNMT SCA model on the e5 fashion dataset for English→Dutch and English→French. We evaluate the translation quality of the resulting models with the BLEU score. The BLEU score has a high correlation with human judgements of translation quality and is one of the most popular metrics to evaluate translation systems. It computes the number of matching N-grams (with \( N\, = \, 1.. 4 \)) between the generated translation and the ground truth reference translation. We use beam search with a beam size of 12 for translation decoding.

4.3 Training Details

All hyperparameters are set based on our validation set. For models trained with both the cross entropy loss and triplet loss, a factor \( \alpha \) of 0.99 and a margin \( m \) of 0.1 were found to work well. The dimensions \( d_{x} \) and \( d_{y} \) of the source and target word representations are set to 256. The dimension \( d \) of the shared spaces is set to 512. The hidden state of the decoder is 512-dimensional. The decoder initialization weight \( \lambda \) is set to 0.5 and the inversed temperature of the softmax function \( \eta \) to 4. We stop the training phase if there is no improvement in BLEU score on the validation set for 10 consecutive evaluation steps.

5 Results

Table 1 shows the BLEU scores obtained by all models on the e5 fashion dataset. These results indicate that our MNMT SCA model outperforms the MNMT baseline on both language pairs. Hence, a multimodal, multilingual space which aligns images and sentences at the level of regions and words is best for visually contextualizing the source word representations. Figure 1 compares some of the translations generated by our MNMT SCA model with those of the MNMT baseline. In the first example, the MNMT baseline incorrectly interprets loosely as referring to the shape of the pullover, while it refers to the knit. Both the MNMT baseline and our MNMT SCA model generate a wrong translation for flattering, but while the word rounded (afgeronde) generated by our MNMT SCA model also applies to the neckline, the word yellow (geel) generated by the MNMT baseline does not.

Table 1. Translation results for English→Dutch and English→French in terms of BLEU score on the e5 fashion test set.
Fig. 1.
figure 1

Comparison of translations generated by the MNMT baseline and our MNMT SCA model for English→Dutch (best viewed in color).

In the second example, the MNMT baseline generates a description of a sleeveless dress (mouwloos kleedje) instead of a printed skirt. Moreover it misidentifies the shape as being easy to combine (makkelijk te combineren) and the zipper as being light brown (lichtbruin). These mistakes are not made by our MNMT SCA model.

As also confirmed in previous works [3, 4], the MNMT models are still surpassed by the pure text-based UNMT baseline, which achieves a BLEU score of 74.38 for English→Dutch and of 48.05 for English→French. This is because the signal coming from the text in a MNMT model is stronger than the one coming from the vision side, and distilling the relevant fine-grained details from an image is a difficult task. However even if the UNMT baseline performs better overall, we can also compare the BLEU scores of the individual test examples. Table 2 reports the percentage of test examples where the associated image helps generate a better translation. These results show that in a third of the test examples, supplying an image with the source sentence results in an improved translation when using our MNMT SCA model. One of the test examples for English→Dutch for which this is the case is shown in Fig. 2.

Table 2. Percentage of test examples where the model outperforms the UNMT baseline for English→Dutch and English→French in terms of BLEU score.
Fig. 2.
figure 2

Example for which our MNMT SCA model outperforms the UNMT baseline for English→Dutch (best viewed in color).

While the BLEU score is a good metric to determine translation quality, it has some disadvantages. For instance, a translation which is significantly different from the reference translation will get a low BLEU score, even if it is still valid and acceptable to the human reader. Moreover, a translation which does not sound that smooth or contains a rather unexpected word may not get penalised as much by BLEU if it still closely resembles the reference translation. For a human though it will be clear that such a translation was generated by a machine. However, e-retailers might prefer having consumers find and buy desired products through machine-generated translations instead of not at all, or through human translations which are much more expensive to obtain.

6 Conclusion

In this paper, we have proposed a novel neural architecture for MNMT, which learns a multimodal, multilingual space jointly with a translation model to obtain visually grounded source word representations. By attending to the visually grounded source word representations we can jointly reason over vision and language in a way that is effective to produce the translation in the target language. We introduced this model in the context of fashion e-commerce, where the product descriptions describe fine-grained product attributes somewhere in the associated image. Moreover, we have improved state-of-the-art multimodal translation results on a real-word fashion e-commerce dataset.

As future work and to further improve the results, we would like to expand our model by integrating multiple languages and to investigate neural architectures that still better recognise fine-grained fashion attributes in images. We would also like to further explore the possibility to train on comparable data as this forms a realistic setting when dealing with product descriptions in different languages. Finally, the model proposed in this paper offers opportunities to automatically generate different types of fashion item descriptions (in one or multiple languages) that are adapted to its users, to the targeted country or culture, or to marketing strategies, which will take into account images of the fashion item.