1 Introduction

Clothing and styling have a great impact on the popularity of the fashion field since they play a critical role in people's lives. Outfit compatibility helps people to cover their weaknesses and show their status. That is why people prefer to wear matching clothes. Add to this the ease of access to data with digitalization, and researchers' interest in this field has increased. In the last few years, there have been studies regarding clothing detection (Liu et al. 2016; Sidnev et al. 2021), clothes segmentation (Zhang et al. 2020), clothes image retrieval (Kang et al. 2020a, b; Ji et al. 2020) and outfit compatibility (Han et al. 2017; Sun et al. 2020a; Li et al. 2016; Song et al. 2017; Kavitha et al. 2020; Vasileva et al. 2018; Wang et al. 2019).

Outfit compatibility is based on measuring the compatibility of several clothing items. In the literature, the approaches for outfit compatibility can be divided into two main types: outfit compatibility prediction and outfit compatibility recommendation. The task of outfit compatibility prediction is a binary classification problem that predicts the compatibility score of an outfit (Han et al. 2017; Li et al. 2016; Vasileva et al. 2018; McAuley et al. 2015). The Outfit Compatibility Recommendation task first estimates the compatibility level for a given outfit and then, if the level is bad, identifies incompatible clothes and determines how to create a compatible outfit to improve compatibility (Wang et al. 2019; Yang et al. 2021; Chen et al. 2019a; Han 2022). Most of the previous work on outfit compatibility prediction has attempted to measure the distance between visual features of clothing using Euclidean distance, Mahalanobis distance and Siamese networks (Vasileva et al. 2018; McAuley et al. 2015; He et al. 2016; Veit et al. 2015). Unlike metric learning-based methods, there are also studies that model the fashion compatibility problem as a sequence (Li et al. 2016). To diagnose outfit compatibility, there are systems that use Bayesian Personalized Ranking, multilayer convolutional networks and contextual metadata (Wang et al. 2019; Han 2022; Lin et al. 2020).

On the other hand, image captioning is one of the popular fields that has gained a place in the literature in recent years. Image captioning-like problems are called sequence-to-sequence problems in the literature. In order to solve such problems, it is necessary to find the dependencies and connections between sequential elements in the input. Although theoretically possible, Recurrent Neural Networks (RNN) suffer from the problem of long-term dependencies due to the problem of vanishing gradients. Different RNN models have been proposed with minor modifications to overcome this problem. Long Short-Term Memory (LSTM) has the ability to remember or forget sequence elements using input, output and forget gates. Cho et al. (2014) and Sutskever et al. (2014) proposed the use of an encoder-decoder model for the machine translation problem that is entirely based on RNN and LSTM. However, the mechanism of encoding the entire input in a single vector in RNNs leads to the same problems. To solve this problem, attention mechanism has been developed by researchers. In this way, instead of encoding the entire input in only one vector, the RNNs utilize the generated attention vector in the decoding phase. Although LSTM + attention mechanism achieves better results than solutions using only LSTM, it is not possible to process the inputs in parallel due to the sequential operation of the RNNs. For this reason, the training time of RNN models is very high. To solve the parallelization problem, Transformers has been proposed which tries to solve sequential-to-sequential problems with only the attention mechanism (Vaswani et al. 2017).

In this study, an AI fashioner is developed to solve the outfit compatibility problems with deep learning techniques. The proposed model first predicts the outfit compatibility and if there are incompatible items among the clothing items that make up the outfit, it helps to create more compatible outfits by suggesting compatible clothing items instead of these incompatible items. A summary of the whole study is shown in Fig. 1. Initially, the dataset is analyzed and the incompatible clothing items are identified. In the next stage, the outfit images are segmented using Mask R-CNN. The mismatched clothing items are respectively removed from the combination image set and the images of the remaining items are compared with the compatible outfit's image sets with the same clothing categories in the dataset. In the comparison phase, low and high-level visual features are extracted and combined to identify matching combinations with the most similar correlations across compatible clothing items. The items that will replace the incompatible parts from the detected matched combinations are given to the attribute detection network. According to the results obtained from this network, different compatibility improvement suggestions are presented for each incompatible combination by utilizing Convolutional Neural Network (CNN) and Transformer.

Fig. 1
figure 1

Overview. This paper presents an AI fashioner model that learns the relationships between compatible clothes and gives suggestions for reorganizing incompatible outfits

The main contributions of the proposed work are as follows:

  1. 1.

    We develop an end-to-end framework problem to make the outfit compatibility task more user-friendly. This framework not only predict and diagnose outfit compatibility but also gives suggestions to improve compatibility.

  2. 2.

    We combine high-level and low-level features to generate reviews of outfit combinations and propose a joint attribute detection module to understand real-world clothing images.

  3. 3.

    Extensive simulations are conducted on two different datasets that contain real-world images. Rather than the classic top–bottom compatibility, reviews and suggestions are created over twelve different clothing categories.

The structure of this work is as follows. In Sect. 2, we briefly review relevant works on fashion compatibility, fashion attribute detection, and image captioning. In Sect. 3, we provide a detailed description of the proposed work. In Sect. 4, we present the results of our simulations, showing the effectiveness of the proposed approach. Finally, we present our conclusions in Sect. 5.

2 Related work

Many previous studies assume fashion compatibility as a metric learning problem. Veit et al. (2015) train a Siamese CNN to learn compatibility across co-purchasing items. McAuley et al. (2015) utilize parameterized distance metrics to learn relationships between co-purchased item pairings. They use CNNs for feature extraction. Chen and He (2018) propose a deep mixed-category metric learning framework that is based on triplet-loss to recommend complementary fashion items.

There have also been alternative approaches to metric learning regarding the sequential processes. Han et al. (2017) use Bi-directional-LSTM to model outfit generation. Li et al. (2016) use RNNs to predict fashion outfit compatibility. There are also methods that use transformation vectors from clothing feature representations to estimate clothing compatibility (Li et al. 2019; Lu et al. 2021).

Image captioning is a challenging area of work involving both image processing and natural language processing (NLP). Image Captioning studies often employ NLP techniques and deep learning models such as CNNs and RNNs to analyze the visual content of the images and generate descriptive captions. Besides these techniques, state-of-the-art approaches mainly use attention techniques to generating captions for images (Anderson et al. 2017; Herdade et al. 2019; Xu et al. 2015). Recent studies comparing image captioning methods have shown that Transformer models using only the attention mechanism yield the most successful results. Transformer was introduced in 2017 in an article called "Attention Is All You Need" as an encoder-decoder architecture based on layers of attention (Vaswani et al. 2017).

There are also studies that address the problem of fashion and image captioning. This involves using machine learning techniques to automatically generate textual descriptions of fashion images, which can be helpful for various applications such as automatic title creation, image search, and style recommendation. Chen et al. (2019b) proposed a Personalized Outfit Generation (POG) model by connecting both user preferences and individual items with transformer architecture. Chen et al., proposed a method for fashion recommendation that combines attention-weighted visual features and GRU-based weak supervised learning. Their approach uses image region-level features and user review information to make recommendations (Chen et al. 2019a). Park et al. (2022) proposed an improved Transformer model for a conversational system that recommends a desired fashion item based on the conversation between the user and the system and fashion image information. Li et al., proposed a framework for clothes image captioning that combines attribute detection, visual attention, and LSTM. The framework uses attention mechanisms to focus on relevant image regions, allowing for more accurate caption generation (Li et al. 2021). Balim and Ozkan (2021) proposed a system for the automatic generation of product titles of fashion images using CNN, LSTM, and Global Vectors. Goenka et al., proposed a new pre-trained transformer model called FashionVLP for fashion image retrieval. The model uses prior knowledge from large image-text corpora to improve performance on the task of fashion image retrieval (FashionVLP 2022). Yang et al. (2020) proposed a set of strategies to improve the performance of captioning for clothing and created the FACAD dataset that can be used in fashion captioning studies.

There are studies to explain compatibility (Han 2022; Lin et al. 2020; Sun et al. 2020b; Tangseng and Okatani 2019; VisQu 2022; Kaicheng et al. 2021). Lin et al. (2020) introduced the ExpFashion dataset, which contains contextual metadata for top and bottom fashion items. They also proposed a neural network-based approach for generating outfit recommendations with abstractive comments. The proposed approach uses the metadata in the ExpFashion dataset to generate personalized recommendations for individual users. Han et al. propose a Bayesian Personalized Ranking (BPR) framework named PAICM, giving also suggestions on how to modify those incompatible outfits to make them appealing to the user (Han 2022). Wang et al., studied the problem of diagnosing outfit compatibility and proposed a multi-layered comparison network for predicting the compatibility of different fashion items. The network uses gradient information to make its predictions, allowing it to take into account the relationships between different items in an outfit (Wang et al. 2019). Mo et al. (2022) presented a model for fashion compatibility assessment utilizing low and high-level features based on multilayer convolutional networks and Transformer for explainable evaluation and recommendation. Yang et al. (2021) uses the attribute information of fashion items to explain their compatibility. Balim and Ozkan (2023) used image processing techniques and transformers to perform the task of diagnoising fashion compatibility and generated explanations over body images.

While previous studies in the literature have mostly focused on the fit of tops and bottoms, this study proposes a system that checks the compatibility of twelve different clothing categories. In addition, while existing studies focused on a single recommendation, in this study, a correlation matrix was created using low and high-level features and three recommendations were generated for each mismatched outfit combination.

3 The proposed method

In this section, we present the proposed model, mainly based on three stages: data preparation, fashion comment generation and fashion captioning. The overall structure of the proposed system is shown in Algorithm 1.

figure a

3.1 Data preparation

3.1.1 Detection of incompatible items

In this study, we use the ModAI dataset which is produced for use in fashion studies (Balim and Özkan 2023). The ModAI consists of real-world outfit images and comments about the compatibility of outfit. In the dataset, comments are about the compatibility or incompatibility of paired clothing categories such as footwear-top: compatible, top-scarf: incompatible, etc. The ModAI dataset contains 11,010 outfit images and 25,916 compatibility comments about clothing pairs for each outfit. It contains twelve clothing fine-grained clothing categories inspired by the ModaNet dataset (Mo et al. 2022) which is used with Mask R-CNN for the segmentation stage. These categories are shown in Table 1. In the ModaNet dataset, footwear and boot are categorized separately, while in the ModAI dataset, these categories are grouped under a single category as footwear.

Table 1 Categorical information from the ModaNet dataset

In the ModAI dataset, the compatibility comments of each outfit contain compatible or incompatible keywords. We use these keywords for detecting incompatible parts. If all the comments on an outfit do not mention incompatible parts, we consider that outfit as compatible. Otherwise, we make a list of clothes mentioned as incompatible and identify clothes that are more mentioned as incompatible. If the number of incompatible clothing categories is less than three, we add more than one item from the same category to the incompatible clothing list. Our goal is to generate three recommendation comments for each incompatible outfit. The whole process is presented in Algorithm 2.

figure b

3.1.2 Outfit segmentation

In real-life images taken for outfit compatibility problems, the clothes are located on the human body. The proposed model uses a segmentation technique called Mask R-CNN to extract relevant clothing items from the human body. Mask R-CNN is basically a built on Faster Region Based Convolutional Neural Networks (Faster R-CNN) (Ren et al. 2016; He et al. 2018). Faster R-CNN is a popular object detection algorithm that uses a CNN to perform both object classification and bounding box regression. Given an input image, Faster R-CNN first generates a set of proposals for potential object locations, and then processes these proposals using the CNN to predict the class label and bounding box coordinates for each object in the image. Faster R-CNN basically performs the learning process in 4 steps:

  • A CNN architecture is used to extract vectors of features from images. (ResNet-101 architecture is preferred for this study.)

  • These feature vectors are sent to the Region Proposal Network to generate candidate frames.

  • The Region of Interest pooling layer reduces candidate sizes to the same size.

  • After the calculations, the features extracted from the proposed frames are determined, the class of the object is determined and the bounding box coordinates are determined.

Like Faster R-CNN, Mask R-CNN uses a CNN to perform object classification and bounding box regression. However, it also includes an additional branch in the network that is used to predict the mask for each object. This branch takes the features extracted by the CNN and processes them using additional layers to generate a binary mask that indicates the pixels belonging to the object. The mask is then combined with the bounding box prediction to generate the final object detection result, which includes the class label, bounding box coordinates, and mask for each detected object in the image.

In this paper, we use the ModaNet dataset to train the Mask R-CNN. This dataset includes 55,176 annotated images with pixel-level segments, polygons, and bounding boxes, covering thirteen fine-grained categories introduced for use in clothing segmentation and feature estimation research (Zheng et al. 2018). After the training phase, we learn the pixel information for clothing items in the fine-grained categories given in Table 1. In order to use the ModAI and the ModaNet datasets jointly, we update the garments segmented as boots to footwear.

In this work, we use the segmentation model in three different stages. Firstly, we use this model in the similarity detection stage to separate clothes from outfit images. Then, we also use this model in the attribute detection module to extract the correct clothing items from e-commerce images. Lastly, we use this model to generate fashion compatibility comments for element-wise feature extraction.

3.2 Fashion comment generation

Fashion comment generation aims to create fashioner-like reviews that address incompatible clothes and make suggestions to improve compatibility for each incompatible outfits. This phase has two components: The similarity detection module and fashion attribute detection module. More details of each component are described in the following sections.

3.2.1 The similarity detection module

In the similarity detection module, we aim to find clothes that can be recommended instead of mismatched items in incompatible outfits. For this purpose, we first segment all the clothes from outfit images using the Mask R-CNN segmentation technique described in Sect. 3.2. Then, we remove the mismatched items from the incompatible outfits and perform a similarity measurement stage with all the compatible outfits in the dataset. Our goal is to identify compatible outfits that are most similar to incompatible outfits without mismatched items. In this way, it will be possible to recommend compatible clothing items in the same category instead of mismatched items.

In this module, correlation matrices are created for each outfit, using low-level features and high-level features together to detect similar combinations of clothes using latent structures between outfits. First, using ResNet-101 which is trained on ImageNet, high-level features are extracted for all segmented clothing items in the dataset. The high-level features mostly reflect features such as fashion style and overall compatibility (Wang et al. 2019). Color information, which is very important for compatibility is used as low-level features are then added to these features to create clothing representation vectors. Although RGB is the most widely-used color space in the digital environment, it can be seen that pixels with the same color value have different color values depending on the light brightness, especially in images taken from the real world. Since real-world images contain different images taken in various poses and places with light tones, it is difficult to extract the color features of clothes correctly. In this study, HSV (Hue, saturation, value) color space is preferred over RGB color space in order to obtain a system that is less affected by the amount of light brightness in the environment. The HSV space consists of hue (H), saturation (S), and brightness (V) components. In this study, images are converted from RGB space to HSV space, and H and S values are used, while the V value is not included in order to minimize the degree of brightness.

Given an outfit with clothing items, we denote each different category with \(l\) and we denote the outfit as \(X=\left[{x}_{1},{x}_{2},\dots ,{x}_{l}\right]\in {R}^{l\times d}\), where is the \(d\) dimensionality of the fashion images. The image embeddings from each outfit are respectively calculated as \({\varvec{F}}\), hue histogram as \({\varvec{h}}\) and saturation histogram as \({\varvec{s}}\) for a clothing item. By concatenating \({\varvec{F}}\), \({\varvec{h}}\) and \({\varvec{s}}\), we obtain \({\varvec{v}}\):

$${v}_{i}=c\left({h}_{i},{s}_{i},{F}_{i}\right)$$
(1)

where \({h}_{i}\) and \({s}_{i}\) are \(i\) th color features, \({F}_{i}\) represents feature vector of \(i\) th fashion item in an outfit. As a result, we have a representation of a clothing item as \({v}_{j} \in {R}^{d}\).

The Pearson correlation coefficient (PCC) is a statistical measure of the linear correlation between two variables. It is commonly used in research to identify positive or negative relationships between variables and is calculated using the mean and standard deviation of the two variables. The PCC ranges from − 1 to 1, where − 1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. In the present study, the PCC is used to detect the similarity between each item in same outfit. To calculate the degree of correlation between the representation vectors \({v}_{i}\) and \({v}_{j}\) of two clothes; the PCC can be calculated as follows:

$${\rho }_{{v}_{i},{v}_{j}}=\frac{\mathrm{Cov}({v}_{i},{v}_{j})}{\sigma {v}_{i}\sigma {v}_{j}}=\frac{\sum |({v}_{i}-\mu {v}_{i})({v}_{j}-\mu {v}_{j})|}{\sigma {v}_{i}\sigma {v}_{j}}$$
(2)

Using this method, the PCC values between the compatible clothes in each outfit is calculated and a correlation matrix is created for representing compatible clothes. Finally, by examining the similarities of the correlation matrices, it is determined which clothes could be recommended instead of incompatible clothes.

3.2.2 The fashion attribute detection module (FAD-NET)

An AI fashioner model needs to make recommendations for replacing mismatched pieces, as real-life fashionistas do. To generate a fashion recommendation, the model needs to identify the characteristics of the clothes to be recommended. At this stage, a neural network is developed that takes clothing images (the identified clothes in the previous section) and produces basic clothing characteristics which are important for compatibility, as output. The whole process is shown in Fig. 2.

Fig. 2
figure 2

The creation of the attribute detection module

With growing e-commerce systems, the amount of tagged data related to fashion has increased. These data are used in many studies about the fashion area. For generating an attribute detection module, e-commerce website data is used to extract specific garment features that are frequently used by fashioners. Approximately, 50 k clothing images and different clothing characteristics are downloaded from e-commerce sites using web scraping techniques. After the data cleaning steps, there are 40 k clothing images and characteristics for using in the attribute detection module. When the identified clothing images are analyzed in the training step, the images mostly include people, fashion products in different clothing categories or different backgrounds. To overcome this problem and identify the correct garment image, the image segmentation technique described in Sect. 3.1 and a number of pre-processing techniques are used.

After these steps, the characteristic detection phase is started. At this stage, a fully connected network is developed that produces color and usage information as output using the weights of networks trained on ImageNet, a popular approach in the literature. This network takes the clothing image as input and produces color and usage type as output. The details of the outputs of the network are shown in Table 2.

Table 2 Outputs of the attribute detection network

In their study on outfit compatibility evaluation, Wang et al. assumed that the first layers in the deep neural network tend to learn low-level features like color from fashion images (Wang et al. 2019). Mo et al. cited that the last layers show abstract characteristics such as style and usage (Kaicheng et al. 2021). These inferences are consistent with the general principle that early layers of CNNs tend to learn low-level features, while later layers learn higher-level abstractions. Based on these studies, a feed-forward neural network is developed to detect the base color and usage which are the characteristics that are often used from fashioners. The network is made up of a multi-output CNN fusion network that concatenates different convolutional layers and outputs the attribute tags. The feature learning layers consist of a pre-trained ResNet-101 network, which is pre-trained on ImageNet, where the fully connected layer takes different convolutional layers. The final outputs of the network are rectified by the sigmoid function, which maps the outputs to the range [0,1], allowing them to be interpreted as probabilities. This allows the network to generate a set of predicted attribute tags for each input image. The whole attribute detection process is shown in Fig. 3.

Fig. 3
figure 3

FADNET attribute detection process. We concatenate the last four block of the ResNet-101 network as feature vector. Next, we use two fully connected layers for extracting clothing attributes

3.3 Fashion captioning

Fashion captioning is the part where fashioner's recommendations are created over the images. In this study, transformer neural networks are used to generate fashion recommendations. The transformer architecture was originally proposed for NLP tasks, such as machine translation, but it has since been successfully applied to a wide range of other tasks. One of the key advantages of the transformer is its flexibility, which allows it to be easily adapted to different problem domains. In the case of image captioning, the encoder part of the transformer can be used to process the visual input, while the decoder part can be used to generate a natural language description of the image. Transformers consist of \(N\) x number of repeating modules between encoders and decoders. Most of the elements of these modules are composed of the Multi Head Attention and position-wise feed forward network. The core component of the Multi Head Attention mechanism is the scaled dot-product attention which consists of queries \(Q= \left\{{q}_{1},{q}_{2},{q}_{3},\dots ,{q}_{{1}_{q}}\right\},{q}_{i} \in {\mathbb{R}}^{{d}_{q}}\), keys \(K= \left\{{k}_{1},{k}_{2},{k}_{3},\dots ,{k}_{{1}_{k}}\right\},{k}_{i} \in {\mathbb{R}}^{{d}_{k}}\) and values \(V= \left\{{v}_{1},{v}_{2},{v}_{3},\dots ,{v}_{{1}_{v}}\right\},{v}_{i} \in {\mathbb{R}}^{{d}_{v}}\), is formulated as follows:

$$Attention \left(Q,K,V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$
(3)

In Eq. (3), Q (Query) is a matrix containing the query (vector of words selected and compared with other words), K (Key) is a matrix containing the keys (vector representations of all words except the selected word) and V (Value) is a matrix containing the values of the words (whole sentence) multiplied by the calculated weight. In the above equation, \({d}_{k}\) is a scaling factor that helps to improve the learning process. The multi-head attention mechanism is shown in Eqs. (4) and (5).

$$MultiHead\left(Q,K,V\right)= Concat \left( {head }_{1},\dots , {head }_{h}\right){W}^{O}$$
(4)
$${head}_{i}= Attention \left(Q{W}_{i}^{Q},K{W}_{i}^{K},V{W}_{i}^{V}\right)$$
(5)

The multi-head attention mechanism uses multiple heads with the same architecture, each of which applies attention independently to the input data. These heads can be repeated h times. By applying self-attention multiple times, with each head attending to different parts of the input sequence, the transformer is able to capture a wide range of relationships between the input tokens. Because each head has its own set of weight matrices, it can learn different aspects of the input data, such as grammar or semantics. This allows the transformer to represent the input data in a more rich and diverse way, improving its ability to perform a variety of NLP tasks.

The position-wise feed forward network is another important component of the transformer architecture. It is typically applied after the self-attention or multi-head attention layers, and it is used to process the representations generated by these layers further, generating a more abstract and condensed representation of the input data. The definition is as follows:

$$\mathrm{FFN}\left(x\right)={W}_{2}max\left(0,{W}_{1}x+{b}_{1}\right)+{b}_{2}$$
(6)

In this equation, x is the input to the position-wise feed forward network, and W1 and b1 are the weight matrix and bias vector of the first linear transformation, respectively. The non-linear activation function is typically a rectified linear unit, which applies the function max(0, x) element-wise to the input. The output of the activation function is then transformed using a second linear transformation, with weight matrix W2 and bias vector b2, to produce the final output of the position-wise feed forward network. This layer helps to capture higher-level features in the input data, improving the performance of the transformer on a variety of tasks.

One of the key differences between the transformer architecture and other popular deep learning models, such as RNNs, is that the transformer does not explicitly incorporate information about the order of the input sequence. To address this limitation, the transformer architecture includes a mechanism for incorporating positional information into the input data. This is typically done by adding a set of fixed "spatial coding" vectors to the input sequence, which encode the relative positions of the input tokens.

The Original Transformer consists of 6 blocks with both encoder and decoder (Vaswani et al. 2017). In this study, the features of the segmented clothing images are extracted using different ResNet-101 architecture and fashioner comments are generated using Transformer. The whole process is shown in Fig. 4.

Fig. 4
figure 4

Fashioner comment generation process

4 Experiments

4.1 Dataset

The performance of the proposed method was evaluated on two datasets. The ModAI dataset is mentioned in Sect. 3. To evaluate the proposed work in another domain different from the ModAI, the Polyvore-T dataset is used, which is widely used in fashion compatibility studies (Wang et al. 2019). In their study, Wang et al., generated incompatible examples by randomly replacing some items in the Polyvore-T dataset. In this study, we follow the same path to evaluate our work in the Polyvore-T dataset and randomly replace some combinations with different clothes while preserving the clothing categories. The negative examples are generated by randomly replacing an item in positive outfits with another item of the same type from different outfits. The attributes of the ModAI and Polyvore datasets are listed in Table 3. The datasets are divided into training, validation, and testing sets with a ratio of 7:1:2 for use in the experiments.

Table 3 The statistics of the ModAI and the Polyvore-T datasets

4.2 Training details

All experiments are conducted on a NVIDIA Quadro RTX 5000 graphics card. The ResNet-101 pre-trained on ImageNet is used as the backbone. Different hyper-parameters such as different number of heads and number of layers were used for the transformer. The batch size for the experiments on the ModAI is 32 while that on the Polyvore-T dataset is 64. Adaptive learning rate optimization algorithm (Adam) is used during the training process with learning rate 0.0001. Dropout is a regularization technique that is commonly used in neural networks to prevent overfitting. In the context of the methods described in the previous statement, it is likely that dropout is applied to the input sequences or to the intermediate representations computed by the encoder and decoder blocks. During training, the models are trained for a maximum of 50 epochs, and the training process is stopped early if the performance on a validation set starts to deteriorate. Finally, during inference, the models use a beam search strategy with a beam size of 3 to generate output sequences. This means that the models consider a set of 3 most likely sequences at each step and select the best one to extend based on the predicted probabilities of the next tokens. The results are reported on standard machine translation and image captioning evaluation metrics including CIDEr (Vedantam et al. 2014), BLEU (Papineni et al. 2001), ROUGE-L (Lin 2004) and METEOR (Banerjee and Lavie 2005).

4.3 Performance analysis

4.3.1 Quantitative results

The results of the proposed method with different hyper-parameters are shown in Table 4. The best results are highligthed in bold. Compatibility comments are generated based on correlation-based similarity and experiments were conducted using different fine-tuning, head, and number of layers. It can be observed that there is a clear difference between the variations with and without fine-tuning. It is also observed that different numbers of heads and layers increase the system performance up to a certain number.

Table 4 Results of different hyperparameters

The results of the different techniques used to generate similarity-based recommendations and their comparison with the dataset are shown in Table 5. The best results are highligthed in bold. It can be seen that correlation-based similarity detection is more successful than distance-only techniques. It is also observed that more successful results are obtained with the ModAI dataset.

Table 5 Results according to different methods and datasets

It can also be observed from Table 5 that using only high-level attributes or only low-level attributes in the fashion recommendation generation phase produces less successful results than using both together.

4.3.2 Qualitative results

The qualitative results of the proposed method are shown in Table 6. The results show that the proposed model's performance is quite adequate. If the system finds the outfit combination compatible, it directly generates the result "your combination is compatible". If there are elements that make the combination incompatible, the proposed model suggests new elements that can be used instead of incompatible elements to make the combination compatible. In the event that there is only one item in the outfit that breaks the outfit compatibility, the proposed model suggests different pieces in the same category that will harmonize the outfit. It is noteworthy that the system sometimes confuses items with similar coordinates in the images. For example, the system suggests a skirt when it should suggest an item in the shorts category. Again, the system may suggest a jacket instead of suggesting a top. These errors can be reduced by increasing the success of segmentation techniques.

Table 6 Example results from the proposed AI fashioner

5 Conclusion

This paper aims to create an artificial fashioner using different levels of visual features, transformers and relationships between clothes. The proposed work not only detects compatibility but also identifies the elements that break the compatibility and defines the characteristics of the elements that can replace the incompatible elements to harmonize the outfit. In the next step, it provides multiple suggestions to correct and increase the outfit compatibility like real fashion designers. Experiments on two different real-world datasets demonstrate the success of the proposed system. For future work, we are planning to explore different aspects of fashion to explain how to better diagnose the compatibility of clothes. We also want to address the issue of creating a personal artificial fashioner, as fashion is very subjective.