1 Introduction

The need for a variety of images of high quality is often in demand of Generative Adversarial Networks (GANs) [1], and image production has advanced quickly. Today, we can quickly create a variety of faces with high fidelity using a pre-trained Style GAN, which also supports several downstream tasks, such as face stylization and facial attribute editing. Another sort of human-related media is full-body photographs of people, which have richer, more varied, and finer-grained content. The vast majority of techniques currently in use in this field base their editing models on instances of the intended clothing, which is commonly twisted and sewn into the given input image. Although these methods allow for the modification of images through more natural high-level language-based descriptions of the intended apparel, text-conditioned fashion-image editing is still favored. This is because they offer a convincing alternative to example-based editing techniques. Additionally, there are several uses for human image generation, such as human pose transfer, virtual try-on, and animations [2] In terms of applications and interactions, it is even preferable to enable lay users to easily control the synthesized human full-body images of people are another type of human-related media that have richer, more diversified, and finer-grained material. Current methods for creating human body images do not produce a variety of clothing because they frequently produce items with basic patterns, such as solid colors, and they do not offer fine-grained control over the textures of the garments. Additional fine-grained annotations are necessary for the production of clothing with textual controls. It is difficult to handle all involved aspects in a single generative model since human body images are so complex. Based on the provided human position and user-specified phrases specifying the clothes shapes, the first process builds a human parsing mask with a variety of clothing shapes. The second process enhances the human parsing mask with a variety of clothing textures based on texts that describe the textures of the clothing. Additionally, it is possible to create 3D models. with the aid of 2) generative network training techniques and 1) 3D human representation. We implement the Pixel-aligned Implicit Function (PIFu) representation for 3D deep learning from a single or a large number of input photographs for the difficult task of textured surface inference of clothed 3D people. Although the majority of efficient deep learning methods for processing 2D images (such as semantic segmentation, 2D joint detection, etc.) use "fully-convolutional" network designs that keep the input and output's spatial alignment in place, this is particularly challenging when processing 3D images[3]. Voxel representations can be fully convolutionalized, however, because of their intrinsic memory needs, they are unable to create fine-scale detailed surfaces.

2 Related work

The GAN is incredibly efficient for creating high-fidelity images. Since the first researcher suggested the first generative model in 2014, other modifications of GAN have been developed [4]. As an alternative to unconditional generation, conditional GANs [5] were suggested to create images based on standards such as segmentation masks [6,7,8] and natural language. The previous system that has been proposed uses human gestures and language as inputs to produce conditional visuals. The Variational Autoencoder (VAE) picture generation paradigm is an alternative to GAN. 3D representations of humans: For tasks involving humans, 3D human representations are essential tools. Parametric models are developed by [9] for the explicit modeling of 3D humans.; simulate human appearances. Although less realistic, parametric modeling offers reliable control over the human model. The number of publications on human Neural Radiance Fields (NeRF) has also skyrocketed in tandem with the growth of NeRF. For a variety of down-stream tasks suggest learning modal-invariant human representations. Several large-scale multi-modal 4D human datasets. Human Image Manipulation and Synthesis. Pose transfer's purpose The goal of this system is to maintain the same person's appearance in different poses. A StyleGAN framework with pose conditioning was suggested by [Albahar et al. 2021]. After being twisted to the desired position, the original image's details are employed to spatially modulate the features for synthesis. Anomaly Detection Generative Adversarial Network (ADGAN) was suggested by [10] for controlled person image creation.

2.1 Human generation

Despite the enormous progress made in the creation of human faces, the intricacy of human positions and appearances makes it difficult to create human images. This deals with 3D human dataset to build 3D human geometry. Some people also try to train 3D human GANs using just 2D human image libraries. The Convolution Neural Network (CNN)-based neural renderers used by [11,12,13] cannot ensure 3D consistency. Human NeRF, which only trains at low-resolution images, is used for this purpose suggesting boosting the resolution by super-resolution, although this still doesn't yield excellent outcomes. 3D-aware GAN. In terms of creating 2D images, the GAN has achieved remarkable success. The generation of 3D awareness has also received a lot of attention [14, 15]. Voxels are used by meshes are used by the researcher to help the 3D-aware generation. Many researchers have developed 3D-aware GANs based on NeRF thanks to recent advancements in the technology. employ 2D decoders for the super-resolution to boost generation resolution. Furthermore, for more accurate geometry and better 3D consistency, it would be preferable to increase the raw resolution by increasing rendering efficiency [16]. We proposed a powerful 3D human representation to enable training at high resolution.

3 Proposed work

The proposed system is used to generate 3D models from text so that poses of various ranges can be obtained along with High-Quality images. The datasets used in this work are as follows.

3.1 Dataset used

Deepfashion-multimodal can be used for text-driven human image creation, text-guided human image manipulation, skeleton-guided human image creation, human posture estimation, human image captioning, multi-modal learning for human images, recognition of human attributes, and prediction of human parsing. Link: drive.google.com/drive/folders/1An2c_ZCkeGmhJg0zUjtZF46vyJgQwIr2. Table 1 shows the description of the dataset.

Table 1 Deepfashion-multimodal dataset overview

BUFF(Bodies under flowing fashion, 4D dataset) (Zhang et al. [5]) High-quality 4D dataset containing ground truth 3D shape of humans wearing apparel. Five subjects, three men and two women are wearing two different outfits make up the BUFF dataset. They move their hips, bend their heads to the left, twist their shoulders, and grind their shoulders. Link: https://buff.is.tue.mpg.de/. Table 2 shows the description of the BUFF dataset.

Table 2 BUFF dataset overview

3.1.1 SMPL

SMPL is a detailed 3D model of the human body that was created using thousands of 3D body scans. (Skinned Multi-Person Linear Model) It is built on skinning and blending shapes. This website offers learning materials for SMPL, including code for utilizing SMPL in Python, Maya, and Unity, as well as sample FBX files containing animated SMPL models. Link: https://star.is.tue.mpg.de/. Table 3 shows the SMPL dataset description.

Table 3 SMPL dataset overview

FashionMNIST is a fashion video dataset containing 500 sequences of models posing in front of the camera. Link:-(https://github.com/zalandoresearch/fashion-mnist). Table 4 shows the FashionMNIST Dataset overview.

Table 4 FashionMNIST dataset

Saeur et al. [17] is a multi-view human dancing video dataset that provides rich poses and accurate SMPL(Skinned Multi-Person Linear Model). Apart from this if there is a need for further new, high-quality images of various designs, styles, and patterns we can still use the technique of Text2human [18] which is the most optimized to date as the Nvidia Tesla V100 GPUs were used in training the models. The only variation that we will be implementing is 3D model generation to give so many poses and help in an accurate way to increase the views of different poses. Now to attain this images have to be generated from text once the images generated from this can be directly fed to generate 3D models [19]. The 3D model generation is done by several methods of inverse graphics as it aims to recover 3D models from 2D observations. Table 5 shows the description of all Dataset used in this work. Figure 1 shows the system architecture for Text to fashion image generation. Figure 2 denotes the overall system design.

Table 5 Overall dataset overview
Fig. 1
figure 1

System architecture diagram for text to fashion image generation

Fig. 2
figure 2

Overall system design

The various modules involved in the work are:

  1. 1.

    Data pre-processing

  2. 2.

    Text to clothes texture, shape analyzer

  3. 3.

    2D Image to 3D human model generator

3.2 Data preprocessing

Refine the dataset by removing any missing, damaged, or irrelevant photos and annotations. Augment the dataset by applying transformations like rotation, flipping, scaling, and shearing to the images, enhancing its size and diversity to better reflect real-world scenarios. Standardize the photographs to a uniform scale, such as 256 × 256, and crop them to eliminate any extraneous or distracting elements, thereby improving the effectiveness and efficiency of machine learning models. Normalize and standardize the pixel values of the photos to minimize variations in lighting and color, enhancing the data's comparability and consistency. Additionally, remove incomplete and half-body images from the Dense-Pose dataset.

3.3 Text to clothes texture, shape analyzer

The majority of techniques currently employed in this field construct their editing models using examples of the desired clothing, typically manipulating and fitting them onto the input image. While these methods enable image modification through more natural language-based descriptions of the intended attire, text-conditioned fashion-image editing remains popular due to its ability to provide a compelling alternative to example-based editing approaches. Moreover, there are numerous applications for generating human images, including human pose transfer, virtual try-on, and animations. Particularly in terms of applications and user interactions, facilitating the easy control of synthesized full-body human images is desirable, as they represent a form of human-related media with richer, more varied, and finely detailed material. However, current methods for generating human body images often lack diversity in clothing, frequently producing items with basic patterns and limited control over garment textures. To address this, additional detailed annotations are required for producing clothing based on textual specifications. Given the complexity of human body images, it is challenging to encompass all relevant aspects within a single generative model. Stage I of our approach involves constructing a human parsing mask with diverse clothing shapes based on the provided human pose and user-specified descriptions of clothing shapes. Subsequently, Stage II enhances this mask by incorporating a variety of clothing textures derived from text descriptions. Figure 3 denotes the flow diagram and Fig. 4 denotes the System Design of Text-driven clothed human image synthesis with 3D human model estimation. Figure 5 shows the System Design of user input text to Human image synthesis. Figure 6 denotes the GAN Architecture.

Fig. 3
figure 3

Overall flow diagram

Fig. 4
figure 4

System design of text-driven clothed human image synthesis with 3D human model estimation

Fig. 5
figure 5

System design of user input text to human image synthesis

Fig. 6
figure 6

GAN architecture

3.4 2D image to 3D human model generator

The creation of 3D models can be achieved through the utilization of both generative network training techniques and 3D human representation. We employed the Pixel-aligned Implicit Function (PIFu) representation for deep learning in 3D, utilizing either a single photograph or a multitude of images to tackle the complex task of inferring textured surfaces of clothed 3D individuals. While many effective deep learning methods for analyzing 2D images, such as semantic segmentation and 2D joint detection, utilize "fully convolutional" network architectures that maintain spatial alignment between input and output, this presents a significant challenge in the context of 3D image processing. Voxel representations can indeed be fully convolutionalized, but due to their inherent memory requirements, they often fail to produce detailed surfaces at fine scales.

Figure 2 shows the system design. The block diagram states and shows the description regarding the clothes humans are to be fed into and this is further analyzed by the BERT and the words are one-hot-encoded which are mapped with the textual descriptions of the dataset's annotations. Once all these are done the Dense-Pose dataset that contains various images of human poses in thermal image format is also passed after feeding the textual descriptions into the model this helps in mapping the shape of the clothes that must be embedded on the human along with the clothes texture that must be embedded as well is carried out. Once the required image is generated further proceed in the conversion of 2D image to 3D image. The module that deals with generating 3D models is composed of 2 parts namely the shape construction and texture/ surface construction, once the image is passed the required space to build the 3D model is calculated, it's the probability that is calculated not the exact space view. Once all the procedures are completed the required 3D model is finalized and produced as the final result.

3.5 3D-generative adversarial network

The working of 3D GAN involves carrying out the transformation of the 2D image input into a 3D model. The following image shows that the \({Z}_{img}\) that is passed as input in the next step is attached with the latent space \(4x4x4\) then goes from \(512x4x4x4\) then transforms to \(256x8x8x8\) then to \(128x16x16x16\) further proceeds to \(64x32x32x32\) finally forming the \(64x64x64x64\). Figure 7 describes the working of 3D-GAN.

Fig. 7
figure 7

3D-GAN implementation

Before getting into VQ-GAN it is important to know about GAN and its algorithms (Algorithm 1 and 2). Figure 8 shows the loss curve for each epoch. It shows the result of Curvature Regularized VAE that shows more and more steps increase (epochs) the Loss decreases. Figure 9 shows the DCGAN Loss curve for each epoch. It shows the result of DCGAN that shows more and more steps increase (epochs) the Loss decreases.

Fig. 8
figure 8

Loss curve for each epoch (Curvature Regularized VAE) graph obtained 50 epochs

Fig. 9
figure 9

DCGAN loss curve for each epoch

figure a

3.6 Vector quantized variational autoencoder (VQ-VAE)

In VQ-VAE, the encoder transforms input data into continuous latent vectors, which are then quantized to discrete representations using a predefined codebook. The important point is that the images must somehow be expressed as sequences to make use of the computationally expensive self-attention process in high-resolution synthesis. The encoder extracts an encoding Z hat and quantizes it to Z q using the closest codebook entry rather than utilizing pixels or patches as tokens (as in VQ-VA) [20]. The image can then be rebuilt by the decoder starting with the quantization. Except for two key modifications in the loss applied, this portion of training required to acquire the codebook representation is nearly identical to the VQ-VAE one. As was stated in the previous section, the Mean Squared Error (MSE) plus the two alignment losses made up the three terms that made up the loss employed in VQ-VAE [21]. Perceptual loss, which is essentially the MSE computed on internal representations of the images rather than two images, is used in place of the MSE in this situation. For instance, when two images are put through a CNN, the n-th layer features are extracted and the MSE is compared [22]. To overcome the VQ-VAE blurring problem, they also include an adversarial loss (the traditional loss used in GANs, where a generator and a discriminator play in a minimax game), with a prediction of real/fake not for the entire image but rather for individual patches [23]. Figure 10 shows the VQGAN implementation and Fig. 11 shows the detailed VQGAN implementation. Figure 12 shows the Stack-hour-glassed structure. Figure 13 shows the Cloth texture Construction and Shape construction.

Fig. 10
figure 10

VQGAN implementation

Fig. 11
figure 11

VQGAN implementation detail

figure b
Fig. 12
figure 12

Stack-hour-glassed structure

Fig. 13
figure 13

Cloth texture construction and shape construction

3.7 Unified fashion generative adversarial network(uFashGAN)

The uFashGAN architecture consists of a GAN that takes an input image and generates a corresponding image, which is then fed into another GAN layer. This generated image is compared to the original input image to compute the loss, and the first generated image is sent back to the discriminator along with other parts of the GAN module. This process is repeated several times to produce a diverse set of results.

figure c

3.8 Stacked-hourglass

Hourglass module is a symmetrical encoder-decoder architecture with skip connections that help in preserving spatial information. The Hourglass module uses residual connections to improve the flow of gradients during training and allows the network to learn deeper features without the vanishing gradient problem. Stacked Hourglass Network (SHN) architecture is designed as a multi-stage process, where the output of one Hourglass module is used as input for the next module. The multi-stage approach helps to refine the estimates of human pose over time and improve the overall accuracy. The SHN uses intermediate supervision to train the network at each stage of the Hourglass module. Overall, the SHN is a powerful architecture for human pose estimation. Figure 12 shows Stacked hourglass Architecture. This Stacked-hour-glass structure here represents how the input image passed will generate the real human model. For example, to generate a real human 3D model of every hourglass that contains \(1x1\) convolution networks help in the overall reconstruction, here the first block tries to reconstruct the head followed by hands then followed by the body, and ends with legs thus the whole body the difficulty that stacked hourglass faces whilst developing of the 3D model is it must develop all features by guessing what it would look like and generate the complete model. The stacked hourglass is composed of the following.

  • Input: The Stack Hourglass architecture is commonly used for image segmentation, with the input usually being an image that requires segmentation. The size of the input image can differ based on the specific task and the model's needs.

  • Down-sampling: In the Stack Hourglass architecture, the input image is typically passed through multiple down-sampling layers, which are commonly composed of convolutional layers followed by max pooling. This process is used to decrease the spatial resolution of the image, allowing the network to capture high-level features and coarse details more effectively.

  • Up-sampling: The Stack Hourglass architecture incorporates several up-sampling layers, which often come in the form of deconvolutional or transposed convolutional layers, to boost the spatial resolution of the feature maps that were generated by the down-sampling layers. This is crucial for the network to capture small-scale details and preserve spatial information.

  • Intermediate output: The Stack Hourglass architecture produces multiple intermediate outputs, which are segmentation maps with different spatial resolutions. These intermediate outputs are generated at different stages of the network, typically after each down-sampling and up-sampling block. These intermediate outputs capture features at different scales and provide different levels of detail.

  • Intermediate supervision: To ensure effective training of the Stack Hourglass architecture, intermediate supervision is applied at each stage to provide feedback to the network. This is achieved by computing the loss between the intermediate output and the ground truth segmentation map, and then backpropagating the error through the network. By doing this, the network can learn and improve the accuracy and detail of the segmentation maps at each stage of the architecture. In summary, the Stack Hourglass architecture is a highly effective and adaptable technique for image segmentation that incorporates down-sampling, up-sampling, and intermediate supervision to capture features at various scales and generate precise segmentation maps.

Additionally, the architecture may use a Squeeze-and-Excitation (SE) block to produce the final output (Fig. 13). The entire module is divided into 2 parts that hold the of the image and shape construction obtained from the input image. The image encoder passes the respective information here the information of the cloth texture along with the color is also obtained and this results in \({Texture Generated (TG}_{fct})\) and the shape of the human in the image is passed to shape construction where the respective 3D model generation for that given pose is generated. Then both the Cloth Texture construction and Shape construction are combined to form the complete 3D clothed human model. Figure 14 describes the Overview of image-to-3D model conversion based on Z space.

  • cloth texture construction

  • shape construction

Fig. 14
figure 14

Overview of the image to 3D model conversion based on Z space

3.9 Single and multi-viewpoint

To begin with, the single-view surface reconstruction involves carrying out the probability field over the 3D space and finding the chance iso-surface. Field with the help of the Marching cube algorithm. Numerous views of the person can provide additional coverage of that person. The Pixel-Aligned Implicit function helps in providing more views so that the shape and texture can be enhanced much well and the detailing can be perfect alongside. Figure 15 shows Single and multi-view points.

Fig. 15
figure 15

Single and multi-view of point

Our goal is to build 3D human images, but before we can do so, we must first construct human images based on words that describe the characteristics of clothing (clothes shapes and clothes textures) [24]. The appropriate human image I where I   RH × W × 3 should be produced using a human pose P   RH × W texts for clothing shapes Tshape, and texts for clothes textures Texture. We wanted to synthesis the human parsing map S   RH × . using a human pose P and words on clothing forms. Texts are converted to a series of clothing shape properties called {a1, …, ai, , ak}, where ai is represented by the values 0 through 1 and class number Ci. The Attribute Embedding Module receives the attributes after which a shape attribute embedding is produced.

$$fsh=Fusion\left(\left[E1\left(a1\right), E2\left(a2\right),\dots Ei\left(ai\right),\dots Ek\left(ak\right)\right]\right)$$
(4)

Once this is complete further move to implementation using VQ-GAN The VQGAN is a GAN architecture that may be used to learn from prior data and produce new images. The feature map of the image data is initially directly fed to a GAN to encode the feature map of the visual sections of the images [25]. A codebook, or dictionary of codes, is created from the vector quantized data and stored.

The loss produced in GAN is LGAN:

$$LGAN\left(N,D\right)=\left[logD\left(x\right)+log\left(1-D\left(^x\right)\right)\right]$$
(5)

Vector quantization also happens between the encoder and decoder networks. After encoding the input x into \(\widehat{z}\), i.e., \(\widehat{z}\)=E(x), we perform an element-wise operation q to obtain a discrete version of the input:

$$z_q=q\left(\widehat z\right):=\text{argmin}zk\epsilon Z\parallel\widehat zij-zk\parallel$$
(6)

Once all this is done further sent to a mixture of experts and feed-forward refinement where high-quality images are obtained next step is to send it to 3D modeling.

4 Generation of 3D models

As for the construction of 3D models, the pixels of 2D images must be aligned with the global context of their corresponding 3D object with a stacked hourglass approach and for image construction by applying the method of image encoding CycleGAN. A collection of local bounding boxes B is used to define the compositional human NeRF representation as FΦ [26]. We employ a subnetwork \({f}_{k}\epsilon\) FΦ to model the local boundaries for each body part k.as seen in Fig. 16, box {bk min; bk max}. Regarding a specific point xi in the canonical coordinate system,

Fig. 16
figure 16

Shows Clockwise Adam optimizer, RMSProp, SDG, Adagrad, Adamax)

The matching radiance ck I and density ik are in the direction di and falling inside the k-th bounding box, respectively. is challenged by

$$\left\{{c}_{i}^{k} , {\sigma }_{i}^{k}\right\}= {F}_{k}\left({x}_{i}^{k} , {d}_{i}\right) , where {x}_{i}^{k}= \frac{2{x}_{i }-\{{b}_{min}^{k}+ {b}_{max}^{k}\}}{{b}_{max}^{k}- {b}_{min}^{k}}$$
(7)

All this can lead to the highly detailed clothed humans that accurately replicate all the texture and geometry from a single image in comparison with existing 3D deep learning models it provides a highly detailed result [27]. This is developed using a Deep fashion dataset that contains a variety of clothing types of huge range. Now comes the important implementation along with the enhancing phase, The input image of a clothed human is fed into the model that carries out the reconstruction of the given image as a 3D model while replicating and preserving the geometry and texture details present in the image [28]. The result ends up as a clothed human that is an accurate representation of the given 2D clothed human image. For this, we imbibe the techniques of the Pixel-Aligned Implicit function. This tends to be heavily memory efficient. The proposed function contains a full convolution image encoder g that is combined with the continuous implicit function f that has its representation using several (Multi-Layer Perceptrons) MLP. All this is implemented to carry out the conversion of a 2D clothed human image to a 3D model Here X is assumed as a 3D point,

$$f(F(x);z(X))=s:s\mathrm\epsilon\mathbb{R}$$
(8)

happens to be the surface representation, the \(x= \pi (X)\) is the 2D representation, and projection of the clothed human image and z(x) is used as the multiple view angle that helps in capturing multiple views of the same point X. Further the g(I(x)) acts as the image feature at point x [29]. As the entire module is continuous and not broken down into several parts this helps in the reconstruction of shape and texture with a high level of detailing along with memory efficiency.

4.1 Single-view and multiple-view surface reconstruction

To begin with, the single-view surface reconstruction involves carrying out the space probability for the 3D space and obtaining the iso-surface of the probability field with the help of the Marching cube algorithm. Numerous views of the person can provide additional coverage of that person. The Pixel-Aligned Implicit function helps in providing more views so that the shape and texture [30,31,32,33,34,35,36,37]. states the different methodologies for processing video Fig. 16 shows the Shows Clockwise Adam optimizer, RMSProp, SDG, Adagrad, and Adamax) detailing can be perfect alongside. With the help of this model, we can directly predict the RGB colors of the surface geometry. This favors self-occlusion and arbitrary topology in the texturing of shapes. It is challenging to extend a model to forecast colors since RGB colors are only defined on the surface of the 3D space, unlike the 3D occupancy field, which is known throughout the entire 3D area. Here, we emphasize the evolution of the model's network architecture and training methodology.

4.2 Marching cube algorithm

To begin with, the single-view surface reconstruction involves carrying out the probability field over the 3D space and finding the chance iso-surface. field with the help of the Marching cube algorithm. Numerous views of the person can provide additional coverage of that person. The Pixel-Aligned Implicit function helps in providing more views so that the shape and texture can be enhanced much better and the detailing can be perfect alongside.

figure d

5 Experimental evaluation

The performance metrics used are the Inception score, Fréchet Inception Distance, and Structural Similarity Index.

5.1 Inception score

The Inception Score (IS) is a metric for evaluating the quality of generated images. It measures the balance between the diversity and quality of the generated images. The IS is defined as the exponential of the entropy of the conditional class distribution \(p\left(y|X\right)\) (i.e., the predicted class probabilities given the generated image), multiplied by the marginal entropy of the class distribution p(y). In mathematical terms, the IS can be given in Eq. (1):

$$IS\left(G\right)=\text{exp}\left({\mathbb{E}}_{x\sim {P}_{g}} {D}_{KL}\left( p\left(y|X\right) \right|| p\left(y\right) )\right)$$
(9)

5.2 Fréchet inception distance score

For assessing the caliber of generated images, another statistic is the Fréchet Inception Distance (FID). It calculates the separation in feature space between the distributions of the generated images and the actual images. In particular, the FID determines the separation between the mean and covariance of the activations of the pre-trained Inception-v3 network on ImageNet. A definition of the FID is:

$${\left|{\mu }_{r}- {\mu }_{g}\right|}^{2}+{T}_{r}({\sum }_{r}+{\sum }_{g}-2{({\sum }_{r}{\sum }_{g})}^{1/2})$$
(10)

where \({\mu }_{r}\) and \({\mu }_{g}\) represent mean of real and generated image feature representation. \({\left|{\mu }_{r}- {\mu }_{g}\right|}^{2}\) represent squared Euclidean distance (or L2 norm) between the means of the real and generated images' feature representations. \({T}_{r}\) represent trace of a matrix, which is the sum of the elements on the main diagonal of the matrix. \({\sum }_{r}\) and \({\sum }_{g}\) covariance matrix of the real and generated images feature representations. \({({\sum }_{r}{\sum }_{g})}^{1/2}\) represent the matrix square root of the product of the covariance matrices of the real and generated images' feature representations.

5.3 Structural similarity index SSIM

Structural Similarity Index (SSIM). SSIM is used as a metric to measure the similarity between two given images. Given in equation (3)

$$\text{SSIM}\left(\text{x},\text{y}\right)=\left(2\mu \_x \mu \_y+c1\right)\left(2\sigma \_xy+c2\right)/\left(\mu \_x^2+\mu \_y^2+c1\right)\left(\sigma \_x^2+\sigma \_y^2++c2\right)$$
(11)

with:

μ_x:

the average of x;

μ_y:

the average of y,

σ_x^2:

the variance of x;

σ_(y)^2:

the variance of y;

2σ_xy:

the covariance of x and y.

C1 and C2are two variables to stabilize the division with a weak denominator; the dynamic range of the pixel-values (typically this is 2 bits per pixel -1); = 0.01 and k2 = 0.03 by default.

Peak Signal-to-Noise Ratio (PSNR) is a technique to figure out the difference between a signal's maximum potential value (power) and the strength of the noise that distorts it and lowers the quality of its representation.

5.4 MSE of X channel

$$MSEx=N l m-l n-l, N=1 {m}^{*}n$$
(12)

5.5 Total MSE

$$\text{MSEt}=\text{MSER}+\text{MSEG}+\text{MSE}$$
(13)

5.6 Calculate PSNR

$$\text{PSNR}={10}^{*}\text{log}10\left(\text{MAXI}\right)2/\text{MSEt}$$
(14)

5.7 Fréchet inception distance: FID

$${d}^{2}=\Vert m{u}_{1}-mu\_2\Vert ^2+Tr\left(C\_1+C\_2-{2}^{*}sqrt\left({C\_1}^{*}C\_2\right)\right)$$
(15)

mu_1” and “mu_2 refer to the mean of the individual features of the real and generated images, C_1 and C_2 are the covariance matrix for the real and generated feature vectors, represented as sigma.

5.8 Chamfer distance

The Chamfer Distance is a metric for measuring the similarity between two sets of points in a metric space. It calculates the average distance between the points in one set and their nearest neighbor in the other set. The Chamfer Distance can be written as:

$${D}_{chamfer}\left(T,I\right)= \frac{1}{|T|}\sum_{t\epsilon T}{d}_{t}(t)$$
(16)

5.9 SSIM index

A metric for determining how similar two photographs are is called SSIM. It gauges similarity in terms of structure, brightness, and contrast. The SSIM is expressed as follows: Luminance: By averaging over all of the pixel values, luminance is determined. Its symbol is (Mu), and the formula is shown below.

$${\mu }_{x}= \frac{1}{N}\sum_{i=1}^{N}{x}_{i}$$
(17)

5.10 Earth mover’s distance

The Earth mover's distance, normalized by the total weight of the lighter distribution, is the smallest amount of work required to match x and y. However, because both distributions have the same total weight in this situation, there is no lighter distribution. Simply divide the labor by one distribution's overall weight.

$$EMD=\sum_{i=1}^{m}\sum_{j=1}^{n}{M}_{ij}{d}_{ij}$$
(18)

The performance evaluation is carried out for the VQ-SEG using the FashionMNIST dataset. VQ-SEG/VQ-VAE is Specifically created for semantic segmentation tasks, VQ-SEG is a modified version of the VQ-VAE method. The addition of a segmentation head to the network, which enables it to anticipate the class labels of each pixel in an image, is the primary distinction between VQ-SEG and the original VQ-VAE. The encoder network is used by the VQ-SEG technique to first encode an input image into a lower-dimensional feature space. A vector quantization module is then used to quantize the feature space into discrete vectors known as embeddings. The decoder network then creates a reconstruction of the original image using the embeddings as its input. The performance of VQ-SEG on semantic segmentation tasks has been the most optimized, and it is a promising methodology for other computer vision problems requiring high-resolution predictions. Table 6 denotes the Comparison of various methods to generate 2D images. Table 7 denotes the Comparison of various methods generated by 3D models, Table 8 shows the score table for GAN models and accuracy at various Epochs and Table 9 shows the overall loss measurement. From the above GAN models and carrying out comparisons using various optimizers along with the given performances are compared. Table 4 shows the various outputs. Figure 17 shows the output of various inputs FashionMNIST.

Table 6 Comparison of various methods generated 2D images
Table 7 Comparison of various methods generated 3D models
Table 8 Performance comparison of GAN models at various Epochs
Table 9 Overall loss measurement
Fig. 17
figure 17

Output of various inputs FashionMNIST

The (Table 10) output shows various images for various poses using the Dense-Pose image. The input text once given is matched with the image name and generates the parsing image the evaluation metrics that have been carried for images and the 3D models are mentioned earlier now the comparison of how the proposed model performs can be inferred from the below tables along with side to side comparison of 3D models generated from the input 2D clothed image further on the introduction of new scores that has not been compared earlier also will be mentioned.

Table 10 Output of various inputs

MISC is a causal discovery algorithm that integrates the maximal information coefficient (MIC) and Bayesian network structure learning to identify causal relationships between variables in complex systems. The algorithm incorporates a search strategy to identify the optimal causal network structure using a combination of greedy hill-climbing and tabu search. MISC outperforms other state-of-the-art causal discovery algorithms and has been used in real-world issues in various fields. A pose-conditioned VAE-based technique called Human-GAN generates a variety of human looks by sampling from a given distribution.

Table 11 denotes the Comparison of Various Methods for Generated Images. Table 12 denotes the Output of generated 3D models. Body-Net is a deep learning-based method for creating an approximate 3D mesh representation of the human body from a single RGB photograph. The model, which was developed using a sizable dataset of 3D body scans, can precisely predict body shape, position, and garment distortion from a single image. The scores are mentioned in (Table 13). SiCloPe: SiCloPe is a technique for 3D object reconstruction from a single RGB picture. The method relies on a mix of generative and discriminative models, where the generative model generates a 3D mesh and the discriminative model checks that the mesh produced is compatible with the input image. The scores are mentioned in (Table 13).VRN (Volumetric Regression Network): A neural network architecture was created to regress 3D shapes from 2D photographs. By regressing a 3D volume representation of the item, it can determine the 3D form of an object from a single 2D image. The architecture processes the 3D volume representation using 3D convolutional neural networks (CNNs). The scores are mentioned in (Table 13).

Table 11 Comparison of various methods for generated images
Table 12 Output of generated 3D models
Table 13 Comparison of various methods for generated 3D models

The above result shows the reconstruction of the 3D model from the given 2D clothed human image that was generated from the textual descriptions. The mask-generated column refers to the inverse of the given image and also mentions within which the 3D model must be generated. The application of the Marching cube algorithm and stacked hourglass aids in the development of the 3D model. The only outliers in this are the half-body images and low-resolution images. Figure 18 shows the Adam optimizer displaying reconstruction loss, vq_vae loss where epoch = 100. Figure 19 represents the SDG optimizer displaying reconstruction loss, vq_vae loss where epoch = 50. Figure 20 mentions the RMSProp optimizer displaying reconstruction loss, vq_vae loss where epoch = 50. Figure 21 denotes the AdaGrad optimizer displaying reconstruction loss, vq_vae loss where epoch = 50.

Fig. 18
figure 18

Adam optimizer displaying reconstruction loss, vq_vae loss where epoch = 100

Fig. 19
figure 19

SDG optimizer displaying reconstruction loss, vq_vae loss where epoch = 50

Fig. 20
figure 20

RMSProp optimizer displaying reconstruction loss, vq_vae loss where epoch = 50

Fig. 21
figure 21

AdaGrad optimizer displaying reconstruction loss, vq_vae loss where epoch = 50

The proposed approach combines several state-of-the-art techniques, including text-driven image synthesis, 3D model estimation, and dataset utilization, to advance the field of clothed human image synthesis and 3D model estimation. While individual components of the system, such as VQGAN for image generation, Pixel-Aligned Implicit function for 3D model reconstruction, and Marching Cube algorithm for mesh creation, are well-established. The novelty lies in the seamless integration of these techniques to enable text-driven synthesis of highly detailed 3D human models from textual descriptions, enhancing the realism and personalization of the shopping experience. Additionally, the comprehensive evaluation of the system's performance using diverse datasets and evaluation metrics highlights its effectiveness and potential impact in real-world applications.

6 Results and discussion

The results of our study demonstrate the effectiveness of the proposed text-driven clothed human image synthesis with 3D human model estimation using VQ-VAE for assistance in shopping. We evaluated the system on a dataset of clothing items with textual descriptions and assessed its performance in terms of image quality, 3D model estimation accuracy, and user satisfaction. The generated images showed remarkable quality and realism. Users found it challenging to distinguish between synthesized images and actual product photographs. This indicates that the VQ-VAE architecture effectively captures fine details, textures, and color variations, resulting in visually convincing images.

The 3D human model estimation component of the system performed admirably. It accurately estimated body shapes and sizes based on textual descriptions and allowed for realistic clothing draping. This feature added an invaluable layer of personalization to the shopping experience, helping users visualize how the clothing items would fit them or others. User feedback and surveys revealed a high level of satisfaction with the system. Users reported that the technology improved their online shopping experience by providing a more immersive and informative means of exploring clothing options. They expressed increased confidence in making purchasing decisions and a reduced likelihood of returning items due to inaccurate fit or appearance. The results of our study demonstrate the effectiveness of the proposed text-driven clothed human image synthesis with 3D human model estimation using VQ-VAE for assistance in shopping. We evaluated the system on a dataset of clothing items with textual descriptions and assessed its performance in terms of image quality, 3D model estimation accuracy, and user satisfaction.

7 Ethical considerations

The ethical implications of text-driven image synthesis, including privacy, consent, and potential misuse, are paramount considerations in responsible AI development. Generating highly realistic images based on textual descriptions raises concerns regarding individuals' privacy rights, necessitating clear guidelines for safeguarding against unauthorized use and ensuring explicit consent whenever feasible. Moreover, developers must anticipate and mitigate potential misuse of synthesized images, such as spreading disinformation or perpetuating biases, through techniques like watermarking, detection methods, and collaboration with policymakers. Addressing biases in both training data and model architecture is essential to promoting fairness and equity in image synthesis outcomes. Ultimately, regulatory frameworks and governance structures are needed to ensure responsible development and deployment of these technologies, engaging experts across disciplines to establish ethical guidelines and promote societal well-being.

8 Conclusion

In conclusion, the integration of text-driven clothed human image synthesis with 3D human model estimation through the use of VQ-VAE represents a significant advancement in the field of assistance in shopping. This innovative approach offers a promising solution to several challenges in the online shopping experience, providing users with a more immersive and informative way to explore clothing options. By enabling the generation of realistic clothed human images based on textual descriptions, this technology bridges the gap between the textual information available on e-commerce platforms and the visual understanding required for shoppers to make informed decisions.

Shoppers can now obtain a clearer representation of how a specific garment might look on themselves or others, which can lead to more confident purchasing decisions and reduced returns. The incorporation of 3D human model estimation adds another layer of realism and utility to the system. It allows users to visualize how clothing items will fit and drape on the body, taking into account individual body shapes and sizes. This personalized aspect enhances the overall shopping experience, promoting customer satisfaction and reducing the likelihood of mismatched expectations. Moreover, the use of VQ-VAE for image generation ensures high-quality, diverse, and coherent image synthesis, which is crucial for a convincing shopping experience. The model's ability to capture fine details and textures contributes to the realism of the generated images, making them visually indistinguishable from actual photographs. While the technology presented in this study holds great promise, it is essential to acknowledge potential challenges and areas for improvement. Further research and development are needed to optimize the model's performance, especially in handling a wide range of clothing styles and text descriptions. Additionally, addressing ethical concerns such as privacy and the potential for misuse of synthesized images is paramount.

In summary, text-driven clothed human image synthesis with 3D human model estimation using VQ-VAE represents a remarkable advancement in assisting consumers with their online shopping decisions. This technology has the potential to revolutionize the e-commerce landscape by providing a more immersive, informative, and personalized shopping experience. With continued research and refinement, it is poised to become an invaluable tool for both shoppers and retailers, enhancing convenience, reducing returns, and ultimately reshaping the way we shop online.

9 Future work

In the future, the addition of natural language processing (NLP) to the chatbot function of virtual try-on might be another component of enhancement for the current paradigm. As a result, consumers would be able to interact with the chatbot in a way that seems more natural and approachable, improving the usability and accessibility of the virtual try-on experience. The virtual try-on functionality might be enhanced to provide consumers with more customization choices in addition to real-time viewing. Users may, for instance, be given the option to change the fit, color, and material of virtual clothing items to better suit their tastes and requirements. Additionally, IoT applications might go beyond merely real-time data and measurements. To detect user motions and offer individualized feedback on posture and movement patterns, for instance, sensors might be included in clothing items. Users may be able to enhance their general health and well-being thanks to this. The virtual try-on features might be utilized to promote more environmentally friendly and sustainable practices as the fashion industry places a growing amount of emphasis on sustainability. A virtual try-on, for instance, might be utilized to show customers how different clothing items would match up with current things in their closets, assisting them in making better educated and environmentally responsible shopping selections. The improvements to the current paradigm may also incorporate the adoption of blockchain technology to build a system that is more transparent and secure for the fashion business. This may lessen the occurrence of fake items and enhance supply chain management procedures.