1 Introduction

In computer science literature, the handling of characters and fonts has long been studied in the field of Optical Character Recognition (OCR), whose goal is to recognize either printed characters (Satirapiwong and Siriborvornratanakul 2021; Lertsawatwicha et al. 2023) or handwritten characters (Obaidullah et al. 2018, 2019; Santosh et al. 2012; Ghosh et al. 2022) from an image and transform them into machine-readable formats. With the emergence of generative artificial intelligence (AI), tasks involving the design of new character styles/fonts have become attractive as a novel approach to harnessing the power of generative AI. Previously, the tasks of font style design are labor-intensive. It may take a long time for designers to design the font because of its complexity. Font generation with deep learning, which is one method of efficiently creating a new typeface automatically, is advantageous in terms of reducing both time and labor costs. Examples of Chinese and English deep learning-based font generation works are those of Liu et al. (2022) and (Park et al. 2022). However, only a handful of papers have been published on Thai font generation using deep learning. As a result, the objective of this work is to study and develop a Thai font generation system using deep learning in order to fill a research gap in this area.

One of the most popular methods to create fonts is Generative Adversarial Network (GAN) (Liu et al. 2022; Park et al. 2022; Miyato and Koyama 2018; Karras et al. 2020a). Style-based GAN architecture 2 (StyleGAN2) (Karras et al. 2020a) has improved GAN models, producing high-quality synthetic images. However, a large number of datasets are required to achieve the best results. Collecting data on Thai fonts is challenging due to copyright restrictions, resulting in a small training dataset. The work of Feng et al. (2021) showed that with a smaller dataset, GAN suffers more overfitting, which limits the model's ability to generalize to new situations. Data augmentation, which has reportedly been suggested as a solution to overcome the issue (Zhang and Khoreva 2019; Karras et al. 2020b; Bowles et al. 2018; Tran et al. 2021), includes rotation, flipping, cropping, translating (Tran et al. 2021), scaling, color transformations, image-space filtering, and image-space corruptions. However, Karras et al. (2020b) proposed that the leaky augmentation problem should be considered. The term leaky augmentation refers to when the generated sample has noise from augmentation even though the dataset contains none. Font generation is vulnerable to leaky augmentation issues, particularly augmentation techniques including rotation and flipping, which alter the appearance of the character set to an incorrect one.

The work of Karras et al. (2020b) also demonstrated that StyleGAN2 with adaptive discriminator augmentation (StyleGAN2-ADA), a model utilizing a technique known as stochastic discriminator augmentation, can reduce leaky augmentation and overfitting issues. As a result, we chose the StyleGAN2-ADA model to generate Thai fonts in this paper. However, because font generation is sensitive to some augmentation techniques. For example, the output characters observed in the work of Tran et al. (2021) could not be read with augmentation. Hence, we ran the model with and without augmentation to see how it performed.

Although StyleGAN2-ADA can produce better and higher-quality images, it requires more computational resources for training. At this point, we additionally utilize Real-Enhanced Super-Resolution Generative Adversarial Networks (real-ESRGAN) as it is a model used to increase the quality of the image (Ledig et al. 2017). Accordingly, combining StyleGAN2-ADA and real-ESRGAN allows us to train the model with small image dimensions, significantly reducing resource consumption. The contributions of this paper can be summarized as follows:

  1. (1)

    We proposed a new Thai font generation system that combines StyleGAN2-ADA (both with and without augmentation) with real-ESRGAN.

  2. (2)

    We combined both human and Fréchet Inception Distance techniques to evaluate the newly generated Thai font result of our and other comparative models.

2 Related works

2.1 Previous works on font generation

Generative Adversarial Network (GAN) (Goodfellow et al. 2014) is one of the most popular approaches to generating fonts. GAN is an unsupervised learning algorithm with two components. The first is a generator, its function is to generate data from noise. The second part is a discriminator, which determines whether the data generated by the generator are fabricated or genuine according to works (Zeng et al. 2021). Even though GAN is an effective model for generating fonts (Hassan et al. 2023), it still faces problems during training, such as mode collapse, which greatly reduces the variety and quality of generated outcomes (Zeng et al. 2021). Furthermore, GAN is difficult to train due to the large number of glyphs required in the task of font generation (Tang et al. 2022).

The paper of Hayashi et al. (2019) in 2019 has proposed GlyphGAN which is used for generating English fonts. There are three distinctive features regarding this model. Firstly, the style of each generated font is consistent. Secondly, generated fonts are legible. And lastly, it is distinct from training images. However, there are limitations as training datasets of Hayashi et al. (2019) are only English alphabets. For a language with a considerable number of letters, it must be an extension of the character class of vector. The second limitation is the imperfect font legibility which can be improved by increasing training datasets. Lastly, GlyphGAN still cannot generate a font by a specific font style. Another paper named StrokeGAN (Zeng et al. 2021) was proposed in 2021, focusing on a solution for GAN’s mode collapse using a one-bit stroke encoding with CycleGAN. As a result, StrokeGAN numerically yielded state-of-the-art results in many font generation tasks. However, StrokeGAN was suggested for Chinese, Japanese, and Korean fonts whose font characteristics are totally different from Thai fonts. While GAN was widely used for font generation, some work, like FontRNN (Tang et al. 2019) in 2019, proposed an alternative solution of an encoder–decoder recurrent neural network (RNN) which is said to effectively handle stroke order information of Chinese characters.

2.2 Image super-resolution

Super-resolution of images refers to enhancing the resolution of an image using several techniques. One milestone of these techniques is called Super-Resolution Generative Adversarial Networks (SRGAN) (Ledig et al. 2017). SRGAN can improve the image’s texture to be more realistic and resemble ground truth. However, this algorithm can be sensitive to noise and artifacts, resulting in poor image quality. To address this issue, Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) (Wang et al. 2018) have been introduced. They began by improving the network structure of SRGAN by employing residual-in-residual dense blocks (RDDB). The discriminator is then transformed into a relativistic average GAN (RaGAN), which learns which image is more realistic than another. Lastly, the perceptual loss has been improved by using the VGG feature before activation.

However, the work of Wang et al. (2021) proposed an improved method called Real-ESRGAN. They had implemented multiple adaptations such as a U-Net discriminator with spectral normalization and had trained the model with pure synthetic data. Real-ESRGAN enhances image quality to have sharper and more natural texture than the former and reduces the effect of ringing and overshooting artifacts according to work (Wang et al. 2021). Accordingly, we hypothesize that using this super-resolution model should help improve the quality of generated fonts to be high resolution in our work.

3 Proposed method

3.1 Dataset

For dataset preparation, we collected 300 Thai fonts for TrueType Font File Format (TTF) from https://www.f0nt.com/. Each font contains Thai characters (letters, vowels, tone marks), English characters, and other symbols. To construct our image datasets of these 300 Thai fonts, we selected 50 Thai letters and used the Pillow library to transform them into a single 512 \(\times\) 512 image as examples shown in Fig. 1.

Fig. 1
figure 1

Examples of the 512 \(\times\) 512 image that represents one Thai font in our image dataset

3.2 Font generation by StyleGAN2-ADA

StyleGAN2 with adaptive discriminator augmentation or StyleGAN2-ADA (Karras et al. 2020b) is a variant of StyleGAN2, a state-of-the-art GAN for synthesizing high-quality images. Continuing from the success of ProGAN (Karras et al. 2018) in 2018, the original StyleGAN (Karras et al. 2019) was introduced in 2019 by NVIDIA researchers as a way to generate realistic images of faces and other objects. In the following year (2020), StyleGAN2 (Karras et al. 2020a) was introduced as an improved version with several new features and modifications. StyleGAN2-ADA (Karras et al. 2020b) is an extension of StyleGAN2 that aims to further improve the quality and diversity of the generated images. One of the main contributions of StyleGAN2-ADA is its use of adaptive discriminator augmentation (ADA), which involves adapting the discriminator network based on the distribution of the training data. This helps StyleGAN2-ADA to better capture features and characteristics of the training data, leading to more realistic and diversely generated images.

One of the key advantages of StyleGAN2-ADA is its ability to train with limited data as it uses a hybrid approach to combine unsupervised and supervised learning. This allows the model to learn more effectively from small datasets, making it a useful tool for scenarios like ours, where large amounts of annotated data may not be available. Overall, StyleGAN2-ADA is a powerful and effective GAN for synthesizing high-quality images and has the potential to be used in a wide range of applications, including computer graphics, image editing, and machine learning research.

From previous works, training a font generator usually requires a large dataset. Due to our small datasets, we decided to use StyleGAN2-ADA (Karras et al. 2020b) as it allows the model to learn more effectively on limited data. Then we compared the results between using StyleGAN2-ADA with and without augmentation because fonts may be sensitive to augmentation. After all, we will pick the best result to apply Real-ESRGAN for blind super-resolution to attain better graphic performance. Table 1 concludes three related works that have significantly contributed to the development of our paper.

Table 1 Summary table of related works which are major parts of our paper

3.3 Baseline model

For experimental purposes, we generate new Thai fonts with our model architecture as illustrated in Fig. 2a, and compare them with other generative models such as GlyphGAN (Hayashi et al. 2019) and Variational AutoEncoder (VAE) models, using the same 3500 epochs of training for all experiments. We begin by comparing StyleGAN2-ADA’s generative model with augmentation as illustrated in Fig. 2b and without augmentation as illustrated in Fig. 2c, using FID and human score to determine which is better for Thai fonts datasets, then using Real-ESRGAN technique to improve image resolution.

Fig. 2
figure 2

a Our model architecture, b StyleGAN2-ADA with augmentation architecture (Karras et al. 2020b), and c StyleGAN2-ADA without augmentation architecture (Karras et al. 2020b)

According to Karras et al. (2020b), augmentation with pixel blitting, general geometric transformations, and color transformations are most effective. However, our datasets do not need augmentation with color transforms as the images are in grayscale. Therefore, we used only augmentation with pixel blitting and general geometric transformations technique for StyleGAN2-ADA’s generative model with augmentation. Our pixel blitting involves horizontal flipping, 90° of rotations, integer translation technique, general geometric transformations with isotropic and anisotropic scaling, arbitrary rotations, and fractional translation with both control augmentation strength with augmentation probability. The probability of augmentation is dependent on the amount of overfitting and changes dynamically. If the probability of augmentation is zero, there will be no augmentation.

3.4 Training

We trained the StyleGAN2-ADA model with our Thai-font image dataset. Two versions of StyleGAN2-ADA were trained for 3500 epochs—one with augmentation (pixel blitting and general geometric transformations, without horizontal flip) and the other without augmentation. Then, the pre-trained RealESRGAN _x8plus.pth (Wang et al. 2021) was used to generate images with better resolution. For comparison purposes, we also trained GlyphGAN and VAE on the same dataset for the same number of training epochs; the GylphGAN model was retrieved from Moritz Salla (2021) whereas the VAE model was from Yigit Atay (2020). After getting the image results from each experiment, we calculated the similarity score (ImagingChopDifference method) between the output image and every image in the dataset. Images that we selected for the next step of human evaluation, are images from the dataset with the highest similarity score.

For each experiment, we experimented with device specifications as follows: NVIDIA® T4 GPUs, Intel(R) Xeon(R) CPU @ 2.20 GHz (2 × vCPU) CPUs, RAM 32 GB, and GPU RAM16 GB. Our programming environment includes Python 3.8.16, PyTorch 1.9.1, CUDA toolkit 11.1, and imageio-ffmpeg 0.4.3.

3.5 Evaluation metrics

3.5.1 Fréchet inception distance (FID)

Fréchet Inception Distance (FID) (Heusel et al. 2017) is one of the quantitative metrics used to evaluate the performance of a generative model according to Obukhov and Krasnyanskiy (2020). It measures the distance between the distribution of images produced by the model and the distribution of real images. The idea is that if the model can produce images that are indistinguishable from real images, the FID score will be low.

3.5.2 Human score

To qualitatively evaluate the generated Thai fonts, we conducted a survey including 50 participants, excluding those who participated in the experiment to prevent bias. The participants were asked to rate on a scale of 0–5 where 5 refers to the best image. Our survey is divided into three sections to evaluate legibility (Can you read the generated Thai font characters? If not, could you please identify the unreadable character?), visually pleasing (How do the results' visuals look? blurry? sharp edges? no smudges?), and diversity (How does the result differ from the sample image?).

3.5.3 Total score

A total score is calculated by weight averaging the FID and Human Score as in Eqs. (1)–(3). In order to generate the new Thai font, we highly concentrated on the “diversity” section and set its weight at 70% whereas setting weights of 20% for “legibility” and 10% for “visually pleasing” as shown in Eq. (1). Both the human score in Eq. (1) and the FID score in Eq. (2) are rescaled to the range of 0–1 where 1 refers to the best result. Finally, both scores are equally weighted to yield the final total score ranging from 0 to 10 in Eq. (3).

$${H}_{i}=(0.2{x}_{1i}+0.1{x}_{2i}+ 0.7{x}_{3i})/5$$
(1)
$${F}_{i}=\left(\left(\sum\nolimits_{j=1}^{n}{y}_{j}\right)-{y}_{i}\right)/\left(\sum\nolimits_{j=1}^{n}{y}_{j}\right)$$
(2)
$${T}_{i}=10(0.5{H}_{i}+0.5{F}_{i})$$
(3)

\({x}_{1i}:\) Human score for the legibility section for model \(i\). \({x}_{2i}:\) Human score for the visual pleasing section for model \(i\). \({x}_{3i}:\) Human score for the diversity section for model\(i\). \({y}_{i}\): FID score for model\(i\). \(n\): Number of models. \({H}_{i}:\) Average Human score for model\(i\). \({F}_{i}\): Average FID score for model \(i.\) \({T}_{i}\): Total score for model \(i.\)

4 Experimental results

4.1 Comparison between StyleGAN2-ADA with and without augmentation

Examples of generated Thai fonts from the two models are shown in Fig. 3. In Fig. 4 of FID scores during the training, StyleGAN2-ADA without augmentation has a lower (better) FID score in the early training epochs. FID scores of both models then gradually decrease over time and the one without augmentation finally becomes more stable around the 2000th epoch. StyleGAN2-ADA with augmentation fluctuates more in the late epochs, indicating that the model without augmentation has more similarities to the ground truth font. According to Table 2, the average training times per epoch for both StyleGAN2-ADA with and without augmentation are similar.

Fig. 3
figure 3

Example results of the two StyleGAN2-ADA models. For each pair of images, the sample images and the images generated by the models are placed on the left and right sides respectively

Fig. 4
figure 4

FID scores of StyleGAN2-ADA with and without augmentation during 3500 training epochs

Table 2 Training time per epoch of StyleGAN2-ADA with and without augmentation

In human evaluation, the results of StyleGAN2-ADA with and without augmentation are shown in Table 3. StyleGAN2-ADA without augmentation outperforms StyleGAN2-ADA with augmentation in two parameters: legibility and visually pleasing. In terms of diversity, although StyleGAN2-ADA with augmentation numerically outperforms StyleGAN2-ADA without augmentation (Table 3), many Thai characters produced by the model seem to suffer from poor legibility such as the two-pronged head (group 3) and round head starting from the middle of the line and head turning towards the left hand (group7) as shown in Table 4. At this point, we conclude that augmentation aids StyleGAN2-ADA in obtaining diverse fonts while sacrificing legibility and visual appeal. Note that, for each model, the best image results were chosen from 10 random seeds before the similarity score between the output image and every image in the dataset was calculated. The image selected for the diversity section of our human evaluation is from the dataset with the highest similarity score, called the sample image.

Table 3 Score comparison between our two models of StyleGAN2-ADA with and without augmentation
Table 4 Thai fonts as generated by our StyleGAN2-ADA models with and without augmentation

For the total scores shown in Table 3, the total score of StyleGAN2-ADA with augment is more (better) than without augment (8.23 vs. 7.97); this is because a big weight is set to the diversity score. For StyleGAN2-ADA without augmentation, it is legible and visually pleasing but not diverse; this is because mode collapse is a phenomenon that can occur when training generative models such as GANs, making GAN’s generator produce outputs that are all similar to one another. This can be a problem because it reduces the overall diversity of the generated data and can lead to poor performance on downstream tasks.

4.2 Comparison between using and not using super-resolution

As shown in Fig. 5, the model with Real-ESRGAN obviously has a better resolution; the edge of the font is sharper and even the 200 × enlarged font picture is not blurry compared to the other.

Fig. 5
figure 5

Example of fonts (200 × zoomed) generated from StyleGAN2-ADA without (left) and with (right) Real-ESRGAN

4.3 Comparison between our model (StyleGAN2-ADA with augment + Real-ESRGAN) and other models

In Fig. 6, the FID scores of VAEs are the lowest since early epochs. On the other hand, the FID scores of GlyphGAN are the most fluctuating one and constantly increase over time. Our StyleGAN2-ADA + Real-ESRGAN has the highest (worst) FID scores at first, but the scores then decrease and become more stable around the 1900th epoch. In the end, our model yields the lowest FID scores (13.82 ± 0.43) around the 3500th epoch as shown in Table 5. Examples of fonts generated by each model are shown in Fig. 7.

Fig. 6
figure 6

FID scores of our model compared to GlyphGAN and VAE

Table 5 Comparing our models (StyleGAN2-ADA with augmentation and Real-ESRGAN) with other deep learning models proposed for font generation
Fig. 7
figure 7

a Original data set, b our model architecture, c VAEs, d GlyphGAN without augmentation

For human evaluation, the results of our model, VAEs, and GlyphGAN are shown in Table 5. VAEs outperform our model and GlyphGAN on legibility and visual appeal. In terms of diversity, our model and GlyphGAN have comparable scores that outperform VAEs. Overall, the total score of our model is more than other models (ours = 8.37, VAEs = 7.57, GlyphGAN = 4.05). It seems that the generated fonts of our models are comparatively legibility, visual pleasing, and diverse. However, for the inference time in Table 6, VAEs is the fastest.

Table 6 The average and standard deviation of test time per epoch

5 Discussion and future works

Compared with VAEs and GlyphGAN, the performance of our model is more effective in terms of diversity, FID, and total score, matching the purpose of this paper to generate a new font with a diverse style. However, the diversity of GlyphGAN (3.60 ± 1.35) is not much different from ours (3.62 ± 1.08). For VAEs, even though its results outperformed ours in the visually pleasing and test time parameters, its generated fonts are too similar to the fonts in the original dataset, contradicting our purpose of achieving a wide range of newly generated fonts. For legibility, our model appears to underperform on this parameter as some generated characters are hard to read. Therefore, fonts generated from our model are not ready yet for actual usage.

In future works, the aim is to enhance font legibility through fine-tuning model parameters. Additionally, the distinctive characteristics of the Thai language will be explored—wherein a single word encompasses alphabets, vowels, and tone marks. The incorporation of supplementary characters, such as 16 vowel characters, four tone characters, and others like and , holds the potential to further refine future models. This refinement would facilitate the generation of coherent Thai words, rather than focusing solely on individual characters. Besides, a more exhaustive analysis should be conducted to help ensure the efficacy and technical intricacies of the proposed methods, particularly in terms of novelty and technical depth.

6 Conclusion

In this paper, we used StyleGAN2-ADA and real-ESRGAN to generate Thai fonts. Our results show that StyleGAN2-ADA is superior to the other methods we tested (i.e., without augmentation, VAEs, and GlyphGAN) because it produces fonts that are significantly different from the original training set. Additionally, using Real-ESRGAN enhances the legibility and visual appeal of the generated fonts. This paper presents experimental evidence that StyleGAN2-ADA is effective in reducing leaky augmentation and overfitting problems. It is worth noting that our model is designed to achieve a different goal than other previous works, which focused on generating fonts that are different from the original training set. Real-ESRGAN which resulted in a noticeable improvement in the sharpness of the edges of the generated font.