Keywords

1 Introduction

Font style is the art of visual representation of text and plays a crucial role in conveying information. It can even deliver deeper meaning, such as whether the current content is delightful or horrible. Designing a font is very time-consuming and requires the highly professional ability of the designer. The designer has to make proper artistic effects for strokes so that the font not only conveys the artistic style but also guarantees the original content of the character. In addition, when designing a large font library of multiple languages, the designer needs to spend a lot of time and effort to keep the characters of different languages in the same style, which not only demands professional knowledge and skills but also requires the designer to be proficient in different languages.

Therefore, automatic font generation via neural networks has attracted the attention of researchers, and many GAN [5]-based models for automatic font generation have been proposed. Early models [13, 14, 24, 25] need to be pre-trained on large datasets and then fine-tuned for specific tasks, which requires many computational resources and much effort to collect training samples. Recently, many few-shot learning methods [3, 9, 11, 19, 20, 23, 31, 32] have been proposed specifically for the font generation task, and these models can generate complete font libraries of the same language based on a small number of samples.

Nevertheless, in many scenarios, such as designing novel covers in different translations, movie promotional posters for different countries, and user interfaces for international users, it is necessary to keep characters of different languages having the same font style. At the same time, characters of different languages vary greatly in their glyph structure, e.g., the strokes and structures of English letters are very different from those of Chinese characters. Specifically, many components of Chinese characters have no counterparts in English letters, which leads to the fact that learning the style of characters from another language is difficult and requires the model to learn high-level style characteristics. Thus, some efforts [15] attempt to use self-attention mechanism to capture style patterns in another language. However, they ignore the potential inter-image correlations between different reference images and thus fail to learn sufficiently essential features in another language. Therefore, we propose to learn better style representation in another language by analyzing the inter-image relationships between all reference images rather than simply considering the intra-image connections.

In this paper, we propose a novel model named CLF-Net. Its core idea is to learn essential style features in another language by modeling the inter-image relationships between all reference images. Specifically, we design a Multi-scale External Attention Module (MEAM) to capture style features at different scales. The MEAM not only considers the intra-image connections between different regions of a single reference image but also implicitly explores the potential inter-image correlations between the overall style images, which makes it possible to extract the geometric and structural patterns that are consistently present in the style images and thus learn the unified essential style information at different scales in another language. In addition, considering that boundary pixels play a key role in determining the overall style of Chinese characters, we define an Edge Loss to compel the model to preserve more edge information and ensure the generated characters have sharper edges with less blur. Combining these components, we have achieved high-quality cross-language font generation.

Our contributions can be summarized as follows:

  1. 1)

    We first implicitly consider the inter-image associations and propose a novel few-shot cross-language font generation network called CLF-Net instead of simply considering the intra-image connections.

  2. 2)

    We design a Multi-scale External Attention Module (MEAM) to learn the unified essential style information at different scales of characters from another language, which solves the problem that the existing font generation models can not fully exploit style information in another language.

  3. 3)

    We introduce an Edge Loss function to make the model generate characters with sharper edges.

  4. 4)

    By modeling the inter-image relationships, our approach achieves significantly better results than state-of-the-art methods.

2 Related Works

2.1 Image-to-Image Translation

Image-to-image (I2I) translation aims to learn a mapping function from the target domain to the source domain. Pix2pix [12] uses a conditional GAN-based network that requires a large amount of paired data for training. To alleviate the problem of obtaining paired data, the CycleGAN [33] introduces cycle consistency constraints, which allow I2I methods to train cross-domain translations without paired data. FUNIT [16] proposes a few-shot unsupervised image generation method to accomplish the I2I translation task by encoding content images and style images separately and combining them with Adaptive Instance Normalization (AdaIN) [10]. Intuitively, font generation is a typical I2I translation task that maps a source font to a target font while preserving the original character structure. Therefore, many font generation methods are based on I2I translation methods.

2.2 Automatic Font Generation

We categorize automatic font generation methods into two classes: many-shot and few-shot font generation methods. Many-shot font generation methods [13, 14, 24, 25] aim to learn the mapping function between source fonts and target fonts. Although these methods are effective, they are not practical because these methods often first train a translation model and fine-tune the translation model with many reference glyphs, e.g., 775 for [13, 14].

Based on different kinds of feature representation, few-shot font generation methods can be divided into two main categories: global feature representation [1, 4, 27, 31] and component-based feature representation [3, 9, 11, 19, 20, 28, 32]. The global feature representation methods, such as EMD [31] and AGIS-Net [4], synthesize a new glyph by combining a style vector and a content vector together, but they show worse synthesizing quality for unseen style fonts. Since the style of glyphs is highly complex and fine-grained, it is very difficult to generate the font utilizing global feature statistics. Instead, works related to component-based feature representation focus on designing a feature representation that is associated with glyphs’ components or localized features. LF-Font [19] designs a component-based style encoder that extracts component-wise features from reference images. MX-Font [20] designs multiple localized encoders and utilizes component labels as weak supervision to guide each encoder to obtain different local style patterns. DFS [32] proposes the Deep Feature Similarity architecture to calculate the feature similarity between the input content images and style images to generate the target images.

In addition, some efforts [9, 11, 15, 21, 23] attempt to use the attention mechanism [26] for the font generation task. RD-GAN [11] utilizes the attention mechanism to extract rough radicals from content images. FTransGAN [15] captures the local and global style features based on self-attention mechanism [29]. Our Multi-scale External Attention Module (MEAM), motivated by external attention mechanism [7], extracts essential style features at different scales for cross-language font generation.

Fig. 1.
figure 1

Architecture overview of the CLF-Net. \(z_c\)/\(z_s\) denotes the content/style latent feature. Conv denotes a convolutional layer. BN denotes BatchNorm. MEAM denotes the Multi-scale External Attention Module. ConvT denotes a transposed convolutional layer.

3 Method Description

This section describes our method for few-shot cross-language font generation, named CLF-Net. Given a content image and several stylized images, our model aims to generate the character of the content image with the font of the style images. The general structure of CLF-Net is shown in Fig. 1. Like other few-shot font generation methods, CLF-Net adopts the framework of GAN, including a Generator G and two discriminators: content discriminator \(D_c\) and style discriminator \(D_s\). Moreover, to make the model show enough generalization ability to learn both local and global essential style features in another language, we propose a Multi-scale External Attention Module (MEAM). More details are given in Sect. 3.2.

3.1 Network Overview

We regard the few-shot font generation task as solving the conditional probability \(p_{gt}(x|I_c,I_s)\), where \(I_c\) is a content image in the standard style (e.g., Microsoft YaHei), \(I_s\) is a few style images having the same style but different contents, and x denotes the target image with the same character as \(I_c\) and with the similar style as \(I_s\). Considering that our task is cross-language font generation, \(I_c\) and \(I_s\) should be from different languages. Therefore, we choose a Chinese character as the content image and a few English letters as the style images to train our CLF-Net. The generator G consists of two encoders and a decoder. The content encoder \(e_c\) is used to capture the structural features of the character content. The style encoder \(e_s\) is used to learn the style features of the given stylized font. Two encoders extract the style latent feature and content latent feature, respectively. Then the decoder d will take the extracted information and generate the target image \(\hat{x}\). The generation process can be formulated as:

$$\begin{aligned} z_c=e_c\left( I_c\right) , z_s=e_s\left( I_s\right) , \end{aligned}$$
(1)
$$\begin{aligned} \hat{x}=G\left( I_c,I_s\right) =d\left( z_c,z_s\right) , \end{aligned}$$
(2)

where \(z_c\) and \(z_s\) represent the content latent feature and style latent feature.

The content encoder consists of three convolutional blocks, each of which includes a convolutional layer followed by BatchNorm and ReLU. The kernel sizes of the convolutional layers are 7, 3, and 3, respectively.

The style encoder has the same structure as the content encoder, including three convolutional blocks. Moreover, inspired by FTransGAN [15] and external attention [7], we design a Multi-scale External Attention Module (MEAM) after the above layers to capture essential style features at different scales. More details are given in Sect. 3.2.

The decoder takes the content feature \(z_c\) and style feature \(z_s\) as input and outputs the generated image \(\hat{x}\). The decoder consists of six ResNet blocks [8] and two transposed convolutional layers that upsample the spatial dimensions of the feature maps. Each transposed convolutional layer is followed by BatchNorm and ReLU.

The discriminators include a content discriminator and a style discriminator, which are used to check the matching degree from the style and content perspective separately. Following the design of PatchGAN [12], two patch discriminators utilize image patches to check the features of the real images and the fake images both locally and globally.

3.2 Multi-scale External Attention Module

Since self-attention mechanism [29] is applicable to the GAN [5] framework, both generators and discriminators are able to model relationships between spatial regions that are widely separated. However, self-attention only considers the relationships between elements within a data sample and ignores the potential relationships between elements in different references, which may limit the ability and flexibility of self-attention. It is not difficult to see that incorporating correlations between different style reference images belonging to the same font helps to contribute to a better feature representation for cross-language font generation.

External attention [7] has linear complexity and implicitly considers the correlations between all references. As shown in Fig. 2a, external attention calculates an attention map between the input pixels and an external memory unit \( M\in \mathbb {R}^{S\times d}\) by:

$$\begin{aligned} A =(\alpha )_{i,j}=\textrm{Norm}(FM^T), \end{aligned}$$
(3)
$$\begin{aligned} F_{out} =AM, \end{aligned}$$
(4)

and \(\alpha _{i,j}\) in Eq. (3) is the similarity between the i-th pixel and the j-th row of M, where M is an input-independent learnable parameter that is a memory of the whole training dataset. A is the attention map inferred from the learned dataset-level prior knowledge.

External attention separately normalizes columns and rows using the double-normalization method proposed in [6]. The formula for this double-normalization is:

$$\begin{aligned} (\tilde{\alpha })_{i,j} =FM_k^T, \end{aligned}$$
(5)
$$\begin{aligned} \hat{\alpha }_{i,j} =\exp {(\tilde{\alpha }_{i,j})}/\sum _k\exp {(\tilde{\alpha }_{k,j})}, \end{aligned}$$
(6)
$$\begin{aligned} \alpha _{i,j} =\hat{\alpha }_{i,j}/\sum _k\hat{\alpha }_{i,k}. \end{aligned}$$
(7)

Finally, it updates the input features of M according to the similarities in A. In practice, it uses two different memory units, \(M_k\) and \(M_v\), as the key and value to improve the capability of the network. This slightly alters the computation of external attention to

$$\begin{aligned} A=\textrm{Norm}(FM_k^T), \end{aligned}$$
(8)
$$\begin{aligned} F_{out}=AM_v. \end{aligned}$$
(9)
Fig. 2.
figure 2

The architecture of External Attention Block and the style encoder with Multi-scale External Attention Module. NN denotes a single-layer neural network. ConvBlock-1, ConvBlock-2, ConvBlock-3, ConvBlock-4, and ConvBlock-5 denote convolutional blocks, each of which includes a convolutional layer followed by BatchNorm and ReLU. \(\mathrm {F_3}\), \(\mathrm {F_4}\), and \(\mathrm {F_5}\) denote the feature maps with receptive fields of 13\(\,\times \,\)13, 21\(\,\times \,\)21, and 37\(\,\times \,\)37, respectively.

As mentioned above, the style of glyphs is complex and delicate. When designing the fonts, experts need to consider multiple levels of styles, such as component-level, radical-level, stroke-level, and even edge-level. Therefore, to improve the attention modules in FTransGAN [15], we design a Multi-scale External Attention Module (MEAM) to capture style features at different scales.

In particular, our method can model relationships between all style reference images from another language with the presence of the MEAM. With the MEAM, we can obtain high-quality essential style features at different scales. Specifically, when the style reference images go into the style encoder, whose architecture is shown in Fig. 2b, they will first go through three convolution blocks. Afterward, we feed the feature map outputted by the last convolutional block in the above layers into the MEAM. The MEAM first further extracts two feature maps separately through two consecutive convolutional blocks, each of which has a convolutional layer with kernel sizes of 3, and each convolutional layer is followed by BatchNorm and ReLU. Then the MEAM uses three juxtaposed External Attention Blocks to process the above three feature maps with receptive fields of 13\(\,\times \,\)13, 21\(\,\times \,\)21, and 37\(\,\times \,\)37, respectively. Thus, the feature maps with different receptive fields contain the multi-scale features. The context information is obtained and incorporated into the feature map through an External Attention Block, which is computed as:

$$\begin{aligned} h_r=EA(v_r), \end{aligned}$$
(10)

where EA denotes the External Attention Block, \(\{v_r\}_{r=1}^{H\times W}\) denotes each region of the feature map and the new feature vector \(h_r\) contains not only the information limited to their receptive field but also the context information from other regions of other reference images.

Then, considering that not all regions contribute equally, we assign scores to each region. Specifically,

$$\begin{aligned} u_r=S_1(h_r), \end{aligned}$$
(11)
$$\begin{aligned} a_r=\textrm{softmax}(u_r^Tu_c), \end{aligned}$$
(12)
$$\begin{aligned} f=\sum _{r=1}^{H\times W}a_rv_r. \end{aligned}$$
(13)

That is, we input the feature vector \(h_r\) into a single-layer neural network \(S_1\) and get \(u_r\) as the latent representation of \(h_r\). Next, the importance of the current region is measured using the context vector \(u_c\), which is randomly initialized and co-trained with the whole model. After that, we can obtain the normalized score by a softmax layer. Finally, we compute a feature vector f as a weighted sum for each region \(v_r\).

We also consider that features at different scales need to be given different weights. Therefore, we flatten the feature map given by the last convolutional block to obtain a feature vector \(f_m\), which is inputted into a single-layer neural network \(S_2\) to generate three weights, then we assign scores to three different scale feature vectors \(f_1\), \(f_2\), and \(f_3\), respectively. These scores explicitly indicate which feature scale the model should focus on. Specifically,

$$\begin{aligned} w_1,w_2,w_3=S_2(f_m), \end{aligned}$$
(14)
$$\begin{aligned} z=\sum _{i=1}^3w_if_i, \end{aligned}$$
(15)

where \(w_1\), \(w_2\), and \(w_3\) are the three normalized scores given by the neural network and z is the weighted sum of three feature vectors. Note that each time the style encoder will accept K images. Thus, the final latent feature \(z_s\) is the average of all vectors:

$$\begin{aligned} z_s=\frac{1}{K}\sum _Kz^k. \end{aligned}$$
(16)

Besides, we copy the style latent feature \(z_s\) seven times to match the size of the content latent feature \(z_c\).

3.3 Loss Function

To achieve few-shot cross-language font generation, our CLF-Net employs three kinds of losses: 1) Pixel-level loss to measure the pixel-wise mismatch between generated images and the ground-truth images. 2) Edge Loss to make the model pay more attention to the edge pixels of characters and make the edges of generated images sharper. 3) Adversarial loss to solve the minimax game in the GAN framework.

Pixel-Level Loss: To learn pixel-level consistency, we use L1 loss between generated images and the ground truth images:

$$\begin{aligned} \mathcal {L}_{L1}=\mathbb {E}_{x,\hat{x}\in P_{(x,\hat{x})}}[\parallel x-\hat{x}\parallel _1]. \end{aligned}$$
(17)

Edge Loss: Pixel-level loss is widely used in existing font generation models. They all estimate the consistency of the distribution of the two domains based on the per-pixel difference between the generated and real characters. However, in the font generation task, the weights of pixels in the images of Chinese characters are different. Different from pixels used as background or fill, boundary pixels play a key role in the overall style of Chinese characters. Therefore, our model needs to pay more attention to the edges of each Chinese character. To preserve more edge information of Chinese characters, we define an Edge Loss to limit our model to generate results with sharper edges inspired by [21]. We utilize Canny algorithm [2] to extract the edges of generated images and the target images and utilize L1 loss function to measure the pixel distance between the two edges:

$$\begin{aligned} \mathcal {L}_{edge}=\mathbb {E}_{x,\hat{x}\in P_{(x,\hat{x})}}[\parallel Canny(x)-Canny(\hat{x})\parallel _1]. \end{aligned}$$
(18)

Adversarial Loss: Our proposed method uses a framework based on GAN. The optimization of GAN is essentially a game problem, and its goal is to allow generator G to generate examples that are indistinguishable from the real data to deceive the discriminator D. In CLF-Net, the generator G has to extract the information from the style images \(I_s\) and the content image \(I_c\), and generate an image with the same content as \(I_c\) and the similar style as \(I_s\), and then the discriminators \(D_c\) and \(D_s\) are used to determine whether the generated image has no difference with the reference images in terms of content and style. We use hinge loss [18] function to compute the adversarial loss as:

$$\begin{aligned} \mathcal {L}_{adv} =\mathcal {L}_{adv_{\mathcal {C}}}+\mathcal {L}_{adv_{\mathcal {S}}}, \end{aligned}$$
(19)
$$\begin{aligned} \mathcal {L}_{advc} =\max _{D_c}\min _G\mathbb {E}_{I_c\in P_c,I_s\in P_s}[\log D_c(I_c)+\log (1-D_c(\hat{x}))], \end{aligned}$$
(20)
$$\begin{aligned} \mathcal {L}_{adv_S} =\max _{D_s}\min _G\mathbb {E}_{I_c\in P_c,I_s\in P_s}\mathrm {~[log}D_s(I_s)+\log (1-D_s(\hat{x}))], \end{aligned}$$
(21)

where \(D_c(\cdot )\) and \(D_s(\cdot )\) represent the output from the content discriminator and style discriminator respectively.

Combining all losses mentioned above, we train the whole model by the following objective:

$$\begin{aligned} \mathcal {L}=\lambda _{L1}\mathcal {L}_{L1}+\lambda _{edge}\mathcal {L}_{edge}+\lambda _{adv}\mathcal {L}_{adv}, \end{aligned}$$
(22)

where \(\lambda _{L1}\), \(\lambda _{edge}\), and \(\lambda _{adv}\) are the weights for controlling these terms.

4 Experiments

4.1 Datasets

For a fair comparison, our experiments use the public dataset of FTransGAN [15], which contains 847 grayscale fonts (stylized inputs), each font with about 1000 commonly used Chinese characters and 52 English letters of the same style. The test set consists of two parts: images with known contents but unknown styles and images with known styles but unknown contents. They randomly select 29 characters and fonts as unknown contents and styles and leave the rest as training data.

4.2 Training Details

We trained CLF-Net on Nvidia RTX 3090 with the following parameters on the above dataset. For experiments, we use Chinese characters as the content input and English letters as the style input. We set \(\lambda _{L1}\) = 100, \(\lambda _{edge}\) = 10, \(\lambda _{adv}\) = 1, and K = 6.

4.3 Competitors

To comprehensively evaluate the model, we chose the following three models, EMD [31], DFS [32], and FTransGAN [15] as our competitors. As mentioned above, previous works usually focus on font generation for a specific language, and there are few works on cross-language font generation. Therefore, in addition to FTransGAN being specifically designed for the cross-language font generation task, EMD and DFS are both designed for monolingual font generation, and we make them suitable for the cross-language task according to the modifications made by the authors of FTransGAN.

Table 1. Quantitative evaluation on the test set. The bold numbers indicate the best, and the underlined numbers represent the second best.

4.4 Quantitative Evaluation

Quantitative evaluation of generative models is inherently difficult because there are no generalized rules for comparing ground truths and generated images. Recently, several evaluation metrics [30] based on different assumptions have been proposed to measure the performance of generative models, but they remain controversial. In this paper, we evaluate the models using various similarity metrics from pixel-level to perceptual-level. As shown in Table 1, our model outperforms existing methods in most metrics.

Pixel-Level Evaluation. A simple way to quantitatively evaluate the model is to calculate the distance between generated images and the ground truths. The pixel-wise assessment is to compare the pixels that are at the same position in the ground truths and generated images. Here, we use the following two metrics: mean absolute error (MAE) and structural similarity (SSIM).

Fig. 3.
figure 3

Visual comparison of our proposed model and its competitor.

Table 2. Effect of different components in our method. The bold numbers indicate the best, and the underlined numbers represent the second best.

Perceptual-Level Evaluation. However, pixel-level evaluation metrics often go against human intuition. Therefore, we also adopt perceptual-level evaluation metrics to comprehensively evaluate all models. Drawing on FTransGAN [15], we use the Fréchet Inception Distance (FID) proposed in [22] to compute the feature map distance between generated images and the ground truths. This metric evaluates the performance of the network rather than simply comparing generated results. In this way, we can evaluate the performance of the content encoder and the style encoder separately. The score is calculated from the top-1 accuracy and the mean Fréchet Inception Distance (mFID) proposed by [17].

4.5 Visual Quality Evaluation

In this section, we qualitatively compare our method with the above methods. The results are shown in Fig. 3. We have randomly selected some outputs from three groups of our model and other competitors. In Fig. 3, the first group is handwriting fonts, the second group is printing fonts, and the third group is highly artistic fonts. We can see that EMD [31] erases some fonts with thinner strokes and works worse on highly artistic fonts. DFS [32] performs poorly on most fonts. FTransGAN [15] ignores fine-grained local styles and is not detailed enough in dealing with the style patterns of the stroke ends on highly artistic fonts, which causes artifacts and black spots in generated images. Our approach generates high-quality images of various fonts and achieves satisfactory results.

4.6 Ablation Study

Edge Loss. As shown in Table 2, after stripping out the Edge Loss from full model(FM), we find that Edge Loss significantly improves the classification accuracy of style and content labels. From the mFID [17] scores, we can observe that the feature distribution of the images generated by the model trained with Edge Loss is closer to the real images.

Multi-scale External Attention Module. Continue taking out the Multi-scale External Attention Module (MEAM), according to Table 2, both pixel-level and perceptual-level metrics drop rapidly.

5 Conclusion

In this paper, we propose an effective few-shot cross-language font generation method called CLF-Net by learning the inter-image relationships between all style reference images. In CLF-Net, we design a Multi-scale External Attention Module for extracting essential style features at different scales in another language and introduce an Edge Loss function that produces results with less blur and sharper edges. Experimental results show that our proposed CLF-Net is highly capable of cross-language font generation and achieves superior performance compared to state-of-the-art methods. In the future, we plan to extend the model to the task of font generation across multiple languages.