Keywords

1 Introduction

Digital comic content is mainly produced to facilitate transport, to reduce cost and to allow reading on screens of devices such as computers, tablets, and mobile phones. To access digital comics in an accurate and user-friendly experience on all mediums, it is necessary to extract and identify comic book elements [4]. Accordingly, the relations between these elements could be investigated further to assist the understanding of the digital form of comic books by a computer. This strategy will help the user to retrieve information very precisely in the image corpus.

Comic book images are composed of different elements such as panel, balloon, text, comic character and their relations (e.g., read before, said by, thought by, addressed to). One research field focuses on analyzing automatically these elements aiming at automatic comics understanding. In early studies, each of them has been first addressed separately and then, in more recent studies, combined all together to get a deeper understanding of the story.

The image analysis community has investigated comics elements extraction for almost ten years, and vary from low-level analysis such as text recognition [3] to high-level analysis such as style recognition [5]. Text detection and recognition in comic book images is one of the most studied elements. The authors from [3, 28] proposed several methods based on image processing techniques. In [17], the authors introduced techniques based on deep learning models to recognize text without the segmentation step of characters. Comic characters (protagonists) detection is one of the most challenging tasks because of the comic book creators are entirely free in the drawing of their comic characters, hence their appearance can change a lot from one comic book to another and even within individual comic books. Several methods have been proposed for recognizing comic characters based on deep neural network or hand-crafted feature related techniques [6, 15, 16, 32]. The speech balloon is a key element in comics, which can have various shapes and contours. Current existing methods for extracting balloons are based on conventional techniques in image processing such as contours, region detection, [3, 14, 18, 25]. The method presented in [23] can also associate speech balloons and comic characters. Panel extraction has been studied for a long time [33]. The evolution of screen quality and size of mobile devices such as smartphones and tablets has placed higher demands on accuracy panel extraction recently. Methods in [2, 12, 13, 29] rely on white line cutting, connected component labeling, morphological analysis or region growing. More recently, new methods based on watershed [21], line segmentation using Canny operator and polygon detection [13], region of interest detection [31], and recursive binary splitting [20] have been proposed.

All the existing methods based on deep learning models or conventional techniques treat each comic’s element separately. This approach has been used because elements from comic book images are different and hence there is hardly an algorithm that can extract all elements at the same time. For more details about existing methods for comic book analysis, the readers are encouraged to read the survey works in [4, 17].

In our work, we investigate a deep learning based approach which processes multiple elements simultaneously. Our approach can help to reduce the process duration of the comic analysis pipeline. Moreover, we propose a new neural network architecture which can detect the relationship between balloons and comic characters. In other words, our model can associate a balloon with its speaker(s).

2 Related Works

2.1 Multi-task Learning

With the success of deep learning approaches, recent works, mostly based on neural network models, have been proposed to extract comic elements [6, 8, 16, 24, 30]. These works include tasks to detect each element separately such as balloon segmentation, panel detection, text recognition or character detection. Hence, the processing pipeline from global images analysis to precise content extraction take much time. In order to reduce the overall processing time for comic book images analysis, we investigate an approach which can handle multiple elements in one deep learning model. Our approach is considered as a multi-task learning (MTL) model among other deep learning models. MTL models aim at learning multiple related tasks jointly to improve the generalization performance of all the tasks [35].

The work in [35] gives a detailed survey for MTL models. We summarize below some popular MTL works based on convolutional neural network (CNN) for the computer vision domain. [36] proposed a deep CNN model which learns jointly different tasks like facial landmark detection, head pose estimation, gender classification, age estimation, facial expression recognition, and facial attribute inference using shared CNN layers. [1] proposed a CNN model to predict attributes in images using individual CNNs for each task and fuses different CNNs in a common layer via a sparse transformation. In [34], the authors proposed a multi-task model to learn the rotate facial task with an auxiliary task as the reconstruction of original images based on generated images. Another popular model is the Mask R-CNN [10] which jointly learns the object detection task and object segmentation task. This model achieves state-of-the-art performances for both segmentation and detection tasks. However, the Mask R-CNN model requires the segmentation masks of the objects.

In order to accomplish the training in our work, we have some elements (classes) that do not have object masks (for example annotations for panels, comic characters and text corresponds to bounding boxes) and elements (classes) that have segmentation masks (balloons). Hence, we extend the Mask R-CNN model to learn both detection and segmentation tasks for panels, comic characters (detection), and balloons (segmentation) using both segmentation masks and bounding boxes.

2.2 Relationship Balloon-Character

The association between balloons and comic characters can create annotations corresponding to story understanding (dialog analysis, situation retrieval). However, whether scanned or digital-born, these relations are not directly encoded in the image but the reader understands them according to other information present in the image. There are few papers in the literature on the topic of relation analysis among comic elements. From our knowledge, only [23] has proposed a method to associate a balloon with its speaker (comic characters). The authors detected firstly panels, balloons and tails of balloons, and comic characters; then they used a geometric graph for each panel where vertices are spatial positions of tail and comic character centroids. Edges are straight-line segments (associations). They formulated an optimization problem by searching for the best pairs (2-tuples) of tail and character corresponding to associations.

In our work, we integrate the association character-balloon into the multi-task CNN model. While the method in [23] requires prior knowledge about the positions of panels, balloons and characters, our method does not require any prior information about characters or balloons. Finally, our new model called Comic-MTL can learn from different kind of annotations (balloon masks, panels and characters bounding boxes, and balloon-character associations) to detect panels and characters, segment balloons, and detect the associations between detected characters and segmented balloons.

3 Proposed Model: Comic-MTL

In this section, we present our proposed Comic-MTL model for comic image analysis which aims at extracting characters, panels, balloons and the associations between characters and balloons. Our model is an extension of the state-of-the-art Mask R-CNN model in instance segmentation [10]. Firstly we modify the loss function for the mask branch in Mask R-CNN to take into account the origin of annotations (masks or bounding boxes), then we add an additional branch which contains a PairPool layer and a binary classifier to detect associations from all possible pairs (a pair contains a balloon and a comic character). The classifier outputs the probabilities of a pair balloon-character to be “has-a-link” and“has-not-a-link”. The additional branch requires an extraction step of relevant features for the pairs of balloon-character. We describe our two modifications in the next sub-sections and illustrate them in Fig. 1.

3.1 Multi-task Learning from Bounding Boxes and Object Masks

We consider a detection/segmentation problem where there are classes with bounding box and segmentation mask annotations. In our work, we have to deal with a comic dataset where panels and characters are annotated with bounding boxes and balloons are annotated with masks. While the state-of-the-art model Mask R-CNN [10] can predict both bounding boxes and masks of the objects, it requires the mask annotations for all classes to train the model. In order to learn jointly the detection task and segmentation task from bounding boxes and object masks, we update the model Mask R-CNN where only the object from classes with mask annotations can contribute to the loss in the mask branch.

In the Mask R-CNN model, the multi-task loss on each sampled RoI (region of interest) in the total N sampled RoIs is defined as \(L = L_{cls} + L_{box} +L_{mask}\) where \(L_{cls}\) and \(L_{box}\) are the loss for the detection branch as in Faster R-CNN [22]; \(L_{mask}\) is the binary cross-entropy loss in the mask branch of Mask R-CNN [10]. In our model, we simply apply \(L_{mask}\) only for RoIs associated with ground-truth classes M which have the mask annotations. For RoIs associated with ground-truth classes \(K = N - M\) which have only bounding box annotations, we optimize only the detection branch.

Fig. 1.
figure 1

The Comic MTL framework for comic book image analysis.

3.2 Relation Analysis for Balloon-Character Pairs

We address the relationship analysis between balloons and comic characters by considering it as a binary classification problem. Each pair of balloon-character needs to be classified as “has a link” or “has not a link”. A pair of balloon-character has a link means that the character speaks the text in the balloon. In order to make a binary classifier for the relation analysis, we add a new branch to the model Mask R-CNN.

Mask R-CNN: Similar to Faster R-CNN, in the first stage, Mask R-CNN use the region proposal network (RPN) [22] to get candidate object bounding boxes. Then in the next stage, it uses RoIAlign to extract features of these candidate boxes to feed in parallel into the categorical classifier, bounding-box regressor, and the mask predictor.

Comic MTL: We add an additional branch to the model Mask R-CNN that takes all pair combination of top N anchors candidates from the RPN network as inputs. Then we optimize the branch with the binary cross-entropy loss. The additional classifier output is distinct from the class, box, and mask outputs, requiring a pair combination step and a feature construction step for all pairs. We present these two important steps in the following.

Balloon-Character Pairs Combination - PairPool: Instead of taking into account all combinations of candidate bounding boxes from the RPN stage, we sample all combination between the candidate boxes which have the best overlap (by a threshold \(\alpha \)) with the ground-truth balloons and the candidate boxes which have the best overlap with the ground-truth characters. This step can reduce a lot of possible pairs. We keep all positive pairs (the character is the speaker of the balloon in the pair) in the ground-truth and keep the same number of negative pairs randomly. Note that in most cases, the total number of negative pairs is bigger than the total number of positive pairs. Next, we pad the set of pairs with zero or trim the set of pairs to N pairs, where N is configurable. We need to fix the number of pairs because it is the input size for the addtional branch.

In our proposed model, we define a multi-task loss on each sampled RoI as \(L = L_{cls} + L_{box} +L_{mask} + L_{rel}\). The first three components were presented in the previous section. The new branch (\(L_{rel}\)) is optimized with the binary cross-entropy loss. Simple reuse of the shared features as other branches to feed into this new relation branch is not enough. Indeed, the relation of a balloon and a character does not depend on the individual features of each bounding box but rather depend on multiple features such as individual visual features, visual features of the union of the two boxes, features related to the positions of the two boxes.

Features Construction: We need to encode the visual layout of the pair balloon-character and also the spatial layout of these two elements. Thus, unlike other branches which use a shared feature, we propose to use a combined feature of (1) the visual features of the union of the two bounding boxes, (2) the spatial features.

For the visual features of the union of the two bounding boxes, we reuse the share features as in Mask R-CNN; instead of each individual bounding box, we take into account the box equal to the union of the two bounding boxes in the pair balloon-character. This feature allows preserving the global visual information of a balloon and its speaker.

For the spatial features, let \(b = [x_{b}, y_{b}, w_{b}, h_{b}]\) and \(c = [x_{c}, y_{c}, w_{c}, h_{c}]\) denote two bounding boxes of a pair, where (xy) are the coordinates of the center of the box, and (wh) are the width and height of the box, respectively. We encode the spatial features with 5-dimensional vectors which are invariant to translation and scale transformations:

$$\begin{aligned}{}[\frac{x_{b}-x_{c}}{w_{b}}, \frac{y_{b}-y_{c}}{y_{b}}, \frac{x_{b}-x_{c}}{w_{c}}, \frac{y_{b}-y_{c}}{y_{c}}, \frac{b \cap c}{b \cup c}] \end{aligned}$$
(1)

The first four features represent the normalized translation between the two boxes, the fifth feature is the overlap between boxes. In this paper, these features are used directly and are concatenated with the visual features to form the final features. We will investigate further some features embedded methods to map this 5-d feature to a high-dimensional representation.

Network Architecture: To experiment with our approach, we instantiate Mask R-CNN with the default architecture. There are four parts of the architecture: (1) the convolutional backbone architecture used for feature extraction over an entire image, (2) the network head for bounding-box recognition (classification and regression), (3) mask prediction that is applied separately to each RoI and (4) the network head for the relation classifier. We experiment with the backbone architecture using ResNet-50 together with the Feature Pyramid Network (FPN). This backbone with ResNet-50 and FPN is a common choice used in many works [10, 22]. For the network heads, we closely follow the architecture of Mask R-CNN and we add the additional relation branch (see Fig. 1).

3.3 Implementation Details

Training: As in Fast R-CNN, a RoI is considered positive if it has IoU (Intersection over Union) with a ground-truth box of at least 0.5 and negative otherwise. We train for 100 epochs with a learning rate of 0.001 which is decreased by 10 at the 80\(^{th}\) epoch. We use a weight decay of 0.0001 and a momentum of 0.9. The parameter \(\alpha \) is set to 0.6. In a comic page, the number of balloons and characters are 15 and 10 respectively (see Sect. 4) so we set the parameter N to 150 which can cover almost all possible combinations and keeps the computation cost low for the relation branch.

Inference: During the tests, instead of considering the outputs of the RPN network, we take the outputs of the detection branch to construct the pairs of detected balloons and characters. We run the relation prediction branch on these combinations (pairs of balloon-character). This is similar to the approach for the mask branch in Mask R-CNN; although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). Because we only classify on the pairs from detected balloons and characters (we set maximum \(N=150\) pairs for each comic book image), Comic MTL adds marginal runtime to its Mask R-CNN counterpart (5% on typical models).

4 Experiments

We experiment with the Comic MTL model on the public eBDtheque dataset [9]. This is the only dataset to date which has bounding box annotations for comic characters, segmentation masks for panels and balloons, and the association between balloons and its speakers (characters).

This dataset is composed of one hundred comic book images containing 850 panels, 1550 comics characters, 1092 balloons and 4691 text lines in total. More description can be found in the original paper [9]. Because of the limit size of the eBDtheque dataset, we run the cross-validation tests on five different training and testing sets. Each training set contains 90 images and each testing test contains 10 remaining images. The results reported are the average of these five validations.

Balloons Segmentation. In addition to the comparison between Comic MTL and existing works for balloon segmentation presented in [25], we train the state-of-the-art Mask R-CNN model for balloon segmentation and compare it to the Comic MTL model. We used the same configuration, and the same train/test sets without any augmentation for both models Mask R-CNN and Comic MTL. Both models use ImageNet pre-trained weights. In order to compare with existing methods, we use the recall (R), the precision (P), and the F-measure (F1) at pixel-level as metrics, therefore we chose the threshold that maximized the F-measure for Mask R-CNN and Comic MTL models. The result details are given in Table 1.

Table 1. Speech balloon segmentation performance in percent.

The neural network models Mask R-CNN and Comic MTL outperform all existing methods with a large margin, 19.4% in the F-measure. The Mask R-CNN model and the Comic MTL model have a similar performance with a slightly lower value for Comic MTL of about 0.08% in the F-measure. However note that the Mask R-CNN can do the balloon segmentation only (because we do not have mask annotations for panels and characters), while balloons segmentation is one of the four tasks that one Comic MTL model can do.

Panels and Characters Detection. We compare our Comic MTL to other existing methods for panels and characters detection. In order to compare with existing methods [3, 17, 19, 26, 29], we use the recall (R), the precision (P), and the F-measure (F1) and chose the threshold that maximized the F-measure for the Comic MTL model. We followed PASCAL VOC criteria and used IoU \(>=\) 0.5 as threshold for good detections [7]. Result details are given in Table 2.

Table 2. Panels detection performance in percent.

Table 2 shows the scores of existing methods (copied from [19]) and the model Comic MTL for panel detection. Comic MTL comes at second place with 8.5% lower compared to [26].

Table 3. Characters detection performance in mAP.

Table 3 shows the scores of existing methods (copied from [19]) and the model Comic MTL for characters detection. For this task, Comic MTL outperforms existing methods with a large improvement, about 17.83% compared to [19]. However, in [19], the authors test a model trained on another dataset than the eBDtheque dataset.

Relation Analysis. In this section, we evaluate the relation analysis between balloons and characters. The Comic MTL proposes a number of pairs balloon-character and classifies these pairs into two classes: has-link or not-has-link. A pair is in class “has-link” if the association between the corresponding balloon and character exists in the ground truth. We compare the Comic MTL model to the work in [23].

In the work of [23], panels, balloons, and characters are necessary. The authors use two different settings: (1) panels, balloons, and characters are extracted automatically by some conventional extraction methods; (2) panels, balloons, and characters are available prior.

Table 4. Balloon-character association performance in percent.
Fig. 2.
figure 2

Balloons segmentation, characters detection and their relation analysis by the model Comic MTL.

Table 4 shows the performance of Comic MTL model compared to [23]. We can see that with the same settings where panels, balloons, and characters are extracted automatically (Setting 1), the model Comic MTL gives better performance. Compared to the performance of [23] when panels, balloons, and characters are available (Setting 2), the Comic MTL is behind. One of the reasons is that the measured error in the Comic MTL model and the Setting 1 of [23] will be a combination of errors from the proposed method and other element extractions (e.g. missed speech balloons, missed comic characters or over-segmentation of panels for [23]). However, we believe that we can investigate further on the features extraction step of the proposed model Comic MTL to improve the results of relation analysis. There is useful information that has not been integrated into the model such as a balloon and a character should be in the same panel to have a link between them, or learning the direction of the balloon tail may help to improve the learning of its association with characters. These features will be included in the next version of our model.

We visualize some results of the model Comic MTL in Fig. 2, which includes characters detection, balloons segmentation and relation analysis.

5 Conclusion

In this paper, we proposed the Comic MTL model which can handle multiple tasks on one CNN model: characters detection, panels and balloons segmentation, and balloon-character association analysis for comic book images. We compared the Comic MTL model with the model Mask R-CNN and other existing methods on the public eBDtheque dataset. Experiments confirm that the Comic MTL can handle multiple tasks in comic book images analysis (3 compared to 1 of existing models) with promising results compared to the state-of-the-art performance. Further investigation could improve the actual performance of the Comic MTL model, such as detecting the tail and its direction to improve the learning of the association between balloons and characters.