Keywords

1 Introduction

When a natural language processing task is performed, the training and test data are usually in the same domain. However, sometimes the data comes from different domains. Recently, studies into domain adaptation have fine-tuned the classifier by using the training data of a learned domain (source domain) to match the test data of another domain (target domain) [5, 7, 11].

If the subject of the domain adaptation is problematic due to lack of target domain labels, active learning [8, 10] and semi-supervised learning [1] are effective. In this paper, we use active learning for domain adaptation for Word Sense Disambiguation (WSD).

Generally, active learning is an approach that gradually increases the precision of the classifier by selecting data with a high learning effect from an unlabeled data set, labeling the data, and adding it to the training data, thereby increasing the amount of training data monotonically. However, in domain adaptation, there are data that have a negative influence on the target domain due to classification in the source domain training data. Here we refer to such data as “misleading data” [3]. In this paper, we detect such data in the source domain training data and delete it to construct training data suitable for the target domain using active learning.

In the experiment, we use three domains: Yahoo! Answers (OC), Book (PB) and newspaper (PN) from the Balanced Corpus of Contemporary Written Japanese (BCCWJ [4]). The data set, which is provided by a Japanese WSD SemEval-2 task [6] has word sense tags attached to parts of these corpora. There are 16 multi-sense words with a certain frequency across all domains, and six patterns of domain adaptation (OCPB, PBPN, PNOC, OCPN, PNPB, and PBOC). We investigate domain adaptation for WSD using the proposed active learning method for \( 16 \times 6 = 96 \) patterns and show the effectiveness of the proposed method.

2 Active Learning with Deleted Misleading Data

2.1 Active Learning

Active learning is an approach that reduces the amount of manual labeling when building effective training data.Using a classifier trained on the current training data, we selected data with as high a learning effect as possible from an unlabeled data set. Then, we manually assign correct labels to the selected data and add it to the training data. Consequently, the amount of labeled data is increased and the classifier is improved.

The key question of active learning is how to choose data with a high learning effect. There are many active learning methods [10]; however, one particularly effective method is widely used. This method selects data with the lowest classification reliability determined by a powerful classifier such as a support vector machine (SVM) classifier [9].

2.2 Detecting and Deleting Misleading Data

The initial labeled data in a general active learning is fixed. This is not problematic because all labeled data is useful. However, the initial pool of labeled data for domain adaptation, i.e., labeled data in the source domain can include harmful data.Here we refer to such data ‘misleading data.’ When general active learning is applied to domain adaptation, misleading data in the source domain prevents active learning from improving the classifier. Therefore, when we add labeled data to the training data, we detect misleading data and delete it from the labeled training data in the source domain.

Fig. 1.
figure 1

Our proposed active learning

Figure 1 shows the algorithm of our method. The initial labeled data in the source domain is denoted \( D_0 \), and the labeled data added to training data during the active learning process is denoted \( A \), where initial \( A \) is empty. \( D_1 \) is the union of \( D_0 \) and \( A\), and \( h_1 \) is the classifier learned through \( D_1\). By using \( h_1\), we classify \( D_0\); the classification result is denoted \( L_1 \). Like general active learning, we classify the unlabeled data set \( U \) in the target domain using \( h_1\) and assign a correct label to identify data \( {\varvec{b}} \) with the lowest classification reliability. Data \( {\varvec{b}} \) is added to \( A\). \( D_2 \) is the union of \( D_0 \) and \( A\), and \( h_2 \) is the classifier learned through \( D_2\). We use \( h_2\) to classify \( D_0\) and denote the classification result as \( L_2 \). We detect misleading data \( {\varvec{z}} \) using \( L_1 \) and \( L_2\) by following procedure. Using to following cases (a),(b) or (c), we can identify misleading data. (a) There are false classifications in \( L_2\). In this case, we identify the data with the highest classification reliability among the false classifications. (b) There are no false classifications. In this case, by comparing \( L_1 \) with \( L_2 \), we identify the data with the greatest decrease in reliability from \( L_1 \) to \( L_2\). (c) There are no false classifications and no data with decreased reliability. In this case, no misleading data is identified. As shown in Fig. 1, this procedure is repeated 10 times.

In this study, active learning is complete when 10 data have been added to the labeled training data set. The only difference between general active learning and active learning for domain adaptation is the distribution of the initial labeled data set. Thus when labeled data is increased through active learning, there are very few differences. Therefore, we evaluate the proposed method with 10 repetitions of active learning.

Table 1. Target words of experiment

3 Experiment

In the experiment, we use three domains: OC, PB and PN from the Balanced Corpus of Contemporary Written Japanese (BCCWJ [4]). As mentioned previously the data set, which was provided by a Japanese WSD SemEval-2 task [6], has word sense tags attached to part of these corpora. There are 16 multi-sense words with some frequency across all domains. These 16 target words are shown in Table 1.Footnote 1 There are six direction patterns of (OCPB, PBPN, PNOC, OCPN, PNPB, and PBOC). Consequently \( 16 \times 6 = 96 \) types of domain adaptation of WSD are used in the experiment.

In each direction of domain adaptation (e.g., OCPB), we conducted active learning for 16 target words. We evaluated the active learning method for domain adaptation using the average of these 16 precision.

We tried three methods. The first method is active learning to select added data at random (Random), the second is standard active learning (AL), and the third is our proposed active learning (Our AL). For all methods, the classifier is a SVM. We use the SVM tool ‘libsvm’ Footnote 2 to train the classifier. Using the -b option, we can obtain the reliability of the classification.

Table 2. Average precision of the final classifier (%)
Fig. 2.
figure 2

Comparison of average precisions

We show the result of the experiment in Figs. 3, 4, 5, 6, 7 and 8. Each figure shows the result of each domain adaptation. In this experiment, active learning stops after 10 repetitions. After 10 repetitions, the current classifier is presented in Table 2 and Fig. 2. Our proposed active learning method outperforms standard active learning in every domain adaptation type.

Fig. 3.
figure 3

Active learning for “OCPB”

Fig. 4.
figure 4

Active learning for “PBPN”

Fig. 5.
figure 5

Active learning for “PNOC”

Fig. 6.
figure 6

Active learning for “OCPN”

Fig. 7.
figure 7

Active learning for “PNPB”

Fig. 8.
figure 8

Active learning for “PBOC”

4 Discussion

4.1 Existence and Detection of Misleading Data

We do not know whether the data as misleading data in the experience are actually misleading data. Here, we use the data labels to determine if the detected data are in fact misleading data, and we examine whether the method for detecting misleading data is effective.

At first, we identify the misleading data individually following a previously proposed method [13]. The labeled data \( D \) in \( S \) of target word \( w \) exists in domain adaptation for fine-tuning the domain \( S \) to \( T\). Next we measure the correct answer rate \( p_0 \) of the classifier \( T \) learned by \( D \), delete data \( x \) from \( D \), and measure the correct answer rate \( p_1 \) of the classifier \( T \) learned by \( D - \{x\}\). When \( p_1 > p_0 \), we consider data \( x \) to be misleading data. We perform this procedure for all data across \( D\) and find the misleading data of target word \( w\). Table 3 shows the amount of misleading data found by this process. The numerical values in the parentheses are the amount of all data.

From the data presented in Table 3, we investigate whether misleading data detected by the experimental procedure are true or not. The result are shown in Table 4. The numerical values in the parenthesis are the amount of detected data, and the numerical values next to the parenthesis are the amount of the true misleading data. From Table 4, it is evident that the amount of detected data is 959, the amount of true misleading data is 121, and the precision is 0.1262. It is thought that this value is low. However, precision is not always reduced deleting false detected data. Therefore, we believe that the detected data were not related to classification.

4.2 Instance Weight

In domain adaptation tasks, labeled data in the target domain are more important than labeled data in the source domain. Therefore, instance weight learning is effective in domain adaptation [3]. Generally, the weight of the instance is defined by the probability density ratio [12]. Here, we investigate active learning weighting of the detected target domain data. We simply weight detected data by doubling the frequency of such data. Table 5 shows the average precision of the final classifier obtained by active learning.

Table 3. Misleading data
Table 4. Correct answer rates of detection of misleading data
Table 5. Active learning with instance weight (%)
Table 6. Use of Daumé’s method in active learning (%)

From Table 5, we can confirm the effect of weighting on target domain labeled data. This experiment is simply weighting double heaviness. We intended to investigate the potential for improvement in future work.

4.3 Feature Weight

Because target domain labeled data are added by active learning, we can use the supervised domain adaptation method.

Here, we combine Daumé’s method [2] with active learning. We convert vector \( {\varvec{x}_s} \) of the source domain into a triple length vector \( ({\varvec{x}_s},{\varvec{x}_s},{\varvec{0}}) \), and vector \( {\varvec{x}_t} \) of the target domain into a triple length vector \( ({\varvec{0}},{\varvec{x}_t},{\varvec{x}_t}) \) using Daumé’s method. We classify the target domain data with the standard classification using the tripled vector. This method weights the common (overlapped) features of the source domain and the target domain.

When the Daumé’s method is combined with active learning, we only have to convert source domain data \( {\varvec{x}_s} \) into \( ({\varvec{x}_s},{\varvec{x}_s},{\varvec{0}}) \), and target domain data \( {\varvec{x}_t} \) into \( ({\varvec{0}},{\varvec{x}_t},{\varvec{x}_t}) \). The result for ten repetitions are shown in Table 6.

From Table 6, it is evident that using the proposed method with Daumé’s method is not effective; however standard active learning combined with Daumé’s method is effective. It is thought that the influence of misleading data becomes small with Daumé’s method; consequently, the proposed method with Daumé’s method was not effective. In future, we intend to investigate this possibility.

5 Conclusion

In this paper, we proposed a new active learning method of domain adaptation for WSD. In standard active learning, labeled training data increases monotonically. However, data in the source domain can deteriorate classification precision (misleading data), which extends errors to the domain adaptation. Our proposed method detects and deletes misleading data in the source domain during the standard active learning process. Through an experiment using three domains (OC, PB and PN) in BCCWJ and 16 common target words, the proposed method outperformed standard active learning. In future, we intend to investigate methods to detect misleading data more accurately and to assign proper weight to instances and features during the active learning process.