SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing

Du, Hongbin George; Hu, Yanke

doi:10.1007/978-3-030-66046-8_16

Hongbin George Du¹¹ &
Yanke Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12575))

Included in the following conference series:

International Conference on Computational Data and Social Networks

1553 Accesses
1 Citations

Abstract

Healthcare text mining attracts increasing research interest as electronic health record (EHR) and healthcare claim data have skyrocketed over the past decade. Recently, deep pre-trained language models have improved many natural language processing tasks significantly. However, directly applying them to healthcare text mining won’t generate satisfactory results, because those models are trained from generic domain corpora, which contains a word distribution shift from healthcare corpora. Moreover, deep pre-trained language models are generally computationally expensive and memory intensive, which makes them very difficult to use on resource-restricted devices. In this work, we designed a novel knowledge distillation method, which is very effective for Transformer-based models. We applied this knowledge distillation method to BioBERT [5], and experiments show that knowledge encoded in the large BioBERT can be effectively transferred to a compressed version of SqueezeBioBERT. We evaluated SqueezeBioBERT on three healthcare text mining tasks: named entity recognition, relation extraction and question answering. The result shows that SqueezeBioBERT achieves more than 95% of the performance of teacher BioBERT on these three tasks, while being 4.2X smaller.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Effective SNOMED-CT Concept Classification from Natural Language using Knowledge Distillation

A large language model for electronic health records

Article Open access 26 December 2022

Clinical Natural Language Processing with Deep Learning

Keywords

1 Introduction

Healthcare text mining attracts increasing research interest as electronic health record (EHR) and healthcare claim data have skyrocketed over the past decade. Recently, deep pre-trained language models, such as BERT [2] and GPT [3], have improved many natural language processing tasks significantly. However, it won’t give satisfactory results by directly applying those deep pre-trained language models to healthcare text mining. One important reason is that those models are trained from generic domain corpora, which contains a word distribution shift from healthcare corpora. Moreover, deep pre-trained language models are difficult to use on resource-restricted devices due to their huge computation complexity and memory consumption. It’s very important to have embedded models that can directly inference on mobile for healthcare related apps in the US because: 1) it can provide better user experience at poor cell phone signal locations, and 2) it doesn’t require users to upload their health sensitive information onto the cloud. In the US, health related data are only allowed to upload to the cloud by mobile apps being developed by certified institutes, which greatly suppresses the enthusiasm of developing healthcare mobile apps from individual developers. There are some model compression techniques developed recently for generic BERT [6,7,8], but there doesn’t exist a small and efficient enough pre-trained language model in healthcare domain. In this work, we developed SqueezeBioBERT. SqueezeBioBERT has 3 transformer layers, and inference much faster while being accurate on healthcare natural language processing tasks. Our contributions are summarized as below:

We designed a novel knowledge distillation method, which is very effective for compressing Transformer-based models without losing accuracy.
We applied this knowledge distillation method to BioBERT [5], and experiments show that knowledge encoded in the large BioBERT can be effectively transferred to a compressed version of SqueezeBioBERT.
We evaluated SqueezeBioBERT on three healthcare text mining tasks: name entity recognition, relation extraction and question answering. The result shows that SqueezeBioBERT achieves more than 95% of the performance of teacher BioBERT on these three tasks, while being 4.2X smaller.

2 Transformer Layer

As the foundation of modern pre-trained language models [2,3,4], transformer layer [1] can capture long-term dependencies of the input tokens with attention mechanism. A typical transformer layer contains two major components: multi-head attention (MHA) and feed-forward network (FFN).

2.1 Multi-head Attention

Practically, we calculate the attention function on a query set Q, with key set K and value set V. The attention function can be defined as below:

$$\begin{aligned} \mathbf{A} = \frac{\mathbf{Q }\mathbf{K ^{T}}}{\sqrt{d_k}} \end{aligned}$$

(1)

$$\begin{aligned} Attention(\mathbf{Q} ,\mathbf{K} , \mathbf{V} ) = softmax(\mathbf{A} )\mathbf{V} \end{aligned}$$

(2)

where $d_k$ denotes the dimension of $ \mathbf{K} $.

Multi-head attention will jointly train the model from different representation subspaces. It is denoted as below:

$$\begin{aligned} MultiHead(\mathbf{Q} ,\mathbf{K} , \mathbf{V} ) = Concat(head_1, ..., head_h)\mathbf{W} \end{aligned}$$

(3)

where h denotes attention head number, $head_i$ is computed by Eq. (2), and W is the linear parameter weight.

2.2 Feed-Forward Network

After multi-head attention, a fully connected feed-forward network will follow, which is denoted as below:

$$\begin{aligned} FFN(x) = max(0, x\mathbf{W} _1 + b_1)\mathbf{W} _2 + b_2 \end{aligned}$$

(4)

3 Knowledge Distillation

A very common way to boost the performance of a machine learning algorithm is to train several models, and then ensemble. Deep learning models are generally heavy neural networks, so it’s normally considered too computationally expensive and inefficient to deploy the ensemble of deep neural networks in the production environment. [9] first proposed Knowledge Distillation and showed the possibility of compressing the function learned from a large complex model into a much smaller and faster model without significant accuracy loss [10]. As deep learning models are becoming more and more complex, knowledge distillation has shown its power of transferring the knowledge from a group of specialist networks to a single model [10,11,12].

Formally, Knowledge Distillation process can be defined as the process of minimizing the loss function between the a large teacher network T and a small student network S as below:

$$\begin{aligned} \mathcal {L}_{KD} = \sum _{x \in X}L(f^T(x), f^S(x)) \end{aligned}$$

(5)

where L denotes the loss function to evaluate the difference between T and S, x is the token input, X is the training set, $f^T$ denotes the output of the teacher network T and $f^S$ denotes the output of the student network S.

4 BioBERT

BioBERT [5], with almost the same structure as BERT and pre-trained on biomedical domain corpora such as PubMed Abstracts and PMC full-text articles, can significantly outperform BERT on biomedical text mining tasks.

BioBERT has been fine-tuned on the following three tasks: Named Entity Recognition (NER), Relation Extraction (RE) and Question Answering (QA). NER is to recognize domain-specific nouns in a corpus, and precision, recall and F1 score are used for evaluation on the datasets listed in Table 1. RE is to classify the relationships of named entities, and precision, recall and F1 score are used for evaluation on the datasets listed in Table 2. QA is to answer a specific question in a given text passage, and strict accuracy, lenient accuracy and mean reciprocal rank are used for evaluation on BioASQ factoid dataset [24].

Table 1. BioBERT Named Entity Recognition evaluation datasets

Full size table

Table 2. BioBERT Relation Extraction evaluation datasets

Full size table

5 BioBERT Distillation

In this section, we developed a novel distillation method for BioBERT. Experiments show that knowledge encoded in the large BioBERT can be effectively transferred to the compressed version of SqueezeBioBERT.

Figure 1 shows the overview of the proposed knowledge distillation method. Supposing that the teacher BioBERT has M transformer layers and the student SqueezeBioBERT has N transformer layers, we distillated BioBERT both on transformer layers and task-specific layers.

Transformer layer distillation consists of multi-head attention distillation and feed forward network distillation. For multi-head attention distillation, we combine Eqs. (2), (3) and (5), and use the mean squared error (MSE) as the loss function since it’s more suitable for regression tasks. Thus, the multi-head attention distillation process is denoted as below:

$$\begin{aligned} \mathcal {L}_{MHA} = \frac{1}{h} \sum _{i = 1}^{h}MSE(M^T_i,M^S_i) \end{aligned}$$

(6)

where h denotes the number of attention heads, $M^S_i$ denotes the output of i-th student attention head, and $M^T_i$ denotes the output of i-th teacher attention head.

For feed forward network distillation, we can use a single linear transformation $W_{FFN}$ to transform the output of the teacher network into the student network. Thus, the feed forward network distillation process is denoted as below:

$$\begin{aligned} \mathcal {L}_{FFN} = MSE(O^T_{MHA}W_{FFN},O^S_{MHA}) \end{aligned}$$

(7)

For task-specific prediction layer distillation, we use softmax cross-entropy as the loss function, since it’s more suitable for classification tasks. Thus, the task-specific prediction layer distillation is denoted as below:

$$\begin{aligned} \mathcal {L}_{pred} = -softmax(O^T_{FFN})log(softmax(O^S_{FFN})) \end{aligned}$$

(8)

In summary, Eqs. (6), (7) and (8) describes the overall procedure of the BioBERT distillation process.

Table 3. Named Entity Recognition metrics comparison

Full size table

Table 4. Relation extraction metrics comparison

Full size table

6 Experiments

We use BioBERT-Base v1.1 [25] as our source model, and distillated it to SqueezeBioBERT on the same three healthcare NLP tasks. BioBERT-Base v1.1 has 12 transformer layers and 109M weights. SqueezeBioBERT has 3 transformer layers and 26M weights.

Table 5. Question answering metrics comparison

Full size table

NER results are show in Table 3, RE results are show in Table 4, and QA results are show in Table 5. From the results, we can see that SqueezeBioBERT is 4.2X smaller than BioBERT, but still achieves more than 95% accuracy performance of the teacher BioBERT on the three NLP tasks. This proves the efficiency of the proposed method of transferring knowledge encoded in the large BioBERT to the compressed version of SqueezeBioBERT.

7 Conclusion

Although recent deep pre-trained language models have greatly improved many natural language processing tasks, they are generally computationally expensive and memory intensive, which makes them very difficult to use on resource-restricted mobile or IoT devices. Embedded models that can directly inference on mobile is important for healthcare related apps in the US because: 1) it can provide better user experience at poor cell phone signal locations, and 2) it doesn’t require users to upload their health sensitive information onto the cloud. In this paper, we designed a novel knowledge distillation method, which is very effective for compressing Transformer-based models without losing accuracy. We applied this knowledge distillation method to BioBERT, and experiments show that knowledge encoded in the large BioBERT can be effectively transferred to a compressed version of SqueezeBioBERT. We evaluated SqueezeBioBERT on three healthcare text mining tasks: name entity recognition, relation extraction and question answering. The result shows that SqueezeBioBERT achieves more than 95% of the performance of teacher BioBERT on these three tasks, while being 4.2X smaller.

References

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Bucilua, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, New York, NY, USA, pp. 535–541. ACM (2006)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Urban, G., et al.: Do deep convolutional nets really need to be deep (or even convolutional)? In: Proceedings of the International Conference on Learning Representations (2016)
Google Scholar
Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2654–2662 (2014)
Google Scholar
Dogan, R.I., et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 47, 1–10 (2014)
Article Google Scholar
Uzuner, O., et al.: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 18, 552–556 (2011)
Article Google Scholar
Li, J., et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016, 1–10 (2016)
Google Scholar
Krallinger, M., et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 7, 1–17 (2015)
Article Google Scholar
Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biol. 9, 1–19 (2008). https://doi.org/10.1186/gb-2008-9-s2-s2
Article Google Scholar
Kim, J.-D., et al.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland, pp. 73–78 (2004). COLING. https://www.aclweb.org/anthology/W04-1213
Gerner, M., et al.: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinform. 11, 85 (2010). https://doi.org/10.1186/1471-2105-11-85
Article Google Scholar
Pafilis, E., et al.: The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS One 8, e65390 (2013)
Article Google Scholar
Bravo, A., et al.: Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinform. 16, 55 (2015). https://doi.org/10.1186/s12859-015-0472-9
Article Google Scholar
Van Mulligen, E.M., et al.: The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J. Biomed. Inform. 45, 879–884 (2012)
Article Google Scholar
Krallinger, M., et al.: Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA, pp. 141–146. https://doi.org/10.1093/database/bay073/5055578 (2017)
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16, 138 (2015). https://doi.org/10.1186/s12859-015-0564-6
Article Google Scholar
https://github.com/naver/biobert-pretrained

Download references

Acknowledgement

This work was supported by Humana.

Author information

Authors and Affiliations

University of Texas at Austin, Austin, TX, 78712, USA
Hongbin George Du
Humana, Irving, TX, 75063, USA
Yanke Hu

Authors

Hongbin George Du
View author publications
You can also search for this author in PubMed Google Scholar
Yanke Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanke Hu .

Editor information

Editors and Affiliations

University of South Florida, Tampa, FL, USA
Sriram Chellappan
The University of Texas at San Antonio, San Antonio, TX, USA
Kim-Kwang Raymond Choo
New Jersey Institute of Technology, Newark, NJ, USA
NhatHai Phan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, H.G., Hu, Y. (2020). SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing. In: Chellappan, S., Choo, KK.R., Phan, N. (eds) Computational Data and Social Networks. CSoNet 2020. Lecture Notes in Computer Science(), vol 12575. Springer, Cham. https://doi.org/10.1007/978-3-030-66046-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-66046-8_16
Published: 04 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66045-1
Online ISBN: 978-3-030-66046-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing

Abstract

Similar content being viewed by others

Effective SNOMED-CT Concept Classification from Natural Language using Knowledge Distillation

A large language model for electronic health records

Clinical Natural Language Processing with Deep Learning

Keywords

1 Introduction

2 Transformer Layer

2.1 Multi-head Attention

2.2 Feed-Forward Network

3 Knowledge Distillation

4 BioBERT

5 BioBERT Distillation

6 Experiments

7 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

SqueezeBioBERT: BioBERT Distillation for Healthcare Natural Language Processing

Abstract

Similar content being viewed by others

Effective SNOMED-CT Concept Classification from Natural Language using Knowledge Distillation

A large language model for electronic health records

Clinical Natural Language Processing with Deep Learning

Keywords

1 Introduction

2 Transformer Layer

2.1 Multi-head Attention

2.2 Feed-Forward Network

3 Knowledge Distillation

4 BioBERT

5 BioBERT Distillation

6 Experiments

7 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation