Keywords

1 Introduction

Machine reading comprehension (MRC) is a frontier field in natural language processing (NLP), which requires that machine can read, understand, and answer questions about a text. Benefiting from the rapid development of deep learning techniques (Hermann et al. 2015; Rajpurkar et al. 2016), the end-to-end neural methods have achieved promising results on MRC task (Seo et al. 2016; Huang et al. 2017; Chen et al. 2016; Clark and Gardner 2017; Hu et al. 2017; Devlin et al. 2018; Rajpurkar et al. 2018). LSTM, CNN, and attention mechanism is the common structures used in MRC. With the introduction of a series of larger and more systematic text representation models, such Bidirectional Encoder Representation from Transformers (BERT), the status of sequence representation models has been challenged. compared with the sequence representation model, a better understanding of semantics and adequate training of the article are the advantages of the pre-training model. After pre-training, simple fine-tuning can handle the problem that the sequence representation model takes a lot of time to solve. In this paper, we combine the BERT and end-to-end network models and apply them to the question and answer task.

SQuAD (Rajpurkar et al. 2018), Dureader (He et al. 2017), CoQA (Reddy et al. 2019), are the large-scale and different datasets for reading comprehension, which requires to answer questions given a passage. And in addition to the general types, the prospects for specific industry applications are now very well. In this paper, we focus on the CAIL (Xiao et al. 2018) datasets (Chinese Judicial Reading Comprehension dataset), The law is closely related to people’s daily lives. Almost every country in the world has laws. Everyone must abide by the laws to enjoy their rights and perform their duties. Every day, tens of thousands of traffic accidents, private loans, and divorce disputes occur. At the same time, in the process of handling these cases, many judgments will be made. The verdict is usually a summary of the entire case, involving the description of the event, the opinion of the court, the result of the verdict, etc. However, there are relatively few legal staff and factors such as uneven judges can often lead to wrong decisions. Moreover, even in similar cases, the judgment results can sometimes be very different. In addition, a large number of documents makes extracting information from them extremely challenging. Therefore, introducing artificial intelligence into the legal field will help judges make better decisions and work more effectively. CAIL requires to answer questions given a civil and criminal judgment documents. The referee documents contain a wealth of case information, such as time, place, relationship, etc., through the intelligent reading and understanding of the judgment documents, the results can help judges, lawyers and the general public to obtain the required information more quickly and conveniently. This dataset is the first reading comprehension dataset based on Chinese judgment documents, which belongs to the Span-Extraction Machine Reading Comprehension. In order to increase the diversity of questions, refer to the SQuAD and CoQA. This dataset adds unanswerable and YES/NO problem. In view of the fact that the civil and criminal judgment documents differ greatly in the factual description, the corresponding types of questions are not the same. In order to take into account the two types of judgment documents at the same time, CAIL dataset will set up civil and criminal test set. An example of CAIL dataset is shown in Fig. 1.

Fig. 1.
figure 1

An example item from dataset CAIL.

To understand the properties of CAIL, we analyze the questions and answers in the development set. Specifically, we explore the numbers of two types of judgment documents, and the proportion of different answer types, and distribution of documents length (Figs. 2 and 3).

Fig. 2.
figure 2

Analysis of the data set

Fig. 3
figure 3

Distribution of document length

The composition of the CAIL training set is mainly segment extraction, which also contains 13% of YES/NO questions and 3% of questions that cannot be answered. A reasonable solution is needed to deal with different types of questions. The length of CAIL documents is generally longer, more than 50% of the documents are longer than 500, and the long-text related issues should be considered in the model design.

To do well on MRC with unanswerable questions, the model needs to comprehend the question, reason among the passage, judge the unanswerability and then identify the answer span. When the question is answerable, the main challenge of this task lies in how to reliably determine whether a question is not answerable from the passage.

There are two kinds of approaches to model the answerability of a question. One approach is to directly extend previous MRC models by introducing a no-answer score to the score vector of the answer span (Levy et al. 2017; Clark and Gardner 2017). But this kind of approaches is relatively simple and cannot effectively model the answerability of a question. Another approach introduces an answer verifier to determine whether the question is unanswerable (Hu et al. 2018; Tan et al. 2018). However, this kind of approaches usually has a pipeline structure. The answer pointer and answer verifier have their respective models, which are trained separately. Intuitively, it is unnecessary since the underlying comprehension and reasoning of language for these components is the same.

In this paper, we divide the questions into three categories, the answerable question, and the unanswerable question, the YES/NO question. If the question is judged to be YES/NO, it is turned into a classification question. Otherwise, first judge whether it can answer, if possible, give the start point and end point.

We propose a model called LBNet (Long-term recurrent attention network from Bert) to incorporate these three sub-tasks into a unified model: (1) an answer pointer to predict a candidate answer span for a question; (2) a no-answer pointer to avoid selecting any text span when a question has no answer; and (3) an answer verifier to determine the probability of the “YES/NO” of a question with candidate answer information Our experimental results on the CAIL dataset show that LBNet effectively predicts the unanswerability of questions and achieves an F1 score of 83.5.

2 LBNet Model

For reading comprehension style question answering, a passage P and question Q are given, our task is to predict an answer A to question Q based on information found in P. The CAIL dataset further constrains answer A either to be a continuous sub-span of passage P or is YES/NO. Answer A often includes non-entities and can be much longer phrases. This setup challenges us to understand and reason about both the question and passage in order to infer the answer.

The BERT model is based on the powerful model of Transformer, which itself has broken the record of many natural language processing directions created by the deep neural network model. In general, it has been able to deal with many problems and achieve good results, However, the traditional long short-term memory network also has its advantages because that it can handle the contextual relationship well and retain the key information. So, people wish to achieve better results by combining these two models. Therefore, we made some changes based on the original BERT model and explored a new model, called LBNet (Long-term Recurrent Integrate BERT Network), which can handle machine reading task better.

LBNet is a contextual attention-based deep neural network for the task of conver-sational question answering, in which, the bottom layer is the input vector, and it is constructed in the same way as BERT, which is a combination of Position Embed-dings, Token Embeddings and Segment Embeddings. LBNet has similar stems with existing machine reading comprehension models, but it also has several unique characteristics to tackle contextual understanding during conversation. Firstly, LBNet applies self-attention on passage and question to obtain a more effective understanding of the passage and dialogue history. Secondly, LBNet leverages the latest breakthrough in BERT contextual embedding (Devlin et al. 2018). Different from the canonical way of appending a thin layer after BERT structure according to (Devlin et al. 2018), we innovatively employed the BiLSTM layer outputs, with locked BERT parameters. Empirical results show that each of these components has substantial gains in prediction accuracy. An illustration of LBNet model is shown in Fig. 4.

Fig. 4.
figure 4

LBNet Model for CAIL datasets

Formally, we can represent the MRC problem as: given a set of tuples (\( {\text{Q}}, {\text{P}},{\text{A}} \)), where \( {\text{Q }} = ({\text{q}}_{1} ,{\text{q}}_{2} , \ldots ,{\text{q}}_{\text{m}} \)) is the question with m words, \( {\text{P }} = \left( {{\text{p}}_{1} , {\text{p}}_{2} , \ldots , {\text{p}}_{\text{n}} } \right) \) is the context passage with n words, and \( {\text{A }} = {\text{p}}_{{\left( {{\text{r}}_{\text{s}} } \right):\left( {{\text{r}}_{\text{e}} } \right)}} \) is the answer with \( {\text{r}}_{\text{s}} \) and \( {\text{r}}_{\text{e}} \) indicating the start and end points, the task is to estimate the conditional probability \( {\text{P }}({\text{A}}|{\text{Q}}, {\text{P }}) \), LBNet consists of four major blocks: Bert & BiLSTM Encoding, Multi-Level Attention, Final Fusion, and Prediction.

We first combine the embedded representation of the question and passage with a universal node u and pass them through a Bert and BiLSTM to encode the whole text. We then use the encoded representation to deal the information interaction. Then we use the encoded and interacted Representation to fuse the full representation and feed them into the final prediction layers to do conduct the prediction. We will describe our model in details in the following.

2.1 BERT and BiLSTM Encoding

  • Embedding

We first segment Chinese sentences into words. Then embed both the question and the passage with the following features. Glove embedding (Pennington et al. 2014) and Elmo embedding (Peters et al. 2018) are used as basic embeddings. Besides, we use POS embedding and NER embedding (Luo et al. 2019), we use 12 dimensions to embed POS tags, 8 for NER tags, and a feature embedding that includes the exact match, lower-case match, lemma match, and a TF-IDF feature. Now we split the question Q into \( {\text{Q }} = \left\{ {{\text{w}}_{\text{t}}^{\text{Q}} } \right\}_{{ {\text{t}} = 1}}^{\text{m}} \), and the passage P into \( {\text{P }} = \left\{ {{\text{w}}_{\text{t}}^{\text{P}} } \right\}_{{ {\text{t}} = 1}}^{\text{n}} \).

Consider the question \( {\text{Q }} = \left\{ {{\text{w}}_{\text{t}}^{\text{Q}} } \right\}_{{ {\text{t}} = 1}}^{\text{m}} \) and the passage \( {\text{P }} = \left\{ {{\text{w}}_{\text{t}}^{\text{P}} } \right\}_{{ {\text{t}} = 1}}^{\text{n}} \). We first convert the words to their respective word-level embeddings (\( \left\{ {{\text{e}}_{\text{t}}^{\text{Q}} } \right\}_{{ {\text{t}} = 1}}^{\text{m}} \) and \( \left\{ {{\text{e}}_{\text{t}}^{\text{P}} } \right\}_{{ {\text{t}} = 1}}^{\text{n}} \)) and character-level embeddings (\( \left\{ {{\text{c}}_{\text{t}}^{\text{Q}} } \right\}_{{ {\text{t}} = 1}}^{\text{m}} \) and \( \left\{ {{\text{c}}_{\text{t}}^{\text{P}} } \right\}_{{ {\text{t}} = 1}}^{\text{n}} \)). The character-level embeddings are generated by taking the final hidden states of a bi-directional recurrent neural network (RNN) applied to embeddings of characters in the token. And \( {\text{E}}_{\text{Q}} \) denotes Q’s segment embeddings. \( {\text{E}}_{\text{P}} \) denotes P’s segment embeddings. \( {\text{E}}_{\text{i}}^{{{\text{m}} + {\text{n}} + 1}} \) denotes position embeddings. The input embeddings is the sum of the token embeddings (word-level and character-level), the segment embeddings and the position embeddings. Now we get the question representation \( {\text{Q}} = {\text{q}}_{{ {\text{i}} = 1}}^{\text{m}} \) and the passage representation \( {\text{P }} = {\text{p}}_{{ {\text{i}} = 1}}^{\text{n}} \), where each word is represented as a d-dim embedding by combining the features/embedding described above.

The universal node \( {\text{u}} \) is first represented by a d-dim randomly-initialized vector. The universal node \( {\text{u}} \) can connect passage and questions. We concatenated question representation Q, universal node representation u, passage representation P together as:

$$ \begin{array}{*{20}c} {V = \left[ {{\text{Q}},{\text{u}},{\rm P}} \right] = \left[ {{\text{q}}_{1} ,{\text{q}}_{2} \ldots {\text{q}}_{\text{m}} ,{\text{u}},{\text{p}}_{1} ,{\text{p}}_{2} , \ldots ,{\text{p}}_{\text{n}} } \right] } \\ \end{array} $$
(1)

\( {\text{V}} \in {\mathbb{R}}^{{ {\text{d}} \times \left( {{\text{m}} + {\text{n}} + 1} \right)}} \) is a joint representation of question universal node, and passage.

  • Word-Level Fusion

Then we first use Bert model (Devlin et al. 2018) and bidirectional LSTM (BiLSTM) to fuse the joint representation of question, universal node, and passage.

$$ \begin{array}{*{20}c} {{\text{H}}^{1} = {\text{Bert}}\left( {\text{V}} \right)} \\ \end{array} $$
(2)

And we pass it through the third BiLSTM and obtain a full representation \( {\text{H}}^{\text{f}} \)

$$ {\text{H}}^{\text{f}} = {\text{BiLSTM}}\left( {{\text{H}}^{1} } \right) $$
(3)

We concatenate \( {\text{H}}^{\text{l}} \) and \( {\text{H}}^{\text{f}} \) together, Thus, \( {\text{H}} = \left[ {{\text{H}}^{\text{l}} ;{\text{H}}^{\text{f}} } \right] \) represents the deep fusion information of the question and passage on word-level. When a BiLSTM is applied to encode representations, it can learns the semantic information bi-directionally.

2.2 Multi-level Attention

To fully fuse the semantic representation of the question and passage, we use the attention mechanism (Bahdanau et al. 2014) to capture their interactions on different levels.

We first divide H into two representations: attached passage \( {\text{H}}_{\text{q}} \) and attached question \( {\text{H}}_{\text{p}} \), and let the universal node representation \( {\text{h}}_{{{\text{m}} + 1}} \) attached to both the passage and question, i.e.

$$ {\text{H}}_{\text{q}} = \left[ {{\text{h}}_{1} ,{\text{h}}_{2} , \ldots ,{\text{h}}_{{{\text{m}} + 1}} } \right] $$
(4)
$$ {\text{H}}_{\text{p}} = \left[ {{\text{h}}_{{{\text{m}} + 1}} ,{\text{h}}_{{{\text{m}} + 2}} , \ldots ,{\text{h}}_{{{\text{m}} + {\text{n}} + 1}} } \right] $$
(5)

Since both \( {\text{H}}_{\text{q}} = \left[ {{\text{H}}_{\text{q}}^{\text{l}} , {\text{H}}_{\text{q}}^{\text{f}} } \right] \) and \( {\text{H}}_{\text{p}} = \left[ {{\text{H}}_{\text{p}}^{\text{l}} , {\text{H}}_{\text{p}}^{\text{f}} } \right] \) are concatenated by three-level representations, we followed previous work FusionNet (Huang et al. 2017) to construct their iterations on three levels. Take the first level as an example. We first compute the affine matrix of \( {\text{H}}_{\text{l}}^{\text{q}} \) and \( {\text{H}}_{\text{p}}^{\text{l}} \) by

$$ \begin{array}{*{20}c} {S = \left( {{\text{ReLU}}\left( {{\text{W}}_{1} {\text{H}}_{\text{q}}^{\text{l}} } \right)} \right)^{\text{T}} ReLU\left( {{\text{W}}_{2} {\text{H}}_{\text{p}}^{\text{f}} } \right)} \\ \end{array} $$
(6)

where \( {\text{S}} \in {\mathbb{R}}^{{\left( {{\text{m}} + 1} \right) \times \left( {{\text{n}} + 1} \right)}} \); \( {\text{W}}_{1} \) and \( {\text{W}}_{2} \) are learnable parameters. Next, a bi-directional attention is used to compute the interacted representation \( \widetilde{{H_{q}^{l} }} \) and \( \widetilde{{H_{p}^{l} }} \).

$$ \widetilde{{{\text{H}}_{\text{q}}^{\text{l}} }} = {\text{H}}_{\text{q}}^{\text{l}} \times {\text{softmax}}\left( {{\text{S}}^{\text{T}} } \right) $$
(7)
$$ \widetilde{{{\text{H}}_{\text{p}}^{\text{f}} }} = {\text{H}}_{\text{q}}^{\text{l}} \times {\text{softmax}}\left( {\text{S}} \right) $$
(8)

where softmax(·) is column-wise normalized function. We use the same attention layer to model the interactions for all the three levels, and get the final fused representation \( \widetilde{{{\text{H}}_{\text{q}}^{\text{l}} }}, \widetilde{{ {\text{H}}_{\text{p}}^{\text{f}} }} \) for the question and passage respectively.

2.3 Final Fusion

After the three-level attentive interaction, we generate the final fused information for the question and passage. Following the work of Sun (2018), we concatenate all the history information: we first concatenate the encoded representation \( {\text{H}} \) and the representation after attention \( {\tilde{\text{H}}} \) (again, we use \( {\text{H}}^{\text{l}} ,{\text{H}}^{\text{f}} ,{\text{and }}\widetilde{{{\text{H}}^{\text{l}} }},\widetilde{{{\text{H}}^{\text{f}} }} \) to represent two different levels of representation for the two previous steps respectively).

First, we pass the concatenated representation H through a BiLSTM to get \( {\text{H}}^{\text{A}} \).

$$ {\text{H}}^{\text{A}} = {\text{BiLSTM}}\left( {\left[ {{\text{H}}^{\text{l}} ;{\text{H}}^{\text{f}} ;\widetilde{{{\text{H}}^{\text{l}} }};\widetilde{{{\text{H}}^{\text{f}} }}} \right]} \right) $$
(9)

where the representation \( H^{A} \) is a fusion of information from different levels.

Then we concatenate the original embedded representation V and \( {\text{H}}^{\text{A}} \) for better representation of the fused information of passage, universal node, and question

$$ A = \left[ {{\text{V}};{\text{H}}^{\text{A}} } \right] $$
(10)

Finally, we use a self-attention layer to get the attention information within the fused information.

$$ {\tilde{\text{A}}} = A \times {\text{softmax}}\left( {{\text{A}}^{\text{T}} {\text{A}}} \right) $$
(11)

Next we concatenate \( {\text{H}}^{\text{A}} \) and \( {\tilde{\text{A}}} \) and pass them through another BiLSTM layer.

$$ O = {\text{BiLSTM}}\left[ {{\text{H}}^{\text{A}} ;\tilde{A}} \right] $$
(12)

We divide O into two parts: \( {\text{O}}^{\text{P}} \), \( {\text{O}}^{\text{Q}} \), which denote the fused information of the question and passage respectively

$$ {\text{O}}^{\text{P}} = \left[ {{\text{o}}_{1} ;{\text{o}}_{2} ; \ldots ;{\text{o}}_{\text{m}} } \right] $$
(13)
$$ {\text{O}}^{\text{Q}} = \left[ {{\text{o}}_{{{\text{m}} + 1}} ;{\text{o}}_{{{\text{m}} + 2}} ; \ldots ;{\text{o}}_{{{\text{m}} + {\text{n}} + 1}} } \right] $$
(14)

2.4 Prediction

We follow the work of Wang and Jiang (2015) and use pointer networks (Vinyals et al. 2015) to predict the start and end position of the answer.

First, we use a function shown below to summarize the question information \( {\text{O}}^{\text{Q}} \) into a fixed-dim representation \( {\text{c}}_{\text{q}} \).

$$ {{\rm c}}_{{\rm q}} = \frac{{\exp \left( {{{\rm W}}^{{\rm T}} {{\rm o}}_{{\rm i}}^{{\rm Q}} } \right)}}{{\sum\nolimits_{{\rm j}} {\exp \left( {{{\rm W}}^{{\rm T}} {{\rm o}}_{{\rm j}}^{{\rm Q}} } \right)} }}{{\rm o}}_{{\rm i}}^{{\rm Q}} $$
(15)

We use two trainable matrices \( {\text{W}}_{\text{s}} \) and \( {\text{W}}_{\text{e}} \) to estimate the probability of the answer start and end boundaries of the \( {\text{i}}_{\text{th}} \) word in the passage, \( \upalpha_{\text{i}} \) and \( \upbeta_{\text{i}} \).

$$ \upalpha_{i} \propto \exp \left( {{{\rm c}}_{{\rm q}} {{\rm W}}_{{\rm s}} {{\rm o}}_{{\rm i}}^{{\rm p}} } \right) $$
(16)
$$ \upbeta_{i} \propto \exp \left( {{{\rm c}}_{{\rm q}} {{\rm W}}_{{\rm e}} {{\rm o}}_{{\rm i}}^{{\rm P}} } \right) $$
(17)

And we use the weight matrix obtained from the answer pointer to get two representations of the passage.

$$ {\text{c}}_{\text{s}} = \sum\limits_{\text{i}} {\upalpha_{\text{i}} \cdot {\text{o}}_{\text{i}}^{\text{p}} } $$
(18)
$$ {\text{c}}_{\text{e}} = \sum\limits_{\text{i}} {\upbeta_{\text{i}} \cdot {\text{o}}_{\text{i}}^{\text{p}} } $$
(19)

To train the network, we minimize the sum of the negative log probabilities of the ground truth start and end position by the predicted distributions.

3 Experiment

3.1 Dataset

The dataset used in the technical evaluation of this task is provided by HKUST Xunfei. The dataset mainly comes from the referee documents of China Referee Documents Network, which includes criminal and civil first instance referee documents.

The training set contains about 40,000 questions, and the development set and test set each have about 5000 questions respectively. For the development set and the test set, each question contains 3 manually labeled reference answers.

In view of the large differences in the factual description of the civil and criminal adjudication documents, and the corresponding types of questions are not the same, in order to take into account both types of adjudication documents at the same time, thereby covering most of the adjudication documents, they are divided into civil and criminal test sets.

3.2 Metrics

This task is evaluated using a macro-average F1 that is consistent with the CoQA competition. For each question, need to be calculated with N standard answers to get N F1 scores, and the maximum value is taken as its F1 value. However, in assessing Human Performance, each standard answer requires an F1 value to be calculated with N-1 other criteria. In order to compare indicators more fairly, N standard responses need to be divided into N groups according to the N-1 group. Finally, the F1 value of each problem is the average of the N groups F1. The F1 value of the entire data set is the average of all data F1. The F1 value of the entire data set is the average of all data F1.

$$ \begin{array}{*{20}c} {Lg = len\left( {gold} \right)} \\ \end{array} $$
(20)
$$ \begin{array}{*{20}c} {Lp = len\left( {pred} \right)} \\ \end{array} $$
(21)
$$ \begin{array}{*{20}c} {Lc = InterSec\left( {gold,pred} \right)} \\ \end{array} $$
(22)
$$ \begin{array}{*{20}c} {precision = \frac{Lc}{Lp}} \\ \end{array} $$
(23)
$$ \begin{array}{*{20}c} {recall = \frac{Lc}{Lg}} \\ \end{array} $$
(24)
$$ \begin{array}{*{20}c} {f1\left( {gold,pred} \right) = \frac{2\, \times \,precision\, \times \,recall}{precision\, + \,recall}} \\ \end{array} $$
(25)
$$ \begin{array}{*{20}c} {Avef1 = \frac{{\mathop \sum \nolimits_{i = 1}^{{count_{ref} }} \hbox{max} \left( {f1\left( {gold_{i} ,pred} \right)} \right)}}{{Count_{ref} }}} \\ \end{array} $$
(26)
$$ \begin{array}{*{20}c} {F1_{macro} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} Avef1_{i} }}{N}} \\ \end{array} $$
(27)

InterSec calculates the intersection of the predicted answer and the standard answer (in words), Countref represents the number of standard answers (three), max part takes the predicted answer and each standard answer, the maximum value of the F1 value. The final score is the average of the average F1 values for the criminal and civil test sets.

3.3 Implementation Details

We use Spacy to process each question and passage to obtain tokens, POS tags and NER tags of each text. We use 10 dimensions to embed POS tags, 10 for NER tags (Luo et al. 2019). We use 100-dim Glove pretrained word embeddings and 1024-dim Elmo embeddings. All the LSTM blocks are bi-directional with one single layer. We set the hidden layer dimension as 125, attention layer dimension as 250. We added a dropout layer over all the modeling layers, including the embedding layer, at a dropout rate of 0.3. We use Adam optimizer with a learning rate of 0.002.

3.4 Experimental Results and Analysis

  • Baseline Moels and Metrics

We compare LBNet with the following baseline models: LibSVM (Chang et al. 2011), BiDAF (Seo et al. 2016), (Devlin et al. 2018), ERNIE (Zhang et al. 2019). The dataset is randomly partitioned into a training set (80%), a development set (20%). We use F1 as the evaluation metric, which is the harmonic mean of precision and recall at word level between the predicted answer and ground truth.

4 Results

Table 1 shows the experimental results of LBNet and baseline models on CAIL datasets. As shown in Table 1, LBNet achieves better results than all baseline models. In detail, LBNet model improves F1 by 19.8, 16.4, 7.8, 6.7 on civil dataset and 17.3, 14.8, 7, 4.2 on criminal dataset compared with LibSVM, BiDAF, ERNIE and BERT, respectively. To be noted that we use the pretrain model of BERT and ENGIE. BERT uses MLM (Masked Language Model) to obtain context-relevant bidirectional feature representations. ENRIE introduces knowledge, combining entity vectors with textual representation. Different from the previous models, we use a unified representation to encode the question and passage simultaneously, and introduce a universal node which plays an important role to predict the unanswerability of a question, and we use the BiLSTM for encoding the embedded representation, which is very effective to fuse information of the question and passage.

Table 1. Experimental results (F1) on the CAIL dataset

5 Conclusions

In this paper, we propose a novel contextual attention-based model, LBNet, to tackle Judicial Reading Comprehension tasks. For the joint learning of different types of questions to design an “answer fragment extraction” and “YES/NO classification and unanswerable question” three tasks of the end-to-end model, the different types of problems unified learning. For the long text problem, draw on the idea of pre-processing in the fine-tune solution for the SQuAD dataset, which is to use the sliding window method to cut the long text into multiple doc_span when data is preprocessed, for words that appear in multiple spans, the doc_span of the word with “maximum context” prevails when the score is subsequently calculated. Following an in-depth analysis of the data set, we found that some of the problems have some laws or the answers to the model prediction can be further corrected, so the post-processing module was added to the overall model structure to further improve performance. By leveraging inter-attention and self-attention and using BiLSTM on passage and conversation history, the model is able to comprehend dialogue flow and fuse It with the digestion of passage content. Furthermore, we incorporate the latest breakthrough in NLP, BERT, and leverage it in an innovative way. LBNet achieves good results over previous approaches. On the dataset CAIL, LBNet achieves F1 score 83.5 and 81.3 accuracy. In the future, we will further optimize the network structure and parameters to get more accurate results.