Keywords

1 Introduction

Industry classification is a primary problem to classify companies into specific industry category according to their primary business sectors, market performances and the major products [1]. It is essential to research in the financial field, as dividing the companies into homogeneous groups could help the academic researchers narrow down the scope of their investigation, identify comparable companies and set performance benchmarks [2]. It also can reflect the industry characteristics of companies and provide investors with market trends.

Unlike A-shares with persistent main business sectors, small-and-medium-sized enterprises (SMEs), especially new startup companies, usually react to the ever-evolving demand of the market by changing their main businesses frequently. For startup companies that aim at publicly trading, classification can help them catch up with the existing A-share companies and find potential competitors. There are already plenty of applications on industry classification on A-share companies like Global Industry Classification Standard (GICS), still, there is a lack of datasets on startup companies for further studies.

As industry classification can be attributed to financial text classification task, common deep neural networks do not perform well on domain-specific tasks. The text classification task is a fundamental task in neural language processing as numerous methods have been proposed, such as TextCNN [6] and BERT [4]. However, the professional terms stand for special meaning which needs an additional explanation when understanding. Recent studies have made attempts to integrate knowledge graphs into basic models. Zhang et al. [15] propose an enhanced language representation model, but the model ignores the relation between entities. W. Liu et al. [10] transform input sentence into a knowledge-rich sentence tree and introduce soft-position and visible matrix. Still, it only concerns relevant triples from the entities present in the sentence, dismissing expanding relations in the knowledge graphs.

Table 1. The annual business reports of one company and its corresponding classification label.

For the problems mentioned above, in this work, we focus on solving the integration of word representation and knowledge. As an effort towards it, we first construct a dataset on startup companies for the industry classification task. The dataset contains the annual business reports of companies on NEEQ and their corresponding labels. These companies are typically SMEs, and their classifications could be wavering in years. For instance listed in Table 1, a firm renamed its security from Daocong Technology into Gaiya Entertainment, with the leading business sector changing from transportation to mobile games. Second, We propose a Knowledge Graph Enriched BERT (KGEB) which can load any pre-trained BERT models and be fine-tuned for classification. It makes full use of the structure of the knowledge graphs extracted from texts by entity linking and nodes expanding. Finally, experiments are conducted on the dataset, and results demonstrate that KGEB can get superior performances.

The contribution of this work is threefold: (1) A large dataset is constructed for industry classification based on the companies listed on NEEQ, consisting of companies’ descriptions of business models and corresponding labels. (2) A Knowledge Graph Enriched BERT (KGEB), which can understand domain-specific texts by integrating both word and knowledge representation, is proposed and is demonstrated beneficial. (3) The KGEB obtains the results of 0.9198 on Accuracy and 0.9089 on F1, which outperforms the competitive experiments and demonstrates that the proposed approach can improve the classification quality.

2 NEEQ Industry Classification Dataset

NEEQ is the third national securities’ trading venue after the Shanghai Stock Exchange and Shenzhen Stock Exchange. We construct the industry classification dataset based on the NEEQ website, and the process is summarized as follows: 1) we acquire 20,040 descriptions of the business model from 2014 to 2017 from the open-source dataset [1]. 2) For each description of the business model, we acquire the releasing time of the report and check out the investment-based industry classification result which is rightly after the releasing time. 3) By filtering and cleaning repeated descriptions, we obtain the final dataset which consists of 17,604 pairs of descriptions of business models and their industry classification labels. We split the dataset into a training set (80%), a dev set (10%), and a test set (10%). Among the dataset, the maximum of descriptions of business model is 13,308, and the minimum is 38, and the median is 630. On average, each company contributes to 1.79 different business model descriptions, demonstrating the wavering features of startup companies. Table 2 summarizes the preliminary information about the dataset of industry classification. The dataset is freely available at https://github.com/theDyingofLight/neeq_dataset.

Table 2. Overview of the classification dataset NEEQ industry classification.

3 Methodology

The text classification task can be defined as follows. Given a passage denoted as \(X=\{x_1,x_2,...,x_n\}\), n is the length of the passage. In this paper, Chinese tokens are at the character level. The model’s target is to predict the classification label Y defined as \(Y=arg max P(Y|X,\theta )\), where \(\theta \) denotes the model parameters. Our overall approach is depicted in Fig. 1.

Fig. 1.
figure 1

The overview of our approach. It contains three steps: (1) Build the local knowledge graphs. (2) Transform the graphs into node representation. (3) Combine input passage representation with node representation for classification.

Local Knowledge Graph. Given an input passage, a set of triples can be retrieved from the knowledge base by linking the mentions parsed from the passage to the entities in the knowledge base and expanding relation paths. We define the set of triples as a Local Knowledge Graph. Formally, a Knowledge Base is represented as a \(K=(V,E)\), where \(V=\{v_j\}\) is the set of vertices and \(E=\{e_j\}\) is the set of edges of the vertices, and each triple (head entity, relation, tail entity) in KB is denoted as \(\tau =(e_h,v_{hs},e_s)\). The local knowledge graph is assumed as \(G=\{\tau _1,\tau _2,...,\tau _{g}\}\), where g is the number of triples. The way to construct the local knowledge graph is as follows. Firstly, for each passage X, we conduct mention parsing to obtain mentions and entity disambiguation to get pairs of entities and nodes from the knowledge base called XLore [13] with the entity linking system XLink [5]. We rank all the candidate nodes by their cosine similarity with the word embedding [9] and select the entities by the threshold larger than 0.4 and top-10 entities if there are more than 10.

Node Representation. After obtaining the local knowledge graph G with g nodes, we feed G into the \(L-\)layer GCN model [14] for the representation of each node, where we denote \(h^{(l-1)}_i\) as the input vector and \(h^{(l)}_i\) as the output vector of node i at the \(l-\)layer. The process of calculation is: \(h^{(l)}_i=\sigma {(\sum ^{n}_{j=1}{\widetilde{A}_{ij}W^{(l)}h^{(l-1)}_j/d_{i}+b^{(l)}})}\) where \(\boldsymbol{\widetilde{A}}=\boldsymbol{A}+\boldsymbol{I}\) represents the matrix sum of adjacency matrix \(\boldsymbol{A}\) and identity matrix \(\boldsymbol{I}\), \(d_{i}=\sum ^{n}_{j=1}{\widetilde{A}_{ij}}\) is the degree of entity i in the local knowledge graph, \(W^{(l)}\) is a trainable linear transformation and \(\sigma \) is a nonlinear function. We initialize the node embedding with the output of a pre-trained model, which takes the whole words in the node as input and outputs a fixed length vector. The output of the GCN last layer is used as the node representation \(H={\{h^{(L)}_1,h^{(L)}_2,...,h^{(L)}_g\}}\).

Knowledge Graph Enriched BERT. Knowledge Graph Enriched BERT is proposed to enrich the representation of long passage with node representation from local knowledge graphs. As a multi-layer bidirectional Transformer encoder, BERT maps an input sequence of characters X to a sequence of representations \(Z=\{z_1,z_2,...,z_n\}\). To fuse node representation into the word embedding layer, we utilize attention mechanism to integrate word embedding \(W=\{w_1,w_2,...,w_n\}\) and node representation \(H={\{h^{(L)}_1,h^{(L)}_2,...,h^{(L)}_g\}}\) formulated as: \(\alpha _t=softmax(H^{T}W^{P}w_t)\), \(w'_t=H\cdot \alpha _t\) where \(W^{P}\) is the trainable parameters and \(W'=\{w'_1,w'_2,...,w'_n\}\) is the output of the fusion. Then we add a residual connection on the original word embedding to avoid vanishing gradient. We also adopt consistent position embedding and token type embedding with BERT and we sum up three layers of embedding as the output of the embedding layer. The output is then fed into a stack of identical layers which contains a multi-head self-attention mechanism and a position-wise fully connected feed-forward network [12]. We utilize the final hidden vector \(z_1\in \mathbb {R}^H\) corresponding to the first input token ([CLS]) to represent the entire sequence. We introduce classification layer weights \(W\in \mathbb {R}^{K\times H}\), where K is the number of labels.

We compute a standard classification loss with \(z_1\) and W, and \(I^*\) denotes the target category: \(O=softmax(z_1W^T)\), \(\mathcal {L}=-log(O(I^*))\)

4 Experiments

Our experiments study the proposed model on the NEEQ industry classification dataset, compare the model with existing approaches and analyze the results.

4.1 Experimental Settings

We have five models for comparison. The models are implemented on open source code. (1) GCN [8]: A fundamental GNN model for the classification task. (2) Logistic Regression [11]: A basic linear model for classification. (3) TextCNN [6]: A CNN with one convolution layer on top of word vectors. (4) BERT [4]: A language model pre-trained on a large scale of corpus to obtain deep bidirectional representations and renews the records on many downstream tasks. (5) K-BERT [10]: it enables language representation model with knowledge graphs by first injecting relevant triples into the input sentence and second being fed into the embedding layer, seeing layer and the mask-transformer.

In all our experiments, we initialize the word embedding with parameters of bert-base-chinese with a hidden size of 768 and 12 hidden layers. In the fine-tuning process, we use a batch size of 8, the number of gradient accumulation steps of 8, the learning rate of 5e−5. In the embedding layer, we use bert-as-service equipped with bert-base-chinese to get the initial node embedding whose dimension is 768, the threshold of similarity of entites in XLore is 0.4, and the layer size of GCN is 4. The dropout in GCN layers is 0.5. Models are trained with the Adam optimizer [7]. We select model based on the dev set.

We conduct the automatic evaluation with the following metrics: Accuracy measures the proportion of the number of samples that are correctly predicted to the total samples. F1 is the macro average of Precision and Recall, which measures the correctness of all categories.

4.2 Results and Analysis

Table 3 shows the experimental results against the competitor methods. GCN achieves the worst performance since it only utilizes the local knowledge graphs, missing information from the passages. Logistic Regression focuses much on the statistical information of words and can benefit from the long texts as in our dataset. TextCNN is a CNN model designed for text classification. It can represent text sequences with a deep neural network, but it doesn’t model the long passage well and lacks domain-specific knowledge. Likewise, BERT is initialized with the pre-train parameters and has been significantly improved, but it still has problems understanding domain-specific texts. Although the result of K-BERT is worse than our model, it demonstrates the influence of knowledge graphs. Compared with CN-DBpedia [3] K-BERT extracts knowledge triples from, XLore contains 3.6 times the amount of entities, which contributes to the deeper comprehension of domain information.

Table 3. The experimental results (%) on the NEEQ industry classification dataset.

Compared with the competitor methods, our model takes advantage of the knowledge. Complementing words with node representation is helpful because it provides additional information and considers the structure of the graphs, making the word embedding select more helpful information from the nodes. The performance achieves absolute improvements by at least +0.5 in Accuracy and +0.94 in F1.

Table 4. The experimental results (%) on the NEEQ industry classification dataset on each category.

As Table 4 shows, the experimental results on each category support the effectiveness of KGEB. TextCNN performs worst on almost all of the classes. BERT performs better, but in most categories, KGEB achieves the highest Precision, Recall and F1 score, demonstrating that the additional knowledge information can not only help distinguish terms and draw attention on domain-specific words but also understand the meaning of decisive words. However, since KGEB is still far from a perfect classifier, we also show that there are descriptions our current model cannot classify well. These labels could be predicted correctly if we had a better entity linking system and node expanding strategies. Taking a text labeled “Telecommunication Service" as an example, the local knowledge graph is constructed based on the entities containing “digital television", “health" and “care" from the text “In addition to retaining the original digital television business, the future will be based on the field of health care" and it is possible to bring noise interference.

4.3 Case Study

Comparing the predicted labels of TextCNN, BERT and KGEB, our model improves the recall of the labels with respect to the baselines. In the cases where all of them predict correctly, the keywords in the sentence can help the models increase the probability of correct labels. In the cases where BERT and TextCNN predict wrong labels, like the sentence “The company is committed to the production and sales of core equipment for water treatment and recycling of domestic sewage", BERT and TextCNN label the description “Industrials", but owing to the additional information about “sewage disposal", KGEB obtains the correct prediction. In some cases, all of the models classify the description with the wrong label. If we removed the noise when conducting entity linking system and node expanding strategies, we could obtain the correct classification results like in the last example.

5 Conclusions

In this paper, we construct a large dataset for industry classification task and propose a novel knowledge enriched BERT which can extract the local knowledge graph from the business sentences and integrate word and knowledge representation. The experimental results outperform the competitor methods and demonstrate the effectiveness of our proposed model. In the future, we will continue to improve this work and extend the method to more applications.