Abstract
Industry classification for startup companies is meaningful not only to navigate investment strategies but also to find potential competitors. It is essentially a challenging domain-specific text classification task. Due to the lack of such dataset, in this paper, we first construct a dataset for industry classification based on the companies listed on the Chinese National Equities Exchange and Quotations (NEEQ), which consists of 17, 604 annual business reports and their corresponding industry labels. Second, we introduce a novel Knowledge Graph Enriched BERT model (KGEB), which can understand a domain-specific text by enhancing the word representation with external knowledge and can take full use of the local knowledge graph without pre-training. Experimental results show the promising performance of the proposed model and demonstrate its effectiveness for tackling the domain-specific classification task.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Industry classification is a primary problem to classify companies into specific industry category according to their primary business sectors, market performances and the major products [1]. It is essential to research in the financial field, as dividing the companies into homogeneous groups could help the academic researchers narrow down the scope of their investigation, identify comparable companies and set performance benchmarks [2]. It also can reflect the industry characteristics of companies and provide investors with market trends.
Unlike A-shares with persistent main business sectors, small-and-medium-sized enterprises (SMEs), especially new startup companies, usually react to the ever-evolving demand of the market by changing their main businesses frequently. For startup companies that aim at publicly trading, classification can help them catch up with the existing A-share companies and find potential competitors. There are already plenty of applications on industry classification on A-share companies like Global Industry Classification Standard (GICS), still, there is a lack of datasets on startup companies for further studies.
As industry classification can be attributed to financial text classification task, common deep neural networks do not perform well on domain-specific tasks. The text classification task is a fundamental task in neural language processing as numerous methods have been proposed, such as TextCNN [6] and BERT [4]. However, the professional terms stand for special meaning which needs an additional explanation when understanding. Recent studies have made attempts to integrate knowledge graphs into basic models. Zhang et al. [15] propose an enhanced language representation model, but the model ignores the relation between entities. W. Liu et al. [10] transform input sentence into a knowledge-rich sentence tree and introduce soft-position and visible matrix. Still, it only concerns relevant triples from the entities present in the sentence, dismissing expanding relations in the knowledge graphs.
For the problems mentioned above, in this work, we focus on solving the integration of word representation and knowledge. As an effort towards it, we first construct a dataset on startup companies for the industry classification task. The dataset contains the annual business reports of companies on NEEQ and their corresponding labels. These companies are typically SMEs, and their classifications could be wavering in years. For instance listed in Table 1, a firm renamed its security from Daocong Technology into Gaiya Entertainment, with the leading business sector changing from transportation to mobile games. Second, We propose a Knowledge Graph Enriched BERT (KGEB) which can load any pre-trained BERT models and be fine-tuned for classification. It makes full use of the structure of the knowledge graphs extracted from texts by entity linking and nodes expanding. Finally, experiments are conducted on the dataset, and results demonstrate that KGEB can get superior performances.
The contribution of this work is threefold: (1) A large dataset is constructed for industry classification based on the companies listed on NEEQ, consisting of companies’ descriptions of business models and corresponding labels. (2) A Knowledge Graph Enriched BERT (KGEB), which can understand domain-specific texts by integrating both word and knowledge representation, is proposed and is demonstrated beneficial. (3) The KGEB obtains the results of 0.9198 on Accuracy and 0.9089 on F1, which outperforms the competitive experiments and demonstrates that the proposed approach can improve the classification quality.
2 NEEQ Industry Classification Dataset
NEEQ is the third national securities’ trading venue after the Shanghai Stock Exchange and Shenzhen Stock Exchange. We construct the industry classification dataset based on the NEEQ website, and the process is summarized as follows: 1) we acquire 20,040 descriptions of the business model from 2014 to 2017 from the open-source dataset [1]. 2) For each description of the business model, we acquire the releasing time of the report and check out the investment-based industry classification result which is rightly after the releasing time. 3) By filtering and cleaning repeated descriptions, we obtain the final dataset which consists of 17,604 pairs of descriptions of business models and their industry classification labels. We split the dataset into a training set (80%), a dev set (10%), and a test set (10%). Among the dataset, the maximum of descriptions of business model is 13,308, and the minimum is 38, and the median is 630. On average, each company contributes to 1.79 different business model descriptions, demonstrating the wavering features of startup companies. Table 2 summarizes the preliminary information about the dataset of industry classification. The dataset is freely available at https://github.com/theDyingofLight/neeq_dataset.
3 Methodology
The text classification task can be defined as follows. Given a passage denoted as \(X=\{x_1,x_2,...,x_n\}\), n is the length of the passage. In this paper, Chinese tokens are at the character level. The model’s target is to predict the classification label Y defined as \(Y=arg max P(Y|X,\theta )\), where \(\theta \) denotes the model parameters. Our overall approach is depicted in Fig. 1.
Local Knowledge Graph. Given an input passage, a set of triples can be retrieved from the knowledge base by linking the mentions parsed from the passage to the entities in the knowledge base and expanding relation paths. We define the set of triples as a Local Knowledge Graph. Formally, a Knowledge Base is represented as a \(K=(V,E)\), where \(V=\{v_j\}\) is the set of vertices and \(E=\{e_j\}\) is the set of edges of the vertices, and each triple (head entity, relation, tail entity) in KB is denoted as \(\tau =(e_h,v_{hs},e_s)\). The local knowledge graph is assumed as \(G=\{\tau _1,\tau _2,...,\tau _{g}\}\), where g is the number of triples. The way to construct the local knowledge graph is as follows. Firstly, for each passage X, we conduct mention parsing to obtain mentions and entity disambiguation to get pairs of entities and nodes from the knowledge base called XLore [13] with the entity linking system XLink [5]. We rank all the candidate nodes by their cosine similarity with the word embedding [9] and select the entities by the threshold larger than 0.4 and top-10 entities if there are more than 10.
Node Representation. After obtaining the local knowledge graph G with g nodes, we feed G into the \(L-\)layer GCN model [14] for the representation of each node, where we denote \(h^{(l-1)}_i\) as the input vector and \(h^{(l)}_i\) as the output vector of node i at the \(l-\)layer. The process of calculation is: \(h^{(l)}_i=\sigma {(\sum ^{n}_{j=1}{\widetilde{A}_{ij}W^{(l)}h^{(l-1)}_j/d_{i}+b^{(l)}})}\) where \(\boldsymbol{\widetilde{A}}=\boldsymbol{A}+\boldsymbol{I}\) represents the matrix sum of adjacency matrix \(\boldsymbol{A}\) and identity matrix \(\boldsymbol{I}\), \(d_{i}=\sum ^{n}_{j=1}{\widetilde{A}_{ij}}\) is the degree of entity i in the local knowledge graph, \(W^{(l)}\) is a trainable linear transformation and \(\sigma \) is a nonlinear function. We initialize the node embedding with the output of a pre-trained model, which takes the whole words in the node as input and outputs a fixed length vector. The output of the GCN last layer is used as the node representation \(H={\{h^{(L)}_1,h^{(L)}_2,...,h^{(L)}_g\}}\).
Knowledge Graph Enriched BERT. Knowledge Graph Enriched BERT is proposed to enrich the representation of long passage with node representation from local knowledge graphs. As a multi-layer bidirectional Transformer encoder, BERT maps an input sequence of characters X to a sequence of representations \(Z=\{z_1,z_2,...,z_n\}\). To fuse node representation into the word embedding layer, we utilize attention mechanism to integrate word embedding \(W=\{w_1,w_2,...,w_n\}\) and node representation \(H={\{h^{(L)}_1,h^{(L)}_2,...,h^{(L)}_g\}}\) formulated as: \(\alpha _t=softmax(H^{T}W^{P}w_t)\), \(w'_t=H\cdot \alpha _t\) where \(W^{P}\) is the trainable parameters and \(W'=\{w'_1,w'_2,...,w'_n\}\) is the output of the fusion. Then we add a residual connection on the original word embedding to avoid vanishing gradient. We also adopt consistent position embedding and token type embedding with BERT and we sum up three layers of embedding as the output of the embedding layer. The output is then fed into a stack of identical layers which contains a multi-head self-attention mechanism and a position-wise fully connected feed-forward network [12]. We utilize the final hidden vector \(z_1\in \mathbb {R}^H\) corresponding to the first input token ([CLS]) to represent the entire sequence. We introduce classification layer weights \(W\in \mathbb {R}^{K\times H}\), where K is the number of labels.
We compute a standard classification loss with \(z_1\) and W, and \(I^*\) denotes the target category: \(O=softmax(z_1W^T)\), \(\mathcal {L}=-log(O(I^*))\)
4 Experiments
Our experiments study the proposed model on the NEEQ industry classification dataset, compare the model with existing approaches and analyze the results.
4.1 Experimental Settings
We have five models for comparison. The models are implemented on open source code. (1) GCN [8]: A fundamental GNN model for the classification task. (2) Logistic Regression [11]: A basic linear model for classification. (3) TextCNN [6]: A CNN with one convolution layer on top of word vectors. (4) BERT [4]: A language model pre-trained on a large scale of corpus to obtain deep bidirectional representations and renews the records on many downstream tasks. (5) K-BERT [10]: it enables language representation model with knowledge graphs by first injecting relevant triples into the input sentence and second being fed into the embedding layer, seeing layer and the mask-transformer.
In all our experiments, we initialize the word embedding with parameters of bert-base-chinese with a hidden size of 768 and 12 hidden layers. In the fine-tuning process, we use a batch size of 8, the number of gradient accumulation steps of 8, the learning rate of 5e−5. In the embedding layer, we use bert-as-service equipped with bert-base-chinese to get the initial node embedding whose dimension is 768, the threshold of similarity of entites in XLore is 0.4, and the layer size of GCN is 4. The dropout in GCN layers is 0.5. Models are trained with the Adam optimizer [7]. We select model based on the dev set.
We conduct the automatic evaluation with the following metrics: Accuracy measures the proportion of the number of samples that are correctly predicted to the total samples. F1 is the macro average of Precision and Recall, which measures the correctness of all categories.
4.2 Results and Analysis
Table 3 shows the experimental results against the competitor methods. GCN achieves the worst performance since it only utilizes the local knowledge graphs, missing information from the passages. Logistic Regression focuses much on the statistical information of words and can benefit from the long texts as in our dataset. TextCNN is a CNN model designed for text classification. It can represent text sequences with a deep neural network, but it doesn’t model the long passage well and lacks domain-specific knowledge. Likewise, BERT is initialized with the pre-train parameters and has been significantly improved, but it still has problems understanding domain-specific texts. Although the result of K-BERT is worse than our model, it demonstrates the influence of knowledge graphs. Compared with CN-DBpedia [3] K-BERT extracts knowledge triples from, XLore contains 3.6 times the amount of entities, which contributes to the deeper comprehension of domain information.
Compared with the competitor methods, our model takes advantage of the knowledge. Complementing words with node representation is helpful because it provides additional information and considers the structure of the graphs, making the word embedding select more helpful information from the nodes. The performance achieves absolute improvements by at least +0.5 in Accuracy and +0.94 in F1.
As Table 4 shows, the experimental results on each category support the effectiveness of KGEB. TextCNN performs worst on almost all of the classes. BERT performs better, but in most categories, KGEB achieves the highest Precision, Recall and F1 score, demonstrating that the additional knowledge information can not only help distinguish terms and draw attention on domain-specific words but also understand the meaning of decisive words. However, since KGEB is still far from a perfect classifier, we also show that there are descriptions our current model cannot classify well. These labels could be predicted correctly if we had a better entity linking system and node expanding strategies. Taking a text labeled “Telecommunication Service" as an example, the local knowledge graph is constructed based on the entities containing “digital television", “health" and “care" from the text “In addition to retaining the original digital television business, the future will be based on the field of health care" and it is possible to bring noise interference.
4.3 Case Study
Comparing the predicted labels of TextCNN, BERT and KGEB, our model improves the recall of the labels with respect to the baselines. In the cases where all of them predict correctly, the keywords in the sentence can help the models increase the probability of correct labels. In the cases where BERT and TextCNN predict wrong labels, like the sentence “The company is committed to the production and sales of core equipment for water treatment and recycling of domestic sewage", BERT and TextCNN label the description “Industrials", but owing to the additional information about “sewage disposal", KGEB obtains the correct prediction. In some cases, all of the models classify the description with the wrong label. If we removed the noise when conducting entity linking system and node expanding strategies, we could obtain the correct classification results like in the last example.
5 Conclusions
In this paper, we construct a large dataset for industry classification task and propose a novel knowledge enriched BERT which can extract the local knowledge graph from the business sentences and integrate word and knowledge representation. The experimental results outperform the competitor methods and demonstrate the effectiveness of our proposed model. In the future, we will continue to improve this work and extend the method to more applications.
References
Bai, H., Xing, F.Z., Cambria, E., Huang, W.B.: Business taxonomy construction using concept-level hierarchical clustering. Papers (2019)
Bhojraj, S., Lee, C., Oler, D.K.: What’s my line? A comparison of industry classification schemes for capital market research. J. Acc. Res. 41(5), 745–774 (2003)
Bo, X., et al.: CN-DBpedia: a never-ending Chinese knowledge extraction system. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Zhang, J., Cao, Y., Hou, L., Li, J., Zheng, H.-T.: XLink: an unsupervised bilingual entity linking system. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds.) CCL/NLP-NABD -2017. LNCS (LNAI), vol. 10565, pp. 172–183. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69005-6_15
Kim, Y.: Convolutional neural networks for sentence classification. Eprint Arxiv (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Li, S., Zhao, Z., Hu, R., Li, W., Liu, T., Du, X.: Analogical reasoning on Chinese morphological and semantic relations. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 2: Short Papers (2018)
Liu, W., Zhou, P., Zhao, Z., Wang, Z., Wang, P.: K-bert: enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Menard, S.: Logistic regression. American Statistician (2004)
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Wang, Z., et al.: Xlore: a large-scale english-chinese bilingual knowledge graph. In: Proceedings of the 12th International Semantic Web Conference (2013)
Zhang, Y., Qi, P., Manning, C.D.: Graph convolution over pruned dependency trees improves relation extraction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: Ernie: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
Acknowledgements
We appreciate the insightful feedback from the anonymous reviewers. This work is jointly supported by grants: Natural Science Foundation of China (No. 62006061), Strategic Emerging Industry Development Special Funds of Shenzhen (No. JCYJ20200109113441941) and Stable Support Program for Higher Education Institutions of Shenzhen (No. GXWD20201230155427003-20200824155011001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, S., Pan, Y., Xu, Z., Hu, B., Wang, X. (2021). Enriching BERT With Knowledge Graph Embedding For Industry Classification. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_82
Download citation
DOI: https://doi.org/10.1007/978-3-030-92310-5_82
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)