1 Introduction

Initial public offerings (IPO) have become one of the core strategic means of raising capital for cash-hungry companies (Grover & Bhullar, 2021). In the recent years, the risk of IPO has brought disasters to the operation of China’s capital market (Saci & Jasimuddin, 2021). The research of information disclosure plays an essential role in promoting the effective operation of the capital market (Michelon et al., 2019). The exposure of fraud scandals (e.g., the Luckin fraud in China) challenges the bottom line of supervision and also reflects the situation that the existing information disclosure system is difficult to effectively contain fraud in the financial market. In fact, in the year of the IPO fraud, the company attracted more media attention and more positive media attitudes (Sun et al., 2021). The existing research on risk disclosure is mainly divided into two perspectives, i.e., risk effect and information effect (Adam-Müller & Erkens, 2020; Hope et al., 2016; Hussein et al., 2020; Li et al., 2019a, 2019b; Xia et al., 2022). According to the view of information effect, risk information can explain known risk factors and emergencies increase information transparency and reduce the risk perception of information users (Ada et al., 2021; Benamati et al., 2021; Hope et al., 2016). It is also to be argued that the more risk disclosure, the more risk perception of investors. At this time, investors are willing to see less risk disclosure. That is, investors are willing to see more risks disclosed. Under the risk effect perspective, information disclosure can enhance communication with the market and reduce information asymmetry between listed companies and investors (Adam-Müller & Erkens, 2020; Li et al., 2019a, 2019b). Although regulators have been improving the information disclosure system (Hoque & Mu, 2019; Zimmer et al., 2010), there are still problems such as vague information disclosure, false statements, and misleading fraud (Lobo & Zhao, 2013; Que & Zhang, 2019). This paper intends to explore the characteristics of enterprise risk disclosure and the kind of market reaction that different risk disclosure could have.

Several scholars (Hussein et al., 2020; Wasiuzzaman et al., 2018) have examined the link between IPO initial returns and information disclosure on the IPO prospectus. For the IPO opening day, the initial returns are the IPO underpricing rate. In similar literature, academics and researchers use the terms initial returns and IPO undervaluation rate interchangeably (Ritter & Welch, 2002). A large number of studies have studied the factors influencing the efficiency of the IPO capital market from the perspectives of IPO price limit policy (Kim et al., 2013), IPO venture capital (Que & Zhang, 2019), investor attention (Chang & Kwon, 2020), and directed share program (DSP) (Chong & Liu, 2020); however, they have not taken the risk information text into account. Although the existing literature has focused on the information value of risk disclosure text to some extent (e.g., Wei et al., 2019), there are still some deficiencies.

First, the risk information disclosed can provide accurate and valuable information to estimate the company’s position and forecast future development (Alshirah et al., 2022). The existing literature on risk disclosure is still lacking, especially the analysis of risk disclosure text from investors’ perspectives. Through manual semantic analysis of the content, we can find out and compare the norms of intangible information disclosure (Catalfo & Wulf, 2016). At the same time, the content of information disclosure also impacts decision-making (Zhang & Liu, 2020). Some studies have evaluated the degree of risk disclosure from the risk scope, simplicity, and uniqueness of the text (Sheng et al., 2021). Therefore, this paper analyzes the degree of information disclosure from a new perspective, that is, content and semantics.

Second, the existing research measures the risk disclosure either by manual reading and labeling or by exchange rating, which is time-consuming, labor-consuming, and subjective. Many scholars have confirmed the importance of data mining and analysis methods for crisis management (Akter & Wamba, 2017). Due to the unpredictability of global events such as the COVID-19 pandemic, governments are using data mining techniques to prepare more innovative and proactive crisis management strategies for future uncertain crises (Park, 2021). Data mining technology has the advantages of being fast and accurate, which has opened up a new development direction for enterprise financial analysis and played an important role in financial crisis management (Shang et al., 2021). This paper uses data mining technology to analyze the prospectus and mine the risk disclosure in the report. This paper uses an unsupervised machine learning algorithm to solve this problem. Thirdly, studies have shown that improving enterprise information transparency can increase the stock market’s liquidity (Choi & Jung, 2021). At the same time, the transparency of enterprise information also affects the trust of investors (Han et al., 2021). Investors have increased their demand for information transparency because they can make better decisions based on the disclosures provided by their companies (Zia-ur-Rehman et al., 2021). This paper takes the regulatory role of enterprise information transparency into consideration.

Finally, this paper further analyzes the characteristics of risk disclosure of illegal enterprises in the future from the perspective of risk aversion. Risk aversion arises from a perceptual bias that, given the limitations of the mental representation of the situation, represents the optimal decision rule (Khaw et al., 2021). However, most decision-makers tend to avoid risks, and the risk-neutral approach cannot meet their needs (Huang et al., 2021a, b). In the case of risk aversion, a preference function that is higher than the expected return and risk is usually required. The level of enterprise risk aversion impacts the optimal decision and its profit (Kouvelis et al., 2021). Therefore, the appeal of safer options depends on decision-makers’ risk aversion (Calsamiglia et al., 2021).

As risk disclosure serves as the only official information channel for IPO companies to open to investors, this study explores the following research questions:

  1. 1.

    What are the characteristics of the risk disclosure text of the prospectus?

  2. 2.

    Do they follow the information effect or the risk effect, respectively?

  3. 3.

    What are the characteristics of the risk disclosure of illegal enterprises in the future?

To answer these questions, this paper uses a text analysis method to analyze the risk information of the prospectus and empirically tests the influence of the text content and semantic features of risk disclosure on the efficiency of the capital market.

This study contributes to the literature by empirically examining the role of risk disclosure texts in improving capital market efficiency and guiding value investment. Firstly, this paper empirically analyzes the impact of prospectus risk disclosure on capital market efficiency by calculating IPO pricing efficiency and regression of independent variables from two aspects of the semantic and content characteristics of the prospectus. This is conducive to a more comprehensive estimation of investors’ response to the IPO enterprise risk disclosure text and deeply analyzes the effect of this response (information or risk effect). Secondly, as a supplementary channel of information resources, enterprise information transparency plays a moderating role in the “information effect” and “risk effect” of risk disclosure. We find that risk disclosure is cross-influenced by information effect and risk effect, and enterprise information transparency directly affects enterprise risk disclosure degree. Thirdly, from the perspective of risk measurement, this study breaks through the existing manual reading, labeling, or complex machine learning classification methods (Cheng et al., 2021; Srivastava & Eachempati, 2021). Based on an unsupervised machine learning algorithm, this paper constructs an unsupervised feature extraction model of risk disclosure text, which provides a new method for the research of feature extraction of risk disclosure text. Fourth, this paper suggests that regulators should adopt different ways to require the risk disclosure degree of newly listed enterprises according to their specific information environment. According to the IPO performance of future illegal enterprises, a regression model of independent variables is constructed. We find that the richness of information disclosure of future illegal enterprises can be reduced to avoid the inspection of regulators.

The remainder of this paper is organized as follows. Section 2 is the literature review and research hypotheses. Section 3 outlines the research design, including sample selection measures, variable design, and the establishment of the measurement model. Section 4 introduces the detailed analysis of empirical results. Section 5 shows the results of the robustness test. Section 6 discusses the empirical results. Section 7 highlights the practical and theoretical contributions of the paper as well as its limitations.

2 Literature review and research hypothesis

2.1 IPO pricing efficiency

As an important part of IPO, pricing has always been one of the key issues of the capital market (Gao et al., 2019; He et al., 2019). The existing literature mainly has studied the reasons for IPO pricing efficiency from the perspectives of information asymmetry, enterprise nature, behavioral finance, and government regulation (Huang et al., 2019a, 2019b; Jog et al., 2019; Liu et al., 2019; Rathnayake et al., 2019; Xuan et al., 2019). According to the view of information asymmetry, poor information disclosure may interfere with the information environment, resulting in a decline in pricing efficiency (Kao et al., 2020); and good information disclosure is conducive to pricing efficiency in the IPO market (Zhou & Sadeghi, 2019).

Although the previous studies have explored the causes of the poor pricing efficiency from the perspectives of investor attention (Chang & Kwon, 2020), regulatory interactions (He & Fang, 2019), and market maker competition (Farooq & Hamouda, 2016), there still exists a lack of in-depth analysis of risk disclosure. Yao and Zhao (2016) argue that the underpricing rate on the first day of listing can effectively reflect the efficiency of asset pricing. The efficiency of good asset pricing represents positive feedback from investors on information disclosure. The lower the underpricing rate is, the higher the pricing efficiency is, and the more positive the market reaction is. Therefore, from the perspective of prospectus risk disclosure, we can understand the internal influence mechanism of IPO pricing efficiency.

2.2 Risk disclosure

In the capital market, information disclosure is an important way to mitigate information asymmetry. Careful investigation of risk disclosure may help weaken investors’ risk choices (McGuinness, 2019). Güçbilmez and Briain (2020) contend that investors with more information should get higher returns. But different information disclosures have different effects. There is no consistent conclusion on the role of risk information disclosure in the existing literature. The current research is mainly divided into two perspectives: risk effect and information effect (Adam-Müller & Erkens, 2020; Hope et al., 2016; Hussein et al., 2020; Li et al., 2019a, 2019b). To explore the true effect of risk disclosure, this article extracts and analyzes the characteristics of risk disclosure text from the perspectives of semantics and content.

The risk effect view is that the disclosure of “bad news” could bring vicious feedback. However, if companies disclose information honestly, but the gains are not worth the loss, listed companies would choose to hide “bad news” when disclosing information (Jin et al., 2021), which could lead to “adverse selection” problems in the IPO market and exacerbate risk information. This is why many listed companies use “good news, but not bad news” to whitewash and disclose information (Lo et al., 2017). Besides, hiding risk information can also effectively avoid the punishment of the third-party supervision mechanism (Nefedova & Pratobevera, 2020). Risk disclosure in the prospectus impacts initial IPO returns (Hussein et al., 2020). Therefore, this paper studies the impact of risk disclosure on IPO underpricing from the perspective of risk effect. This paper puts forward the following hypothesis for risk disclosure from the perspective of risk effect, as shown in Fig. 1.

H1A: When the semantic novelty of prospectus risk disclosure is higher, the risk perceived by investors is lower, and the underpricing rate will be lower.

H1B: When the risk disclosure content of the prospectus is richer, the risk perceived by investors is lower, and the underpricing rate will be lower.

Fig. 1
figure 1

Research model

The information effect view holds that the disclosure of risk information will bring positive feedback (Kamal, 2021). According to the information effect viewpoint, the content of risk disclosure helps reveal known risk factors, alleviating information asymmetry and enabling investors to have specific risk estimations (Li et al., 2019a, 2019b). The view of the information effect of risk disclosure holds that risk disclosure increases the supply of information, reduces the asymmetry of information, and easily wins the trust of investors, which may trigger a positive market response (Huo et al., 2022). The risk effect view of risk disclosure holds that risk information can reveal unknown risk factors, enhance investors’ risk perception, and trigger their fear of unknown risks (Campbell et al., 2014). Regulators such as the Securities and Exchange Commission have also shown interest in the quality of security risk disclosure (Cheong et al., 2021). Theoretical work usually predicts a negative correlation between disclosure and risk premium, where additional disclosure reduces estimated risk or information asymmetry (Ellahie et al., 2022). Therefore, this paper studies the impact of risk disclosure on IPO underpricing from the perspective of information effect. This paper puts forward the following hypothesis for risk disclosure from the perspective of information effect, as shown in Fig. 1.

H2A: When the semantic novelty of prospectus risk disclosure is higher, the degree of information asymmetry will be lower, and the underpricing rate will be lower.

H2B: When the risk disclosure content of the prospectus is richer, the degree of information asymmetry will be lower, and the underpricing rate will be lower.

2.3 Information environment

Transparency of the information environment refers to the extent to which external investors have access to the company’s information. To respond to the concerns of external investors, enterprises disclose themselves, thus affecting the company’s information environment (Xue et al., 2020). Loh and Stulz (2018) show that investors are more dependent on analysts’ research when the market is uncertain. It has been found that risk information is often obscure and needs to be interpreted by professional analysts (Spence et al., 2020). Analysts’ prediction of some quantitative indicators of the tracked enterprises can bring additional information to investors (Gu et al., 2019; Call et al., 2013). When more analysts track an enterprise, it could receive more attention. As a result, more information about the company can be revealed to the outside world, and the company’s information transparency would be higher. Therefore, this paper tracks the number of analysts as an indicator to measure the transparency of enterprise information.

Many studies have shown that the information environment of a company is the main cause of information transmission (Farooq & Hamouda, 2016), and a good information environment is useful in stabilizing the financial market (Papadamou et al., 2017). As a matter of fact, there are certain prerequisites for the impact of information disclosure on the capital market (Albring et al., 2020). Huang et al., (2019a, 2019b) find that the high transparency of the market information environment is conducive to easing information asymmetry. When the information environment is uniquely examined, the negative impact of ESG ratings on IPO underpricing is more pronounced in countries with more transparent financial disclosure, higher liability standards, and stronger shareholder protection (Baker et al., 2021). Therefore, we assume that corporate information affects the degree of information disclosure. We study enterprise information transparency as a moderating variable and study the impact of underpricing rates on risk disclosure. Accordingly, the following hypotheses are proposed.

H3A: When the transparency of enterprise information is low, the underpricing rate reaction to the semantic novelty of risk disclosure in the prospectus will be significantly strengthened.

H3B: When the transparency of enterprise information is low, the underpricing rate reaction to the richness of risk disclosure’s content in the prospectus will be significantly strengthened.

Based on the above assumptions, this paper discusses how the information and risk effects of risk disclosure play their respective roles, reveals the relationship between initial IPO returns and semantic novelty and content richness, and the impact of corporate transparency on semantic novelty and content richness. The research model is shown in Fig. 1.

3 Research design

3.1 Sample selection and data source

Since the reform and opening up, China’s economy has witnessed sustained and rapid development, national wealth has grown rapidly, and GDP has ranked second in the world. China’s capital market has made some achievements in development over the past 40 years. From the perspective of the proportion of China’s capital market in the global market, the share of stocks, bonds, and asset management markets reached 13%, 15%, and 9% at the end of 2020, respectively, second only to the United States. The international status of the capital market has increased rapidly and gradually, matching China’s economic status. As the second largest stock market in the world, it is reasonable to select the data as a research sample. In terms of representativeness, China’s stock market has initially formed a multi-level capital market, with a variety of trading platforms, including small and medium-sized boards, main board, equity trading market, gem, etc. Although it started late, it has developed rapidly and has become relatively mature. From the perspective of uniqueness, due to the late start, there still exists a lag in the legal system construction. The credit system is, to a certain extend, imperfect and the economic systems are different, which makes China’s stock market unique. From the above two aspects, the data of the Chinese stock market selected in this study has its research value.

The prospectus and related variables are from website “Oriental Wealth” and “Financial Circle”. This paper uses prospectuses from 2009 to 2019 as risk disclosure samples. Respectively, the prospectuses for Oriental Fortune are from 2010 to the present, and the prospectuses for Financial Circle are from 2009 to the present. Corporate governance-related data and corporate financial data are extracted from the China Center for Economic Research (CCER) financial database. The company characteristic data comes from the China Stock Market & Accounting Research (CSMAR) database and is proofread with CCER database data to ensure the accuracy of IPO enterprise data.

Referring to the existing literature, we deal with the initial samples according to the following principles: (1) eliminating the missing samples after data matching, (2) removing samples of outliers, (3) eliminating ST-listed companies because the financial data of ST listed companies can only be disclosed after certain processing, which has no reference value, (4) eliminating A-share listed companies that issue H shares to avoid the impact of various regulatory rules, referring to the research of Chen et al. (2018), (5) eliminating backdoor listed companies as Lee et al. (2019) found that the performance of backdoor listed companies in China was significantly better than that of other IPO companies before and after listing, and (6) defining the sample industry as the manufacturing industry, considering the impact of different industry characteristics and information disclosure regulations. After the above treatment, 1297 observations are obtained. Considering the influence of plate differences, 826 observations are obtained. Considering the characteristics of the unsupervised model, 101 companies’ prospectus risk disclosure text data is selected. There are 21 companies listed in the Shenzhen A-share market and 80 companies listed in the Shanghai A-share market.

3.2 Measurement of risk disclosure level

Information disclosure of listed companies can be divided into quantitative data and text. Unstructured text is of great significance to the analysis of the stock market and financial decision-making (Chan & Chong, 2017). There are five ways to measure the quality of information disclosure. First, the text length is used to measure the quality of enterprise information disclosure. When companies discuss more content in the financial statements, the quality of information disclosure is relatively high. Nowadays, the capital market risk disclosure text is old-fashioned and whitewashed (Lo et al., 2017). The length of the text has been difficult to measure the level of information disclosure. The second is to build an indicator system based on specific data to calculate the information disclosure index (Al-Hadi et al., 2019). Because of the complexity of specific indicators, indirect measurement of the level of information disclosure through specific indicators has lost the accuracy and effectiveness of the disclosure text itself. The third is to measure the quality of information disclosure by using the evaluation scores of information disclosure on platforms such as exchanges (Adam-Müller & Erkens, 2020). On the one hand, third-party organization scoring plays an important role in information valuation (Grassa et al., 2020). On the other hand, it ignores the subtle differences between listed companies. Besides, the market environment is changing with each passing day, and the quantitative index system of information disclosure needs to be verified, so it is lack of accuracy. The fourth is to manually read the risk disclosure of the prospectus or annual report and manually measure and mark the risk information (Shivaani et al., 2020; Yao & Zhao, 2016). Although this approach eliminates the barriers of other approaches to the semantic understanding of risk disclosure, it consumes too much labor. At the same time, it is subjective and time-consuming. Fifthly, many researchers use computer softwares to extract risk keywords (Ibrahim & Hussainey, 2019), mood and nature (Shivaani & Agarwal, 2020), semantic tone (Gonzalez et al., 2019), and other features to measure the intensity of risk information disclosure. In practice, a sentence may contain a lot of subjective information or intentions (Mai & Le, 2020) and different risk characteristics and require different risk management strategies (DuHadway et al., 2017). Therefore, it is challenging to express risk information with a single feature. This paper refers to the fifth way, using unsupervised machine learning to deal with risk disclosure. The text’s semantic novelty and content richness are measured to explore its impact on the efficiency of the capital market.

3.3 Related algorithm design

3.3.1 Text vectorization model based on Neural Network–Word2vec

Word2vec (Mikolov et al., 2013) and Doc2vec (Le & Mikolov, 2014) models represent deep text representation models. Compared with the traditional text representation model, it can transform words, sentences, or even paragraphs into fixed dimension vectors more fully combined with text features. This method overcomes the limitations of traditional methods in mining single text features and has been widely used in abnormal comment detection (Chang et al., 2018), candidate recommendation (Kim et al., 2019), and other fields. Word2vec, through training, can simplify the processing of text content into vector operation in k-dimensional vector space, and the similarity in vector space can be used to represent the semantic similarity of text. Therefore, the word vector output from Word2Vec can be used to do a lot of NLP-related work, such as clustering, finding synonyms, part of speech analysis, and so on. This section will explain the principle of the basic model word2vec.

Word2vec model is a word vector mapping model, which is mainly divided into two network structures: CBOW (Continuous Bag of Words) and skip-gram. CBOW is to predict the center word through context; Skip-gram is to predict the context through the center word. Thus, these two different approaches only change the way inputs and outputs are managed, but in any case, the network does not change, and the training always occurs between single pairs of words (as onehot in the inputs and outputs) (Di et al., 2021). Then, the Word2Vec model is used to use different hyperparameter training corpora, including vector dimension, context window size, and training iterations. The purpose of training with different hyperparameters is to fine-tune the model and determine the best embedding for synonym extraction (Al-Matham & Al-Khalifa, 2021). The principle is shown in Fig. 2.

Fig. 2
figure 2

CBOW and Skip-gram network structures of Word2vec model

3.3.2 Doc2vec

Based on Word2vec, Mikolov et al. (2013) developed an unsupervised method to map sentences or complete paragraphs to vector spaces of corresponding dimensions, namely the Doc2vec model. Doc2Vec has been used for sentiment analysis, though not as often as Word2Vec. We used Gensim’s Doc2Vec implementation with the default hyperparameters (Mishra et al., 2019). Doc2Vec model is a common neural network embedding method in natural language processing, which is used to vectorize words and documents from context. Compared with TF-IDF, LDA, and Word2Vec models, Doc2Vec model has the highest accuracy in functional area classification (Niu & Silva, 2021a, b). The model includes PV-DM (Distributed Memory Model of Paragraph Vectors) and PV-DBOW (Distributed Bag Of Words Model of Paragraph Vectors), similar to CBOW and Skip-gram in Word2vec, respectively.

The PV-DM model is used through the context and the corresponding paragraph vector to predict the probability of the possible central word. The schematic diagram of the PV-DM model is shown in Fig. 3a. The objective function of the PV-DM model is the maximum mean log-likelihood function, as shown in formula 1:

$$\frac{1}{T}\sum\limits_{t = k}^{T - k} {\log p(w_{t} |d_{t} ,w_{t - k} , \ldots w_{t + k} ),}$$
(1)

where T is the number of words, k specifies the size of the sliding window, dt is the paragraph vector of the current sentence, and the prediction task is completed by a softmax classifier. As shown in formulas 2 and 3,

$$p(w_{t} |d_{t} ,w_{t - k} , \ldots w_{t + k} ) = \frac{{e^{{y_{wt} }} }}{{\sum\nolimits_{i} {e^{{y_{i} }} } }}$$
(2)
$$y = Uh(d_{t} ,w_{t - k} , \ldots w_{t + k} ;W,D) + b,$$
(3)

where the h function is the concatenation or average of context words, and b is the intercept.

The PV-DBOW model is another method of training paragraph vectors in Doc2vec to randomly extract text windows from which words are extracted as words to be predicted. The purpose is to complete a specific classification task through the known paragraph vector. The schematic diagram of the model is shown in Fig. 3b. The objective function of the model is as shown in formula 4,

$$\frac{1}{T}\sum\limits_{t = k}^{T - k} {\log p(w_{t - k} , \ldots w_{t + k} |d_{t} ).}$$
(4)

In general, the Doc2vec model can effectively extract text features and abstract the text content into a fixed dimension vector so that the similarity of the text on the semantic level can be calculated and expressed in the vector space. Therefore, We use the most advanced embedding algorithms Doc2Vec as learning techniques. The algorithm builds word and document embeddings in an unsupervised manner (Chen & Sokolova, 2021).

Fig. 3
figure 3

PV-DM and PV-DBOW network structures of Word2vec model

3.3.3 Cosine similarity

Common distance measurement methods include Euclidean distance and cosine distance. Considering the characteristics of the model studied in this paper, the cosine similarity still maintains the property of “1 when the same, 0 when orthogonal, and 1 when the opposite” in the high dimension (Alshammeri et al., 2021). The value of Euclidean distance is influenced by dimension. In addition, Euclidean distance represents the absolute difference in numerical value, while cosine distance represents the relative difference in direction. By comparing different similarity measurement methods, it is found that neural network technologies (Word2vec, Doc2vec, Law2Vec) can learn the embedding effect in the task as well as other technologies (Mandal et al., 2021). When calculating document similarity, we should consider the semantic similarity of the text, not the text itself. Therefore, this paper uses a Doc2vec model to calculate the similarity between risk disclosure documents (Niu & Silva, 2021a, b). The specific calculation method is shown in Formula 5 as follows:

$$Similarity(firmA,firmB) = \frac{{\overrightarrow {{{\text{firm}}A}} \cdot \overrightarrow {{{\text{firm}}B}} }}{{(\overrightarrow {{{\text{|firm}}A|}} *\overrightarrow {{{\text{|firm}}B|}} )}}$$
(5)

3.4 Variable design and model

3.4.1 Dependent variable: IPO pricing efficiency

Referring to the research of Yao and Zhao (2016), this paper uses the IPO underpricing rate on the first day as an index to measure IPO pricing efficiency. IPO underpricing rate on the first day is defined as the difference rate between the closing price and the issuing price of the stock on the first day. If the index is positive, it means IPO underpricing. If it is negative, it means IPO premium. Some scholars have pointed out that information asymmetry is an important reason for IPO underpricing (Kao et al., 2020). Therefore, this paper uses the IPO underpricing index as the explanatory variable to study the impact of risk information disclosure on the securities market. We use the pricing efficiency of existing shares IR = (P1−P0)/P0 (Chivianti & Sukamulja, 2021), where P1 is the closing price of shares on the issue date, and P0 is the issue price.

3.4.2 Independent variable: semantic novelty and content richness of risk disclosure texts

3.4.2.1 Semantic novelty

Based on the vector space model and Python built-in a Gensim library of Python, this paper uses the doc2vec model to extract semantic vectors from risk disclosure texts after preprocessing and word segmentation. Compared with the traditional way of extracting text vectors, doc2vec brings word-order information into the model, which makes the results more accurate. Then we calculate the semantic cosine similarity between the risk information disclosure texts of the prospectus to measure the semantic novelty of the risk information disclosure texts (Alshammeri et al., 2021). For risk disclosure document set M and test document set N, this paper uses an unsupervised machine learning algorithm to calculate the semantic similarity between N and M. The semantic similarity list is constructed, and the overall similarity of the document set is obtained by processing the average value of the list value to describe the semantic novelty of the risk disclosure text. The process is shown in Fig. 4. The measurement of semantic similarity between any risk disclosure documents can be expressed by the cosine similarity of two vectors.

Fig. 4
figure 4

Flow chart of the semantic novelty of text

3.4.2.2 Content richness

In this paper, the content richness of the risk disclosure of the prospectus is measured by the number of risk types in the risk disclosure text (Elshandidy & Zeng, 2022). Compared with the way of manual reading in the study of Yao and Zhao (2016), it is more intuitive and objective to describe the richness of the risk disclosure content of the prospectus according to the number of risk types of the disclosure text and the way of data acquisition is more convenient.

3.4.2.3 Control variable

We control the company characteristic variables that may affect the pricing efficiency on the first day of listing, such as enterprise age, market system environment, registered capital, governance structure, loss situation, irrational emotions of investors, etc. Among them, the company size is the logarithm of total assets in the year of listing. The cost loss of large-scale institutions after financial fraud is far greater than that of small-scale institutions, so it may have an impact on the information disclosure effect of listed companies (Kabir et al., 2020).

In terms of governance structure, the nature of enterprise equity (State) is selected as the characteristic variable, and the state-owned holding is 1, while the non-state-owned holding is 0. Previous studies have shown significant economic differences in underpricing in different financial market environments, proving the importance of location selection in the listing process (Marcato et al., 2018). Bhardwaj and Imam (2019) also verify the importance of the external market environment for information disclosure. Therefore, for the institutional market environment, this paper chooses the “market-oriented index of China’s regions” in the 2013 report of China’s provincial enterprise operating environment index compiled by Wang et al. (2013) as the alternative variable of the institutional market environment. All variables are defined and shown in Table 10 in Appendix.

After controlling the company characteristic variables that may affect the pricing efficiency on the first day of listing, such as the age of the enterprise, the market system environment, the registered capital, the governance structure, and the loss situation, this paper explores the cross-action mechanism of information effect and risk effect on the IPO market from the perspective of semantics and content. Therefore, the following models are built, as shown in formulas 6, 7, and 8:

$$\begin{aligned} & Pricing\_Efficiency\left( {Semantic} \right) = \alpha + \beta_{1} Asset + \beta_{2} Age + \beta_{3} SOE \\ &\quad + \beta_{4} Market\_Index + \beta_{5} {\text{Re}} gistered\_Capital \\ & \quad + \beta_{6} Turnover\_rate + \beta_{7} LOSS + \beta_{8} Semantic\_Novelty + \beta_{9} block + \sum {Control} + \varepsilon , \\ \end{aligned}$$
(6)
$$\begin{aligned} & Pricing\_Efficiency\left( {Content} \right) = \alpha + \beta_{1} Asset + \beta_{2} Age + \beta_{3} SOE \\ &\quad + \beta_{4} Market\_Index + \beta_{5} {\text{Re}} gistered\_Capital \\ & \quad + \beta_{6} Turnover\_rate + \beta_{7} LOSS + \beta_{8} Content\_Richness + \beta_{9} block + \sum {Control} + \varepsilon . \\ \end{aligned}$$
(7)
$$\begin{aligned} & Pricing\_Efficiency\left( {Semantic*Content} \right) = \alpha + \beta_{1} Asset + \beta_{2} Age + \beta_{3} SOE \\ &\quad + \beta_{4} Market\_Index + \beta_{5} Registered\_Capital \\ & \quad + \beta_{6} Turnover\_rate + \beta_{7} LOSS + \beta_{8} Semantic\_Novelty + \beta_{9} Content\_Richness \\ &\quad + \beta_{10} Semantic\_Novelty*Content\_Richness + \beta_{11} block + \sum {Control} + \varepsilon . \\ \end{aligned}$$
(8)

4 Analysis and results

4.1 Descriptive statistical analysis and text heterogeneity evaluation

From the perspective of semantics and content, this paper explores the cross-influence mechanism of information and risk effects on the IPO market. First, this paper makes descriptive statistics on some statistical variables. According to the statistical results in Table 1, the semantic novelty of the risk disclosure text of the IPO prospectus is generally high. In terms of text content, there are about 14 risk disclosure statements in the IPO prospectus, which is significantly higher than the seven risk categories required in the standards for the content and format of information disclosure by companies offering securities to the public No. 1—prospectus issued by CSRC in 2015. It can be seen that with the increasing demand of public investors for the openness of IPO enterprises, the willingness for risk disclosure of IPO enterprises gradually increases. We also classified according to different stock plates. Table 2 shows descriptive statistics of each variable. We found that there were differences in each variable of enterprises in different stock plates.

Table 1 Descriptive statistics of statistical variables
Table 2 Descriptive statistics classified by the stock plates

This paper uses an unsupervised machine learning algorithm to extract semantic novelty features of the risk disclosure document set. The distribution of semantic novelty of risk disclosure is depicted in Fig. 4, and the distribution of content richness of risk disclosure is shown in Fig. 5 and Fig. 6. According to the distribution diagram in Fig. 4, at the semantic level, the novelty distribution mainly focuses on the position of 0.70–0.75. It can be seen that the semantic novelty of the risk disclosure text of the IPO prospectus is generally high. Figure 5 implies that the risk categories of IPO enterprise risk disclosure mainly focus on 13–16, which is significantly higher than the seven risk categories required in the standards for the content and format of information disclosure by companies offering securities to the public No. 1—prospectus issued by CSRC in 2015. It can be seen that the willingness of IPO companies to disclose risks has gradually increased (Fig. 6).

Fig. 5
figure 5

Distribution of semantic novelty

Fig. 6
figure 6

Distribution of content richness

4.2 Empirical results

Table 3 shows the results of risk disclosure on the IPO’s first-day market performance. The results show that: At the semantic level, it can be seen from model 1 that semantic novelty (Semantic_Novelty(p = − 1.003**)) is inversely proportional to the underpricing rate. In other words, the higher the semantic novelty of IPO prospectus risk disclosure, the lower the risk perception of investors, the lower the information asymmetry between enterprises and investors, the lower the underpricing rate, and the lower the first-day market returns (Hussein et al., 2020). This result follows the risk effect and information effect of risk disclosure (Adam-Müller & Erkens, 2020; Hope et al., 2016; Hussein et al., 2020; Li et al., 2019a, 2019b). So this validates hypothesis 1A and hypothesis 2A. However, according to Model 1 and Model 2 in Table 4, taking a step to look at a level group of properties of the stock plate, we can see that semantic novelty has a significant negative impact on IPO underpricing rate, mainly for Shenzhen A-share listed enterprises, but has no significant impact on Shanghai A-share listed enterprises.

Table 3 Regression analysis of the novelty and richness of information disclosure to IPO underpricing rate
Table 4 Grouping regression analysis of the novelty and richness of information disclosure to IPO underpricing rate

The above results may be caused by the different trading systems of the two stock markets. The trading rules of the Shenzhen stock Market are collective bidding, while the trading rules of the Shanghai Stock Market are continuous bidding. In the last 15 min, Shenzhen Stock Market has bidding time. As there is no bidding time in the Shanghai exchange market, investors in the Shenzhen exchange market can operate in the last fifteen minutes. Investors will feel tired of long-term decision-making, thus affecting the accuracy of investment (Ma et al., 2021), which leads to the instability of closing prices and increases the volatility of underpricing rate in the IPO market. As a result, the semantic novelty of Shenzhen A-share prospectuses has no significant effect on IPO underpricing.

At the content level, the results are consistent with those of Yao and Zhao (2016). It can be seen from model 2 in Table 3 that content richness (Content_Richness(− 0.009***)) is inversely proportional to the underpricing rate. In other words, the higher the richness of the risk disclosure in the prospectus, the lower the risk perception of investors, the lower the information asymmetry between enterprises and investors, the lower the underpricing rate, and the lower the first-day market returns (Ellahie et al., 2022). So this validates hypotheses 1B, and hypothesis 2B. The more types of risks disclosed in the IPO prospectus, the less information asymmetry between enterprises and investors (Elshandidy & Zeng, 2022). Hence, this leads to the easier gaining investor trust, the lower the underpricing rate (Li et al., 2019), and the more market returns on the first day. This result follows the information effect of risk disclosure. According to Models 3 and 4 in Table 4, taking a step to look by a level group of property of the stock plate, the richness of the risk disclosure statement has a significant negative effect on IPO underpricing rate for companies listed in Shenzhen and Shanghai A-shares. This further confirms that the richness of risk disclosure is negatively correlated with IPO underpricing rate.

According to Models 5 and 6 in Table 4, it can be concluded that the interaction between semantic novelty and content richness has a significant effect on IPO underpricing after grouping according to stock sectors. We conclude that the interaction between semantic novelty and content richness on IPO underpricing is significant across different groups. However, there are different interaction effects among unlisted stocks. In Shenzhen A-share listed companies, the interaction between semantic novelty and content richness has a positive correlation with IPO underpricing rate. In Shanghai A-share listed companies, the interaction between semantic novelty and content richness negatively correlates with IPO underpricing rate.

Moreover, this paper considers enterprise information transparency. We divide enterprise information transparency into high and low levels by the median enterprise information degree. According to Table 5, we also explore the moderating effect of enterprise information transparency. When corporate information transparency is low, market response to the semantic novelty and richness of risk disclosure in the prospectus will be significantly enhanced. When corporate information transparency is high, the market reaction to the semantic novelty and richness of risk disclosure in the prospectus has no significant effect. So this validates our hypothesis 3. We explain this as follows: corporate information transparency has a moderating effect on trust, and investors are significantly less interested in enterprises with high information degrees than those with low transparency (Zu et al., 2018). Therefore, the lower the transparency of corporate information, the fewer trust investors have in enterprises and the less trust they have in the documents disclosed by risks. As a result, when enterprises are listed in IPOS, investors will keep a conservative attitude when investing, which will restrain the IPO underpricing rate.

Table 5 Grouping regression results of the moderating effect of enterprise information transparency

At the semantic level, risk disclosure is negatively correlated with IPO underpricing rate. At this point, the risk effect is significant. This means that when risk disclosure is improved, investors have more channels to obtain enterprise information and have certain predictions of enterprise risk. On the contrary, whitewash in semantics brings about positive effects (Lo et al., 2017). At the content level, content richness has a negative correlation with the first-day underpricing rate, and the information effect is obvious. This means that when the richness of information disclosure is increased, investors are more inclined to make investment decisions according to the types of risks disclosed when there is no other way to obtain information.

4.3 Further analysis

The above part analyzes the internal influence mechanism of prospectus risk disclosure on IPO underprice from the perspective of cross-influence of risk effect and information effect. But why are companies willing to make full disclosure of risks before the relevant research results are available? Are enterprises with a better performance showing their good performance just to avoid undervalued evaluation in the lemon market? Or are the expedient measures adopted by companies with poor performance to avoid the regulatory risks caused by the decline of future performance? This paper will further study these questions next.

4.3.1 Risk disclosure and future violation risk

The existing research on financial market supervision almost focuses on punishment after the event. However, from the perspective of risk prevention, preventive supervision should play an important role. This paper examines the economic consequences of the opportunistic behavior of management from the perspective of the risk of corporate violations. Due to the delay of punishment result, date of illegal occurrence, and punishment judgment date, this paper defines whether the IPO company has illegal behavior fraud within three years after listing. We can find the possibility of fraud in the prospectus (Sun et al., 2021).

If the company’s violations in that year, including false records, delayed disclosure, major omissions, fraudulent listing, insider trading, etc., are punished by the CSRC, the fraud is 1; otherwise, it is 0. According to the research results in Table 6, for enterprises with future irregularities, semantic novelty (Semantic_Novelty (P = 6.595)) has no significant effect on IPO underpricing rate, while content richness (Content_Richness (P = − 0.023)) has no significant impact on IPO underpricing rate. For enterprises with no future irregularities, semantic novelty (Semantic_Novelty (P = − 0.724*)) of the prospectus has a significant influence on the IPO underpricing rate. The content richness (Content_Richness (− 0.007***)) of the prospectus significantly affects the IPO underpricing rate. According to the regression results in Table 6, it is easier for future illegal enterprises to gain investor trust, improve pricing efficiency, and obtain market returns by manipulating risk disclosure information. This shows that in the future, illegal enterprises are more likely to obtain the reward of honest disclosure by manipulating the types of semantic novelty and risk disclosure (content richness) in the prospectus to avoid the risk and cover up the illegal issues. The semantic novelty and risk types (content richness) of the prospectus risk disclosure of excellent companies have a significant influence on the IPO underpricing rate. This shows that based on honest disclosure, to prevent the malignant market effect brought by semantic inflexibility, good enterprises will also improve semantic novelty, enhance the value content of information disclosure, and generate market returns.

Table 6 Grouping regression results of the classification of enterprises in violation of regulations in the future

5 Robustness test

5.1 Changing the measurement of market pricing efficiency

To control the influence of the industry and the market, we need to redefine IPO pricing efficiency. Studies have shown that the return rate of the Shenzhen Stock Exchange component index from IPO pricing date to listing date is significantly different before and after the reform (Zhou et al., 2021). The P/E ratio is calculated using the closing price of the IPO day and earnings variables prior to the IPO (Makrominas & Yiannoulis, 2021). So, this paper adjusts the efficiency of IPO pricing, as shown in Formula 8, where RM is the weighted return rate of Shanghai and Shenzhen stock indexes on the first day of listing. According to Tables 7 and 8, after replacing the measurement method of market pricing efficiency, we find that the results are still significant, which shows the robustness of the results.

$${\text{AdjIR}} = \frac{{1 + {\text{IR}}}}{{1 + {\text{RM}}}}$$
(9)
Table 7 Robustness regression analysis of IPO pricing efficiency
Table 8 Robust grouped regression analysis of IPO pricing efficiency

5.2 Other robustness tests

In addition to changing the way to measure the efficiency of new share pricing, this paper also conducted many other ways of robustness tests. By changing the semantic training set of the unsupervised model (Niu & Silva, 2021a, b), for example, changing other IPO observation values (Massa & Zhang, 2021), multiple regression models (Rashidet al., 2014), and other ways to test the robustness, the verification results are consistent with our experimental results, which will not be described here. This paper finds that the risk disclosure semantics and content information of the prospectus has a significant impact on the IPO market performance on the first day.

For the important regulatory role of corporate information transparency as a channel for investors to obtain other information, the study on the regulatory role of corporate information transparency on IPO underpricing rate is significant (He & Fang, 2019). We re-classified the level of enterprise information transparency. 75% of enterprise information transparency level was delimited, as shown in Table 9 (Li & Zhu, 2021). After replacing the classification method of enterprise information transparency, we found that the results were still significant, indicating the robustness of the results.

Table 9 Robust grouped regression results of the moderating effect of enterprise information transparency

Discussion

The paper examines whether the structure of the risk factor disclosure in an IPO prospectus helps explain the cross-section of first-day returns in a sample of Chinese initial public offerings (Grover & Bhullar, 2021). By constructing the IPO pricing efficiency model, the grouping regression analysis of risk disclosure and IPO pricing efficiency is carried out. The study finds that the structure of risk disclosure helps explain the first-day exchange returns of IPO (Sheng et al., 2021). The paper uses textual analysis to extract two aspects (semantic novelty and information richness) of language used in the risk disclosure segment of Chinese IPO prospectuses. From the perspective of risk effect, at the semantic level, the higher the semantic novelty of IPO prospectus risk disclosure, the lower the risk perception of investors, the lower the underpricing rate, and the lower the first-day market return. At the content level, the higher the richness of the risk disclosed in the IPO prospectus, the lower the risk perception of investors, the lower the underpricing rate, and the lower the first-day market return (Li et al., 2021a, b). From the perspective of the information effect, at the semantic level, the higher the semantic novelty of prospectus risk disclosure, the lower the degree of information asymmetry between enterprises and investors, the lower the underpricing rate, and the lower the first-day market return. In terms of content stratification, the richer the risk disclosure content of the prospectus is, the lower the information asymmetry between enterprises and investors, the lower the underpricing rate, and the lower the first-day market return (Hussein et al., 2020).

The paper examines the association of these two aspects with initial IPO returns. We also consider the interaction between the risk effect and the information effect. Under the same stock sector nature, the interaction between semantic novelty and content richness in risk disclosure prospectus and IPO underpricing rate is significant, but the adjustment direction of the interaction between different stock sector natures is different (Baker et al., 2021; Peng et al., 2021). This fills the previous interaction between risk effect and information effect on IPO underpricing. These results are interpreted as consistent with a risk effect and an information effect, respectively.

Based on an unsupervised machine learning algorithm, this paper explores the mechanism of information effect and risk effect on the short-term and long-term of the IPO market from the perspective of semantics and content (Engelen et al., 2020). To mark whether there are violations of future enterprises and explore the factors influencing the IPO underpricing rate of future enterprises. The prospectus is open to fraud (Sun et al., 2021). This paper constructs a regression model of the IPO underpricing rate of illegal enterprises in the future. It is found that the structure of risk disclosure helps to explain the characteristics of IPO prospectuses of future offending companies (Sheng et al., 2021). For enterprises with no future irregularities, the semantic novelty and content richness of prospectus significantly influence IPO underpricing rate, respectively. For enterprises with future irregularities, the prospectus’s semantic novelty and content richness have no significant influence on IPO underpricing rate, respectively. Therefore, IPO underpricing rate is affected by the prospectus’s semantic novelty and content richness in the short term. In the long run, whether enterprises violate rules is also influenced by semantic novelty and content richness. In the future, semantic novelty and content richness will have an effect on the IPO underpricing rate of non-offending enterprises, while semantic novelty and content richness have no effect on the IPO underpricing rate of offending enterprises (Chintya et al., 2020; Engelen et al., 2020).

First of all, based on the summary of the existing research on the risk effect and information effect of risk disclosure, this paper finds that these two views do not exist independently and then explores the cross-action mechanism of the two effects (Adam-Müller & Erkens, 2020; Hope et al., 2016; Hussein et al., 2020; Li et al., 2019a, 2019b). The results show that at the semantic level, the risk disclosure text information of the IPO prospectus follows the risk effect of risk disclosure. At the content level, the results follow the information effect of risk disclosure. We find that risk effect and information effect interact with risk disclosure. Second, this paper considers the regulatory mechanism of enterprise information transparency (He & Fang, 2019). Under the same stock block nature, it is found that corporate information transparency can regulate the impact of semantic novelty and content richness of prospectus on IPO underpricing. This means that when the state regulates IPO underpricing, not only the structure of the prospectus can be reformed, but also the transparency of corporate information can be changed. The lower the transparency of the enterprise is, the higher the semantic novelty and content richness of the prospectus will be for investors, which reduces IPO underpricing. Investors are more inclined to make investment decisions according to the risk types of risk disclosure when there is no other access to information (Evans & Sun, 2021). Third, this paper classifies the samples to explore the risk disclosure characteristics of future illegal enterprises. It is found that the semantic and content of risk disclosure of future offending enterprises have no significant effect on IPO underpricing rate (Chintya et al., 2020; Engelen et al., 2020). This shows that future illegal enterprises are more likely to obtain the reward of honest disclosure by manipulating the content of the risk disclosure text of the prospectus to avoid the risk, cover up the problem of the violation, and then improve the market return. Based on honest disclosure, a good IPO company will also improve the level of semantic disclosure, increase the value content, and generate market returns to prevent the malignant market effect caused by a semantic template. Good companies generally disclose risk information voluntarily. Voluntary disclosure of enterprises has a better fund-raising capacity, so its own issue price is high, and the underpricing rate will be reduced (Bourveau et al., 2022).

The findings of this paper enrich the relevant research of agency theory, information asymmetry theory, and risk disclosure text analysis. First of all, from the perspective of short-term and long-term, this paper empirically studies the impact of prospectus risk disclosure on the efficiency of the capital market and complements the relevant research on risk disclosure, which is conducive to a more comprehensive and accurate estimation of the impact of IPO enterprise risk disclosure text on the efficiency of capital market in China (Engelen et al., 2020). Second, it summarizes the two viewpoints of the existing scholars (Adam-Müller & Erkens, 2020; Hope et al., 2016) and further explore the internal influence mechanism of risk disclosure text on the efficiency of the capital market, that is, how to play the role of risk effect and information effect, which help to alleviate the contradiction between frank communication and risk aversion. Third, as a supplementary channel of information resources, enterprise information transparency plays a regulatory role in the information effect and risk effect of risk disclosure (He & Fang, 2019). Fourth, from the perspective of risk measurement, it breaks through the existing manual reading, labeling, or complex machine learning classification (Catalfo & Wulf, 2016). Based on an unsupervised machine learning algorithm, this paper complements the research on quantitative feature extraction of risk disclosure text (Niu & Silva, 2021a, b).

The results of this study have some practical significance. Our research finds that the risk information of prospectus in China is uneven. This paper considers the significance of the IPO listing system and management from four aspects. For policymakers, the conclusion of this paper suggests that they need to readjust the structure of the prospectus and adopt a structure with high semantic novelty and high content richness to submit the prospectus and reduce the IPO underpricing rate of enterprises from the perspective of semantic novelty and rich content (Grover & Bhullar, 2021). At the same time, enterprises are required to expand the submission of information on the transparency of corporate information in the prospectus (Choi & Jung, 2021). Although the willingness of IPO companies to disclose risks is gradually increasing, the phenomenon of manipulating the text of risk disclosure by whitewashing and other means still exists (Lo et al., 2017). This impacts the capital market’s efficiency, especially the future illegal enterprises, which will lead to the “lemon market” effect. Therefore, the effective audit of third-party supervision should be continued based on further standardizing the content and form of risk disclosure of the IPO prospectus. For market regulation, the structure of the prospectus can be used to analyze whether an enterprise has been listed and whether it has the risk of future violations so as to prevent the short-term payment of huge funds by enterprises, to eliminate the arbitrage opportunities in the market, and to reduce the market stability and investors’ property losses caused by the bankruptcy of enterprises (Sun et al., 2021). Future illegal enterprises can hide their problems by manipulating the risk text, which shows that the practice of risk disclosure in China does not reflect the original intention of the system (Jin et al., 2021), so it provides empirical evidence for the regulatory authorities to strengthen the supervision of risk disclosure of prospectus. The results of this paper can be used for reference for the regulatory authorities to supervise the information disclosure of the prospectus. Also, according to the specific information environment of newly listed enterprises, we should encourage them to reduce the degree of the risk disclosure model in different ways and transmit more information, which is conducive to the operation efficiency of the capital market (Baker et al., 2021). For portfolio managers, it is necessary to help investors reasonably analyze the future development of enterprises from corporate prospectuses, strengthen the promotion of new share issuance, and reduce the degree of information asymmetry between new shareasonably, select companies with high semantic novelty and content richness in the prospectus, and avoid pursuing companies with high IPO underpricing rates. For investors, from the perspective of risk effect and information effect, this paper makes use of semantic novelty and content richness in the prospectus to select enterprises with a low underpricing rate for investment (Wang & Song, 2021). It avoids speculative bubbles and improves the efficiency of capital allocation within the scope of the whole society. At the same time, but also consider the future of the enterprise, whether there is a violation of the impact of investment, to prevent their own property losses (Sun et al., 2021).

6 Conclusion and future research directions

This paper examines the content of risk disclosure of Chinese firms and how it affects IPO underpricing. The study uses the unsupervised machine learning approach to generate two major variables (Engelen et al., 2020): Semantic_Novelty and Content_Richness. Regression results indicate that Semantic Novelty is negatively related to IPO underpricing, and Content Richness is negatively associated with IPO underpricing. Therefore, the sentence novelty and richness of a good prospectus risk disclosure text are high. From the perspective of semantics and content, this paper explores the cross-influence mechanism of risk effect and information effect on the IPO market. At the semantic level, the risk disclosure text information of the IPO prospectus follows the risk effect of risk disclosure. At the content level, the results follow the information effect of risk disclosure. The interaction between risk effect and information effect on risk disclosure under the nature of the same stock plate (Adam-Müller & Erkens, 2020; Hope et al., 2016; Hussein et al., 2020; Li et al., 2019a, 2019b). In the future, illegal enterprises can hide their own problems by manipulating the risk text. They can avoid punishment by improving the semantic novelty and richness of their prospectuses (Nefedova & Pratobevera, 2020).

Most especially, the study of this paper explores the economic effect of the risk disclosure text of the prospectus on China’s capital market. The overall development of China’s capital market is still a government-led model. According to the specific information environment of newly listed enterprises, different ways are adopted to encourage them to reduce the degree of risk disclosure mode so as to alleviate the information asymmetry between enterprises and investors (Huang et al., 2019a, 2019b). Our research can provide a theoretical basis for the Chinese government to reduce the low IPO price rate, stabilize the stock market, and provide a theoretical basis for the sustainable and healthy development of China’s capital market and even the national economy.

There are still limitations in this research that are worthy of further study. First, this paper overcomes the subjectivity and high cost of manual reading and machine learning manual annotation and adopts an unsupervised machine learning algorithm to analyze the text, which is more objective and convenient in method. However, due to the sensitivity of the doc2vec model to the number of samples, there may be some errors in the results. Future research can explore the way of unsupervised text processing. Secondly, this paper reveals the important impact of unstructured text information on the market, which is helpful for us to fully and objectively understand the impact of risk disclosure text on the efficiency of market resource allocation. However, it is not enough to measure only novelty and risk content, which may be affected by many factors, and more factors can be taken into consideration in the future (Boroon et al., 2021; Changchit et al., 2021; Fu et al., 2021; Huang et al., 2021a, b; Islam et al., 2021; Li et al., 2021a, b; Vali et al., 2021; Wang et al., 2021; Zhang et al., 2021a, b; Zhang, Ye et al., 2021). Third, different market participants’ perception levels of risk and investment decision-making ability may have an impact on the results. Therefore, the follow-up study can explore the impact of investors’ risk perception level and investment decision-making ability on the capital market.