Keywords

1 Introduction

In today’s world, the stock market is becoming increasingly popular. This is due to several factors, including the increased accessibility of the market for ordinary users. Nowadays, anyone in the world can easily purchase shares of any company. It is only necessary to open an account with a brokerage firm. Currently, some of the most popular brokerage firms for regular users are Interactive Brokers [1] and Robinhood [1]. Both applications provide broad access to the stock market and other financial instruments, including ETFs, bonds, and more. Based on statistics on Robinhood usage [2], we can see that over 15 million people use trading platform.

Increased access to the stock market leads to a growing interest in investing and an increase in the number of new investors. However, investing in the stock market is not without risks and requires good preparation and analysis. To protect and grow their investments, investors must carefully analyze the company they plan to invest in. Speaking of the US market, currently one of the largest markets in the global economy, all public companies in this market are regulated by the Securities and Exchange Commission (SEC), which requires companies to submit quarterly reports on their business; these reports are called Form 10-K [3]. These reports are public and available for investors and analysis, and based on the data provided in these reports, one can assess the state of affairs in the company, how well their business is going, and make decisions accordingly.

In this paper, we will discuss the possible use of Python and its tools for automating the extraction of key data from companies’ quarterly reports (10K) and methods of predicting them based on historical indicators. We will also find relationships between stock prices and critical indicators. Thus, the goal of this paper is to find the relationship between stock prices and critical indicators based on business reports to predict stock prices for investment decisions.

The paper has the following structure: Sect. 2 considers literature reviews; Section 3 describes general financial ratios used for the analysis model; Section 4 demonstrates a data analysis methods and their implementation; the conclusion is the last section.

2 Related Works

In the field of financial data analysis and financial indicator forecasting using machine learning and data analysis methods, numerous studies have been conducted. We will consider some of them in this section.

Zanc et al. [4] investigated the application of deep neural networks for financial market forecasting, they used the stock index with exchange rate data and demonstrated that deep neural networks allow for high accuracy in forecasting. Wasserbacher et al. [5] explored the application of machine learning methods for analyzing and predicting financial data. Their research used company stock data and showed that machine learning methods could be effectively used for financial indicator forecasting. Doryab et al. [6] examined the application of regression analysis for predicting investment returns. In their research, they used company stock return data and demonstrated that regression analysis could be effectively used for investment return forecasting. Snihovyi et al. [7] predicted cryptocurrency prices using different ML algorithms. All these studies show that machine learning and data analysis methods can be effectively used for financial indicator forecasting. In contrast of previous researches, in this paper, we will apply these methods for analyzing financial data and predicting financial indicators based on business reports.

Mushtaq et al. [8] employ Natural Language Processing (NLP), a subdomain of AI, to predict the sentiments while analyzing 3729 annual 10-k financial reports of S&P 500 companies over the 2002–2019 years. They disclosed that there is no significant association between the firm’s financial performance indicators and 10-ks positivity [8]. We believe that the more reports, the less correlation between 10-k reports and financial indicators, because stock markets react more strongly to negative results of reports than positive ones as described by Huang et al. [9].

A firm’s annual reports help investors decide about the company’s stocks. As a rule, investors analyze financial data to predict stock prices and future returns, volatility, and risks. At the same time, financial performance indicators may affect the massive text of the company’s 10-k report.

10-k reports are a signal that can disclose the positive performance of the company, using complex and obfuscation narratives [10]. At the same time, financial indicators can reveal actual state of the company without manipulation from the sides of agents (executives) who try to save the company’s positive image and their own positions in the company. Cohen et al. [11] revealed that 10-k reports are relevant for the firm’s financial indicators, such as future earnings, profitability, and news announcements, and can predict firm-level bankruptcies.

A regression model with explanatory financial and control variables can measure their impact on dependent variables (e.g., positive or negative news as a qualitative variable, price of the stock as a quantitative variable, etc.) [8]. Among financial indicators, existing research contains return on assets (ROA), return on equity (ROE), Tobin’s Q (TQ), and return on invested capital (ROIC). Control variables of the company consist of firm size, firm’s assets tangibility, liquidity, financing needs or deficit, and financial leverage. Investors have different risk profiles that defines their propensity to different FI [12]. Clusters can combine investors with identical preferences who are interested in same business reports [13]. Neural networks are used to predict stock prices based on big data [14].

Our model includes quantitative variables, both dependent and explanatory ones. Our study, in contrast to existing ones, reveals only a statistically significant explanatory variable for the stock’s price change 10-k report to reveal the direction of changes using 10-k reports and data analysis with API. After data analysis, stock price forecasts can be made based on only non-multicollinearity explanatory variables from the 10-k report, not all financial metrics of the report.

3 Financial Ratios of 10K Report

A 10-K Report is a report that companies registered in the United States and traded on American stock exchanges should submit to the United States Securities and Exchange Commission (SEC). The 10-K report contains detailed information about the company’s financial condition, operations, management, risks, and strategy.

The 10-K report includes the following information:

  1. 1.

    Financial Statements: Financial indicators such as balance sheet, income statement, cash flow statement, and statement of changes in equity.Business overview: A description of the main aspects of the business, including products, services, industry, geographic markets, and major competitors.Risk factors: An analysis of potential risks that could negatively affect the company’s performance, including competition, regulatory risks, and technological changes.

  2. 2.

    Management’s Discussion and Analysis (MD&A): The company’s analysis of its financial results, strategy, plans, and factors that could impact future results.

  3. 3.

    Corporate governance: Information on the company’s board of directors and executive officers, as well as information on their compensation, stock holdings, and stock options.

  4. 4.

    Security ownership and additional expenses: A description of the shareholder capital structure, including various classes of shares and shareholder rights.

  5. 5.

    Legal proceedings: Information on any significant legal proceedings in which the company is involved.

  6. 6.

    Tax matters: A description of the company’s tax obligations and any potential tax issues that may arise.

  7. 7.

    Significant agreements and contracts: A description of significant contracts and agreements that may be material to the company’s business.

In the 10-K Report, vital financial indicators provide information about the company’s financial condition and performance. These indicators are divided into several financial statements, such as the balance sheet, income statement, and cash flow statement.

The Income Statement, also known as the earnings report or the statement of performance, is a summary of a company’s revenues and expenses for a specific period of time, usually a year or a quarter. It shows how the company converts revenue from the sale of goods and services into net profit, considering all expenses.

Here are the main sections and items of the Income Statement:

  1. 1.

    Revenue: Income from the sale of goods or services. Revenue can also be referred to as sales or turnover.

  2. 2.

    Cost of Goods Sold (COGS): The costs of producing or purchasing the goods or services that the company sells. COGS includes materials, labor, and overhead costs for production.

  3. 3.

    Gross Profit: The difference between revenue and cost of goods sold. Gross profit shows how much a company earns after paying its direct costs for producing goods or services.

  4. 4.

    Operating Expenses: Costs of managing the company that is not related to producing goods or services. Operating expenses include employee salaries, rent, advertising, depreciation, research and development, and other non-production costs.

  5. 5.

    Operating Income: The difference between gross profit and operating expenses. Operating income shows how much a company earns from its core business, excluding interest and taxes.

  6. 6.

    Interest and Other Financial Expenses: Costs of interest on debts and other financial expenses, such as fees for servicing debts.

  7. 7.

    Income Tax: The amount of taxes the company must pay on its income.

  8. 8.

    Net Income: Reflects the final profit after accounting for all revenues, expenses, interest, and taxes for a specific period of time, usually a year or a quarter. Net income is used to determine the success of a company and its ability to generate profits for shareholders.

The balance sheet is a snapshot of a company’s financial position at a specific date. It includes assets (what the company owns), liabilities (what the company owes), and equity (the difference between assets and liabilities). The balance sheet consists of the following sections:

  1. 1.

    Assets: These are divided into current assets (e.g., cash, accounts receivable, inventories) and long-term assets (e.g., equipment, real estate, intellectual property).

  2. 2.

    Liabilities: These include current liabilities (e.g., accounts payable, short-term debt) and long-term liabilities (e.g., borrowed funds, pension obligations).

  3. 3.

    Equity: This is the sum of funds invested by shareholders and the company’s accumulated earnings.

Cash Flow Statement:

This report shows how a company generates and uses cash over a specific period of time, typically a year. It details changes in cash and cash equivalents, divided into three main categories:

  1. 1.

    Operating Cash Flow: reflects the net cash flow generated from the company’s core activities, such as selling goods and services, paying suppliers, employee salaries, and taxes. Positive operating cash flow indicates that the company is successfully converting its profits into cash.

  2. 2.

    Investing Cash Flow: reflects cash flows related to investments in long-term assets, such as the purchase or sale of equipment, real estate, intellectual property, shares of other companies, and debt instruments. Negative investing cash flow may be associated with investments in the company’s growth and development.

  3. 3.

    Financing Cash Flow: reflects cash flows related to the company’s financing, including the issuance and repayment of shares and debts, payment of dividends to shareholders, and other financing-related operations. Negative financing cash flow may indicate the repayment of debts or dividends.

  4. 4.

    Based on data from this report, we can calculate the next important financial ratios (Table 1):

Table 1. Financial metrics of companies’ 10-K reports

4 Data Analysis

We obtained historical data for the past 20 years (1997–2022) for Amazon Inc. [15] using a third-party service called Financial Modeling [16], which stores and provides annual and quarterly reports of companies through an API (Fig. 1). From this data, we extracted essential metrics such as Revenue, Cost of Revenue, Operating Income, Net Income, Current Assets, Current Liabilities, Total Debt, Total Equity, Shareholders Equity, Outstanding Shares, Market Price per Share, Earnings per Share, and Book Value per Share.

Using Python, we manipulated this data and created a quarterly dataset (Table 2). We analyze this dataset and identify relationships between stock price and financial indicators extracted from the quarter report using random forest and multiple linear regression algorithms.

We selected two algorithms to compare their performance, predicting stock prices based on financial indicators extracted from business reports. Random Forest uses multiple decision trees to make accurate predictions, preventing overfitting by averaging the results. Multiple linear regression analyzes the impact of various variables on stock prices. Comparing these algorithms will help us assess each variable’s contribution to explaining stock price changes (Fig. 2).

Fig. 1.
figure 1

Dynamics of market price per share for Amazon, Inc., 1997–2022

Table 2. Summary statistics

Therefore, before using multiple linear regression to analyze the relationship between stock prices and various variables, it is necessary to check for multicollinearity and ensure this phenomenon is absent in the data. If multicollinearity is detected, measures such as removing one of the explanatory variables or using Farrar–Glober algorithm may be taken to eliminate it. To remove correlated variables, we calculated the correlation coefficients between the independent variables, as shown in Fig. 3. Then, we set a correlation threshold of 0.8 and removed variables that were highly correlated.

Fig. 2.
figure 2

Histogram of market price per share for Amazon, Inc., 1997–2022

After analyzing for correlation, we removed the correlated explanatory variables: earning_per_share, total_debt, total_assets, net_profit_margin, return_on_assets, price_to_book_ratio. We also removed three additional variables: debt_to_equity_ratio, current_ratio, and return_on_equity. We observed that after removing these variables, the deviation percentage for our model improved significantly. The improved deviation percentage indicated that the model’s predictions were closer to the actual stock prices. And now we only have 5 independent variables: net_income, price_to_earning_ratio, total_equity, operating_margin, and gross_margin. To build multiple regression and random forest models, we used the sklearn library and pandas to create a dataset based on the previously processed data.

Fig. 3.
figure 3

Correlation analysis for explanatory variables

We split the original dataset into training and testing subsets to train the model and evaluate its performance. We followed a commonly used approach where the data was split in an 80:20 ratio, where 80% of the data was used for training the model, and the remaining 20% was used for evaluating its performance. It helped us estimate the accuracy and robustness of the model on new data and prevent overfitting. The dataset splitting into training and testing subsets was performed using the “train_test_split” function in the sci-kit-learn library. The code is shown in Fig. 4.

Fig. 4.
figure 4

Estimation of selected algorithms

We used RMSE, Deviation Percentage, and R-squared to compare our models. RMSE measures accuracy, Deviation Percentage measures relative error, and R-squared measures how well the model fits the data. Together, these metrics provide a comprehensive assessment of the model’s performance, because they are interpretable, evaluated thoroughly, and capture different aspects of prediction accuracy and model fit. Overall, they allow us to evaluate and communicate the effectiveness of our models in predicting stock prices. The results of our models are shown in Table 3.

Table 3. Coefficient comparison

Based on the evaluation results, the Random Forest model predicts stock prices using financial indicators better than the Linear Regression model. The Random Forest model has a lower RMSE of 9.54 (compared to 27.87 for Linear Regression) and a Deviation Percentage of 34.61% (compared to 394.24% for Linear Regression), which means it has better predictive accuracy and fewer prediction errors. Additionally, the Random Forest model has a higher R-squared value of 97.22% (compared to 76.22% for Linear Regression), indicating that it has better overall explanatory power in capturing the variability in stock prices. However, caution should be exercised in overestimating the model, as additional factors may also affect the result. We can distinguish direct (Fig. 5) and polynomial (Figs. 67) dependences among market price per share and significant explanatory variables.

Fig. 5.
figure 5

Market price per share and total equity (graph)

5 Conclusions

Based on the study, we can conclude that several factors, including total equity, gross margin, operating margin, price-to-earnings ratio, and net income, influence the market price per share. Taking into account the comparison of the two models, Random Forest exhibited superior performance in predicting stock prices compared to Multiple Linear Regression. It achieved significantly lower values for both RMSE (9.53) and Deviation Percentage (34.61%), indicating improved accuracy and reduced prediction errors.

Fig. 6.
figure 6

Market price per share and gross margin (polynomial dependence)

Fig. 7.
figure 7

Market price per share and net income (direct and polynomial dependences)

Furthermore, the Random Forest model demonstrated a higher R-squared value of 97.22%, highlighting its exceptional explanatory power. This means that the financial indicators utilized in the model, making it a more reliable, can explain around 97.22% of the variability in stock prices and effective approach for understanding the underlying patterns influencing stock prices. The Deviation Percentage of 34.61% may seem relatively high, suggesting that there is still room for improvement in the model’s precision in predicting stock prices accurately. While Random Forest outperformed Multiple Linear Regression, it is essential to consider other factors that might affect stock prices beyond the selected financial indicators.

Overall, the study results showed that Random Forest is an effective tool for analyzing the dependence of stock prices on various variables. However, it is essential to note that our model does not account for all factors affecting stock prices, so investors should use market analysis to make the correct investment decisions.