Keywords

1 Introduction

In recent years, the emergence of generative AI and large language models (LLMs) such as OpenAI’s ChatGPT have led to significant advancements in NLP. Many of these models provide the ability to be fine-tuned on custom datasets [1,2,3] and achieve the state-of-the-art (SOTA) performance across various tasks. A few of the LLMs such as GPT-3 [4] have demonstrated in-context-learning capability without requiring any fine-tuning on task-specific data. The impressive performance of ChatGPT and other LLMs [5,6,7,8,9, 79] in zero-shot and few-shot learning scenarios is a major finding as this helps LLMs to be more efficient [74,75,76,77,78]. With such learning methodologies, the LLMs can be used as a service [10] to empower a set of new real-world applications.

Despite the impressive capability of ChatGPT in performing a wide range of challenging tasks, there remain some major concerns about it in solving real-world problems like log analysis [93]. Log analysis is a vast area, and much research has been done. It mainly comprises three major categories, namely, log parsing, log analytics, and log summarization. Log parsing is an important initial step of system diagnostic tasks. Through log parsing, the raw log messages are converted into a structured format while extracting the template [11,12,13,14]. Log analytics can be used to identify the system events and dynamic runtime information, which can help the subject matter experts to understand system behavior and perform system diagnostic tasks, such as anomaly detection [15,16,17,18], log classification [19], error prediction [20, 21], and root cause analysis [22, 23]. Log analytics can further be used to perform advanced operations e.g., identify user activities, and security analysis e.g., detect logged-in users, API/service calls, malicious URLs, etc. As logs are huge in volume, log summarization enables the operators to provide a gist of the overall activities in logs and empowers the subject matter experts to read and/or understand logs faster. Recent studies leverage pre-trained language models [17, 24, 25] for representing log data. However, these methods still require either training the models from scratch [26] or tuning a pre-trained language model with labeled data [17, 24], which could be impractical due to the lack of computing resources and labeled data.

Fig. 1.
figure 1

An example of log code, log message, and structured log from [34]

More recently, LLMs such as ChatGPT [93] have been applied to a variety of software engineering tasks and achieved satisfactory performance [27, 28]. With a lack of studies to analyze ChatGPT’s capabilities on log processing, it is unclear whether it can be performed well on the logs. Although many papers have performed the evaluation of ChatGPT on software engineering tasks [29, 30, 33], specific research is required to investigate its capabilities in system log area. We are aware that the LLMs are fast evolving, with new models, versions, and tools being released frequently, and each one is improved over the previous ones. However, our goal is to assess the current situation and to provide a set of experiments that can enable the researchers to identify possible shortcomings of the current version for analyzing logs and provide a variety of specific tasks to measure the improvement of future versions. Hence, in this paper, we conduct an initial level of evaluation of ChatGPT on log data. Specifically, we divide the log processing [32] into three subsections: log parsing, log analytics, and log summarization. We design appropriate prompts for each of these tasks and analyze ChatGPT’s capabilities in these areas. Our analysis shows that ChatGPT achieves promising results in some areas, but limited outcomes in others and contains several real-world challenges in terms of scalability. In summary, the major contributions of our work are as follows:

  • To the best of our knowledge, we are the first to study and analyze ChatGPT’s ability to analyze the log data in multiple detailed aspects.

  • We design the prompts for multiple scenarios in log processing and record ChatGPT’s response.

  • Based on the findings, we outline several challenges and prospects for ChatGPT-based log processing.

Fig. 2.
figure 2

Various prompt designs to address the research questions.

2 Related Work

2.1 Log Data

With the increasing scale of software systems, it is complex to manage and maintain them. To tackle this challenge, engineers enhance the system observability [31, 99] with logs.

Logs capture multiple system run-time information such as events, transactions, and messages. A typical piece of log message is a time-stamped record that captures the activity that happened over time (e.g., software update events or received messages). Logs are usually generated when a system executes the corresponding logging code snippets. An example of the code snippet and generated code is shown in Fig. 1. A system with mature logs essentially facilitates the system behavior understanding, health monitoring, failure diagnosis, etc. Generally, there are three standard log formats, i.e., structured, semi-structured, and unstructured logs [72]. These formats share the same components: a timestamp and a payload content.

Structured logs usually keep a consistent format within the log data and are easy to manage. Specifically, the well-structured format allows easy storing, indexing, searching, and aggregation in a relational database. The unstructured log data achieves its high flexibility at the expense of the ease of machine processing. The characteristic of free-form text becomes a major obstacle for efficient query and analysis on unstructured or semi-structured logs. For instance, to count how often an API version appears in unstructured logs, engineers need to design a complex query with ad-hoc regular expressions to extract the desired information. The manual process takes lots of time and effort and is not scalable.

2.2 Log Processing

Logs have been widely adopted in software system development and maintenance. In industry, it is a common practice to record detailed software runtime information into logs, allowing developers and support engineers to track system behaviors and perform postmortem analysis. On a high level, log processing can be categorized in three types as discussed below.

Log Parsing. Log parsing is generally the first step toward automated log analytics. It aims at parsing each log message into a specific log event/template and extracting the corresponding parameters. Although there are many traditional regular expression-based log parsers, but, they require a predefined knowledge about the log template. To achieve better performance in comparison to traditional log parsers, many data-driven [12, 37,38,39,40,41,42] and deep learning based approaches [24, 26] have been proposed to automatically distinguish template and parameter parts.

Log Analytics. Modern software development and operations rely on log monitoring to understand how systems behave in production. There is an increasing trend to adopt artificial intelligence to automate operations. Gartner [97] refers to this movement as AIOps. The research community, including practitioners, has been actively working to address the challenges related to extracting insights from log data also being referred to as “Log Analysis” [96]. Various insights that can be gained are in terms of log mining [85], error detection and root cause analysis, security and privacy, anomaly detection, and event prediction.

Log Mining. Log mining seeks to support understanding and analysis utilizing abstraction and extracting useful insights. However, building such models is a challenging and expensive task. In our study, we confine ourselves to posing specific questions in terms of most API/service calls that can be extracted out of raw log messages. This area is well studied from a deep learning aspect and most of those approaches [49,50,51,52,53,54,55,56] require to first parse the logs and then process them to extract the detailed level of knowledge.

Error Detection and Root Cause Analysis. Automatic error detection from logs is an important part of monitoring solutions. Maintainers need to investigate what caused that unexpected behavior. Several studies [22, 43, 45,46,47,48] attempt to provide their useful contribution to root cause analysis, accurate error identification, and impact analysis.

Security and Privacy. Logs can be leveraged for security purposes, such as malicious behaviour and attack detection, URLs, and IP detection, logged-in user detection, etc. Several researchers have worked towards detecting early-stage malware and advanced persistence threat infections to identify malicious activities based on log data [57,58,59,60,61].

Anomaly Detection. Anomaly detection techniques addresses to identify the anomalous or undesired patterns in logs. The manual analysis of logs is time-consuming, error-prone, and unfeasible in many cases. Researchers have been trying several different techniques for automated anomaly detection, such as deep learning [62,63,64,65] and data mining, statistical learning methods, and machine learning [23, 66,67,68,69,70,71].

Event Prediction. The knowledge about the correlation of multiple events, when combined to predict the critical or interesting event is useful in preventive maintenance or predictive analytics that can reduce the unexpected system downtime and result in cost saving [80,81,82]. Thus, the event prediction method is highly valuable in real-time applications. In recent years, many rule-based and deep learning based approaches [83, 88,89,90,91,92] have evolved and performing significantly.

Log Summarization. Log statements are inserted in the source code to capture normal and abnormal behaviors. However, with the growing volume of logs, it becomes a time-consuming task to summarize the logs. There are multiple deep learning-based approaches [19, 44, 96, 98] that perform the summarization, but they require time and compute resources for training the models.

2.3 ChatGPT

ChatGPT is a large language model which is developed by OpenAI [93, 94]. ChatGPT is trained on a huge dataset containing massive amount of internet text. It offers the capability to generate text responses in natural language that are based on a wide range of topics. The fundamental of ChatGPT is generative pre-training transformer (GPT) architecture. GPT architecture is highly effective for natural language processing tasks such as translation in multiple languages, summarization, and question answering (Q & A). It offers the capability to be fine-tuned on specific tasks with a smaller dataset with specific examples. ChatGPT can be adopted in a variety of use cases including chatbots, language translation, and language understanding. It is a powerful tool and possesses the potential to be used across wide range of industries and applications.

2.4 ChatGPT Evaluation

Several recent works on ChatGPT evaluation have been done, but most of the papers target the evaluations on general tasks [33, 73], code generation [27], deep learning-based program repair [28], benchmark datasets from various domains [29], software modeling tasks [30], information extraction [87], sentiment analysis of social media and research papers [84] or even assessment of evaluation methods [86]. The closest to our work is [35], but they focus only on log parsing.

We believe that the log processing area is huge and a large-level evaluation of ChatGPT on log data would be useful for the research community. Hence, in our work, we focus on evaluating ChatGPT by conducting an in-depth and wider analysis of log data in terms of log parsing, log analytics, and log summarization.

3 Context

In this paper, our primary focus is to assess the capability of ChatGPT on log data. In line with this, we aim to answer several research questions through experimental evaluation.

3.1 Research Questions

Log Parsing

RQ1. How does ChatGPT perform on log parsing?

Log Analytics

RQ2. Can ChatGPT extract the errors and identify the root cause from raw log messages?

RQ3. How does ChatGPT perform on advanced analytics tasks e.g., most called APIs/services?

RQ4. Can ChatGPT be used to extract security information from log messages?

RQ5. Is ChatGPT able to detect anomalies from log data?

RQ6. Can ChatGPT predict the next events based on previous log messages?

Log Summarization

RQ7. Can ChatGPT summarize a single raw log messages?

RQ8. Can ChatGPT summarize multiple log messages?

General

RQ9. Can ChatGPT process bulk log messages?

RQ10. What length of log messages can ChatGPT process at once?

To examine the effectiveness of ChatGPT in answering the research questions, we design specific prompts as shown in Fig. 2. We append the log messages in each of the prompts (in place of the slot ‘[LOG]’).

3.2 Dataset

To perform our experiments, we use the datasets provided from the Loghub benchmark [13, 34]. This benchmark covers log data from various systems, including, windows and linux operating systems, distributed systems, mobile systems, server applications, and standalone software. Each system dataset contains 2,000 manually labeled and raw log messages.

Fig. 3.
figure 3

Flow Diagram.

3.3 Experimental Setup

For our experiments, we are using the ChatGPT API based on the gpt-3.5-turbo model to generate the responses for different prompts [93]. As shown in Fig. 3, we send the prompts appended with log messages to ChatGPT from our system with Intel® Xeon® E3-1200 v5 processor and Intel® Xeon® E3-1500 v5 processor and receive the response. To avoid bias from model updates, we use a snapshot of gpt3.5-turbo from March 2023 [95].

3.4 Evaluation Metrics

As our study demands a detailed evaluation and in some cases, there was no state-of-the-art tool, we evaluated the output by our manual evaluation.

4 Experiments and Results

Each of the subsections below describes the individual evaluation of ChatGPT in different areas of log processing.

Fig. 4.
figure 4

Log parsing of raw log message.

4.1 Log Parsing

In this experiment, we assess the capability of ChatGPT in parsing a raw log message and a preprocessed log message and find the answer to RQ1. For the first experiment, we provide a single raw log message from each of the sixteen publicly available datasets [34] and ask ChatGPT to extract the log template. We refer to it as first-level log parsing. ChatGPT performs well in extracting the specific parts of log messages for all sixteen log messages. One of the examples of ChatGPT’s response for first-level log parsing is shown in Fig. 4. Next, we preprocess the log message, extract the content, and ask chatGPT to further extract the template from the log message. ChatGPT can extract the template and variables from the log message successfully on all sixteen log messages with a simple prompt. One of the examples of ChatGPT’s response is shown in Fig. 5.

4.2 Log Analytics

To evaluate ChatGPT’s capability in log analytics, we perform several experiments in each of the categories described in Sect. 2.2.

Log Mining. In this experiment, we are seeking the answer of RQ2 by investigating if ChatGPT can skim out the knowledge from raw logs without building an explicit parsing pipeline. We perform our experiments in several parts. We provide a subset of log messages containing 5, 10, 20, and 50 log messages from Loghub benchmark [34] and ask ChatGPT to identify the APIs. Figure 6 shows an example of ChatGPT response when a smaller set of log messages were passed. We notice that ChatGPT consistently missed identifying some APIs from the log messages irrespective of the count of log messages, but still shows 75% or more accuracy in all cases. Results are reported in Table 1.

Fig. 5.
figure 5

Log parsing of preprocessed log message.

Table 1. ChatGPT’s performance to identify the APIs, errors and root cause from Loghub dataset [34].
Fig. 6.
figure 6

ChatGPT response to extract the APIs from log messages.

Fig. 7.
figure 7

ChatGPT response to identify the errors and root cause from set of 5 log messages from Loghub dataset [34].

Error Detection and Root Cause Analysis. In this experiment, we explicitly ask ChatGPT [95] to identify the errors, warnings, and possible root causes of those in the provided log messages and address RQ3. Aligning towards our study structure, we first provide five log messages from the Loghub dataset [34] and later increase the size of log messages to ten, twenty, and fifty. Figure 7 shows the identified errors from five log messages and a detailed report for all the combinations with their response time is being reported in Table 1. It is evident from Table 1 that ChatGPT successfully identifies the errors and warnings on a smaller set of log messages than a larger set.

Fig. 8.
figure 8

ChatGPT response to extract urls, IPs, and users from set of 5 log messages from Loghub dataset [100].

Security and Privacy. In this experiment, we focus on addressing RQ4 and investigate if ChatGPT can identify the URLs, IPs, and logged users from the logs and extract knowledge about malicious activities. We use the open source dataset from Loghub [100] and follow the same approach of sending the set of five, ten, twenty, and fifty log messages to chatGPT to detect the URLs, IPs, and users from them. We use the ‘Prompt 4’ from Fig. 2 to ask if there are any malicious activities present in the logs. As shown in Table 2, ChatGPT extracts out the IPs and logged-in users with high accuracy irrespective of the length of log messages. An example of ChatGPT’s response is shown in Fig. 8. The detailed report is shown in Table 2.

Table 2. ChatGPT performance to extract urls, IPs, and users from the log messages from Loghub dataset [34].
Fig. 9.
figure 9

ChatGPT response for anomaly detection for a sample from Loghub dataset [34].

Anomaly Detection

To evaluate ChatGPT’s capability to detect anomalies in logs and to address RQ5, we use ‘Prompt 5’ from Fig. 2. As detecting anomalies through log messages would require context, we append 200 log message entries and ask ChatGPT to detect anomalies from it. Without showing any examples to ChatGPT of how an anomaly might look like, it still tries to identify the possible anomalies and provide its analysis in the end. One of the examples is shown in Fig. 9.

Fig. 10.
figure 10

ChatGPT response for event prediction from Loghub dataset [34].

Event Prediction

It is interesting to evaluate ChatGPT’s performance in predicting future events in log messages. Typically, for future event prediction, a context of past event is required, hence, we append 200 log messages to ‘Prompt 6’ from Fig. 2 and ask ChatGPT to predict the next 10 messages for simplicity. This experiment addresses the RQ6. While ChatGPT predicts the next 10 events in log format, it fails to predict even a single log message correctly when compared with the ground truth. ChatGPT’s response is shown in Fig. 10.

4.3 Log Summarization

This experiment is designed to understand if ChatGPT could succinctly summarize logs. We perform this study in two steps. First, To address the RQ7, we provide a single log message from each of the sixteen datasets of opensource benchmark [34] to ChatGPT to understand its mechanics. This is useful to understand the log message in natural language. Figure 11 shows one of the log messages from the Android subset of the Loghub dataset [34] and ChatGPT response. It is evident from the response that ChatGPT provides a detailed explanation of the log message. Next, to address the RQ8, we provide a set of ten log messages from each of the sixteen subsets of the Loghub dataset [34] to ChatGPT and ask to summarize the logs. ChatGPT generates a concrete summary collectively from the provided log messages as shown in Fig. 12. In Fig. 12, we only show a few log messages for visual clarity. ChatGPT generates an understandable summary for all the sixteen subsets.

Fig. 11.
figure 11

Summary generated by ChatGPT for single log message from Loghub dataset [34].

Fig. 12.
figure 12

Collective summary generated by ChatGPT for ten log messages from Loghub dataset [34].

5 Discussion

Based on our study, we highlight a few challenges and prospects for ChatGPT on log data analysis.

5.1 Handling Unstructured Log Data

For our experiments, we send the unstructured raw log messages to ChatGPT to analyze its capabilities on various log-specific tasks. Our study indicates that ChatGPT shows promising performance in processing the raw log messages. It is excellent in log parsing and identifying security and privacy information, but encounters difficulty in case of API detection, event prediction, and summarizing. It misses out on several APIs and events from raw log messages.

5.2 Performance with Zero-Shot Learning

We perform our experiments with zero-shot learning. Our experimental results show that ChatGPT exhibits good performance in the areas of log parsing, security, and privacy, and average performance in the case of API detection, incident detection, and root cause identification. As ChatGPT supports few-shot learning, it remains an important future work to select important guidelines to set effective examples and evaluate ChatGPT’s performance with them.

5.3 Scalability - Message Cap for GPT

Most of the intelligent knowledge extraction from logs depends on processing a large amount of the logs in a short period. As ChatGPT 3.5 can only process limited tokens at once, it poses a major limitation in feeding the bigger chunk of log data. For our experiments, we could only send 190 to 200 log messages appended (addressing RQ9 and RQ10) with the appropriate prompt at once. As most of the real-time applications would require to continuously send larger chunks of log messages to a system for processing, this limitation of ChatGPT 3.5 may pose a major hindrance in terms of scalability making them less suitable for tasks that require up-to-date knowledge or rapid adaptation to changing contexts. With the newer versions of ChatGPT, the number of tokens may be increased which would make it more suitable for its application in the log processing area.

5.4 Latency

The response time of ChatGPT ranges from a few seconds to minutes when the number of log messages is increased in the prompt. The details about response time are shown in Table 1 and 2. Most of the intelligent knowledge extraction from logs depends on the processing time of the large amount of the logs. With the current state of response time, ChatGPT would face a major challenge in real-time applications, where a response is required in a shorter period. As currently, we have to call openAI API to get ChatGPT’s response, with the newer versions of ChatGPT, it may be possible to deploy these models close to applications and reduce the latency significantly.

5.5 Privacy

Log data often contains sensitive information that requires protection. It is crucial to ensure that log data is stored and processed securely to safeguard sensitive information. It is also important to consider appropriate measures to mitigate any potential risks.

6 Conclusion

This paper presents the first evaluation to give a comprehensive overview of ChatGPT’s capability on log data from three major areas: log parsing, log analytics and log summarization. We have designed specific prompts for ChatGPT to reveal its capabilities in the area of log processing. Our evaluations reveal that the current state of ChatGPT exhibits excellent performance in the areas of log parsing, but poses certain limitations in other areas i.e., API detection, anomaly detection, log summarization, etc. We identify several grand challenges and opportunities that future research should address to improve the current capabilities of ChatGPT.

7 Disclaimer

The goal of this paper is mainly to summarize and discuss existing evaluation efforts on ChatGPT along with some limitations. The only intention is to foster a better understanding of the existing framework. Additionally, due to the swift evolution of LLMs especially ChatGPT, they would likely become more robust, and some of their limitations described in this paper are remediated. We encourage interested readers to take this survey as a reference for future research and conduct real experiments in current systems when performing evaluations. Finally, with continuous evaluation of LLMs, we may miss some new papers or benchmarks. We welcome all constructive feedback and suggestions to help make this evaluation better.