Keywords

1 Introduction

Ambiguity has historically been a challenge in Natural Language Processing (NLP) and continues to present obstacles for modern systems. Large Language Models (LLMs) generate output by calculating a “most likely” response to any given input, with inputs usually given in the form of a prompt from a human user. However, prompts are often ambiguous, and even the best possible prediction cannot fully resolve underspecified prompts [1]. Even in a conversation between two humans, both speaking the same language and communicating clearly, misunderstandings are common. While there are many ways to resolve ambiguity in a human conversation, perhaps the most obvious way is to ask for clarification. However, commonly used LLM systems such as ChatGPT, Bard, and Bing do not ask clarifying questions in response to ambiguous prompts.

This is a problem in fields where precision is important. Detailed discussions, including follow-up questions, are a necessary part of human communication in such fields. Even for simple requests, a lack of follow-up questions can lead to suboptimal answers or cause the LLM to misunderstand the user’s needs.

LLMs are capable of identifying ambiguity in user prompts and forming questions in response to ambiguity when prompted to do so. This has already been demonstrated with AIs answering simple ambiguous questions [2, 3], but has not yet been demonstrated with AIs intended to generate longer-form responses such as letters or documents. We propose that an LLM-based system that asks clarifying questions when needed will produce content that is more closely aligned to the desires of the human user than a comparable system which asks no clarifying questions.

2 Background

Ambiguity has historically been a challenge for NLP, including parsing [4,5,6], Named Entity Recognition (NER) [7,8,9], story understanding [10,11,12,13], and numerous other NLP tasks. In the past decade, neural models have been employed to create software that can exhibit reading comprehension-like behaviors on ambiguous, natural language text. Recurrent Neural Networks (RNN) [14, 15] and Long Short-Term Memory Networks (LSTM) [10, 16,17,18] demonstrated successes with specialist systems dedicated to specific NLP tasks.

Large Language Models (LLM) offer a more general solution than specialized LSTM systems. Over the past several years LLMs have improved dramatically, demonstrating state-of-the-art performance in multiple areas of NLP, matching or outperforming specialized LSTM-based systems on several NLP benchmarks [19,20,21] including answering questions about children’s stories [22], common-sense reasoning [23], reading comprehension [24], translation, and summarization [21], among others.

Output from LLMs can be further improved by the introduction of “chain-of-thought reasoning” [25], in which the LLM is prompted to write out a full logical argument for its conclusion in small steps, rather than skipping straight ahead to the final conclusion. Chain-of-thought reasoning in LLMs leads to fewer hallucinations, more factually correct responses, more advanced reasoning, and improved ability to solve puzzles or trick questions. LLM systems operating on a chain-of-thought model also have the potential to explain the reasoning that led them to a given conclusion, which is considered to be a desirable trait in both logical and ethical reasoning, and a necessary step for humans to trust the results of an analysis [26, 27].

2.1 Context, Ambiguity, and User Needs

Understanding context is necessary for accurate reasoning and communication. When precision is needed in human communication, a wide variety of methods are used to clarify what would otherwise be inherently ambiguous language. For example, when gathering requirements for new software, a high degree of precision is needed, usually far more than is initially provided, which is the motive for the phase of ‘requirements gathering’ within software engineering. Requirements gathering has been researched at length, and often employs a wide variety of techniques including questionnaires, face-to-face dialogs between customer and developer, and various exercises designed to improve user engagement with the requirements gathering process [28, 29]. None of this would be necessary if software engineers could reliably get good results by simply asking users to “please state your requirements clearly!” One of the key goals of requirements gathering is to understand the context of the desired software, for example, what problem the software is needed to solve, and what specific change or improvement it is hoped the software will achieve.

Prior research in conversational interfaces has shown that better results can be achieved when the full context of a conversation is considered, not just the immediate prompt [30]. Systems such as ChatGPT are in a disadvantaged position with regards to context, as users can enter any prompt on any topic, and the LLM must provide a response with no knowledge of who the user is or why they are asking. It is unreasonable to assume that an LLM could interpret the user’s intended meaning in the absence of context when this is not possible even in communication between humans, using inherently ambiguous and context-dependent language. We believe that asking clarifying questions is a skill which LLMs must master if they are to communicate clearly and precisely with humans.

2.2 Prior Work

Several recent works have addressed the concept of LLMs using clarifying questions. The CLAM architecture [2] presents a method for using an LLM to assess ambiguity, generate a clarifying question if needed, and then generate an answer based on the user’s response to the question. CLARA [3] showed that a similar framework could be used to interpret user commands given to a robotic arm. ClarifyDelphi [31] uses clarifying questions to assist in context-sensitive ethical reasoning. Zhang et al. 2023 present a framework for asking clarifying questions before retrieving data from a database [32]. ClarifyGPT [33] demonstrates the benefits of asking clarifying questions for LLM code-creation tasks. Follow-up questions have been shown to be effective at steering the conversation in automated surveys conducted by LLM chatbots [34].

Our research differs from these prior works in several ways:

  1. 1.

    We examine the overall quality of generated text documents. The ability to create and modify original documents is a key strength of LLMs over earlier AI systems.

  2. 2.

    In our research, human users rate the quality of documents based on their own needs and subjective judgements. This is a realistic scenario for documents generated for human use. Prior research has relied heavily on “simulated humans” modeled by AI or automated metrics which may not reflect users’ subjective experience of document quality.

  3. 3.

    We also examine users’ willingness to engage in question-and-answer dialog with the AI. It is hoped that the AI asking follow-up questions will encourage users to engage more deeply with their own prompts and with the document creation process. However, it is also possible that users will find the process annoying or arduous.

2.3 Existing Benchmarks and Evaluation Methods

There are many benchmarks currently in use for the evaluation of LLMs. Many of the common benchmarks, such as the BLEU benchmark [35] and BERTScore [36], measure the overall quality of the generated text. However, they do not measure how well the output corresponds to the initial prompt or to a user-desired outcome. Other benchmarks test the LLM’s ability to give the correct answer to questions with previously established correct answers, including numerous question-answer (QA) datasets [37]. Some QA datasets target specific types of questions, including CoQA for Conversational Question Answering [24], TruthfulQA for misleading questions [38], and the Children’s Book Test for reading comprehension of short stories [22]. These styles of benchmark are poorly suited to determining whether a generated content has fulfilled a user’s needs. Measuring the overall quality of the text, as BLEU and BERTScore do, does not tell us whether the high-quality text has solved the user’s problem or merely provided elegant but irrelevant prose. QA datasets are only suitable for measuring the LLM’s ability to produce short, accurate responses to questions with objectively right and wrong answers. This is not suitable for the evaluation of longer-form content. A letter, essay, or short story cannot be objectively classified as “correct” or “incorrect.” The overall quality of such a document can only be measured subjectively, by the evaluation of the reader.Footnote 1

Validation of generative models for visual art and music may offer some guidance here. As with long-form textual content, the quality of visual art and music cannot generally be objectively evaluated. Furthermore, such systems are most often employed in the task of generating content (art or music) from a short textual prompt, and quality of these systems must be evaluated on how closely the output matches the intent behind the prompt given to the model. Despite the challenges associated with subjective analysis by human evaluators, including higher costs and challenges with methodology and sample size, it is often the only way to gain reliable feedback on the quality of output from creative systems [39, 40]. For instance, the experiments which validated the quality of DALL-E had human evaluators rate images for both realism and accuracy relative to each image’s corresponding prompt [41].

3 System Architecture

For this study we created a web-based application called the Clarifying Questions Document Generator (CQDG). The key components of CQDG are:

  • A user-facing front-end.

  • A back-end powered by OpenAI’s GPT3.5 API.

  • A database for logging results from the use of the system.

CQDG was designed as an interface between the user and the OpenAI API. CQDG applies specific prompt-engineering templates to the user’s questions, prompts, and responses to induce GPT3.5 to identify ambiguity, generate follow-up questions, and ultimately produce a final output that considers both the original user prompt as well as the additional information from the ensuing conversation. In some cases, the API is prompted multiple times to produce multi-step results for ambiguity analysis before the user is shown only the final response of a small sequence of API interactions. In other cases, the API is given a modified version of the user’s original prompt decorated with specific prompt engineering to steer the response. To the user it appears as if each of their inputs is given just one output in direct response to what they wrote, just as when chatting with ChatGPT directly, although several interactions between the web page and the GPT API are actually taking place during each step of the process without being shown to the user. The process of generating follow-up questions is similar to that shown by the CLAM model (Kuhn, Gal, and Farquhar 2023) [2].

The baseline document is generated by providing the OpenAI API with an unmodified version of the user’s original prompt. The experimental output, hereafter referred to as the QA Document, is generated by providing GPT3.5 with the full context of the original prompt, the follow-up questions, and all user responses. An example of a complete log of prompts and responses sent and received from GPT3.5 is provided in the appendix.

4 Methodology

4.1 Experiment Design

Since CQDG relies on user interaction in the form of question asking and answering, the use of large static databases of question-answer sets is insufficient to test this design. Direct interaction between CQDG and human users is necessary. So, participants are directed to a public website hosting CQDG, which guides them through the experiment. Participants complete the study either on a Zoom call with a researcher or in person with the researcher in the room. The participant is asked to narrate out loud their thought process and any challenges or difficulties they encounter using the system, and the researcher takes notes on any feedback given by the participants. On Zoom calls, participants are asked to share their screens so the researcher can observe their interactions with CQDG.

Step 1 Explanation and consentFootnote 2. CQDG shows the user an explanation of the experiment, and then asks for the users consent to participate in the study, with an explanation of what data will be collected and how it will be used.

Step 2: Demographic Questions. The user is asked a small set of demographic questions. For the small sample size of this pilot study, we were not able to draw conclusions about how different groups respond to the system. However, we hope that this data will be valuable in the full study. The demographic questions are:

  • Age

    • [Numerical Input]

  • Gender

    • “Female”

    • “Male”

    • “Other/Nonbinary”

  • “What is your prior experience with generative AI such as ChatGPT, Bard, or similar programs?”

    • “I use generative AI regularly.”

    • “I have used generative AI before, but not often.”

    • “I have never used generative AI before.”

  • “Is English your primary spoken language?”

    • “Yes”

    • “No”

Step 3: Instructions. The user is shown the following instructions: “Think of a writing task you would like the AI to help you produce. This can be a document you actually need (you will have the opportunity to keep the output) or something you only think up for the sake of the experiment. Either way, please think in detail about what you want the AI to write for you before proceeding to the next step. When you have a clear idea of what you want to ask the AI to write, enter a 1-sentence or 2-sentence prompt in the textbox below, asking the AI to write your document for you. The AI will ask you a series of questions, and you will then be given two versions of the document you requested, and asked for feedback on which version you prefer. A text-entry area is provided for the user to enter their prompt.

Step 4: Follow-up Questions. After the user enters their initial prompt, CQDG presents the user with three clarifying follow-up questions generated by GPT 3.5 based on the users prompt, along with a text-entry field for the user to enter their response.

Step 5: Document Output. After all questions have been answered, CQDG uses GPT3.5 to generate two versions of the requested document. One version uses only the users original prompt to generate the document (baseline). The other version additionally uses the responses to the follow-up questions (QA Document). The outputs are presented to the user in random order, one at a time. When the user is shown each document, they are asked to rate the document according to three metrics, each evaluated on a scale of 1–5:

  • How close is this document to what you hoped for when you made your initial request?

    • (5) Very close to what I was hoping for.

    • (4) Somewhat close to what I was hoping for.

    • (3) A little bit like what I was hoping for.

    • (2) Not very close to what I was hoping for.

    • (1) Not at all what I wanted.

  • How useful would this document be to you?

    • (5) I could use this document as-is.

    • (4) I could use this document with minimal modification.

    • (3) I could use this document with substantial modification.

    • (2) This document could be used as a general starting point but requires major revisions to be usable.

    • (1) This document is not usable at all.

  • How would you rate the overall quality of this document?

    • (5) Excellent quality.

    • (4) Above average quality.

    • (3) Average quality.

    • (2) Below average quality.

    • (1) Poor quality.

Step 6: Optional Continuation and Exit Questionnaire. After ranking each output with the three questions listed above, the user is shown an exit questionnaire with the following questions:

  • Please rate the following statements on a scale of “Strongly Agree” to “Strongly Disagree” (Each of the following statements is shown with 5 options and analyzed as a scale score of 1–5: 5-Strongly Agree, 4-Slightly Agree, 3-Neutral, 2-Sightly Disagree, 1-Strongly Disagree)

    • It was annoying to have to answer questions even though I had already explained what I wanted the AI to do.

    • I felt like the AI was more engaged with my problem because it asked follow-up questions.

    • I would be willing to answer follow-up questions from an AI if answering questions led to better results.

    • I liked that the AI showed me two options to pick between, instead of only picking the option it thought was best.

  • Do you have any additional feedback or comments (optional)?

    • A free-text entry is provided.

5 Results

A total of eight participants completed the pilot study. Although participants were not prompted to complete the study multiple times, several participants specifically requested to run the study again with different prompts immediately after completing the study for the first time. This was allowed, and the eight participants completed the study a total of fourteen times. This is not a sufficient sample size to draw statistically significant conclusions about the overall effectiveness of CQDG. However, as a pilot study, the primary goal was to inform the design of a follow-up study with a much larger sample of participants completing the study without direct supervision from the researchers.

5.1 Participant Responses

Document Ratings.

As shown in Fig. 1, participant ratings for the document resulting from the question-and-answer process were similar to the ratings given for the baseline output which was generated using only the original prompts.

Exit Survey.

As shown in Fig. 2, participants responded positively to the question-and-answer process overall. Participants did not express annoyance at being asked additional questions before receiving their output, and overall felt positively about the question-answering experience.

Completion Time.

The average time to complete the study, measured from the acceptance of the consent to the completion of the exit survey, was 16 min 46 s. However, there was substantial variation in completion time, with the shortest time being 6 min 19 s and the longest being 41 min 35 s. This is to be expected, since participants were free to enter their own prompts and give as brief or as detailed answers as they desired for the question-answering phase. Most of the difference in completion time is explained by the difference in time spent entering answers with varying levels of detail. The longest completion time was for a user who requested a complete resume of a long musical career and gave substantial details in their prompt and answers. The shortest was for a user who requested a haiku and gave very short and general guidance in their prompt and answers.

Fig. 1.
figure 1

Average document ratings given by participants. Note that this sample size is not large enough to be statistically significant, thus no error bars have been included.

Fig. 2.
figure 2

Exit survey results. Overall, participants responded positively to answering questions from the AI and did not find the process annoying. Note that this sample size is not large enough to be statistically significant, thus no error bars have been included.

6 Discussion

Absolute vs Relative Measures.

Participants in this study were asked to rate the quality of the produced document on an absolute scale with five options. For both the baseline and QA documents, most participants felt positively overall, but were not completely satisfied with either document, which led to most responses being in the upper half of the scale (3–5) leaving little room to differentiate the documents. Even in cases where participants expressed verbally or in written feedback that they liked/disliked some aspect of a document, this was often not reflected in the scores.

Engagement and Insight from Follow-Up Questions.

Several participants expressed that the follow-up questions themselves introduced new ideas or caused them to think about aspects of their request that they had not previously considered. Participants indicated that this was a benefit of the question-answering process. Conversely, questions which asked for simple information such as the user’s name or organization were not considered helpful by participants. Baseline documents often included tags such as [your name] and [name of organization] and participants did not see a benefit to giving this information in the interactive phase rather than entering the information later.

Novel Ideas in Baseline Documents.

The baseline documents included a greater variety of content that did not come directly from the participants prompts or responses. Given limited information to work with, GPT often produces plausible outputs that surprises participants or takes a direction they had not previously thought of. While this was undesired in some cases, in other cases the participants expressed finding the originality to be useful and insightful. This is in line with previous findings that LLMs often perform surprisingly well at underspecified tasks [42].

Rigid Outputs from QA Documents.

Conversely, the outputs that were generated using both the original prompts as well as the questions and answers typically included far less original content and often copied pieces of the participants answers verbatim, resulting in a document that closely adhered to the participants stated needs but offered little originality. Participants expressed valuing the insight from the questions themselves, which often contained ideas they had not thought of, but this insight and originality did not carry forward into the final output.

7 Future Work

This study was designed as a pilot for a study which will include a larger sample size and allow participants to complete the survey without the direct supervision of a researcher. Based on the results of the pilot, the larger study will:

  1. 1.

    Allow users to read both documents and then indicate preference for one document or the other, rather than asking users to rate the documents one at a time.

  2. 2.

    Use higher resolution on rating scales. The 1–5 scale proved to be insufficiently sensitive.

  3. 3.

    Refine the prompt engineering of the sequences input to GPT. Ideally, the final output should take participants’ responses into account while retaining a degree of originality, without copying participant answers verbatim.

  4. 4.

    Gear questions towards encouraging users to think about their needs in ways they had not previously considered or proposing expansions or alternatives, rather than gathering information that the user could easily enter into a template form (e.g. the name of their organization).

  5. 5.

    Provide a way to continue refining the documents after their initial creation. Several participants, especially those with prior experience with generative AI, specifically requested the ability to continue refining the outputs they were given with new instructions.

  6. 6.

    Compare GPT 3.5, GPT 4, and other LLMs. GPT 3.5 was only used in this case for simplicity due to the small number of participants.

  7. 7.

    Conclusion

We have proposed that using LLMs to generate follow-up questions can lead to superior output for text documents generated by the LLM. However, initial results do not show an obvious advantage of the QA documents over the baseline. The primary disadvantage faced by CQDG was that the QA documents focused heavily on the users’ answers and did not generate as much original content as documents generated from the prompt alone. This issue could be solved by modifying the prompt engineering in the templates that present the users’ prompts and responses to the LLM. The intent behind this pilot study was to investigate users’ response to CQDG and these insights will inform the design of a larger study to be conducted later this year.