Benchmarking Natural Language Understanding Services for Building Conversational Agents

Liu, Xingkun; Eshghi, Arash; Swietojanski, Pawel; Rieser, Verena

doi:10.1007/978-981-15-9323-9_15

Xingkun Liu³⁹,
Arash Eshghi³⁹,
Pawel Swietojanski⁴⁰ &
…
Verena Rieser³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 714))

876 Accesses
22 Citations

Abstract

We have recently seen the emergence of several publicly available Natural Language Understanding (NLU) toolkits, which map user utterances to structured, but more abstract, Dialogue Act (DA) or Intent specifications, while making this process accessible to the lay developer. In this paper, we present the first wide coverage evaluation and comparison of some of the most popular NLU services, on a large, multi-domain (21 domains) dataset of 25 K user utterances that we have collected and annotated with Intent and Entity Type specifications and which will be released as part of this submission (https://github.com/xliuhw/NLU-Evaluation-Data). The results show that on Intent classification Watson significantly outperforms the other platforms, namely, Dialogflow, LUIS and Rasa; though these also perform well. Interestingly, on Entity Type recognition, Watson performs significantly worse due to its low Precision (At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. We’d threfore like to stress that this paper does not include an evaluation of this feature in Watson NLU.). Again, Dialogflow, LUIS and Rasa perform well on this task.

(Work done when Pawel was with Emotech North LTD).

Access provided by Autonomous University of Puebla. Download chapter PDF

Overview of the NLPCC 2018 Shared Task: Spoken Language Understanding in Task-Oriented Dialog Systems

Leveraging intent–entity relationships to enhance semantic accuracy in NLU models

Article Open access 26 May 2024

Trilingual conversational intent decoding for response retrieval

Article 05 September 2023

1 Introduction

Spoken Dialogue Systems (SDS), or Conversational Agents are ever more common in home and work environments, and the market is only expected to grow. This has prompted industry and academia to create platforms for fast development of SDS, with interfaces that are designed to make this process easier and more accessible to those without expert knowledge of this multi-disciplinary research area.

One of the key SDS components for which there are now several such platforms available is the Natural Language Understanding (NLU) component, which maps individual utterances to structured, abstract representations, often called Dialogue Acts (DAs) or Intents together with their respective arguments that are usually Named Entities within the utterance. Together, the representation is taken to specify the semantic content of the utterance as a whole in a particular dialogue domain.

In the absence of reliable, third-party—and thus unbiased—evaluations of NLU toolkits, it is difficult for users (which are often conversational AI companies) to choose between these platforms. In this paper, our goal is to provide just such an evaluation: we present the first systematic, wide-coverage evaluation of some of the most commonly used^{Footnote 1} NLU services, namely: Rasa,^{Footnote 2} Watson,^{Footnote 3} LUIS^{Footnote 4} and Dialogflow.^{Footnote 5} The evaluation uses a new dataset of 25 k user utterances which we annotated with Intent and Named Entity specifications. The dataset, as well as our evaluation toolkit will be released for public use.

2 Related Work

To our knowledge, this is the first wide coverage comparative evaluation of NLU services—those that exist tend to lack breadth in Intent types, Entity types, and the domains studied. For example, recent blog posts [3, 5], summarise benchmarking results for 4 domains, with only 4 to 7 intents for each. The closest published work to the results presented here is by [1], who evaluate 6 NLU services in terms of their accuracy (as measured by precision, recall and F-score, as we do here) on 3 domains with 2, 4, and 7 intents and 5, 3, and 3 entities respectively. In contrast, we consider the 4 currently most commonly used NLU services on a large, new data set, which contains 21 domains of different complexities, covering 64 Intents and 54 Entity types in total. In addition, [2] describe an analysis of NLU engines in terms of their usability, language coverage, price etc., which is complimentary to the work presented here.

3 Natural Language Understanding Services

There are several options for building the NLU component for conversational systems. NLU typically performs the following tasks: (1) Classifying the user Intent or Dialogue Act type; and (2) Recognition of Named Entities (henceforth NER) in an utterance.^{Footnote 6} There are currently a number of service platforms that perform (1) and (2): commercial ones, such as Google’s Dialogflow (formerly Api.ai), Microsoft’s LUIS, IBM’s Watson Assistant (henceforth Watson), Facebook’s Wit.ai, Amazon Lex, Recast.ai, Botfuel.io; and open source ones, such as Snips.ai^{Footnote 7} and Rasa. As mentioned above, we focus on four of these: Rasa, IBM’s Watson, Microsoft’s LUIS and Google’s Dialogflow. In the following, we briefly summarise and discuss their various features. Table 1 provides a summary of the input/output formats for each of the platforms.

Table 1 Input requirements and output of NLU services

Full size table

(1) All four platforms support Intent classification and NER; (2) None of them support Multiple Intents where a single utterance might express more than one Intent, i.e. is performing more than one action. This is potentially a significant limitation because such utterances are generally very common in spoken dialogue; (3) Particular Entities and Entity types tend to be dependent on particular Intent types, e.g. with a ‘set_alarm’ intent one would expect a time stamp as its argument. Therefore we think that joint models, or models that treat Intent & Entity classification together would perform better. We were unable to ascertain this for any of the commercial systems, but Rasa treats them independently (as of Dec 2018). (4) None of the platforms use dialogue context for Intent classification and NER—this is another significant limitation, e.g. in understanding elliptical or fragment utterances which depend on the context for their interpretation.

4 Data Collection and Annotation

The evaluation of NLU services was performed in the context of building a SDS, aka Conversational Interface, for a home assistant robot. The home robot is expected to perform a wide variety of tasks, ranging from setting alarms, playing music, search, to movie recommendation, much like existing commercial systems such as Microsoft’s Cortana, Apple’s Siri, Google Home or Amazon Alexa. Therefore the NLU component in a SDS for such a robot has to understand and be able to respond to a very wide range of user requests and questions, spanning multiple domains, unlike a single domain SDS which only understands and responds to the user in a specific domain.

4.1 Data Collection: Crowdsourcing Setup

To build the NLU component we collected real user data via Amazon Mechanical Turk (AMT). We designed tasks where the Turker’s goal was to answer questions about how people would interact with the home robot, in a wide range of scenarios designed in advance, namely: alarm, audio, audiobook, calendar, cooking, datetime, email, game, general, IoT, lists, music, news, podcasts, general Q&A, radio, recommendations, social, food takeaway, transport, and weather.

The questions put to Turkers were designed to capture the different requests within each given scenario. In the ‘calendar’ scenario, for example, these pre-designed intents were included: ‘set_event’, ‘delete_event’ and ‘query_event’. An example question for intent ‘set_event’ is: “How would you ask your PDA to schedule a meeting with someone?” for which a user’s answer example was “Schedule a chat with Adam on Thursday afternoon”. The Turkers would then type in their answers to these questions and select possible entities from the pre-designed suggested entities list for each of their answers. The Turkers didn’t always follow the instructions fully, e.g. for the specified ‘delete_event’ Intent, an answer was: “PDA what is my next event?”; which clearly belongs to ‘query_event’ Intent. We have manually corrected all such errors either during post-processing or the subsequent annotations.

The data is organized in CSV format which includes information like scenarios, intents, user answers, annotated user answers etc.(See Table 4 in Appendix). The split training set and test set were converted into different JSON formats for each platform according to the specific requirements of the each platform (see Table 1)

Our final annotated corpus contains 25716 utterances, annotated for 64 Intents and 54 Entity Types.

4.2 Annotation and Inter-annotator Agreement

Since there was a predetermined set of Intents for which we collected data, there was no need for separate Intent annotations(some Intent corrections were needed). We therefore only annotated the data for Entity Tokens & Entity Types. Three students were recruited to do the annotations. To calculate inter-annotator agreement, each student annotated the same set of 300 randomly selected utterances. Each student then annotated a third of the whole dataset, namely, about 8 K utterances for annotation. We used Fleiss’s Kappa, suitable for multiple annotators. A match was defined as follows: if there was any overlap between the Entity Tokens (i.e. Partial Tokens Matching), and the annotated Entity Types matched exactly. We achieved moderate agreement (\(\kappa = 0.69\)) for this task.

5 Evaluation Experiments

In this section we describe our evaluation experiments, comparing the performance of the four systems outlined above.

5.1 Train and Test Sets

Since LUIS caps the size of the training set to 10K, we chose 190 instances of each of the 64 Intents at random. Some of the Intents had slightly fewer instances than 190. This resulted in a sub-corpus of 11036 utterances covering all the 64 Intents and 54 Entity Types. The Appendix provides more details: Table 5 shows the number of the sentences for each Intent. Table 6 lists the number of entity samples for each Entity Type. For the evaluation experiments we report below, we performed 10 fold cross-validation with 90% of the subcorpus for training and 10% for testing in each fold.^{Footnote 8}

5.2 System Versions and Configurations

Our latest evaluation runs were completed by the end of March 2018. The service API used was V1.0 for Dialogflow, V2.0 for LUIS. Watson API requests require data as a version parameter which is automatically matched to the closest internal version, where we specified 2017/04/21.^{Footnote 9} In our conversational system we run the open source Rasa as our main NLU component because it allows us to have more control over further developments and extensions. The evaluation done for Rasa was on Version 0.10.5, and we used its spacy_sklearn pipeline which uses Conditional Random Fields for NER and sk-learn (scikit-learn) for Intent classifications. Rasa also provides other built-in components for the processing pipeline, e.g. MITIE, or latest tensorflow_embedding pipeline.

6 Results and Discussion

We performed 10-fold cross validation for each of the platforms and pairwise t-tests to compare the mean F-scores of every pair of platforms. The results in Table 2 show the micro-average^{Footnote 10} scores for Intent and Entity Type classification over 10-fold cross validation. Table 3 shows the micro-average F-scores of each platform after combining the results of Intents and Entity Types. Tables 7 and 8 in the Appendix show the detailed confusion matrices used to calculate the scores of Precision, Recall and F1 for Intents and Entities.

Table 2 Overall scores for intent and entity

Full size table

Table 3 Combined overall scores

Full size table

Performing significance tests on separate Intent and Entity scores in Table 2 revealed: For Intent, there is no significant difference between Dialogflow, LUIS and Rasa. Watson F1 score (0.882) is significantly higher than other platforms (\(p<0.05\), with large or very large effects sizes—Cohen’s D). However, for Entities, Watson achieves significantly lower F1 scores (\(p<0.05\), with large or very large effects sizes—Cohen’s D) due to its very low Precision. One explanation for this is the high number of Entity candidates produced in its predictions, leading to a high number of False Positives.^{Footnote 11} It also shows that there are significant differences for Entity F1 score between Dialogflow, LUIS and Rasa. LUIS achieved the top F1 score (0.777) on Entities.

Table 3 shows that all NLU services have quite close F1 scores except for Watson which had significantly lower score (\(p<0.05\), with large or very large effects sizes—Cohen’s D) due to its lower entity score as discussed above. The significance test shows no significant differences between Dialogflow, LUIS and Rasa.

The detailed data analysis results in the Appendix (see Tables 5 and 6) for fold-1^{Footnote 12} reveal that distributions of Intents and Entities are imbalanced in the datasets. Also, our data contains some noisy Entity annotations, often caused by ambiguities, which our simplified annotation scheme was not able to capture. For example, an utterance in the pattern “play xxx please” where xxx could be any entity from song_name, audiobook_name, radio_name, posdcasts_name or game_name, e.g. “play space invaders please” which could be annotated the entity as [song_name: space invaders] or [game_name: space invaders]. This type of Intent ambiguity that can only be resolved by more sophisticated approaches that incorporate domain knowledge and the dialogue context. Nevertheless, despite the noisiness of the data, we believe that it represents a real-world use case for NLU engines.

7 Conclusion

The contributions of this paper are two-fold: First, we present and release a large NLU dataset in the context of a real-world use case of a home robot, covering 21 domains with 64 Intents and 54 Entity Types. Secondly, we perform a comparative evaluation on this data of some of the most popular NLU services—namely the commercial platforms Dialogflow, LUIS, Watson and the open source Rasa.

The results show they all have similar functions/features and achieve similar performance in terms of combined F-scores. However, when dividing out results for Intent and Entity Type recognition, we find that Watson has significant higher F-scores for Intent, but significantly lower scores for Entity Type. This was due to its high number of false positives produced in its Entity predictions. As noted earlier, we have not here evaluated Watson’s recent ‘Contextual Entity’ annotation tool.

In future work, we hope to continuously improve the data quality and observe its impact on NLU performance. However, we do believe that noisy data presents an interesting real-world use-case for testing current NLU services. We are also working on extending the data set with spoken user utterances, rather than typed input. This will allow us to investigate the impact of ASR errors on NLU performance.

Table 4 Data annotation example snippet

Full size table

Notes

1.
According to anecdotal evidence from academic and start-up communities.
2.
https://rasa.com/.
3.
https://www.ibm.com/watson/ai-assistant/.
4.
https://www.luis.ai/home.
5.
https://dialogflow.com/.
6.
Note that, one could develop one’s own system using existing libraries, e.g. sk_learn libraries http://scikit-learn.org/stable/, spaCy https://spacy.io/, but a quicker and more accessible way is to use an existing service platform.
7.
Was not yet open source when we were doing the benchmarking, and was later on also introduced in https://arxiv.org/abs/1805.10190.
8.
We also note here that our dataset was inevitably unbalanced across the different Intents & Entities: e.g. some Intents had much fewer instances: iot_wemo had only 77 instances. But this would affect the performance of the four platforms equally, and thus does not confound the results presented below.
9.
At the time of producing the camera-ready version of this paper, we noticed the seemingly recent addition of a ‘Contextual Entity’ annotation tool to Watson, much like e.g. in Rasa. Wed like to stress that this paper does not include an evaluation of this feature in Watson NLU.
10.
Micro-average sums up the individual TP, FP, and FN of all Intent/Entity classes to compute the average metric.
11.
Interestingly, Watson only requires a list of possible entities rather than entity annotation in utterances as other platforms do (See Table 1).
12.
Tables for other folds are omitted for space reason.

References

Braun D, Mendez AH, Matthes F, Langen M (2017) Evaluating natural language understanding services for conversational question answering systems. In: Proceedings of SIGDIAL 2017, pp 174–185
Google Scholar
Canonico M, Russis LD (2018) A comparison and critique of natural language understanding tools. In: Proceedings of CLOUD COMPUTING 2018
Google Scholar
Coucke A, Ball A, Delpuech C, Doumouro C, Raybaud S, Gisselbrecht T, Dureau J (2017) benchmarking natural language understanding systems: Google, Facebook, Microsoft, Amazon, and Snips. https://medium.com/snips-ai/benchmarking-natural-language-understanding-systems-google-facebook-microsoft-and-snips-2b8ddcf9fb19
Canh NT (2018) Benchmarking intent classification services. June 2018. https://medium.com/botfuel/benchmarking-intent-classification-services-june-2018-eb8684a1e55f
Wisniewski C, Delpuech C, Leroy D, Pivan F, Dureau J (2017) Benchmarking natural language understanding systems. https://snips.ai/content/sdk-benchmark-visualisation/

Download references

Author information

Authors and Affiliations

Heriot-Watt University, Edinburgh, EH14 4AS, UK
Xingkun Liu, Arash Eshghi & Verena Rieser
The University of New South Wales, Sydney, Australia
Pawel Swietojanski

Authors

Xingkun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Arash Eshghi
View author publications
You can also search for this author in PubMed Google Scholar
Pawel Swietojanski
View author publications
You can also search for this author in PubMed Google Scholar
Verena Rieser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingkun Liu .

Editor information

Editors and Affiliations

Apple, Cupertino, CA, USA
Erik Marchi
Kore University of Enna, Enna, Italy
Sabato Marco Siniscalchi
Polytechnic University of Turin, Torino, Italy
Sandro Cumani
Kore University of Enna, Enna, Italy
Valerio Mario Salerno
National University of Singapore, Singapore, Singapore
Haizhou Li

Appendix

We provide some examples of the data annotation and the training inputs to each of the 4 platforms in Table 4, Listings 1, 2, 3 and 4.

We also provide more details on the train and test data distribution, as well as the Confusion Matrix for the first fold (Fold_1) of the 10-Fold Cross Validation. Table 5 shows the number of the sentences for each Intent in each dataset. Table 6 lists the number of entity samples for each Entity Type in each dataset. Tables 7 and 8 show the confusion matrices used to calculate the scores of Precision, Recall and F1 for Intents and Entities. The TP, FP, FN and TN in the tables are short for True Positive, False Positive, False Negative and True Negative respectively.

Table 5 Data distribution for intents in Fold_1

Full size table

Table 6 Data distribution for entities in Fold_1

Full size table

Table 7 Confusion matrix summary for intents in Fold_1

Full size table

Table 8 Confusion matrix summary for entities in Fold_1

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, X., Eshghi, A., Swietojanski, P., Rieser, V. (2021). Benchmarking Natural Language Understanding Services for Building Conversational Agents. In: Marchi, E., Siniscalchi, S.M., Cumani, S., Salerno, V.M., Li, H. (eds) Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. Lecture Notes in Electrical Engineering, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-15-9323-9_15

Download citation

DOI: https://doi.org/10.1007/978-981-15-9323-9_15
Published: 11 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9322-2
Online ISBN: 978-981-15-9323-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Benchmarking Natural Language Understanding Services for Building Conversational Agents

Abstract

Similar content being viewed by others

Overview of the NLPCC 2018 Shared Task: Spoken Language Understanding in Task-Oriented Dialog Systems

Leveraging intent–entity relationships to enhance semantic accuracy in NLU models

Trilingual conversational intent decoding for response retrieval

1 Introduction

2 Related Work

3 Natural Language Understanding Services