Introduction

In 1950, Alan Turing posed a question, Can machines think? [1] From that time onwards, a challenge has been posed to Artificial Intelligence practitioners to make machines think or in simple words disguise it as a human. Chatbots came into the picture as a utility program, an advisor or simply a friend with whom you can talk to. There are various design techniques which emerged during its evolution. This paper deals particularly with the techniques used to build chatbots and their respective chatbot example.

The primary task of a chatbot is to produce a suitable response by contemplating natural language input provided by humans. There are several ways to generate that response, which defines the modeling mechanism of a chatbot as shown in Fig. 1. One is the rule-based method, wherein clever parsing of user input with hardcoded phrases and premade templates are used to generate the reply. The other one, neural-network-based approach was made possible by the rise of deep learning. The neural network is trained on large data-sets so that it can generate relevant and grammatically correct responses. Input can be of any form-text, images or speech. So, the models have also been introduced to convert speech to text [2] and Convolutional Neural Network(CNN) [3] models enable chatbot to derive useful information from images [4].

Fig.1
figure 1

Classification of chatbot models

The neural network-based approach can be broadly classified as retrieval based and generative. Retrieval based methods generate reply by computing the most relevant response, either based on the method of scoring function such as computing conditional probabilities [5] implemented through the neural network or by evaluating the relationship between context and candidate replies in a reinforced co-ranking manner [6]. On the contrary, Generative method produces one word at a time corresponding to the given input after probabilities have been computed over the whole vocabulary [7]. Procedure to combine both retrieval and generative has also been introduced, wherein the retrieved reply is fed to generative model and the final response is decided by making a comparison between retrieved and generated reply based on reranking [8]

In terms of functionality, chatbots are mainly of two types. One is task-oriented, which are not the best conversational agent, but are very robust when it comes to executing specific tasks and handling domain-specific orders. Their application varies from making a restaurant reservation, booking flight tickets, promoting movies etc. The second type is open-domain chatbots. These are the typical conversational agents that try to mimic humans. Their aim is to generate human-like responses and get the other person into believing that it indeed is one. Every year, Loebner Prize is awarded to the chatbot that does this task the best. It can be reiterated that chatbots are far reaching application and have the potential to be integrated in various domains. This work is motivated by the need to understand, analyze and catalog the existing work on chatbots in the academia as well as the industry alike. It can be postulated that chatbots can become virtual personal assistants that enable enhanced perception which is easily available to humankind on a daily basis. However, present-day chatbots are far from passing the Turing test, for which this competition was introduced.

Related Work

Chatbots have numerous real-world applications. Specifically, e-learning, marketing, medical diagnosis, cultural heritage, e-customer care services, task organizer are domains which make extensive use of chatbots. Moreover, their application is further pronounced when the activity takes place over the internet. E-learning chatbots can significantly contribute to providing interactive learning experience as well as individual attention to the improvement of each student. One solution for e-learning chatbot [9] starts with the basic NLP algorithm, Latent Dirichlet Allocation (LDA) which helps to process user’s query, to remove stop words and extract keywords. Ontology is then built between learning concepts, depending on the course, lesson, topic, user etc. A chatbot for recommending tires has also been implemented using Petri Net [10]. It builds an ontology based on the knowledge domain of tires. Petri Net is formed by making use of user’s responses to the type of vehicle (car, scooter, etc.), model number, year of manufacture etc. Each response is given weight and weighted sum of each context is calculated. If the weighted sum of a given context is less than the information the system already knows, context switching takes place so that repeated questions are not asked in a loop. Another usefulness of chatbots has been explored in the field of medical diagnosis [11]. The fact that previous clinical data of the patient is required for any such interaction is emphasized upon. Training of such a system has to be done on a regular basis for updated information regarding any disease and its diagnosis. Medical field is quite critical and has a long way to go in terms of chatbot based technology. Based on the tourist profile and the concept of context, chatbot has been developed which suggests various tourists’ places, information about the place and services related to it. It can also suggest hotels nearby as well as famous dine in places. This chatbot is supposed to work as a tourist guide. The architecture has an inference engine at its core. This engine first analyses the text provided by the user and then generate useful reply using Context Dimension Tree (CBT) [12]. Knowledge acquisition is an important aspect of building a chatbot. Data filtered with respect to context, person, place etc. has to be acquired for chatbot to perform efficiently. Curious Cat [13] was designed to collect data from users by finding the right crowd, quality of conversations, consistency of replies, also known as crowdsourcing. This chatbot is further used as a personal conversation assistant. The academia as well as the industry has significantly progressed the usage and development of chatbots. These virtual assistants are set to become an indispensable tool in the foreseeable future.

Classical/Rule-Based Approaches

Classical approaches can be called as rule-based approaches as they set predefined rules to generate responses. These rules have grown more complex and sophisticated over time. These works very well when the domain of the conversation is closed i.e. the conversation is centered on a particular topic/task. But as the input becomes more natural or the domain moves to the open one, efficiency of rule-based approaches deteriorates. Writing rules can be done by a language designed for classical chatbots, AIML (Artificial Intelligence Markup Language), which is based on XML (eXtensible Markup Language).

Pattern Matching

Pattern Matching is one of the fundamental techniques of designing chatbots and is used in almost every chatbot to some extent. This method makes use of a prewritten set of rules and predefined templates to produce the response. ELIZA [14] was the first chatbot designed using this technique. Initially, it identifies the keywords in the text starting from left to right. Each keyword has a RANK/precedence associated with it. Then, the input string is decomposed in a predefined template. E.g. For input: I am sad, it takes ‘sad’ as a keyword and forms the reply- how long have you been ‘sad’? ELIZA was a psychotherapist program and many people grew attached to it after it’s invented. The challenges associated with this approach included the identification of the most important keyword and appropriate transformation rule. In addition, it does not take into account the previous history or context of the conversation which makes it look less natural.

Parsing

Parsing is the process of breaking down the input string to reveal its syntactic structure. The string is first divided into noun and verb phrase. Then, adjectives, articles, and nouns are recognized and a syntax tree is formed. Parsing helps validate a sentence's grammatical structure with respect to a language. Earlier, simple parsers were being used which could identify the keywords. For instance, ‘take the food’ and ‘can you get the food’ would both be parsed to ‘take food’. This enables a chatbot with limited templates and patterns to generate the response for polymorphic input strings. Later chatbots use complete parsing techniques involved in processing natural language. This type consists of three levels of parsing which are syntax, semantics and pragmatic parsing [15]. Jabberwacky uses this technique for business circumstances where more control over conversational flow is required [16].

Markov Chain Models

Markov chain model, if described mathematically, is the model that describes the probability of present events on the basis of the state of previous events. It takes into consideration the probability with which a letter or word occurs within a dataset. It makes use of this probability distribution to choose the most likely words for a reply. The order of the Markov chain determines how many successive letters/words are to be taken as input. For a 0 order Markov chain, given a string khdddkhhdd, the letter k occurs with a probability of 2/10, h with 3/10 and d with 5/10. For order 1 Markov chain, it will also consider the previous element to compute the fixed probabilities.

Given a string “the black dog jumped into the pool”. For an order 2 Markov chain ‘the black’ will result in ‘dog’, ‘black-dog’ will result in ‘jumped’ and so on for remaining words. If two results share different input then 0.5 probabilities will be assigned to both the input string.

The chatbot built on this method (HeX) used to generate a nonsense sentence that used to sound right, as a failback method [17] Another chatbot MegaHAL by the same scientist, used the entropy to determine the most likely word for a response out of many probable candidates [18].

$$I\left(\frac{w}{s}\right)= -{log}_{2}P(\frac{w}{s})$$
(1)

where w is the word following symbol sequence s

Semantic Nets (Ontologies)

Ontologies are a hierarchical structure of real-world concepts. Concepts are also called classes, which are the focus of most ontologies. Instances of various classes, when combined together along with the ontology, form the knowledge base. For example, a class of bread represents all bread. Further, they are divided into subclasses such as white bread and brown bread [19]. These classes can be connected to each other making a graph of hierarchy, where white and brown bread are subclasses of bread superclass. The classes can be connected on the basis of their logical relationship with each other. The properties of the class are defined by ‘slots’. It may include bread’s texture, color, company etc. Various ‘facets’ of the slots can also be defined. These describe the value type, cardinality, range of the slot etc. Its advantage lies in the fact that searching through the nodes can be done as well special reasoning rules can imply new responses. OpenCyc [20] and Wordnet [21] ontologies have been used in chatbots.

AIML

Artificial Intelligence Markup Language [22] is one of the technological advances dedicated to the development of chatbots. It is used for dialog modeling between a chatbot and human where the stimulus–response methodology is followed. Pattern Recognition and Matching Techniques form the basis of AIML. It is easy to implement as it is closely related to XML (eXtensible Markup Language) and the tags assist in making the task of dialog making it much simpler. Graphmaster which implements the pattern matching algorithm is responsible for managing the tree which is formed by storing the patterns of AIML. It provides efficient utilization of space as well as time. It is also highly reusable because of its simplicity as well as the availability of source code along with documentation.

Structure of an AIML tag is:

 < command > ParametersList < /command > 

where < command > is start tag and < /command > is closing tag. The most used tags are category, pattern, and template. The knowledge-base unit or commonly called dialogue is defined by category. The pattern defines the user’s probable input and chatbot’s response is defined in the template.

 < category > 

 < pattern > how are you? < /pattern > 

 < template > 

I am absolutely fine!

 < /template > 

 < /category > 

AIML also defines wildcards which are ‘_’ and ‘*’. They replace a string or a part of the string. AIML gives high priority to categories which have wildcard within them and they are analyzed first.

 < category > 

 < pattern > I love * < /pattern > 

 < template > I too love < star/ > . < /template > 

 < /category > 

 < srai > tag is also a powerful tag in AIML, as it has the ability to submit its own response as input to itself. Such a thing is useful when the user recursively talks about a particular topic, and this technique gives chatbot a chance to respond in the most natural way.

Wallace [23] created this XML dialect.

A.L.I.C.E [24] was the first chatbot based on AIML. The learning model used in ALICE is supervised one, i.e. it is being supervised by a person, the botmaster. After the initial design of ALICE, many other chatbots were built using AIML with further improvements.

Chatscript

Chatscript is the chatbot scripting language. It was developed by Bruce Wilcox in 2010. His chatbot ‘Suzette’ [25] won the 2010 Loebner Prize. Chatscript is basically an improvised version of AIML. Instead of searching for a matching category amongst thousands, chatscript searches for a related context. Such a context is called ‘concept’ and rules are defined within each concept. Concepts are nothing but a set of synonyms or words that are similar in some way. A concept of all pronoun, the noun can be created. Matching of each user input is done against the concepts preloaded into chatscript. Word-net Ontologies can be combined with chatscript to give better responses. The wildcards are also present in chatscript as in AIML. Apart from that, it also introduces the concept of variables, which can be used to store user-specific local information, which makes the conversation more natural and effective. Facts such as subject-verb-object triples can be created by chatscript and further stored in the tabular format. This table comes in handy while answering user input by simply querying into it.

Concept: ~ food (bread, juice, vegetable, fruits, pizza, burger, cold-drink)

S: (I love ~ pizza) Are you a foodie?

Structured Query Language (SQL) and Relational Database

Relational database (RDB) management system is used in the development of the chatbot. The primary objective behind using the database is to remember previous conversations and generate different replies even to the same questions posed at different interval of time. The most used RDB language is SQL. ViDi (Virtual Diabetes physician) [26] has been developed using this technique. This chatbot was specifically associated with the knowledge of diabetes disease. In this approach, forward and back pointers are maintained within the database also called extension and prerequisite variables. Whenever a response is generated, it is linked to another response/s based on the underlying knowledge base. These links are then used to generate new responses for each user input.

Language Tricks

It is often more natural to introduce concepts in a chatbot that are human-like. These may include deliberately committing a mistake in spelling or impersonating itself. Language tricks are often an additional technique used in the development of a chatbot. Some of the common language tricks are:

  1. 1)

    Typing errors and fake keystrokes: When a user types in an input, he/she usually examines the chatbot as it is typing the reply. It looks very human-like to fake backspaces and commits some spelling mistakes, which are some natural tendencies of humans.

  2. 2)

    Canned Responses: There are some patterns which the chatbot is unable to cover in its pattern matching algorithm. Such responses are hard-coded by the developer.

  3. 3)

    Personal History: Developers provide an identity to the chatbot to make it more convincing. The details about its birth, age, parents, preferences, stories are inculcated into it [17].

Neural-Network Based

Neural-network based chatbots have done away with the monotonous task of writing rules for each utterance-response pair. There are two ways in which neural network can output reply, either by producing from scratch (generative) or by retrieving from the large dataset (retrieval-based). Some hybrid approaches combining these two have also been introduced. The basic underlying structure/model employed in all approaches has been discussed. Then a table (Table 1) is presented which covers the most recent work done on top of basic structures/models.

Table 1 Design Techniques for chatbots along with evaluation metrics, corpus used and possible enhancement areas

Recurrent Neural Network

The ability to consider previous conversations and context while generating a response is desirable for any conversational agent. The responses become monotonous if it only takes current input into account while forming a reply. Recurrent Neural Network (RNN) allows the chatbot to take as input the previous output, and come up with a more sensible reply. In other words, RNN allows the data to persist, unlike normal Neural Network.

In the above Fig. 2, A is a small part of the neural network and \({x}_{t}\) is the input to it, it outputs ht. Since there is a loop forming in the network, it signifies the flow of data from output to input again. This loop makes the idea of RNN look unclear, but when we unroll the loop we will find a simple neural network that passes information from one network to the other.

Fig.2
figure 2

Recurrent Neural Network

RNNs have been used extensively for various purposes such as language translation, modeling speech recognition, image captioning etc. However, the unmodified version of RNN is not used much because it suffers from vanishing gradient problem [27].

Long Short Term Memory(LSTM) [28]

It becomes difficult for a simple RNN to remember information seen multiple steps ago when the unrolling steps increase too much. This is because the value of the gradient depends majorly on two factors which are weights and the activation function (basically their derivative). When either of them approaches to 0, the gradient vanishes with time. Activation Functions such as tanh and sigmoid makes the condition even worse as their derivative values are mostly close to 0.This is where LSTM comes into the picture and solves the problem of vanishing gradient. LSTM uses identity as its activation function whose derivative is 1 which prevents the backpropagated gradient to vanish. LSTM does this task with the help of ‘gates’. Gates are the component which decides the information that will be allowed to pass through. The gates output the value between 0 and 1, deciding how much of each component should be let through. A value of 0 means not to let anything pass through and 1 means let everything pass. LSTM takes help from the input and forgets gates to control the flow of information from one network unit to the successor unit. These gates determine the network's state update mechanism. The output gate determines the output from the hidden layer.

The three gates together form a memory cell of LSTM. LSTM does the task of remembering, for instance, the gender of the subject, so that the chatbot can use ‘his/her’ depending on the previously remembered input. LSTM overcomes the problem of long-term dependencies. However, not all LSTM share the exact same structure. There are many variations of LSTM being proposed [29]. Another popular variant is Gated Recurrent Unit (GRU) [5]. In this architecture the input and forget gate are combined to form a single “update gate”.

Seq2seq

One of the most effective techniques for machine translation [7], seq2seq can also be effectively applied to conversational modeling. The basic structure of a seq2seq model consists of two RNN as shown in Fig. 3. RNN is generally used in the form of LSTM or GRU. The objective is to calculate the conditional probability of p(y1,y2,…yn’/ × 1, × 2,…xn) where x and y represent the input and output sequences, respectively. Length of n and n’ can vary. Seq2seq can easily allow such condition as two RNN are used for input and output sequences. The encoder-decoder mechanism is used. Firstly, the input sequence is subject to the first encoder RNN and a vector is produced as its output. Second RNN sets its initial state according to this vector. The output from this decoder RNN is then subjected to a suitable probability function. These two networks are trained together and back-propagation takes place and weights are adjusted accordingly.

Fig. 3
figure 3

Sequence to sequence model [7]

Since seq2seq was developed primarily for machine translation, it takes source language sentence and converts it to vectors which represent word embeddings. The target language is decoded by the second RNN. In the case of chatbots, this technique can easily be used with slight modification by considering input sentence as the source language string and its response as the element for target language string.

Deep Seq2seq

Further improvement over seq2seq models can be done by joining multiple LSTM rather than just two. Better performance is expected out of such a model with deeper layers [30]. The most simple procedure to design such a model is to keep forwarding output from the previous layer to the next layer. The first encoder layer is fed with the input string. The encoder LSTM performs the task of conversion of each word to the vector. This output is fed to the next layer of encoder LSTM. Finally, the output from last encoder LSTM is passed to the first layer of decoder LSTM. Here again, the output is forwarded from one layer to the next. Finally, a suitable probability function is applied to get the target string.

Evaluation Methods

Mostly used metrics for evaluation of a chatbot are BiLingual Evaluation Understudy (BLEU) [31] and perplexity [32], METEOR (Metric for Evaluation of Translation with Explicit ORdering) [33] which were originally meant for machine translation methods. These measures are used for conversational modeling at various places [30, 34, 35].

BLEU measures the similarity between generated text and the expected response. A score of 1.0 represents a perfect match whereas 0.0 represents a perfect mismatch. It measures the adequacy and fluency of a generated text by counting the words which match with the expected response. Matching of words takes place for every word, in pair, in triplets and so on, also called n-grams. For n = 1, it would consider a single token (unigrams), for n = 2 a word pair is considered (bigrams) and so on. Order of grams (words) is not significant in this method.

E.g. He is the only son of Great Odin. (Expected response).

Great Odin has only one child. (Generated response).

‘Only’ is the unigram and ‘Great Odin’ is the bigram that matches in both the sentences.

To overcome some of the limitations of BLEU metric, authors have used METEOR which is very much similar to BLEU, with the added functionality of synonym matching and mapping between generated and expected response. It matches the exact words in the two sentences; each word in expected response is mapped to another word in the generated response. Synonyms are found for mismatched words. After matching the unigrams, the score is computed based on unigram precision and recall.

Perplexity defines the goodness of a probability model to predict a test data. Perplexity is exponentiation of entropy. After the training of the model, the test set can be used to compute the perplexity. If a model ‘q’ exists, perplexity is given by:

$${2}^{\frac{-1}{N}{\sum }_{i=1}^{N}{log}_{2}q({x}_{i})}$$
(2)

Xi is the test set or input words.

N is the length of the sentence.

Model performs better when perplexity is less.

The evaluation methods for a conversational bot still remain a question in the open domain. This is because the effectiveness of a chatbot can only be evaluated in a real-time domain. The task of evaluating a chatbot is subjective which deals closely with human judgement. Metrics like BLEU, METEOR, and perplexity have been extensively used but the general consensus remains that one cannot completely encompass user experience using traditional mathematical indicators. User experience has been measured with the following metrics: user engagement, coherence, domain-coverage, depth of conversation etc [36].

User engagement is measured by the duration of chat between human and chatbot. More number of turns in conversation might mean that the chatbot is able to provide answers so as to keep the user engaged. Coherence is measured by the relevancy of the reply generated. This is generally a hard objective to reach in an open ended conversation but is extremely important as well. E.g. If a user is talking about Politics and gets a response unrelated to it, like sports, it would be considered a weakly coherent response. A task-oriented chatbot is domain-specific, whereas an open domain conversational agent is expected to deal with multiple domains. In the case of multi-turn conversation, it is important the chatbot is able to converse about a topic in some depth, as it happens with humans.

So, the best method to evaluate a chatbot is to get it rated by a human being. He/she can decide whether the responses generated were meaningful and natural. The grammar, effectiveness, and naturalness of a chatbot can only be judged truly by a human. For a task-oriented chatbot, the user can be asked whether they feel satisfied with the responses or whether the chatbot was able to answer their queries.

The work was conducted systematically by bifurcation of the paper search space into different relevant domains. The first domain was taken as Rule-based and other is Neural Network based. Among each domain, chronological ordering was followed to build a stronger understanding of the works with respect to the evolution of chatbots.

The table presented outlines the recent developments in the field. Many variations of encoder-decoder networks have been used. Deep learning models have been used extensively like HRED, GAN, VAE etc.

Discussion and Future Work

The backbone of conversation modeling is encoder-decoder model. This model was designed for Neural Machine Translation (NMT). However, conversation modeling is altogether a complex task to be done using this model. This is because encoder-decoder model assumes one single reply for a given input. This is not true for conversation agents as a natural response can vary for the same input at different time and condition. The encoder-decoder model averages out the utterance-response pair. This is why it was noted in many papers that generic responses such as ‘I don’t know’ have been produced by different models. Also, evaluating these models has long been a challenge posed to AI practitioners. Since quantitative evaluation metrics such as BLEU and perplexity are far from human judge evaluation, especially for chatbots. Other metrics have also been introduced in several papers but no standard method exists for chatbots till now.

As for future work, there are various areas which still need to be explored in the field of conversation modeling.

  • The objective function: Log likelihood and MMI have been majorly been used as objective functions. Log likelihood measures the most probable response for a given utterance. To take previous conversations and context probabilities, experiments can be done on optimizing the objective function.

  • Persona development: Many authors agree to the fact that conversations look more natural when they have imbibed personalities in speaker and addressee. This task, however, ought to be done by the model by understanding the speaking style and mood of the person. Work has been done in this area, but still need to consider various other parameters as well.

  • Two-sided conversation: It is true that most chatbots are made to reply to the given utterance, but this makes the conversation one-sided. Hence, it is important for the chatbot to come up with topics that interests the person it is talking to. This again can be done by encoding huge amount of conversation history and persona building.

Conclusion

Chatbots have become an integral part of our day to day life. A great deal of effort is employed to make it talk like a human. Nowadays, chatbot is a part of almost every application which deals with activities like ordering clothes, food, electronic appliances and so on. They are also used to book tickets, appointments, shows, or any transactional activity. Businesses use chatbot to solve customer’s problem by suggesting frequently asked questions and try to make the conversation interactive. If the customer is not satisfied, human intervention takes place in most cases. This review of chatbots presented gives a clear picture of the approaches that can be deployed in the development of a chatbot. Mostly the vanilla versions are presented which can be further manipulated and improved. Starting from fundamental approaches like pattern matching, parsing, semantics Nets to deep neural network-based approaches such as RNN, LSTM, have been cited with their respective chatbot example While going through the review, reader gets the idea of how chatbots evolved with time. Modern day chatbots still use those played out but powerful techniques. More and more chatbots these days are making use of neural network-based approaches, but keeping the advantageous elements of non-AI based methods. This observation is visible in this review that includes the latest work done in the field of conversational agents. However, it is quite clear that conversational bots, as of now, are far from passing the Turing test. Still, on the road to improvement, various quantitative and qualitative metrics to determine the efficiency of a chatbot have been discussed. The paper is concluded with the most recent work done in the field of conversational modeling. It is hoped that this work shall propel the research community with a better understanding of chatbots.