What is AI?


I was meeting with a non-technical business colleague/manager recently, and he asked me to explain some of these AI terms like machine learning and deep learning. “Is machine learning a type of deep learning, or is it the other way around?”. So, I drew this picture on the whiteboard, and went on a somewhat rambling lecture on AI and its history. It then occurred to me that many business types are asking themselves (and their colleagues) similar questions, and I have seen more than one manager mangle these distinctions. So, this article is an attempt at a quick non-technical overview.

In the spirit of the hastily hand-drawn picture above, this article is not intended to be a thorough taxonomical categorization of the sub-fields of AI. Nor do I make any claims of accuracy, so go easy on me if you disagree, or think I am wrong on some detail. If you Google “What is AI?”, you will get tons of in-depth articles/blogs (including Wikipedia entries), books, and many images that are substantially more comprehensive. But many non-technical managers have neither the time nor the inclination to dive into details; this article is aimed at helping some of these folks be a little better informed on AI.

AI can be broadly classified as symbolic AI and Machine Learning. These days, the term AI is synonymous with Machine Learning, and more recently with Deep Learning. But, the origins of AI were mostly symbolic with hand-coded rules meant to capture expert knowledge. Typically, an AI software engineer, well-versed in an “AI language” like Lisp or Prolog, would be paired with a domain expert (say, a medical doctor) to represent the relevant knowledge in the form of IF-THEN rules. There are many other symbolic knowledge representation mechanisms, besides rules, such as frames. To this day, you will find rules/frames used in many products such as many state-of-the-art chatbot frameworks that use AIML rules or intent frames to author dialogs using scripted conversations. These products, while somewhat successful, still suffer from the limitations noted below.

After a couple of AI hype cycles during the 1960s and the 1980s, the field of AI entered a long “AI Winter” till the mid-2000s. Why? Well, rules/frames are fragile from a software engineering stand-point and it is difficult to manage/maintain a symbolic AI system once we get past a few hundred rules. Rules start conflicting with each other and it is impossible to trace the sequence of rule triggers and to debug these systems. Rules have to be authored by hand with extensive input from expensive and busy domain experts. The “learning” in these systems is mostly “supervised” and off-line. Attempts were made to create rules automatically through “unsupervised” and online learning based on feedback from user interactions. However, most of these attempts remained academic efforts with few commercially successful implementations.

Machine Learning got started in the mid-1990s when computer scientists and statisticians started collaborating and learning from each other. Algorithms such as decision trees and support vector machines were used in the early 2000s to mine increasingly large databases for patterns that can be used for prediction/classification and other advanced analytical tasks. The emergence of faster computers and “big data” software tools such as Hadoop ignited interest in data-driven pattern recognition that enables computers to learn from historical data. The main difference is that the new AI engineers, now called data scientists, do not engage in traditional software engineering. Rather, their job is to extract features from the raw data and use these features to create supervised learning models that enables the machine to learn to classify and predict based on historical data. The data scientist provides labelled data that identifies the combination of features that point to each distinct class/label. This “model engineering” is far more robust than “rules engineering” and benefits from a virtuous cycle of faster computers, more data, and online feedback from users. Unsupervised machine learning methods such as clustering are often used in combination with supervised methods.

Deep Learning has its origins in Artificial Neural Networks (ANN) which were part of “Connectionist AI“, also dating back to the 1960s. Many algorithmic advances such as backpropagation, multi-layer perceptrons, convolutional networks and recurrent networks were progressively discovered in the 1980s, 1990s, and the 2000s. But deep learning, which gets its name from the multitude of neural layers (ranging from 5 to 100 or more), only became commercially viable about 5 years ago with the emergence of GPUs as the computational workhorses. These faster GPU-based computers along with the availability of massive amounts of unstructured data such as images, audio, video and text, is key to the current success of AI. In addition, the pace of innovation in deep learning algorithms and architectures, over the last 5 years, has been incredible. Today, deep learning systems can perform image recognition, speech recognition and natural language understanding tasks with astonishing accuracy.

Deep learning systems are also mostly supervised learning systems in that enormous amounts of labelled data has to be supplied by the data scientist to train these systems (the weights of the interconnections between the neurons). But, unlike more traditional statistical machine learning algorithms (like random forests), deep learning systems can automatically perform feature extraction from raw data. So, the data scientists do not have to perform feature engineering. The significance of deep learning is that successive layers learn features at increasing levels of abstraction. So, while the first few layers might recognize edges and other lower level image features, the next few layers recognize higher-level features such as nose/ear/mouth, while the next few layers recognize the entire face and so on.

Generative Adversarial Networks (GANs) and Autoencoders are examples of unsupervised deep learning systems. Reinforcement Learning systems are examples of deep learning systems that may be thought of as online learning systems in that they learn directly from actions performed in a simulated environment and feedback obtained when deployed in real environment. Autonomous cars and game-playing systems such as AlphaGo utilize reinforcement learning; AlphaGo is a good example of simulation-based learning in that the system was trained by playing against itself a gazillion times. This is also, then, an example of unsupervised learning since the system gets better on its own by observing its mistakes and correcting them.

There are many other related sub-fields of AI such as evolutionary (genetic) algorithmsgame theory, multi-agent systems and so on. Also, note that AI benefits from other disciplines such as mathematical optimization which have been part of other areas such as Operations Research (OR). In fact, the recent boom in AI has also rejuvenated interest in related fields such as control theory since many of the algorithms behind autonomous cars, drones and robotics have mathematical roots in other disciplines. AI is therefore a truly inter-disciplinary field where scientists and engineers from a variety of backgrounds are able to apply their mathematical and software skills.

I have tried to keep this overview non-technical and brief. I hope this helps some business types get a hang of some of the buzzwords and jargon floating around the office.

Posted in Uncategorized | Leave a comment

Hey Bot, Why am I here?

Why am I here?

As 2017 rolls in, a few of us are probably pondering this metaphysical question. As they drop the ball in Times Square, you are asking yourself “How did I drop the ball?” But, don’t worry, this is not another New Year’s Eve motivational article asking you to peer deep inside your soul, and wonder where it all went wrong.

Rather, this is an article about the importance of dialog context for chatbots. Speaking of context, I went for a stroll on a nice Saturday afternoon at a nearby trail that I frequently visit. On a whim, I pulled out my iPhone and decided to have a chat with Siri and Google Assistant (Allo). This is what I had in mind:

  • Me: Where am I?
  • Bot: You are at <address>, GPS location, identifies landmark, shows map
  • Me: Why am I here?
  • Bot: Oh, this is a trail. You are probably walking or hiking.

Simple enough, right? I mean, I am asking where I am, and then with that context in mind, I am asking why am I there? Yes, this is pathetic I know; lonely me just having a sad little dialog with my bot. But, this should be easy for a chatbot right? Oh, before you object about the “here” part, I am perfectly OK with substituting the question “Why am I at this location?” — which I ended up doing as you will see below. This, by the way, in itself, is an interesting Natural Language Processing (NLP) disambiguation question (related to entity recognition); but, that wasn’t my aim.

Now, imagine my surprise when I got these results from Google Assistant (first) and Siri (next). I am not showing the “Where am I” question which I asked first. Rest assured that I did ask that question first (to set up the context), and both bots pulled up a map, though neither proceeded to identify the landmark as a trail. Google Assistant was better than Siri on the “Where am I” question, but only marginally so. As I said, neither was able to tell that I am at a trail, even though this (GPS) location is identified as such on Google Maps.

Even accepting that neither bot could tell that I am at a trail, the answers they gave me are just ridiculous. I specifically asked why I am at this location; so the metaphysical/confusing answers are inexcusable. But, more importantly, the bots were not keeping the context of the dialog. They were treating the questions as separate conversations. These bots are more like information retrieval search engines with an NLP veneer; they are not true dialog bots or genuine conversational AI interfaces.

Instead, a bot’s dialog manager should hold the context variable like so:

  • Question-1: Where am I?
  • Bot uses the GPS/maps application(s) to set up the context variable, location = X
  • Question-2: Why am I here?
  • Bot recognizes the entity “here” to mean “location = X”, and properly interprets the question, and attempts an answer based on understanding that the user is at a particular location X.

Maintaining the dialog state by obtaining values for these context variables is a key feature of any chatbot dialog manager. Context variables can either be obtained directly from the user, or, as in the above illustration, obtained as a result of performing an action(s) (in this case, GPS-based location retrieval) on some external application(s). For bonus points, the bot would obtain context variables through personalization based on access to a user’s dialog and information access history. Imagine this dialog:

  • Me: Where am I?
  • Bot: You are at your favorite trail <trail-name>.
  • Me: Why am I here?
  • Bot: You come here often, especially on weekends. You walk for about an hour or so.

Is this far-fetched? Not really. The information is readily available (thanks to ever-present GPS). Whether the GPS data is recorded and used to track the user’s whereabouts and historical patterns is clearly a matter for privacy experts to sort out. But I, for one, will be glad to allow these bots access to all such personal information to enable them to have better context. Notice that context from these data sources supplements context obtained from the conversation/dialog itself. So, the data integration from multiple sources aids the AI, NLP and machine learning by providing additional data to help further the dialog and enable an intelligent personalized conversation.

Lest you think that I am picking an isolated incident out of context (pardon the pun) to paint these bots in a bad light, let me narrate another recent example of my interaction with Google Assistant. A friend had sent a Google Calendar invite, via Gmail, for lunch at a nearby restaurant. I got there a little early, having used Google Maps on my iPhone to navigate to the place. To kill time before my friend arrived, I stood in front of the restaurant and asked Google Assistant (Allo) where I was, and why I was there. For the first question, it showed me the map, though strangely didn’t precisely identify the restaurant (this was a strip mall). It totally bombed the second question, even though I tried asking in many ways (such as who I was meeting).

Clearly, Google Assistant is not integrated with all these other services that Google owns; I have allowed personalization for all Google services where asked, so I am assuming these Google services have all this data stored under my identity to be readily leveraged. Perhaps, this has been rectified (my experience is a few months old). But, this is an example of the importance of integrating many different data sources to obtain context variables needed to conduct an intelligent dialog that is personal and effective. Ironically, integrating a lot of different data sources and history of previous dialog, will enable the user to have a shortened dialog with the bot. This is important since in many cases, especially for business applications, the user is not trying to engage in a long conversation with the bot. The aim is to conduct just enough dialog to obtain information and perform necessary actions towards accomplishing some goal.

Siri and Google Assistant are open-domain bots with unrestricted context and vocabulary. So, the bar is higher for these bots, unlike their narrow-domain single-application bots which have restricted context and vocabulary. Still, the importance of data integration from a variety of sources remains the same for both types of bots. Data integration may not seem as sexy as NLP or deep learning, but it is just as important for bots to obtain the context needed to hold an intelligent dialog with the user. Identifying all relevant sources of such data and actively making them available to the bot is important. Historical data is used to train the bot with the proper context. Real-time data allows the bot to update the data variables with current context as it pertains to the dialog.

Don’t just sit there, say something. Share your bot stories. Or, tell me I got this all wrong, and I am just being cranky and mean to these nice bots on New Year’s Eve. Go, right ahead. Unlike the bots, I will take it all in context.

Posted in Uncategorized | Leave a comment

Data-Driven Dialog Authoring

Most bots today are authored using bot building platforms such as api.ai or wit.ai. These tools typically have 3 components:

  • Intent classifier: Maps user utterances into the most relevant intent, where intent signifies the goal or action that the user is trying to accomplish.
  • Entity recognizer: Extracts structured data, synonymously called context variables, parameters or entities, from the unstructured utterance. Examples of entities include names and locations.
  • Dialog manager: Maintains (updates) the state of the dialog (input/output context, known/unknown parameters); responds to the user utterance based on the intent and context; fetches information from and performs actions on enterprise systems.

Learning intents and entities

Typically, the structure of an intent is: {[input_contexts],[utterances],[parameters],[actions],[responses],[output_contexts]}, where:

  • input_contexts: {<flag1>,<flag2>,…} are strings where, each flag when set, act as an indicator for the existence of a particular input context for this intent.
  • utterances: {“string1”,”string2”,…} are alternate “User says” natural language statements that, each, represent a different variation of what the user might say to signal this intent. This utterance matching is conditional on the existence of the input_contexts. So, the same utterance might match different intents based on the input_contexts.
  • parameters: {entity-1,entity-2,…} are entities that are either embedded in the utterance, or obtained through an external [enterprise] system.
  • actions: {webhook-1,webhook-2,…} are typically “webhooks” which are calls to RESTful APIs (HTTP POST/GET) that pass/retrieve parameters.
  • responses: {“string1”,”string2”,…} are alternate natural language statements that, each, represent a different variation of what the bot says in response to the user utterance.
  • output_contexts: {<flag1>,<flag2>,…} are strings where, each flag when set, act as an indicator for the existence of a particular output context that is a result of execution of this intent.

To train the bot to classify intents based on utterances, bot authors upload (either manually or programmatically through APIs) example sentences that signify a particular intent. To train the bot to recognize entities, bot authors either upload values for entities or annotate utterances with entities, so the bot learns to extract entities from utterances. Both these are examples of the use of built-in machine learning to train the bot from the real conversations present in the historical chat log transcripts. Given enough data, it is possible to obtain robust intent classification and entity recognition from bot platforms such as api.ai. However, this is not true for dialog management; there is limited (if any) support for data-driven dialog management in these scripting based platforms.

Using historical data to author robust dialog scripts

Scripting a dialog amounts to “chaining” intents using the input and output context flags. This is the essence of authoring the dialog using the dialog manager. However, this is, at present, a laborious and error-prone process which results in a fragile bot that breaks when the user deviates from the expected script. For existing bot authoring platforms, I am not aware of a systematic way of authoring these dialogs automatically (or semi-automatically) using historical data. In other words, there is no learning involved in conversational dialog flows.

Currently, the only way to leverage historical chat log data for creating robust conversational flows is to test the authored flow using a “hold-out” dataset that has not been used in the dialog script authoring. This is akin to the standard practice in machine learning of a hold-out test dataset. This is an issue for enterprise bots that need to capture complex business process flows with nonlinear loops and forks. A data-driven approach to dialog script authoring will make it easier to leverage historical data to create robust dialog scripts using these authoring tools.

There are two artifacts used to create enterprise bots for applications such as customer/tech support: process flow scripts currently used by call/chat center agents, and the historical chat logs. The former is used extensively for script authoring; the latter is seldom used. One approach to leveraging the latter, in current platforms, is to first use real data to templatize the conversational flow scripts such that the most common flows are captured through reusable flows. Then, these scripted flows can be tested using variants of the flow that have not been directly used in the design. Attention is currently being focused on the UI/UX aspect of conversational flow design, which is a welcome development. However, the use of historical chat log data for Q/A or Q/C of bot scripts is currently a neglected area, at least in the context of script authoring in existing bot platforms.

In the long run, use of deep learning neural networks will likely render the current script authoring practices obsolete. These deep learning systems will be completely data driven. But, at present, none of the bot platforms feature deep learning algorithms that obviate the need for bot scripting. They may use deep learning for intent classification or entity recognition, but I am not aware of any publicly available standard bot platform (such as api.ai) that uses deep learning for dialog management. Till such purely data-driven approaches become available, bot authors will need to work on industry-standard approaches for dialog scripting, that moves this discipline from an ad-hoc art to more of a systematic engineering practice.

Posted in Uncategorized | Leave a comment

Short-circuiting a chatbot: Yelling “representative”

Customer care/service/support is a natural application domain for chatbots, where they can be used to replace or augment existing Interactive Voice Response (IVR) systems. Naturally, we must ask ourselves what we can learn from decades of experience with IVR systems, even as IVR technology has matured and embraced advances in natural language understanding and speech recognition.

Given the current nascent state of the conversational dialog management technology, chatbots are likely to face a problem common to most IVR systems — users yelling “representative” or “associate” or “agent” in attempt to bypass the bot. At what point should the user be allowed to do this?  We are talking about a situation where the user has not engaged the bot to a sufficient level, to even give the bot a chance to solve their problem. The user just wants to talk to a live human. Why?

Because, in most cases, current IVR-type systems force the user to follow some pre-determined scripted path. The users know that, and they don’t even want to venture down that path. They want to “short-circuit” the bot and go straight to a live person. As IVR systems get better, the navigation has become less rigid. Still, a user is likely to feel that they are better off talking to a human. Maybe they feel that their problem is unique and unlikely to be solved by an automated system. Perhaps they have had bad experiences with IVR systems in the past, and just don’t want to bother trying the newer ones, even though the warm and friendly IVR systems says “I am a new system that can converse in natural language. So, please tell me in a few words what you are trying to do”. “Uh, hmm, representative!, associate!, agent!!”.

Current IVR systems are designed to at least get sufficient information from the user so as to route to the correct live agent. So, the dialog is limited to determining “intent” for call routing purposes. But even this can be bypassed if the user is sufficiently determined. Very few systems currently refuse to allow the user to continue unless they provide this information. “I see that you want to talk to an agent. But, to get you to the right representative, I need some basic information.” “Representative!, associate!, agent!!”. “Sorry, I didn’t understand that. Please tell me what you are calling about. You can say, for example, customer service, billing, tech support…” “Representative!, associate!, agent!!” At this point, most IVR systems will say “Alright, please hold for an agent” and transfer the user. Some systems will keep trying to get at least some information from the user. A few will just hangup on the user, or say “Please, hangup and try again”.

How do we solve this problem? First, users need to gain confidence that the chatbot is able to assist them in a manner similar to that of a live agent. To do this, they should first give the bots a chance. Bots often feature learning through feedback, so with sufficient exposure to users, they will learn to do better. But how do we prevent the users from short-circuiting the bots without giving them a fair shake. This jump-starting process requires business/economic and policy incentives. If the number of live human agents is reduced (which is a common objective of call center automation), the users will find themselves in long queues waiting on hold for an agent. This is a common incentive to return back to the chatbot and give it a chance. Another option is to force users to pay for live human assistance; this will pass on some of the higher costs of human agents to the users. A third option is to limit the hours of operation of the support agents, and emphasize that the chatbots are available 24/7.

In employee-to-employee (E2E) applications within an enterprise, users seeking support can be subjected to policy requirements that mandate usage of chatbots before they are allowed to contact a live human. Example of such applications are support of field technicians and technical/administrative support for employees. Experienced field technicians who call in seeking support from live agents particularly abhor rigid rules-based IVR-type systems. They often know exactly what they want, and hence seek the flexibility to perform the desired action (often involving a back-end enterprise system) without having to go through a detailed menu featuring a checklist of routine items. Hence, they try to bypass rigid IVR-like systems that follow a pre-determined script driven by rules-based enterprise systems. A chatbot that features a sophisticated conversational AI-based dialog system should allow such experienced users the flexibility to express what they need in a simple manner, so the dialog can be kept short and pleasant. A policy mandate (from supervisors and management) to try the chatbot first enables the requisite confidence-building to happen.

In customer-facing applications, where the chatbot is exposed to customers external to the enterprise, the jump-starting process is trickier. The reputation of the business is at stake; unhappy customers can be very costly to any business, especially in this age of social media. There may also be legal/regulatory restrictions. One solution is to trial these chatbots with a carefully selected set of customers who can then be counted on to virally spread the positive experience. Today, businesses of all stripes seem to rushing to deploy customer-facing chatbots without a seemingly careful consideration of how this new support channel ties in with existing customer care channels including IVR. Many of today’s bots are simple transactional systems with very limited dialog capability. However, as the applications mature, businesses will find that fragile dialog systems break the trust that customers place on chatbots. This will compromise the ability of businesses to enable chatbots to learn through the powerful feedback mechanisms that machine learning and artificial intelligence technologies enable.

One solution to the short-circuiting problem is to allow the chatbot to make the decision to gracefully hand-off the conversation to a live agent. Chatbots have an edge over existing IVR systems in that they feature NLP and AI algorithms that can assess sentiment of the user in addition to objective metrics on performance of the chat. The chatbot can make a case-by-case judgement on how persistent it should be, before bringing a live agent on the line. Leveraging information about the user based on history of previous calls as well as other profile information would be valuable in this context. Being “stateful” in the sense of keeping track of previous recent transactions is one way to gain the user’s trust. “I see that you called in earlier this morning about a problem with your service. Are you still experience the same issue?” is a reassuring way of beginning a conversation with a customer who has been calling repeatedly about a problem.

A careful review of human-factors and human-computer interfaces as they apply to existing IVRs is necessary before full-scale deployment of dialog chatbots that feature truly conversational AI. Otherwise, we are likely to repeat the mistakes made with IVRs. Many conversational flows for chatbots, engaged in enterprise use-cases, involve complex process flows that resemble a multi-stage IVR tree. Even if the chatbot is capable of taking the dialog to completion, the user is unlikely to stick around till the bitter end. Why would they? It is easier to short-circuit the bot by yelling “Representative!, associate!, agent!!”.

Posted in Uncategorized | Leave a comment

Starting a chatbot company?

Looks like everyone wants to be a chatbot entrepreneur these days. Conversational AI is one of the hottest startup market opportunities right now.

If you are preparing to throw your hat in the ring as a (B2B) chatbot entrepreneur, as a reality check, it would be a good idea to read this article (a little dated, but relevant).

Now, before you despair, while the questions from this VC are worth pondering, Roger Chen makes a fundamental assumption: The technology enabling true conversational AI exists. In reality, it doesn’t. Even for narrow vertical enterprise applications. What you see advertised as chatbots are, in reality, laboriously scripted apps with an NLP layer. Many of his questions center around the value proposition of a “better NLP” in the context of simple chatbots where NLP is the key technology.

But while machine learning and AI are inherent in NLP/NLU, that is not the “conversational AI” technology that excites people. The conversational AI involves learning outside of the NLP context; that is, learning what it takes to converse with humans. Understanding many relevant contexts: linguistic context pertaining to the semantics and the syntax of the conversation (for examples, entities and their values), personality context pertaining to the persona of the human with whom the bot is engaging in the dialog, and organizational context pertaining to the particular B2B enterprise domain/applications that this bot is connecting to.

As a specific example of this distinction of conversational AI learning outside of NLP, consider the vocabulary used by the bot. Normally, one would consider vocabulary to be part of NLP. But while character/word/sentence representations (such as word embeddings) may be NLP technologies, the vocabulary itself is domain-specific. So constructing a suitable vocabulary for the bot may itself be a distinguishing aspect of the proposed innovation. While vocabulary is seldom considered crucial, in practical applications, it plays a big role. A large vocabulary is often a hindrance (at the least, a resource hog) while a small vocabulary leads to the problem of handling out-of-vocabulary (OOV) words/tokens. This is just one example of many details that go into creating a truly conversational AI bot, that are orthogonal to the NLP technology itself.

Current chatbots or “virtual assistants” essentially serve as new NLP-enabled messaging interfaces to existing enterprise systems with the real value coming from data integration from diverse sources. In other words, many of these “chatbots” are (semantic) search engines with NLP capability; they find answers to simple NLP queries that involve extracting data from multiple enterprise silos. This serves a useful automation function with a newer common interface: messaging front-ends. In many of these applications, the aim is, in fact, to limit or avoid conversations, and to return the information quickly. This is just a convenient productivity application for enterprise users, but they are not intended to be true conversational AI interfaces.

Most of these “chatbots” will break easily when the user goes off script which will happen in a real conversation involving complex workflows (not simple IVR type work-flows). For enterprise applications, for example, customer-service call-center type applications especially those that involve complex troubleshooting, these chatbots often require hand-coded rules that capture domain knowledge (like the old-day expert systems). This approach will not generalize or scale.

So, the biggest opportunity right now is for startup tech companies that can create the (deep learning) technology to engineer a realistic conversation, even in a narrow domain. For a discussion of some of the technical challenges of a purely data-driven deep learning approach, see this article.

Posted in Uncategorized | Leave a comment

Unsupervised Deep Learning for Vertical Conversational Chatbots

One approach to building conversational (dialog) chatbots is to use an unsupervised sequence-to-sequence recurrent neural network (seq2seq RNN) deep learning framework. About a year ago, researchers (Vinyals-Le) at Google  published an ICML paper “A Neural Conversational Model” that describes one such framework; a review can be found here. The Vinyals-Le paper (and associated framework) is instructive in understanding some of the parameters of such seq2seq chatbot models. We assume that Vinyals-Le used Tensorflow, though this is not explicitly stated in the paper.

Note that seq2seq may not be the best way to build a truly conversational chatbot ; the Vinyals-Le chatbot is more of a Q/A system that originated in machine translation. As Vinyals-Le note, in machine translation and Jeopardy-playing Q/A systems, the context is limited to the current question. In an extended dialog-based conversational system, the context needs to be maintained across multiple Q/A sequences. Maintaining context over a period of time is a key requirement for dialog systems. Seq2seq just happens to be a simple framework that is easy to generalize across domains and is purely data-driven. For a general discussion of alternatives, refer to this post ; the taxonomy presented there would identify the Vinyals-Le method as a generative model which learns responses using historical data. Many existing production systems such as the Google Assistant use retrieval-based methods, such as the Tensorflow model presented here.

Vinyals-Le present two different applications of this model: IT Helpdesk Troubleshooting (a vertical chatbot) and Movie Dialogs (an horizontal chatbot). By vertical chatbots, I mean closed-domain chatbots that are focused on particular vertical applications. By horizontal chatbots, I mean open-domain chatbots like Siri, Google Assistant or Alexa. Vertical chatbots are easier to build than horizontal ones, and are often the ones needed in enterprise applications like the IT helpdesk application identified in this paper. Additionally, in vertical chatbots, data for the input context usually originates from domain-specific enterprise systems, and the desired output is often an action to be executed on one or more back-end enterprise systems. You could say that vertical chatbots are often goal-driven systems where the purpose of the conversation or dialog is to obtain information to execute some action. This has design implications: for example, intent classification is less of an issue since the vertical domain provides considerable context for the conversation.

Unsupervised learning, in the chatbot context, implies that the model can be trained directly from historical chat log data (transcripts), without the need for any human labeling. If the aim is to primarily build a Q/A system, we can treat conversations as input Q/A pairs, where each sentence in the conversation is both an answer to a previous sentence, and a question to the next sentence; that is, each sentence appears in two Q/A pairs. This is the approach Vinyals-Le had taken for their horizontal bot. But for their vertical bot, they have explicitly introduced turn-taking and used previous Q/A pairs as context while generating response to a given question; this is necessary to retain context across the dialog. Their conversations were about 400 words long (on average), and their training corpus had 30M tokens. This translates to about 75K conversations. Assuming 20 words per sentence on average, there are about 20 sentences in each conversation.

The size and content of the vocabulary is a key model parameter. The smaller the vocabulary, the lesser the training time. Vertical chatbots need a vocabulary tailored to the specific problem. For example, in this Vinyals-Le paper, for the IT helpdesk application, they used the most common 20K words, presumably derived from the corpus. They used special tokens like <URL>, <COMMAND> and <NAME> to deal with entities whose values are important to the conversation, yet cannot be in the vocabulary. This is necessary to distinguish these entities from the catchall <UNKNOWN> symbol which refers to words in the corpus that are not in the vocabulary (out-of-vocabulary, OOV). A small vocabulary leads to more <UNKNOWN> OOV tokens; handling these unknowns is one of the challenges of building a vertical chatbot with a tailored compact domain-specific vocabulary. See this Luong-et al Google paper for one approach to handling this OOV problem, albeit in the context of machine translation.

Vinyals-Le don’t fully describe how they handle the OOV problem. It appears that they have removed these entities from the training corpus and replaced them with these special tokens. The problem with this approach is that the model will not be able to handle these OOV entity values, and reason with them, when they are presented as part of a conversation. However, Vinyals-Le’s model does handle OOV entities such as URLs. My guess is that they handle it through a separate input pre-processing and output post-processing step that uses a separate dictionary of these OOV entity values that is independent of the vocabulary. This would be similar to the approach used in the Luong-et al Google paper  mentioned above; these OOV entities are traced back to their appearance in the source system (the input corpus). Regardless of how they are handled, these OOV entity values are important in vertical enterprise applications, since actions resulting from these conversations are tied to the specific values for these entities (such as URL or phone number).

Other important model parameters are the number of layers, the types and number of cells used in each layer, the maximum number of words allowed in each sentence (input and output), and the word embedding size. For the vertical bot, Vinyals-Le used 1 layer with 1024 LSTM cells; more typical values for chatbots are 3 layers and 512 cells. They didn’t specify the maximum sentence length, but based on the above discussion, we can assume this to be 20. Note that this parameter is important, because the Tensorflow seq2seq implementation pads all sequences (both input and output) to the same fixed length; this may be less of an issue for RNN / seq2seq implementations that allow for variable sequence lengths. A large value for maximum sentence length will require a correspondingly large memory allocation, while a smaller value will result in the inability to handle large sentences. 20 words per sentence is a reasonable compromise since most chat dialogs involve small sentences. Vinyals-Le also don’t specify the embedding size, but a typical value is 256. Like with all deep learning systems, higher values for these parameters will result in more complex models that require more training time and large datasets to prevent overfitting. There are many other relevant parameters, some specific to RNNs such as the use of GRUs instead of LSTMs, attention mechanisms,  regularization, mini-batch size, number of epochs, learning rate, loss function and optimization algorithm (stochastic gradient descent, AdaGrad, Adam etc).

Seq2seq chatbot models, like the Vinyals-Le approach, provide a simple way to build unsupervised vertical chatbots. The advantage of these models is their simplicity and their generality, with little need for domain-dependent rules. The disadvantages include adaptation to the task of building chatbots (given their origins in machine translation and Q/A), difficulty maintaining context in lengthy conversations, and holding a dialog with a consistent personality (though this is less of an issue for vertical bots compared to horizontal bots). While a lot of (media) attention has been paid to horizontal chatbots like Google Assistant, Siri and Alexa, vertical chatbots, tailored to specific enterprises and domains might be a bigger business opportunity.

Posted in Uncategorized | 1 Comment

Data-driven dialog bots

This is a good overview of data-driven dialog systems (bots) based on a survey paper.

A survey of available corpora for building data-driven dialogue systems

Posted in Uncategorized | Leave a comment