One approach to building conversational (dialog) chatbots is to use an unsupervised sequence-to-sequence recurrent neural network (seq2seq RNN) deep learning framework. About a year ago, researchers (Vinyals-Le) at Google published an ICML paper “A Neural Conversational Model” that describes one such framework; a review can be found here. The Vinyals-Le paper (and associated framework) is instructive in understanding some of the parameters of such seq2seq chatbot models. We assume that Vinyals-Le used Tensorflow, though this is not explicitly stated in the paper.
Note that seq2seq may not be the best way to build a truly conversational chatbot ; the Vinyals-Le chatbot is more of a Q/A system that originated in machine translation. As Vinyals-Le note, in machine translation and Jeopardy-playing Q/A systems, the context is limited to the current question. In an extended dialog-based conversational system, the context needs to be maintained across multiple Q/A sequences. Maintaining context over a period of time is a key requirement for dialog systems. Seq2seq just happens to be a simple framework that is easy to generalize across domains and is purely data-driven. For a general discussion of alternatives, refer to this post ; the taxonomy presented there would identify the Vinyals-Le method as a generative model which learns responses using historical data. Many existing production systems such as the Google Assistant use retrieval-based methods, such as the Tensorflow model presented here.
Vinyals-Le present two different applications of this model: IT Helpdesk Troubleshooting (a vertical chatbot) and Movie Dialogs (an horizontal chatbot). By vertical chatbots, I mean closed-domain chatbots that are focused on particular vertical applications. By horizontal chatbots, I mean open-domain chatbots like Siri, Google Assistant or Alexa. Vertical chatbots are easier to build than horizontal ones, and are often the ones needed in enterprise applications like the IT helpdesk application identified in this paper. Additionally, in vertical chatbots, data for the input context usually originates from domain-specific enterprise systems, and the desired output is often an action to be executed on one or more back-end enterprise systems. You could say that vertical chatbots are often goal-driven systems where the purpose of the conversation or dialog is to obtain information to execute some action. This has design implications: for example, intent classification is less of an issue since the vertical domain provides considerable context for the conversation.
Unsupervised learning, in the chatbot context, implies that the model can be trained directly from historical chat log data (transcripts), without the need for any human labeling. If the aim is to primarily build a Q/A system, we can treat conversations as input Q/A pairs, where each sentence in the conversation is both an answer to a previous sentence, and a question to the next sentence; that is, each sentence appears in two Q/A pairs. This is the approach Vinyals-Le had taken for their horizontal bot. But for their vertical bot, they have explicitly introduced turn-taking and used previous Q/A pairs as context while generating response to a given question; this is necessary to retain context across the dialog. Their conversations were about 400 words long (on average), and their training corpus had 30M tokens. This translates to about 75K conversations. Assuming 20 words per sentence on average, there are about 20 sentences in each conversation.
The size and content of the vocabulary is a key model parameter. The smaller the vocabulary, the lesser the training time. Vertical chatbots need a vocabulary tailored to the specific problem. For example, in this Vinyals-Le paper, for the IT helpdesk application, they used the most common 20K words, presumably derived from the corpus. They used special tokens like <URL>, <COMMAND> and <NAME> to deal with entities whose values are important to the conversation, yet cannot be in the vocabulary. This is necessary to distinguish these entities from the catchall <UNKNOWN> symbol which refers to words in the corpus that are not in the vocabulary (out-of-vocabulary, OOV). A small vocabulary leads to more <UNKNOWN> OOV tokens; handling these unknowns is one of the challenges of building a vertical chatbot with a tailored compact domain-specific vocabulary. See this Luong-et al Google paper for one approach to handling this OOV problem, albeit in the context of machine translation.
Vinyals-Le don’t fully describe how they handle the OOV problem. It appears that they have removed these entities from the training corpus and replaced them with these special tokens. The problem with this approach is that the model will not be able to handle these OOV entity values, and reason with them, when they are presented as part of a conversation. However, Vinyals-Le’s model does handle OOV entities such as URLs. My guess is that they handle it through a separate input pre-processing and output post-processing step that uses a separate dictionary of these OOV entity values that is independent of the vocabulary. This would be similar to the approach used in the Luong-et al Google paper mentioned above; these OOV entities are traced back to their appearance in the source system (the input corpus). Regardless of how they are handled, these OOV entity values are important in vertical enterprise applications, since actions resulting from these conversations are tied to the specific values for these entities (such as URL or phone number).
Other important model parameters are the number of layers, the types and number of cells used in each layer, the maximum number of words allowed in each sentence (input and output), and the word embedding size. For the vertical bot, Vinyals-Le used 1 layer with 1024 LSTM cells; more typical values for chatbots are 3 layers and 512 cells. They didn’t specify the maximum sentence length, but based on the above discussion, we can assume this to be 20. Note that this parameter is important, because the Tensorflow seq2seq implementation pads all sequences (both input and output) to the same fixed length; this may be less of an issue for RNN / seq2seq implementations that allow for variable sequence lengths. A large value for maximum sentence length will require a correspondingly large memory allocation, while a smaller value will result in the inability to handle large sentences. 20 words per sentence is a reasonable compromise since most chat dialogs involve small sentences. Vinyals-Le also don’t specify the embedding size, but a typical value is 256. Like with all deep learning systems, higher values for these parameters will result in more complex models that require more training time and large datasets to prevent overfitting. There are many other relevant parameters, some specific to RNNs such as the use of GRUs instead of LSTMs, attention mechanisms, regularization, mini-batch size, number of epochs, learning rate, loss function and optimization algorithm (stochastic gradient descent, AdaGrad, Adam etc).
Seq2seq chatbot models, like the Vinyals-Le approach, provide a simple way to build unsupervised vertical chatbots. The advantage of these models is their simplicity and their generality, with little need for domain-dependent rules. The disadvantages include adaptation to the task of building chatbots (given their origins in machine translation and Q/A), difficulty maintaining context in lengthy conversations, and holding a dialog with a consistent personality (though this is less of an issue for vertical bots compared to horizontal bots). While a lot of (media) attention has been paid to horizontal chatbots like Google Assistant, Siri and Alexa, vertical chatbots, tailored to specific enterprises and domains might be a bigger business opportunity.