Optimizing Virtual Agent (VA) Training
Achieve NLU model’s precision, recall & accuracy up to 78%
The success of any Virtual Agent (VA) depends on the training of its Natural Language Understanding (NLU) model prior to configuration. A Major challenge is providing the right set of representative examples from historical data for this training. Identifying a few hundreds of right examples out of millions of historical data is a herculean task. What makes it even more daunting is that this task is usually done by digital service providers (DSPs) manually and it’s extremely time-consuming.
This article articulates about developing a machine-learning (ML) based Intent Analyzer tool to identify the most effective data set for NLU training. These datasets cover maximum scope for the respective intent making to train the NLU model in a highly efficient way which leads to improved precision, recall, and accuracy.
Machine learning-based Intent Analyzer tool identifies most relevant representative examples for NLU training
The conventional approach of identifying the training dataset for VA NLU depends heavily on DSP’s internal process experts. It involves choosing the most relevant few hundred examples of millions of historical chat. But it is crippled with inefficiencies because it lacks coverage of all the examples needed for training. Also, it makes way for manual biases and highly time-consuming.
Developing an ML-based intent analyzer tool is the most optimal approach for identifying representative training examples. Below are the steps involved:
Chat logs are sourced as .csv files and imported as data frames using panda’s library. The chat columns are sliced into frames for further processing of data.
It’s an initial cleanup process using regular expressions, followed by refinement of chat-specific components such as chat scripts, time stamps, removal of stop word and punctuation.
Text Vector Processing
TFIDF (Term Frequency Inverse Document Frequency) vectorization is a process to identify the relevance of the word within the context of the document followed by performing stemming to get root (stem) of word sans prefix/suffix.
K-means clustering is performed to identify useful logs. The number of chats (e.g. N= 50) can be specified so that the top N chat logs can be derived. The outcome of the clustering module is to provide the most relevant chats per use case identified.
Data Import – Templatize the input to optimize the most time-consuming step
It involves sourcing historical data from chat logs for respective intents/use cases. Millions of chats are picked and imported to identify the most relevant handful of them.
The input data need to be in a standard format. To ensure this standardization, it is recommended to create a template for it.
Below figure represent a sample from a template:
The most important parameters to be captured in a template are categorized into three sections
- Primary importance: Full customer conversation.
- Secondary importance: Agent utterance, Agent group name, Customer utterance, Business unit name.
- Tertiary importance: Start time, Escalated, Engagement ID and Transferred.
Remove random noise/white noise
Seasonality in data can lead to wrong inferences. For example, higher call drops or lower speeds during Thanksgiving or Christmas. To reduce the impact, choose the dataset spread over a larger period like 9-12 months.
For efficient separation of metadata, flatten the file into excel file or other simpler formats as shown below
Reducing import time
By performing parallel processing and avoiding overloading memory.
Data Preprocessing – Leverage raw text preprocessing, regular expression and lemmatization
The step involves initial cleanup of the chat data by removing chat specific components such as timestamps, stop-words, and punctuations.
Below are some recommendations to be followed during data preprocessing
Special character processing and text analysis
Below methods can be followed to remove special characters and analyze text at a high level
Raw text preprocessing
Removes chat specific notations and special characters which don’t add any value to the analysis. E.g. – timestamps, special characters, etc.
Segregates numerals from alphabets and retains only special strings of alphanumeric values.
Lemmatization or stemming
Focuses on reducing the words to their root words.
Rare word removal
They create noise through their association with other words. They might not be rare, but their usage in a certain context can be misleading. E.g. – revert, captive, hill.
Text Vector Pre-processing – Leverage term frequency-inverse document frequency (TFIDF) for most effective vectorization
Text vector preprocessing helps in understanding the importance of words as per the relative context. It focuses on the difference in relevance in different circumstances.
Choice of vectorization method
Choose a method based on the type of text data. It is recommended to choose ‘term frequency-inverse document frequency (TFIDF) vectorization’ since it considers the relative importance of a word in each context.
Term Frequency-Inverse Document Frequency (TFIDF)
TFIDF ensures that the chats are selected according to their relative importance. More than their overall significance in everyday usage, it measures how critical they are in the context of the chat log corpus being analyzed. This helps in identifying the most relevant chats as per the intent. Below techniques are used to increase the accuracy of the outcome from the TFIDF process.
N-grams and other multiword usage
Certain words have an entirely different meanings when used in combination with a few other words. Such words should be configured appropriately.
Tunes the parameters of the vectorization algorithm to optimize the output.
Clustering – Perform k-means clustering technique for effective classification
Clustering ensures that the top-N chats (where N is variable depending on business/NLU needs) are derived. These can be quickly analyzed to identify utterances, intents and entities. Additional ML processing such as entity or intent recognition can also be performed if required. All this results in significant time and effort saving.
Choice of clustering technique
Choose the technique that can work on huge volume of data like millions of customer chats. It is recommended to use k-means clustering for such a volume.
Ensure one chat is mapped with only one intent i.e. avoid overlapping
Ability to scale the number of top-N use-cases based on intent call-volume (by varying the number of clusters) this enables the number of representative samples to be adjusted based on whether a given intent has more or less volume.
Leveraging a machine learning-based approach for identification of training data for NLU can have following benefits:
- The number of use cases crossing confidence threshold can increase by 160-180%.
- Time efficiency can save up to 97% of time in identifying the examples
- Transfer to live agents can reduce by almost 80%
- Achieve Recall and Accuracy of up to 78%
By Sathya Ramana Varri C
Sathya is a Senior Director & heads the AI/ML and Intelligent Automation delivery for the largest US telecom service provider player in US for Prodapt, a global leader in providing software, engineering, and operational services to the communications industry. He has 20+ years of experience spread across various domains/technologies. Sathya is instrumental in several customer experience, intelligent automation & digital transformation initiatives.