Developing Machine Learning-based Approach for Optimizing Virtual Agent (VA) Training

Optimizing Virtual Agent (VA) Training

Achieve NLU model’s precision, recall & accuracy up to 78%

The success of any Virtual Agent (VA) depends on the training of its Natural Language Understanding (NLU) model prior to configuration. A Major challenge is providing the right set of representative examples from historical data for this training. Identifying a few hundreds of right examples out of millions of historical data is a herculean task. What makes it even more daunting is that this task is usually done by digital service providers (DSPs) manually and it’s extremely time-consuming.

This article articulates about developing a machine-learning (ML) based Intent Analyzer tool to identify the most effective data set for NLU training. These datasets cover maximum scope for the respective intent making to train the NLU model in a highly efficient way which leads to improved precision, recall, and accuracy.

Machine learning-based Intent Analyzer tool identifies most relevant representative examples for NLU training

The conventional approach of identifying the training dataset for VA NLU depends heavily on DSP’s internal process experts. It involves choosing the most relevant few hundred examples of millions of historical chat. But it is crippled with inefficiencies because it lacks coverage of all the examples needed for training. Also, it makes way for manual biases and highly time-consuming.

Developing an ML-based intent analyzer tool is the most optimal approach for identifying representative training examples. Below are the steps involved:

Data Import

Chat logs are sourced as .csv files and imported as data frames using panda’s library. The chat columns are sliced into frames for further processing of data.

Data Preprocessing

It’s an initial cleanup process using regular expressions, followed by refinement of chat-specific components such as chat scripts, time stamps, removal of stop word and punctuation.

Text Vector Processing

TFIDF (Term Frequency Inverse Document Frequency) vectorization is a process to identify the relevance of the word within the context of the document followed by performing stemming to get root (stem) of word sans prefix/suffix.

Clustering

K-means clustering is performed to identify useful logs. The number of chats (e.g. N= 50) can be specified so that the top N chat logs can be derived. The outcome of the clustering module is to provide the most relevant chats per use case identified.

Data Import – Templatize the input to optimize the most time-consuming step

It involves sourcing historical data from chat logs for respective intents/use cases. Millions of chats are picked and imported to identify the most relevant handful of them.

Recommendations:

  1. Templatize:

The input data need to be in a standard format. To ensure this standardization, it is recommended to create a template for it.

Below figure represent a sample from a template:

The most important parameters to be captured in a template are categorized into three sections

  • Primary importance: Full customer conversation.
  • Secondary importance: Agent utterance, Agent group name, Customer utterance, Business unit name.
  • Tertiary importance: Start time, Escalated, Engagement ID and Transferred.
  1. Remove random noise/white noise

Seasonality in data can lead to wrong inferences. For example, higher call drops or lower speeds during Thanksgiving or Christmas. To reduce the impact, choose the dataset spread over a larger period like 9-12 months.

  1. Key-value pair

For efficient separation of metadata, flatten the file into excel file or other simpler formats as shown below

  1. Reducing import time

By performing parallel processing and avoiding overloading memory.

Data Preprocessing – Leverage raw text preprocessing, regular expression and lemmatization

The step involves initial cleanup of the chat data by removing chat specific components such as timestamps, stop-words, and punctuations.

Below are some recommendations to be followed during data preprocessing

  1. Special character processing and text analysis

Below methods can be followed to remove special characters and analyze text at a high level

Raw text preprocessing

Removes chat specific notations and special characters which don’t add any value to the analysis. E.g. – timestamps, special characters, etc.

Regular expression

Segregates numerals from alphabets and retains only special strings of alphanumeric values.

  1. Lemmatization or stemming

Focuses on reducing the words to their root words.

  1. Rare word removal

They create noise through their association with other words. They might not be rare, but their usage in a certain context can be misleading. E.g. – revert, captive, hill.

Text Vector Pre-processing – Leverage term frequency-inverse document frequency (TFIDF) for most effective vectorization

Text vector preprocessing helps in understanding the importance of words as per the relative context. It focuses on the difference in relevance in different circumstances.

Recommendations:

  1. Choice of vectorization method

Choose a method based on the type of text data. It is recommended to choose ‘term frequency-inverse document frequency (TFIDF) vectorization’ since it considers the relative importance of a word in each context.

  1. Term Frequency-Inverse Document Frequency (TFIDF)

TFIDF ensures that the chats are selected according to their relative importance. More than their overall significance in everyday usage, it measures how critical they are in the context of the chat log corpus being analyzed. This helps in identifying the most relevant chats as per the intent. Below techniques are used to increase the accuracy of the outcome from the TFIDF process.

N-grams and other multiword usage

Certain words have an entirely different meanings when used in combination with a few other words. Such words should be configured appropriately.

Hyper-parameter tuning

Tunes the parameters of the vectorization algorithm to optimize the output.

Clustering – Perform k-means clustering technique for effective classification

Clustering ensures that the top-N chats (where N is variable depending on business/NLU needs) are derived. These can be quickly analyzed to identify utterances, intents and entities. Additional ML processing such as entity or intent recognition can also be performed if required. All this results in significant time and effort saving.

Recommendations:

  1. Choice of clustering technique

Choose the technique that can work on huge volume of data like millions of customer chats. It is recommended to use k-means clustering for such a volume.

  1. One-on-one mapping

Ensure one chat is mapped with only one intent i.e. avoid overlapping

  1. Intent-specific scaling

Ability to scale the number of top-N use-cases based on intent call-volume (by varying the number of clusters) this enables the number of representative samples to be adjusted based on whether a given intent has more or less volume.

Leveraging a machine learning-based approach for identification of training data for NLU can have following benefits:

  • The number of use cases crossing confidence threshold can increase by 160-180%.
  • Time efficiency can save up to 97% of time in identifying the examples
  • Transfer to live agents can reduce by almost 80%
  • Achieve Recall and Accuracy of up to 78%

By Sathya Ramana Varri C

Sathya is a Senior Director & heads the AI/ML and Intelligent Automation delivery for the largest US telecom service provider player in US for Prodapt, a global leader in providing software, engineering, and operational services to the communications industry. He has 20+ years of experience spread across various domains/technologies. Sathya is instrumental in several customer experience, intelligent automation & digital transformation initiatives.

Dr. Mike Lloyd

How to Mitigate Security Risks in the Cloud

How to Mitigate Security Risks in the Cloud Enterprises continue to spend billions annually on security technology, yet cyber breaches continue to come fast and furious. So what exactly is going on here? Why are ...
Alex Tkatch

Dare to Innovate: 3 Best Practices for Designing and Executing a New Product Launch

Best Practices for Designing and Executing a Product Launch Nothing in entrepreneurial life is more exciting, frustrating, time-consuming and uncertain than launching a new product. Creating something new and different can be exhilarating, assuming everything ...
Derrek Schutman

Providing Robust Digital Capabilities by Building a Digital Enablement Layer

Building a Digital Enablement Layer Most Digital Service Providers (DSPs) aim to provide digital capabilities to customers but struggle to transform with legacy O/BSS systems. According to McKinsey research, 70% of digital transformation projects don’t ...
Brian Rue

What’s Holding DevOps Back

What’s Holding DevOps Back And How Developers and Businesses Can Vault Forward to Improve and Succeed Developers spend a lot of valuable time – sometimes after being woken up in the middle of the night ...
Marcus Schmidt

What IT Leaders Should Know About Microsoft’s Operator Connect

Microsoft’s Operator Connect Earlier this year, Microsoft announced a new calling service for Microsoft Teams (Teams) users called Operator Connect. IT leaders justifiably want to know how Operator Connect is different from Microsoft’s existing PSTN ...

CLOUD MONITORING

The CloudTweaks technology lists will include updated resources to leading services from around the globe. Examples include leading IT Monitoring Services, Bootcamps, VPNs, CDNs, Reseller Programs and much more...

  • Opsview

    Opsview

    Opsview is a global privately held IT Systems Management software company whose core product, Opsview Enterprise was released in 2009. The company has offices in the UK and USA, boasting some 35,000 corporate clients. Their prominent clients include Cisco, MIT, Allianz, NewVoiceMedia, Active Network, and University of Surrey.

  • Nagios

    Nagios

    Nagios is one of the leading vendors of IT monitoring and management tools offering cloud monitoring capabilities for AWS, EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service). Their products include infrastructure, server, and network monitoring solutions like Nagios XI, Nagios Log Server, and Nagios Network Analyzer.

  • Datadog

    DataDog

    DataDog is a startup based out of New York which secured $31 Million in series C funding. They are quickly making a name for themselves and have a truly impressive client list with the likes of Adobe, Salesforce, HP, Facebook and many others.

  • Sematext Logo

    Sematext

    Sematext bridges the gap between performance monitoring, real user monitoring, transaction tracing, and logs. Sematext all-in-one monitoring platform gives businesses full-stack visibility by exposing logs, metrics, and traces through a single Cloud or On-Premise solution. Sematext helps smart DevOps teams move faster.