Learn more with less data

Learn more with less data - Active Learning

A key barrier for companies to adopt machine learning is not lack of data but lack of labeled data. Labeling data gets expensive, and the difficulties of sharing and managing large datasets for model development make it a struggle to get machine learning projects off the ground.

That’s where our “learn more from less data” approach comes into action. At JPMorgan Chase, we are focused on reducing the need for data to build models. Instead, we focus on building gold training datasets, helping reduce the labeling cost and increasing the agility of model development.


What is Active Learning


Active learning is a form of semi-supervised learning, which works well when you have a lot of data but face the expense of getting that data labeled. By labeling data points that help the quality of the model, teams can identify the samples that are most informative.


Using machine learning (ML) models, active learning can help identify difficult data points and ask a human annotator to focus on labeling them.

To explain passive learning and active learning, let’s use the analogy of teacher and student. In the passive learning approach, a student learns by listening to the teacher's lecture. In active learning, the teacher describes concepts, students ask questions, and the teacher spends more time explaining the concepts that are difficult for a student to understand. Student and teacher interact and collaborate in the learning process.


In ML model development using active learning, annotator and modeler interact and collaborate. An annotator provides a small labeled dataset. The modeling team builds a model and generates input on what to label next. Within a few iterations, teams can build refined requirements, a labeled gold training set, active learner and working machine learning model.


How We Identify Difficult Data Points

To identify difficult data points, we use a combination of methods, including:

Classification Uncertainty Sampling

When querying for labels, the strategy selects the sample with the highest uncertainty — data points the model knows least about. Labeling these data points makes the ML model more knowledgeable.


Margin uncertainty


When querying for labels, the strategy selects the sample with the smallest margin. These are data points the model knows about but isn’t confident enough to make good classifications. Labeling these examples increase model accuracy.

Entropy Sampling


Entropy is a measure of uncertainty. It is proportional to the average number of guesses one has to make to find the true class. In this approach, we pick the samples with the highest entropy.

Disagreement based sampling

While using this method, we pick those samples where different algorithm disagree Example if model is classifying into 5 classes - A,B, C, D & E and if we are using 5 different classifier e.g.


  • Bag of words

  • LSTM

  • CNN

  • BERT

  • HAN (Hierarchical Attention Networks)


Annotator can label examples on which classifiers disagree.

Information Density

In this approach, we focus on a denser region of data and select few points in each dense region. Labeling these data points help the model classify large number of data points around these points.

Business value


In this method, we focus on labeling the data points that have higher business value than the others.

Alignment Between Humans and Machines

Traditionally, data scientists work with annotators to label a portion of their data and hope for the best when training their model. If the model wasn’t sufficiently predictive, more data would be labeled, and they would try again until its performance reached an acceptable level. While this approach still makes sense for some problems, for those that have vast amounts of data or unstructured data, we find that active learning is a better solution.


Active learning combines the power of machine learning with human annotators to select the next best data points to label. This intelligent selection leads to the creation of high-performance models in less time and at lower cost.





Snorkeling and Weak Supervision -

My blog - https://www.jpmorgan.com/technology/technology-blog/snorkeling

Snorkeling: Label Data With Less Labor


In above example we have shown how active learning enables collaboration between the annotator and data scientist to intelligently select data points to label. Using this approach we can identify important data points that the annotator should label to rapidly improve model performance

Snorkeling complements active learning by helping partial automation of data label creation. It focuses on identifying easy data points that can be labeled programmatically, instead of by an annotator.

Background: What is Snorkeling?

Snorkel is a library developed at Stanford for programmatically building and managing training datasets.

In Snorkel, a Subject Matter Expert (SME) encodes a business rule for labeling data into a Labeling Function (LF). The LF can then be applied to the unlabeled data to produce automated candidate labels. Typically, multiple LFs are used to produce differing labels, and policies are defined for selecting the best final label choice. These policies include majority vote and a model-based weighted combination.

The labeling functions can be evaluated for coverage of the unlabeled training data. The SME can determine if gaps exist, and add additional LFs for those cases. The labeled training data can then be used to train or generate a classifier model. The purpose of this model is to evaluate the quality of the labeled dataset produced by Snorkeling versus a reference or gold labeled data set. This model is evaluated using manual labeled test data for performance analysis. This can be used as feedback to the SME to further tune the LFs.



Snorkel.

Labeling function and process



LFs can use different types of heuristics. For example, patterns in the content can be identified, such as keywords or phrases. Or attributes of the content such as the length or source of the content could be used. The SME determines the best LFs based on knowledge of the domain, data, and by iteratively improving the LFs to increase coverage and reduce noise.



LFs can use different types of heuristics. For example, patterns in the content can be identified, such as keywords or phrases. Or attributes of the content such as the length or source of the content could be used. The SME determines the best LFs based on knowledge of the domain, data, and by iteratively improving the LFs to increase coverage and reduce noise.

Because of the high cost and time consuming nature of producing manual labels, a variety of programmatic and machine learning techniques are used. Data scientists uses a combination of techniques such as Snorkeling, Active learning, and manual labeling, depending on the stage of ML development and types of data and the requirements of the training environment.

Why is Snorkeling valuable?

Snorkeling has two primary sources of value 1) Labor savings 2) Faster time to market

Labor savings

Applying Snorkeling can substantially reduce the amount of labor required. In one project, two annotators label 3K to 5K labels per week, and in another project, five annotators label approximately 2,000 customer interactions per week. A data engineer can create a set of Snorkel label functions in about one month for each project. The label functions can be run each week and the results can either be used to directly retrain the model, or reviewed by the annotators in less than half the time to annotate the unlabeled data.

Combining this approach with Active Learning allows the data scientist to create a high performing model with significantly reduced cost compared to using traditional data labelling approaches.



Cost Saving

Applying Snorkeling can substantially reduce the amount of labor required. In one project, two annotators label 3K to 5K labels per week, and in another project, five annotators label approximately 2,000 customer interactions per week. A data engineer can create a set of Snorkel label functions in about one month for each project. The label functions can be run each week and the results can either be used to directly retrain the model, or reviewed by the annotators in less than half the time to annotate the unlabeled data.

Faster time to market

Using Snorkeling, we built a model on unseen data using heuristics and small labels, and augmented data using fine-tuned transformation functions. The team then built and deployed a model within 10 days – far faster than the traditional development cycle of 30 days or more. Separately, we analyzed that by training a model with a data set label using Snorkeling, we could improve model accuracy significantly.

Applying Snorkel

Industry solutions

In a study at Google [Bach et al 2019], data scientists used an extension of snorkeling to process 684,000 unlabeled data points. Each data sample was selected from a larger data set by a coarse grained initial keyword-filtering step. A developer wrote ten labeling functions. These LFs included:

  • The presence of URLs in the content, and specific features of the URLs

  • The presence of specific entity types in the content, like ‘person’, ‘organization’, or ‘date’, using a Natural Language Processing tool for Named-Entity Recognition

  • The matching of the topic of the content with specific topic categories, using a topic model classifier

The model trained on the labeled data set from Snorkeling matched the performance of 80K hand labeled training labels, and were within 5% of the performance metric (F1 score) of a model trained on 175K hand-labeled training data points.

JPMC solution

Due to Covid-19 virus and lockdown, customers’ feedback and complaint pattern profoundly changed at unprecedented speed. To understand customer issues, we used Snorkeling to build machine learning using a small set of labels, heuristics functions and augmentation techniques team created a dataset and COVID specific model.

A data scientist with the project team, wrote 20 LFs for the Voice of the Customer (VoC) project, to label data for training the VoC model for COVID-19 and lockdown themed customer feedback. Below is an example using mock-up/synthesized data.


Snorkel Example


Using the model, the team identified complaint themes and the business took immediate action to solve customer problems.

Learn More and Get Started