Snorkeling: Label Data With Less Labor

Label data with less Labor

Check out how Snorkeling can complement active learning and help partially automate the process of data label creation.

Active learning enables collaboration between the annotator and data scientist to intelligently select data points to label. It helps identify important data points that the annotator should label to rapidly improve model performance. Snorkeling complements active learning by helping partial automation of data label creation. It focuses on identifying easy data points that can be labeled programmatically, instead of by an annotator.

A key barrier for businesses’ adoption of machine learning is not lack of data but lack of labeled data. In Learn more with less data, we shared how active learning enables collaboration between the annotator and data scientist to intelligently select data points to label. Using this approach we can identify important data points that the annotator should label to rapidly improve model performance

Snorkeling complements active learning by helping partial automation of data label creation. It focuses on identifying easy data points that can be labeled programmatically, instead of by an annotator.


Background: What is Snorkeling?

Snorkel is a library developed at Stanford for programmatically building and managing training datasets.

In Snorkel, a Subject Matter Expert (SME) encodes a business rule for labeling data into a Labeling Function (LF). The LF can then be applied to the unlabeled data to produce automated candidate labels. Typically, multiple LFs are used to produce differing labels, and policies are defined for selecting the best final label choice. These policies include majority vote and a model-based weighted combination.

The labeling functions can be evaluated for coverage of the unlabeled training data. The SME can determine if gaps exist, and add additional LFs for those cases. The labeled training data can then be used to train or generate a classifier model. The purpose of this model is to evaluate the quality of the labeled dataset produced by Snorkeling versus a reference or gold labeled data set. This model is evaluated using manual labeled test data for performance analysis. This can be used as feedback to the SME to further tune the LFs.

The overall process is shown in the following figure.