Snorkel AI Case Studies Scaling Clinical Trial Screening at MSKCC with Snorkel Flow

Edit This Case Study Record

	Scaling Clinical Trial Screening at MSKCC with Snorkel Flow Snorkel AI

Scaling Clinical Trial Screening at MSKCC with Snorkel Flow

Snorkel AI

Technology Category	Sensors - Flow Meters Sensors - Liquid Detection Sensors
Applicable Industries	Cement Education
Applicable Functions	Product Research & Development Quality Assurance
Use Cases	Tamper Detection Virtual Training
Services	Testing & Certification Training
Challenge	Memorial Sloan Kettering Cancer Center (MSKCC), the world’s oldest and largest cancer center, was faced with the challenge of identifying patients as candidates for clinical trial studies by classifying the presence of a relevant protein, HER-2. The process of reviewing patient records for HER-2 was laborious and time-consuming as it required clinicians and researchers to sift through complex, variable patient data. The data science team at MSKCC wanted to use AI/ML to classify patient records based on the presence of HER-2, but the lack of labeled training data was a significant bottleneck. Labeling data, especially complex patient records, required clinician and researcher expertise and was prohibitively slow and expensive. Even when experts were able to manually annotate training data, their labels were at times inconsistent, limiting model performance potential. Read More
About Customer	Memorial Sloan Kettering Cancer Center (MSKCC) is the world’s oldest and largest private cancer center. It provides care to increase the quality of life of more than 150,000 cancer patients annually. In service of this, they use AI to speed the discovery of more effective strategies to prevent, control and ultimately cure cancer in the future. The data science team at MSKCC was tasked with the challenge of using AI/ML to classify patient records based on the presence of HER-2, a protein common to many cancers. Read More
Solution	MSKCC used Snorkel Flow to build an AI application to classify patient records across five classes categorizing the presence of HER-2. This application was used for a downstream clinical trial screening system to identify potential clinical trial participants. The team used 3,200 data points they’d labeled previously outside of the platform. They ingested the data and split it across training, validation, and test sets. The lead Bioinformatics Engineer developing the project wrote just eight noisy, imperfect labeling functions which Snorkel Flow combined to auto-label a training dataset. They used this to train an XGboost model within the platform. Using error analysis tools within the platform, the team used feedback from this model to learn where it was confused and how to correct. After a few rapid iterations, the team achieved an overall accuracy of 93% and an average F1 of 87% across all classes. Read More Log in to view content
Contents

Technology Category

Sensors - Flow Meters

Sensors - Liquid Detection Sensors

Applicable Industries

Cement

Education

Applicable Functions

Product Research & Development

Quality Assurance

Use Cases

Tamper Detection

Virtual Training

Services

Testing & Certification

Training

Challenge

Memorial Sloan Kettering Cancer Center (MSKCC), the world’s oldest and largest cancer center, was faced with the challenge of identifying patients as candidates for clinical trial studies by classifying the presence of a relevant protein, HER-2. The process of reviewing patient records for HER-2 was laborious and time-consuming as it required clinicians and researchers to sift through complex, variable patient data. The data science team at MSKCC wanted to use AI/ML to classify patient records based on the presence of HER-2, but the lack of labeled training data was a significant bottleneck. Labeling data, especially complex patient records, required clinician and researcher expertise and was prohibitively slow and expensive. Even when experts were able to manually annotate training data, their labels were at times inconsistent, limiting model performance potential.

About Customer

Memorial Sloan Kettering Cancer Center (MSKCC) is the world’s oldest and largest private cancer center. It provides care to increase the quality of life of more than 150,000 cancer patients annually. In service of this, they use AI to speed the discovery of more effective strategies to prevent, control and ultimately cure cancer in the future. The data science team at MSKCC was tasked with the challenge of using AI/ML to classify patient records based on the presence of HER-2, a protein common to many cancers.

Solution

MSKCC used Snorkel Flow to build an AI application to classify patient records across five classes categorizing the presence of HER-2. This application was used for a downstream clinical trial screening system to identify potential clinical trial participants. The team used 3,200 data points they’d labeled previously outside of the platform. They ingested the data and split it across training, validation, and test sets. The lead Bioinformatics Engineer developing the project wrote just eight noisy, imperfect labeling functions which Snorkel Flow combined to auto-label a training dataset. They used this to train an XGboost model within the platform. Using error analysis tools within the platform, the team used feedback from this model to learn where it was confused and how to correct. After a few rapid iterations, the team achieved an overall accuracy of 93% and an average F1 of 87% across all classes.

Impact #1	The document classification AI application built by the team is now used downstream to power a clinical trial screening system. This system allows MSKCC to identify HER-2 among patient records without relying on human experts to review each record. The use of Snorkel Flow has significantly reduced the time to label complex, domain-specific text documents as training data by labeling programmatically. It has also increased explainability by encoding the labeling rationale for each training data point as labeling functions that can be inspected like code. The team was able to use model-guided error analysis to identify data quality issues and iterate rapidly to improve.

Impact #1

The document classification AI application built by the team is now used downstream to power a clinical trial screening system. This system allows MSKCC to identify HER-2 among patient records without relying on human experts to review each record. The use of Snorkel Flow has significantly reduced the time to label complex, domain-specific text documents as training data by labeling programmatically. It has also increased explainability by encoding the labeling rationale for each training data point as labeling functions that can be inspected like code. The team was able to use model-guided error analysis to identify data quality issues and iterate rapidly to improve.

Benefit #1	Achieved an overall accuracy of 93% and an average F1 of 87% across all classes
Benefit #2	Auto-labeled thousands of patient records
Benefit #3	Reduced time to build a document classification from months to weeks

Benefit #1

Achieved an overall accuracy of 93% and an average F1 of 87% across all classes

Benefit #2

Auto-labeled thousands of patient records

Benefit #3

Reduced time to build a document classification from months to weeks

Download PDF Version

Overview

Scaling Clinical Trial Screening at MSKCC with Snorkel Flow

Operational Impact

Quantitative Benefit