Snorkel AI Case Studies Scaling Clinical Trial Screening at MSKCC with Snorkel Flow
Edit This Case Study Record
Snorkel AI Logo

Scaling Clinical Trial Screening at MSKCC with Snorkel Flow

Snorkel AI
Sensors - Flow Meters
Sensors - Liquid Detection Sensors
Cement
Education
Product Research & Development
Quality Assurance
Tamper Detection
Virtual Training
Testing & Certification
Training
Memorial Sloan Kettering Cancer Center (MSKCC), the world’s oldest and largest cancer center, was faced with the challenge of identifying patients as candidates for clinical trial studies by classifying the presence of a relevant protein, HER-2. The process of reviewing patient records for HER-2 was laborious and time-consuming as it required clinicians and researchers to sift through complex, variable patient data. The data science team at MSKCC wanted to use AI/ML to classify patient records based on the presence of HER-2, but the lack of labeled training data was a significant bottleneck. Labeling data, especially complex patient records, required clinician and researcher expertise and was prohibitively slow and expensive. Even when experts were able to manually annotate training data, their labels were at times inconsistent, limiting model performance potential.
Read More
Memorial Sloan Kettering Cancer Center (MSKCC) is the world’s oldest and largest private cancer center. It provides care to increase the quality of life of more than 150,000 cancer patients annually. In service of this, they use AI to speed the discovery of more effective strategies to prevent, control and ultimately cure cancer in the future. The data science team at MSKCC was tasked with the challenge of using AI/ML to classify patient records based on the presence of HER-2, a protein common to many cancers.
Read More
MSKCC used Snorkel Flow to build an AI application to classify patient records across five classes categorizing the presence of HER-2. This application was used for a downstream clinical trial screening system to identify potential clinical trial participants. The team used 3,200 data points they’d labeled previously outside of the platform. They ingested the data and split it across training, validation, and test sets. The lead Bioinformatics Engineer developing the project wrote just eight noisy, imperfect labeling functions which Snorkel Flow combined to auto-label a training dataset. They used this to train an XGboost model within the platform. Using error analysis tools within the platform, the team used feedback from this model to learn where it was confused and how to correct. After a few rapid iterations, the team achieved an overall accuracy of 93% and an average F1 of 87% across all classes.
Read More
The document classification AI application built by the team is now used downstream to power a clinical trial screening system. This system allows MSKCC to identify HER-2 among patient records without relying on human experts to review each record. The use of Snorkel Flow has significantly reduced the time to label complex, domain-specific text documents as training data by labeling programmatically. It has also increased explainability by encoding the labeling rationale for each training data point as labeling functions that can be inspected like code. The team was able to use model-guided error analysis to identify data quality issues and iterate rapidly to improve.
Achieved an overall accuracy of 93% and an average F1 of 87% across all classes
Auto-labeled thousands of patient records
Reduced time to build a document classification from months to weeks
Download PDF Version
test test