Pycon Israel 2021

Automatic Curation of Test sets
05-03, 15:30–15:55 (Asia/Jerusalem), PyData Track 1

Test sets are often designed to have a specific composition of cases, with constraints applied to each sub-population. Treating test-set curation as an optimization problem could save precious time and transition us towards a "data as code" paradigm.

Test set preparation is an essential part of any data science project. It is often the case that the test set is not just a random choice of samples, but rather a carefully designed population, with specific limits on the number of cases from each important sub-group. As the constraints get complicated, it often takes a while to get them all just-right.  In this talk I'll show how to treat the test-set curation as a constraint-optimization problem that can be automatically solved using linear programming. I will demonstrate an open-source python library, curation-magic, which elegantly does this for you, and argue that treating test-sets as an outcome of such optimization is a desired transition towards a "data as code" paradigm.

Session language


Target audience

Data Scientists, Other (please specify below)

Other (target audience)


Dr. Jonathan Laserson is a machine learning expert and consultant, and the lead AI strategist of Zebra Medical Vision. He did his PhD in the AI lab of Stanford University and his undergraduate studies at the Technion. He built ML systems for Google and IBM research, and at Zebra Medical lead the development of clinical AI products from the idea stage to FDA-approval and production.