2021-05-03, 15:30–15:55, PyData Track 1
Test sets are often designed to have a specific composition of cases, with constraints applied to each sub-population. Treating test-set curation as an optimization problem could save precious time and transition us towards a "data as code" paradigm.
Test set preparation is an essential part of any data science project. It is often the case that the test set is not just a random choice of samples, but rather a carefully designed population, with specific limits on the number of cases from each important sub-group. As the constraints get complicated, it often takes a while to get them all just-right. In this talk I'll show how to treat the test-set curation as a constraint-optimization problem that can be automatically solved using linear programming. I will demonstrate an open-source python library, curation-magic, which elegantly does this for you, and argue that treating test-sets as an outcome of such optimization is a desired transition towards a "data as code" paradigm.