Pycon Israel 2021

Cutting the Right Corners: Handling High Cardinality by Understanding Your Data
2021-05-02, 16:00–16:25, General Track 2

Handling high cardinality with big data can be challenging. We improved our pipeline speed and stability by understanding which data matters more and creating a smart “Cardinality Protector” to reduce cardinality with minimal effect on the data.


As a marketing analytics platform, Singular handles and ingests billions of user events on a daily basis, along with all the marketing data pertaining to each event: Was an ad clicked? When and where? Which network served that ad? How much did the ad cost? And much more. The data is then aggregated so our customers can use it to make informed decisions in their daily marketing operations.

As our operations scaled, we have experienced cases where the sheer number of events, with the large number of columns saved per event, some of which have high cardinality, slowed down our data ingestion pipeline. It ate up CPU, memory and network resources to the point of affecting the user experience. The burden on the system was exacerbated by click spam: a type of fraud where automated tools simulate millions of ad clicks. Click spam increases the already high load on our pipeline.

Our challenge was to reduce the amount of data we ingest, improve our pipeline speed and stability and provide a better overall user experience. But we couldn't just remove excess rows as all rows are essential -- including ones that represent possibly-fraudulent clicks. Our customers want to measure click spam activity and find out where it originates. However, is it possible to retain the necessary information but still reduce the cardinality of some columns?

This was the starting point for what became the "Cardinality Protector." In-depth research into our data helped us prioritize all the columns and metrics by their importance to customers. We then created smart rules in order to cut out some of the most extreme cardinality with minimal effect on the data.

In this session, we will show how we applied our cardinality protection logic to improve system performance significantly while minimizing the effect on the data. We'll talk about the challenges we ran into, both in terms of prioritization logic and system resources, and unveil some of the cool tricks we used, with Pandas and on-disk sorting/group-by, to apply cardinality protection to large batches of data.


Session language – English Target audience – Developers, DevOps, Data Scientists, R&D