2022-06-28, 10:30–10:50, PyData
Running millions of tasks is difficult when dealing with high workload variance. We improved our pipeline efficiency by using ML models to classify task requirements and dynamically allocate the necessary system resources.
To process our customers' data, Singular's data pipeline fetches and enriches data from dozens of different data sources multiple times a day.
The pipeline consists of hundreds of thousands of daily tasks, each with a different processing time and resource requirements, depending on the customer's size and business needs. We deal with this scale by using Celery and Kubernetes as our tasks infrastructure. This lets us allocate dedicated workers and queues to each type of task based on its requirements.
Originally, our task requirements, required workers, and resources were all configured manually. As our customer base grew, we noticed that heavier and longer tasks were grabbing all the resources and causing unacceptable queues in our pipeline. Moreover, some of the heavier tasks required significantly more memory, leading to Out-Of-Memory kills and infrastructure issues.
If we could classify tasks by their heaviness and how long they were going to take, we could have segregated tasks in Celery based on their expected duration and memory requirements and thus minimized interruptions to the rest of the pipeline. However, the variance in the size and granularity of the fetched data made it impossible to classify if a task was about to take one minute or one hour.
Our challenge was: how do we categorize these tasks, accurately and automatically? To solve the issue we implemented a machine-learning model that could learn to predict the expected duration and memory usage of a given task. Using Celery’s advanced task routing capabilities, we could then dynamically configure different task queues based on the model's prediction.
This raised another challenge - how could we use the classified queues in the best way? We could have chosen to once again configure workers statically to consume from each queue. However, we felt this approach would be inadequate at scale. We decided to make use of Kubernetes’ vertical and horizontal autoscaling capabilities to dynamically allocate workers for each classified queue based on its length. This improved our ability to respond to pipeline load automatically, increasing performance and availability. Additionally, we were able to deploy shorter-lived workers on AWS Spot instances, giving us higher performance while lowering cloud costs.