»Boosting simulation performance with Python«
2019-06-04, 11:00–11:25, Hall 3

In this talk I will present the architecture of our simulation infrastructure written in Python which allows to simulate hours of real-life in only minutes of simulation. I will describe challenges and issues we encountered and how we handled them.

Outline

  • Introduction: Importance of simulations and benefits of deploying this approach
  • Brief overview of SimPy library and its API
  • CSR's simulation architecture
  • Challenges and issues we encountered and how we handled them
  • Distributed simulation - adjusting the simulation architecture in the transition to microservices

Description

Our company's product uses a fleet of real (not virtual) robots to perform different tasks in a fulfillment warehouse. The importance of simulations is significant: it allows to test our solution, new features and perform regression tests without the need for real and expansive hardware, measure and analyze the impact of different implementations and optimizations, and explore possible solutions before deploying them in production. Tasks performed by physical robots take time (movement over the warehouse, box lifting, etc.), but in simulation, where virtual robots are used, there is no need to wait all that time. Shortening simulation time improves the development process by providing faster feedback to developers and quicker CI and testing cycles. Another benefit is a more deterministic simulation - using this approach, each component (thread) in the system gets equal opportunity (CPU time) in each time tick, which is not affected by the underlying machine or operating system that the simulation is running on. Also, it is possible to simulate any hour of the day easily, and by that we wouldn't panic before the "Y2K bug".

In this talk, I will first give a brief overview of the SimPy library (a discrete-event simulation framework - see https://simpy.readthedocs.io/) and its API, and go over our simulation architecture, which includes wrapping time-related functionality into our own 'scheduler' module, dividing the time into discrete ticks, subscription of the robots and components of the system (represented by threads) to the central SimPy component, which gives them the opportunity to perform their task in each time tick, and by that allows time to pass faster than in reality.

Next, I will dive deeper and describe challenges and issues we encountered and the way we chose to solve them: time leak of non time-tied components of the system (such as event-driven components), time leak in creation of new threads, difficulty of debugging, and differences between dev and production environments.

In the last part I will show how we expanded SimPy to adjust this architecture when making the transition to microservices (which we plan to contribute back to the open-source community). I will finish with an interesting comparison between single-process simulation where only one CPU is utilized due to the GIL limitation and distributed (multi-process) simulation where there is a communication overhead.