2021-05-02, 10:00–10:25, General Track 1
Beyond basic algorithmic considerations when writing our code, you would be surprised how easy it is to get more than 100X increase in efficiency with less than 30 minutes of work without even improving the time complexity.
When operating on big arrays we often fall into old habits of code writing, be it using pandas, numpy or vanilla python. While these habits may optimize the speed at which we write code, they often fall short of the optimal code for run-time. Even saving milliseconds of run time per task can accumulate to staggering amounts. Sometimes despite having very similar syntax between functions and packages there is a huge difference in performance, since the internal workings of pandas, numpy and python varies, as each balances the overhead (or “init”) and marginal cost differently.
We will explore common and run-time costly pitfalls when using pandas and numpy and we will see when it is more efficient to use vanilla python compared to these packages.
I will introduce a profiling method and a timing method. Working with both together can help us detect the weakest points in our code, and quickly test different options for improving it. One of the main points of the talk is how to come up with many code variations and test them quickly to come up with the best solution.
I will present many often-neglected functions from these packages or native, and experiment to see when each is more efficient. E.g. using pandas index vs dict and itemgetter; np, pd or py isin methods; apply vs map; concatenating and appending data to arrays; and many more.
In addition, we will learn of some useful and surprising efficiency tricks and data-structures like sparse matrix, numpy array to replace a dict, clever ways of using memorization and more.