Distributed Pandas – From Performance Art to Production



This talk explores how our modern set of distributed pandas came to be, starting with Map/Reduce, then Spark + Sparkling Pandas, and finally, looking at the modern ecosystem that exists today (Koalas, Modin, and Dask). Not only will you learn the limitations of modern distributed pandas, but you'll also see how open-source development and friendly competition "works." For those interested in getting involved, you will also learn how to contribute to improving distributed pandas. No talk like this would be complete without a conflict of interest disclosure, which for me includes being one of the two original co-authors for Sparkling Pandas (and some funny stories), being a Spark committer, but this balances out with my current work on co-writing books on Dask and Ray (which of course I hope you buy). At the end of this talk you will be questioning if you really want to scale pandas given all of the duct-tape involved, and have a good idea of how to choose which particular duct-taped-together solution is going to involve the least amount of rusty spoons in your eyeballs.


Holden Karau

Open Source Engineer at Netflix, Apache Spark Commiter

Holden Karau is a trangender American-Canadian computer scientist and author based in San Francisco, CA. For $dayjob she works at Netflix specializing in distributed data tools. She is a committer on the Apache Spark project, and has written books about Spark, Ray, and Kubeflow.

Go To Speaker