Scaling the Git Model to Billions of Objects


Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency. Data version control systems designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data. In this talk you'll learn about the challenges with modern DataOps, and how using object storage for data lakes is the first step to solving them. By the end of the session you’ll understand how to scale a Git-like data model to petabytes of data, across billions of objects – without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Adi Polak

VP of DevEx
As Vice President of Developer Experience at Treeverse, Adi builds the open-source project lakeFS, git-like operations for data lakes. In her work, she brings her vast industry research and engineering experience to bear in educating and helping teams design, architect, and build cost-effective data systems and machine learning pipelines that emphasize scalability, expertise, application lifecycle processes, team processes, and business goals. Adi is a frequent worldwide presenter, instructor, and the author of O'Reilly's upcoming book, "Machine Learning With Apache Spark." Previously, she was a senior manager for Azure at Microsoft, where she focused on building advanced analytics systems and modern architectures. When Adi isn’t building data pipelines or thinking up new software architecture, you can find her on the local cultural scene or at the beach.
Go To Speaker