Talk
Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.
Data version control systems designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data. In this talk you'll learn about the challenges with modern DataOps, and how using object storage for data lakes is the first step to solving them.
By the end of the session you’ll understand how to scale a Git-like data model to petabytes of data, across billions of objects – without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.