ML Data Version Control and Reproducibility at Scale


Petabytes of unstructured image & video data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built. One commonplace method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power. This is because traditional practices of copying and modifying data locally lack the version control & auditability crucial for reproducibility, making the process of iterating on models with various data subsets a daunting task. Next, regularly shuttling data between a central repository and local environments strains resources & time, especially when choosing different subsets of data for each training run. These coupled with operating within a local environment hampers the ability to harness the full power of parallel computing, as well as unique distributed capabilities of systems like Apache Spark. This is where data version control technologies can help overcome these challenges for computer vision researchers. In this workshop we will demonstrate how with hands-on tools, you will be able to achieve greater data management efficiencies.


Einat Orr

CEO and Co-Founder

Einat Orr is the CEO and Co-Founder of lakeFS, a scalable data version control platform that delivers a Git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory. Einat previously led several engineering organizations, most recently as CTO at SimilarWeb.

Go To Speaker