A reflective lake

An Introduction to Modern Data Lake Storage Layers

In recent years we’ve seen a rise in new storage layers for data lakes. In 2017, Uber announced Hudi - an incremental processing framework for data pipelines. In 2018, Netflix introduced Iceberg - a new table format for managing extremely large cloud datasets. And in 2019, Databricks open-sourced Delta Lake - originally intended to bring ACID transactions to data lakes. 📹 If you’d like to watch a video that discusses the content of this post, I’ve also recorded an overview here....

 · 13 min
Skipping stones on the data lake...

Updating Partition Values With Apache Hudi

If you’re not familiar with Apache Hudi, it’s a pretty awesome piece of software that brings transactions and record-level updates/deletes to data lakes. More specifically, if you’re doing Analytics with S3, Hudi provides a way for you to consistently update records in your data lake, which historically has been pretty challenging. It can also optimize file sizes, allow for rollbacks, and makes streaming CDC data impressively easy. Updating Partition Values I’m learning more about Hudi and was following this EMR guide to working with a Hudi dataset, but the “Upsert” operation didn’t quite work as I expected....

 · 3 min