hudi

If you’re not familiar with Apache Hudi, it’s a pretty awesome piece of software that brings transactions and record-level updates/deletes to data lakes. More specifically, if you’re doing Analytics with S3, Hudi provides a way for you to consistently update records in your data lake, which historically has been pretty challenging. It can also optimize file sizes, allow for rollbacks, and makes streaming CDC data impressively easy. Updating Partition Values I’m learning more about Hudi and was following this EMR guide to working with a Hudi dataset, but the “Upsert” operation didn’t quite work as I expected....

An Introduction to Modern Data Lake Storage Layers

Updating Partition Values With Apache Hudi