THE CLOUD

"Serverless" Analytics of Twitter Data with MSK Connect and Athena

Like many, I was recently drawn in to a simple word game by the name of “Wordle”. Also, like many I wanted to dive into the analytics of all the yellow, green, and white-or-black-depending-on-your-dark-mode blocks. While you can easily query tweet volume using the Twitter API, I wanted to dig deeper. And the tweets were growing… Given the recent announcement of MSK Connect, I wanted to see if I could easily consume the Twitter Stream into S3 and query the data with Athena....

 · 6 min
A reflective lake

An Introduction to Modern Data Lake Storage Layers

In recent years we’ve seen a rise in new storage layers for data lakes. In 2017, Uber announced Hudi - an incremental processing framework for data pipelines. In 2018, Netflix introduced Iceberg - a new table format for managing extremely large cloud datasets. And in 2019, Databricks open-sourced Delta Lake - originally intended to bring ACID transactions to data lakes. 📹 If you’d like to watch a video that discusses the content of this post, I’ve also recorded an overview here....

 · 13 min

SSH to EC2 Instances with Session Manager

I’m kind of an old-school sys admin (aka, managed NT4 in the 90’s) so I’m really used to SSH’ing into hosts. More often than not, however, I’m working with AWS EC2 instances in a private subnet. If you’re not familiar with it AWS Systems Manager Session Manager is a pretty sweet feature that allows you to connect remotely to EC2 instances with the AWS CLI, without needing to open up ports for SSH or utilize a bastion host....

 · 3 min
Skipping stones on the data lake...

Updating Partition Values With Apache Hudi

If you’re not familiar with Apache Hudi, it’s a pretty awesome piece of software that brings transactions and record-level updates/deletes to data lakes. More specifically, if you’re doing Analytics with S3, Hudi provides a way for you to consistently update records in your data lake, which historically has been pretty challenging. It can also optimize file sizes, allow for rollbacks, and makes streaming CDC data impressively easy. Updating Partition Values I’m learning more about Hudi and was following this EMR guide to working with a Hudi dataset, but the “Upsert” operation didn’t quite work as I expected....

 · 3 min
Jupyter Notebook Continuous Deployment Architecture

Continuous Deployment of Jupyter Notebooks

This is a guide on how to use AWS CodePipeline to continuously deploy Jupyter notebooks to an S3-backed static website. Overview Since I started using EMR Studio, I’ve been making more use of Jupyter notebooks as scratch pads and often want to be able to easily share the results of my research. I hunted around for a few different solutions and while there are some good ones like nbconvert and jupytext, I wanted something a bit simpler and off-the-shelf....

 · 4 min
https://flic.kr/p/S3jt5j

Building and Testing a new Apache Airflow Plugin

Recently, I had the opportunity to add a new EMR on EKS plugin to Apache Airflow. While I’ve been a consumer of Airflow over the years, I’ve never contributed directly to the project. And weighing in at over half a million lines of code, Airflow is a pretty complex project to wade into. So here’s a guide on how I made a new operator in the AWS provider package. Overview Before you get started, it’s good to have an understanding of the different components of an Airflow task....

 · 8 min
Example output of Air Quality Data

Build your own Air Quality Monitor with OpenAQ and EMR on EKS

Fire season is closely approaching and as somebody that spent two weeks last year hunkered down inside with my browser glued to various air quality sites, I wanted to show how to use data from OpenAQ to build your own air quality analysis. With Amazon EMR on EKS, you can now customize and package your own Apache Spark dependencies and I use that functionality for this post. Overview OpenAQ maintains a publicly accessible dataset of various air quality metrics that’s updated every half hour....

 · 12 min
https://pimpyourowndevice.com/stickers/developer-avocado-cheerful/

5 things they don't tell you about being a developer advocate

image credit this awesome cheerful dev advocado sticker I recently rejoined AWS as a developer advocate for analytics on the EMR team and in the past 6 months I’ve learned a lot of things about DevRel…and a lot about what I didn’t know. So here’s a top 5 list. 1. There are multiple roles in DevRel Some people focus on community. Others focus on proof-of-concepts. Others still focus on demos, tutorials, or docs....

 · 2 min
stacked rocks on a beach, credit: https://flic.kr/p/LhbFfr

Big Data Stack with CDK

I wanted to write a post about how I built my own Apache Spark environment on AWS using Amazon EMR, Amazon EKS, and the AWS Cloud Development Kit (CDK). This stack also creates an EMR Studio environment that can be used to build and deploy data notebooks. Disclaimer: I work for AWS on the EMR team and built this stack for my various demos and it is not intended for production use-cases....

 · 9 min
Aurora over Lake Hawea, credit: https://flic.kr/p/U2vmjD

Initial Revision

Hi there 👋 It’s been a long time – 3,399 days to exact – since I’ve blogged, but since starting a new role as a developer advocate I figured it was a good time to start again. 😁 Welcome to the 2021 revamp of my personal developer site. In it, you will find: Posts about technical (and sometimes personal) content Personal and work projects I want to share I get excited and make things, and will do my best to share that here....

 · 1 min