In the previous blog posts (part 1, part 2, part 3, and part 4) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: making the data lake. In the fourth blog post, a more technical one, I’ll give some insights into how we’re leveraging Apache’s Airflow to build the more complicated data pipelines, and I give you some tips on how to get started.
This blog post is part of a series of five? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics, who we’ve worked with alongside during this (ongoing) project.
Getting Started with Cloud Composer
Cloud Composer is part of Google’s Cloud Platform and brings you most of the upside of using Apache Airflow (open source) and barely any of the downsides (setup, maintenance, etc.). Or to follow their main USP: “A fully managed workflow orchestration service built on Apache Airflow.” While we had worked with Airflow before, we weren’t looking forward to spending time having to worry about its management as we planned to spend most time setting up and maintaining the data pipelines. In the end, then you have to stick to create pipelines (DAGs).
What is it suitable for?
You want to load data from the Google Analytics API, store it locally, translate some values to something new, and have it available in Google BigQuery. However, you would build it; it’s multiple tasks and functions that are depending on itself. You wouldn’t want to load the data into BigQuery when the data wouldn’t have been cleaned (trash in, trash out sounds familiar?). With BigQuery, the next task is only being processed if the previous step was successful.
Tasks are in almost every case; just one thing: get data from BigQuery, upload a file from GCS into BigQuery, download a file from Cloud Storage to local, process data. What makes Airflow very efficient to work with is that the majority of data processing tasks already have pre-built functions. The first three tasks that I listed here are operators (GoogleCloudStorageDownloadOperator, GoogleCloudStorageToBigQueryOperator) that operate as functions.
Versus Google Cloud Functions
If you mainly run very simple ‘pipelines’ that only exist of 1 function that needs to be executed or have only a handful use cases, it is likely overkill to leverage Cloud Composer; the costs might be too high, you still have overhead with DAGs. In that case, you might be better off with Google Cloud Functions as you can write similar scripts that will enable you to also trigger them with Google Cloud Scheduler to run at a specific time.
The costs for Google Cloud Composer are doable, for a basic setup, it’s around 450 dollars (if you run the instances 24 hours * 7 days a week) as you leverage multiple (a minimum of 3) small instances. For more information on the costs, I would point to this pricing example.
See above an example data pipeline, in typical Airflow fashion every task is depending on the previous task. In other words: notify_slack_channel would not run if any of the previous tasks would fail. All tasks are happening in a particular order from left to right. In most cases, data pipelines become more complicated as you can have multiple flows going on at the same time and combining them at the end.
Tips & Tricks
Google Cloud Build, Repositories
The files for Google Cloud Composer are saved in Google Cloud Storage. Which is smart in itself, but at the same time, you want them to live in a Git repository so you can efficiently work on it together. By leveraging this blog post, you’re able to connect the Cloud Storage bucket to a repository and set up a sync between the two. This will help you build a deployment pipeline basically and make sure that only production-ready code from your master branch ends up in GCS.
After working with it for a few months now, I’m still not sure if managing dependencies through Google Cloud Composer is a good or bad thing, as it creates some obstacles if you want to run a deployment and want to add some Python libraries (as your servers could be down 10-30 mins at a time). For other setups, this usually is a bit more smooth and creates less downtime.
Sendgrid for Email Alerts
One of the upsides of Apache Airflow is that it sends alerts upon failure of tasks. Make sure to set up the Sendgrid notifications while you’re setting up Google Cloud Composer. This will be the most straightforward way of receiving email alerts (for free, as in most cases, you shouldn’t get too many failure emails).
Document the crap out of your setup and DAGs. When I took over some of the pipelines that were used at Postmates for XML sitemap generation it was a nightmare, it was hard to read, the code didn’t make a lot of sense, and we had to refactor certain things just because of that. As sometimes pipelines (just like regular code) can be left untouched/unviewed for months (as they literally sometimes only have one job) you want to make sure that you come back and understand what happens inside the tasks.