In the previous blog posts (part 1 and part 2) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. We’ll try to go a bit more into detail on how you can do this yourself in this post in which we transformed our marketing data lake into an actual data warehouse.
This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.
The Process of Building a Data Warehouse
In our endeavor of building a data warehouse, we had a couple of big initiatives that we first wanted to get done. We needed some reporting and visualization tables and aligned with that, we needed to make sure that we could have data that was cleaned for other purposes (deduplication, standardization: some typical ETL problems).
In order to streamline the processes, we used three different ways of getting the data streamlined:
Google Cloud Functions
Google Cloud Functions is used for both transforming our data as loading our initial data for a few use cases. Early on we noticed that not every vendor was available through regular data loading platforms, like StitchData. An example of that was Google Search Console, as we didn’t want to have the need to run additional infrastructure for just dealing with Load scripts we leveraged Cloud Functions to run a daily script (with support from Scheduler to make them daily).
After loading the data we also Transform some other tables from our marketing data lake to new production tables using Cloud Functions. All of our scripts are currently written in Python or Node.js, but as Cloud Functions makes it possible to deal with multiple languages it provides us with the flexibility to leverage others over time.
Backfill: As Functions can easily be rewritten and tested within the interface, it also provides us with a good way to backfill data easily as we can easily adjust the dates that a script needs to run.
In some other cases, we can also leverage Google BigQuery’s scheduled queries. In a few instances, we just want to load the data from raw data lake tables into a production table. Mainly because we don’t always need all the columns, we can limit our data drilling and be able to clean the data in the query itself. In that case, scheduled queries can come in pretty handy as they run on a certain schedule, can be easily updated, and already point towards another data set and table.
For more complicated data flows we’re currently using Airflow via Google Cloud Composer. Cloud Composer, as we mentioned in a previous blog post, enables us to not have to worry about maintaining the Airflow infrastructure but gives us all the other upside of it. This gives us the ability to focus on creating and maintaining the DAGs that help drive the actual data structuring flows.
How we mainly use Airflow is to combine, clean and enhance data from multiple sources and then reupload it back into Google BigQuery for visualization in other tools. Singular use cases are more easily captured by one or two tasks, but in Airflow we run flows that usually have multiple tasks that need to be executed in a certain order and not at all if one of them fails. This is what Airflow is meant to do, and that’s how we’re leveraging it too. As an example of our affiliate marketing campaigns, we have a structure set up that only pays out once travel is concluded (a very standard approach in the travel industry). This means that we need to retrieve orders from our partner > verify them with our database > create a new format to upload back to our vendor and run the actual upload. In addition, we want to set up some alerting for the team as well. Resulting in 6 tasks in this case that need to be executed in the right order: the perfect use case for Airflow.
In the previous blog post, I touched on how we wanted to set up raw tables that are transformed once or multiple times. We decided to do this to both make the data more streamlined and also to make them ready for visualization on our channel dashboards. The MarketLytics team did a great job documenting this with a very visual result that you can see here:
As discussed previously, we go through multiple stages with the data that we get into the data lake and transform it to the data warehouse.
Example of data enhancement: One of the most common scenarios that we’ve tried to solve for was connecting existing data from a vendor back to the data that we receive in our web analytics tool: Google Analytics. As an example, if properly tagged we should be able to get the data from a specific newsletter campaign from the UTM parameters and then connect the data to (in our case) Marketo to what we have there on deliverability and open rate (%).