This blog post is part of a series of three, in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.
Six months ago, we were looking at Google Sheets, Google Analytics, and many other tools/vendors daily and had no marketing data live in a data warehouse (nor a lake). We wanted to get better at using (marketing) data but mostly find a better way to connect insights from multiple sources to find insights but also make more data available to the rest of the internal teams (think: PPC, SEO, Email, Product).
- Prove business value: Everything we do as Marketing needs to prove some kind of business value. We’re not doing things just to keep ourselves busy. So we wanted to make sure upfront that the business cases that we had in mind would have an upside by creating this setup. Some of these were: detailed insights into social media performance, attribution level data, and a way for us to connect transactions between data sources.
- Access to ‘our’ raw data: As often with marketing data, you’re using the interface of the actual marketing vendor to pull your daily numbers. But since so much of the marketing data overlaps with internal data from other channels we wanted to get access to all the raw data that is ‘ours’ (to begin with) and have the ability to combine it. This, for example, would provide us with insights on the lifetime value of customers by channels or the ability to export cohorts of users.
- The analysis doesn’t scale in a spreadsheet and isn’t repeatable: Besides creating reporting monkeys, we wanted our teams to not have to spend time pulling data from different sources and manually combining it. It’s too high a risk for potential mistakes, the process doesn’t scale, and valuable time of team members would be lost in the process.
- More control of data quality: We want to control our data pipelines and be involved in the ETL (Extract, Transform, Load) process so we know how data is shaped. There shouldn’t be a black box of data being created in the background. With having this process potentially in place, we can control the quality and usefulness of data more (and continue more over time).
- More flexibility for visualization and ‘deep analysis’: With having access to our raw data in a warehouse, we knew that we would have to rely less on building analysis right on top of vendor’s APIs or worry about potential issues on how to visualize data (on dashboards).
Looking at the complete picture
Marketing data is often confined within the platforms and tools used to distribute the marketing message. But your prospects don’t confine themselves to just one source or medium to interact with your company (the path length is usually longer than one session before a transaction takes place). To be genuinely able to measure the effectiveness of your marketing activities, extracting the data from these platforms and having the ability to analyze them is essential. This goes back to our first point on Why. It needs to solve cases that provide business value.
Beginning with the end in mind
One of the biggest challenges in working with data is that often you don’t actually know precisely what data you need to solve for your use case. Usually, you tend to go through the cycle of “capture – store – enrich – analyze” a couple of times before arriving at really valuable insights.
By creating your own data warehouse, you create the flexibility to transform and relate the data to each other in ways that aren’t possible when your data is locked within separate platforms.
Build versus Buy?
While it was clear that we wanted to own as much of the data ourselves, an apparent next discussion was how to gain access to the actual data. Would we build the pipelines ourselves? Would we buy tools to do so, and how much of the transformation of data would fall on the team? How about maintenance?
Pros of Buying a solution
- Plug and Play: A strong Pro for buying software usually is a decreased time to value because you’re skipping the building phase. So you avoid needing to have a strong learning curve and having to deal with the hard/soft-ware setup. What was a requirement for us is that we didn’t want to deal with a long process of setting up resources.
- Scalability: As we grow, our datasets will grow. While paying extra for additional volume isn’t a problem, we wanted to make sure that the infrastructure would hold up and could last while we reach a certain factor of scale.
Cons for Buy & Build
- High Costs: Buying tools would likely, over the long term, have more direct costs associated with it. While obviously, the indirect costs for building the solutions ourselves (increased headcount) would also be expected to be high.
- Vendor Buy-in: Regardless of either buy or build, it’s a strong Con that we’re basically bought into a vendor’s platform that we wanted to make sure that certain aspects of it are able to be moved over to another service if we deem that to be the right choice at that point in time.
- Standard data formatting: With some vendors, it was harder to change the data schema (often for a good reason) or pull additional data formats that we wanted to collect from, for example Facebook (the data was available in the API, just not in the vendor’s export).
- Time to first use: Building pipelines for extracting the data from certain platforms could take a long time. But obviously the real business value will, in most cases, only show up when you’re actively working with the data. Losing months while dealing with extracting the data was something (also taking into account our seasonality, which provides a ton of data) that would delay this. So going with platforms that could directly provide us with extracted data was a great pro versus a strong con for building it ourselves.
- Maintenance: We didn’t want to maintain extract pipelines from the marketing platforms that we use. While it’s a really important part of our data strategy, we wanted to avoid the need to spend a significant amount of our time to stay up to date and maintaining pipelines. With a buy solution, we would rely on the resources of the vendor to deal with this part of the data flows, which is very much what we would prefer at this point, mainly considering that the resources of our data team are minimal.
In the end, we decided to go with a combination of the two strategies. We decided to let most of the Extract processes be handled by software that we would buy, or that would already be available opensource as this is a task that we wanted to make as cost-effective as possible. Most of the Transform will be done through a set of tools that we would maintain and build ourselves as there are too many custom use cases and needs. With that, it meant that we also needed to ‘Load’ the data back into our data warehouse mainly ourselves. More on the details of that in our upcoming blog posts.
An often asked question in this process is what kind of costs we were expecting to have. While we won’t disclose the costs for the additional help received, we’ll quickly touch on the costs for this whole setup thus far. Parts of this are variable based on the size of your data but so far our costs:
- Google Cloud Platform: ~$400-700 monthly.
- This includes costs for Google BigQuery, Cloud Composer and other tools. We’re expecting that this will increase at some point but gradually will go up over time.
- Note: In another blog post we might dive deeper into Google Cloud Composer and how we leverage that for Airflow (processing data pipelines). Right now, Cloud Composer presents at least half the costs for Google Cloud Platform. If you decide not to use it, your costs will be significantly lower for GCP.
- Stitch Data: $ 350-500 monthly based on volume for now. We’re expecting that this will increase at some point but gradually will go up over time.
- SuperMetrics: $190 monthly per license per vendor (depending on the number of licenses you might need).
Strategy going forward
With some of these decisions made upfront, we decided to move forward with collecting the data channel by channel, starting with the most important channels. This way, instead of needing to collect all the data for every channel, we were able to directly start building dashboards on top of the data that would be useful for teams. The steps to take to get there were as follows, and we’ll dive into some of them in detail in upcoming blog posts:
- Extract all the data using Google BigQuery Transfer Service, Supermetrics and StitchData.
- Validate that the data in Google BigQuery is correct by comparing it to the regular data sets, existing dashboards, and the data that we would look at in the vendor’s platform.
- Transform the data so data sources can be combined and Transform the data to be directly useful for reporting purposes.
- Load this new data into new reporting tables, backfill historical data where possible, and sync it back into Google BigQuery.
- Depending on the channel, build dashboarding in Google DataStudio or Tableau.
In the next blog post, we’ll go into depth on how we created a marketing data lake, where all our raw marketing data from our vendors (Google Search Console, Google Ads, Bing Ads, etc.) lives. Expect that to go live within the next two weeks. Until then, feel free to share your thoughts on this blog post on Twitter.
Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.