Part 1: Why Build a Marketing Data Warehouse? – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

This blog post is part of a series of three, in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Six months ago, we were looking at Google Sheets, Google Analytics, and many other tools/vendors daily and had no marketing data live in a data warehouse (nor a lake). We wanted to get better at using (marketing) data but mostly find a better way to connect insights from multiple sources to find insights but also make more data available to the rest of the internal teams (think: PPC, SEO, Email, Product).

Why?

  • Prove business value: Everything we do as Marketing needs to prove some kind of business value. We’re not doing things just to keep ourselves busy. So we wanted to make sure upfront that the business cases that we had in mind would have an upside by creating this setup. Some of these were: detailed insights into social media performance, attribution level data, and a way for us to connect transactions between data sources.
  • Access to ‘our’ raw data: As often with marketing data, you’re using the interface of the actual marketing vendor to pull your daily numbers. But since so much of the marketing data overlaps with internal data from other channels we wanted to get access to all the raw data that is ‘ours’ (to begin with) and have the ability to combine it. This, for example, would provide us with insights on the lifetime value of customers by channels or the ability to export cohorts of users.
  • The analysis doesn’t scale in a spreadsheet and isn’t repeatable: Besides creating reporting monkeys, we wanted our teams to not have to spend time pulling data from different sources and manually combining it. It’s too high a risk for potential mistakes, the process doesn’t scale, and valuable time of team members would be lost in the process.
  • More control of data quality: We want to control our data pipelines and be involved in the ETL (Extract, Transform, Load) process so we know how data is shaped. There shouldn’t be a black box of data being created in the background. With having this process potentially in place, we can control the quality and usefulness of data more (and continue more over time).
  • More flexibility for visualization and ‘deep analysis’: With having access to our raw data in a warehouse, we knew that we would have to rely less on building analysis right on top of vendor’s APIs or worry about potential issues on how to visualize data (on dashboards). 

Looking at the complete picture

Marketing data is often confined within the platforms and tools used to distribute the marketing message. But your prospects don’t confine themselves to just one source or medium to interact with your company (the path length is usually longer than one session before a transaction takes place). To be genuinely able to measure the effectiveness of your marketing activities, extracting the data from these platforms and having the ability to analyze them is essential. This goes back to our first point on Why. It needs to solve cases that provide business value.

Beginning with the end in mind

One of the biggest challenges in working with data is that often you don’t actually know precisely what data you need to solve for your use case. Usually, you tend to go through the cycle of “capture – store – enrich – analyze” a couple of times before arriving at really valuable insights.

By creating your own data warehouse, you create the flexibility to transform and relate the data to each other in ways that aren’t possible when your data is locked within separate platforms.

Build versus Buy?

While it was clear that we wanted to own as much of the data ourselves, an apparent next discussion was how to gain access to the actual data. Would we build the pipelines ourselves? Would we buy tools to do so, and how much of the transformation of data would fall on the team? How about maintenance?

Pros of Buying a solution

  • Plug and Play: A strong Pro for buying software usually is a decreased time to value because you’re skipping the building phase. So you avoid needing to have a strong learning curve and having to deal with the hard/soft-ware setup. What was a requirement for us is that we didn’t want to deal with a long process of setting up resources.
  • Scalability: As we grow, our datasets will grow. While paying extra for additional volume isn’t a problem, we wanted to make sure that the infrastructure would hold up and could last while we reach a certain factor of scale.

Cons for Buy & Build

  • High Costs: Buying tools would likely, over the long term, have more direct costs associated with it. While obviously, the indirect costs for building the solutions ourselves (increased headcount) would also be expected to be high.
  • Vendor Buy-in: Regardless of either buy or build, it’s a strong Con that we’re basically bought into a vendor’s platform that we wanted to make sure that certain aspects of it are able to be moved over to another service if we deem that to be the right choice at that point in time.
  • Standard data formatting: With some vendors, it was harder to change the data schema (often for a good reason) or pull additional data formats that we wanted to collect from, for example Facebook (the data was available in the API, just not in the vendor’s export).
  • Time to first use: Building pipelines for extracting the data from certain platforms could take a long time. But obviously the real business value will, in most cases, only show up when you’re actively working with the data. Losing months while dealing with extracting the data was something (also taking into account our seasonality, which provides a ton of data) that would delay this. So going with platforms that could directly provide us with extracted data was a great pro versus a strong con for building it ourselves.
  • Maintenance: We didn’t want to maintain extract pipelines from the marketing platforms that we use. While it’s a really important part of our data strategy, we wanted to avoid the need to spend a significant amount of our time to stay up to date and maintaining pipelines. With a buy solution, we would rely on the resources of the vendor to deal with this part of the data flows, which is very much what we would prefer at this point, mainly considering that the resources of our data team are minimal.

In the end, we decided to go with a combination of the two strategies. We decided to let most of the Extract processes be handled by software that we would buy, or that would already be available opensource as this is a task that we wanted to make as cost-effective as possible. Most of the Transform will be done through a set of tools that we would maintain and build ourselves as there are too many custom use cases and needs. With that, it meant that we also needed to ‘Load’ the data back into our data warehouse mainly ourselves. More on the details of that in our upcoming blog posts.

Costs?

An often asked question in this process is what kind of costs we were expecting to have. While we won’t disclose the costs for the additional help received, we’ll quickly touch on the costs for this whole setup thus far. Parts of this are variable based on the size of your data but so far our costs:

  • Google Cloud Platform: ~$400-700 monthly.
    • This includes costs for Google BigQuery, Cloud Composer and other tools. We’re expecting that this will increase at some point but gradually will go up over time.
    • Note: In another blog post we might dive deeper into Google Cloud Composer and how we leverage that for Airflow (processing data pipelines). Right now, Cloud Composer presents at least half the costs for Google Cloud Platform. If you decide not to use it, your costs will be significantly lower for GCP.
  • Stitch Data: $ 350-500 monthly based on volume for now. We’re expecting that this will increase at some point but gradually will go up over time.
  • SuperMetrics: $190 monthly per license per vendor (depending on the number of licenses you might need).

Strategy going forward

With some of these decisions made upfront, we decided to move forward with collecting the data channel by channel, starting with the most important channels. This way, instead of needing to collect all the data for every channel, we were able to directly start building dashboards on top of the data that would be useful for teams. The steps to take to get there were as follows, and we’ll dive into some of them in detail in upcoming blog posts:

  1. Extract all the data using Google BigQuery Transfer Service, Supermetrics and StitchData.
  2. Validate that the data in Google BigQuery is correct by comparing it to the regular data sets, existing dashboards, and the data that we would look at in the vendor’s platform.
  3. Transform the data so data sources can be combined and Transform the data to be directly useful for reporting purposes.
  4. Load this new data into new reporting tables, backfill historical data where possible, and sync it back into Google BigQuery.
  5. Depending on the channel, build dashboarding in Google DataStudio or Tableau.

In the next blog post, we’ll go into depth on how we created a marketing data lake, where all our raw marketing data from our vendors (Google Search Console, Google Ads, Bing Ads, etc.) lives. Expect that to go live within the next two weeks. Until then, feel free to share your thoughts on this blog post on Twitter.
Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.


What books am I reading in 2020?

For the last four years, I wrote blog posts (2019, 2018, 2017 & 2016) listing the books that I read in the past year and that I wanted to be reading during that year. As always, the past year I didn’t read all the books that I’ve listed out in the blog post as I discovered some new ones and changed my focus on some others. But I did read a lot, as I finished ~25 books (and put two books aside that weren’t worth finishing).

What books I didn’t get to in 2019 and have re-added to the list for 2020:

My favorites from 2019:

  • High Growth Handbook: My favorite book from last year as it touched right on the topics that I care about on a daily basis, great interviews with daily practitioners on many topics that relate to High Growth companies including a lot of tactical advice.
  • The Ride of a Lifetime: A fascinating read on how Robert Iger made it’s way to the top of Disney as their CEO. One of my favorites as it shows the personal side of growth he had to go through to reach the top and it provides some insights along the way of how big Disney is of a company.
  • High-Performance Habits: Nothing wrong with learning more about performance habits that could help you improve.
  • Loonshots: A different book from what I would usually read as it went into detail on projects that have changed the world and created innovations that we still benefit from these days (think radar, energy).
  • Secrets of Sand Hill Road: A great book if you want to know more about investing, money and the workings around startups related to investors, deal flows and structures.

What books I’d like to be reading in 2020:


Leave your recommendations on Twitter: @MartijnSch as I’d love to know from others what I should be reading from your recommendations and should be on the list or removed from the list.


Starting & Growing SEO Teams

“Tell me, who/what do I need to hire to run our SEO program? What is the first hire for a new SEO team?” Questions I often get, usually followed by: “Do you know anybody for our team?”. As so many companies around the Bay Area are hiring it makes sense, which also makes hiring harder. I’ve previously blogged about writing a better job description for SEO roles but I also wanted to shed some light into what I’d suggest as good setups for an SEO team and what roles + seniority to hire for depending on your company structure.

Why need SEO support?

Most startup founders or early employees don’t have an extensive background in Marketing or specifically SEO (and they shouldn’t). Most of the time, they have been too busy building the company, getting to product-market fit and iterating on their product/service. In a lot of growing B2C companies, you need to establish plans for long term growth. That’s what SEO can usually bring to these companies: a sustainable long term growth strategy. But in order to get there, you’ll need to bring in extra help to make sure that it actually is sustainable. Instead of employing short term SEO tactics that might put your growth more at risk if you approach it wrong (as many startups also do).

Why create an SEO Team?

So why do you need to create an SEO team, for many of us this is common knowledge as we’re in this on a daily basis? But let’s say you’re getting started, these could be some of the objectives:

  • Dedicated focus on SEO, there are too many other channels to take care off.
  • Need to grow a long-term channel to success.
  • Too many tasks, need to specialize with its own dedicated specialized IC/team.
  • Build out more brand awareness for the company (SEO is a great way of doing this long term).
  • Grow revenue & transactions at a low Customer Acquisition Cost.

Consultant versus Hiring Inhouse

Hiring an SEO Consultant versus an Inhouse SEO

Some teams can’t always hire talent right away (think about the Bay Area where basically all the bigger companies constantly have a need to hire SEO talent) or it might take too long to ramp up SEO. In some other cases, it made more sense for the company to hire a consultant in the short term to take care of some issues and figure out what’s actually needed instead of just hiring somebody with potentially the wrong skill set for the long term.

My take on this is usually that if you already know what you want your SEO team to work on & are able to wait another 2-3 months that you’re better off hiring somebody in-house (if resources are available). In other cases: you have a short term need, you need a technical SEO but want to hire a content person, etc. You’re likely better off starting with a specialized consultant in an area to make sure your issues around that are covered.

Also, read my blogpost on levels & seniority in SEO roles.

Hiring SEOs isn’t good enough: resources!

The Skills of an SEO Leader

Provide them with resources, when I joined Postmates one of the questions that I wanted to make sure was that they provided me with not just resources to set up some tools but also that I had engineers available to run the actual implementations and a designer to create the new pages that we needed there.

  • Engineering: It’s important, as you the SEO can’t make all the changes yourself, you’ll need the team to make actual changes. Most SEOs that I meet don’t have the knowledge about their infrastructure to actually push code or design something that complies with brand guidelines.
  • Design: You need to add additional content, you need more blog posts, but they can’t just be text. There need to be visual add-ons to it, so you need design support.
  • Content: In a bigger company there will be an actual huge need for content (either new or to edit existing content). 

How have you been growing SEO teams, what is missing, what should SEO teams really focus on? Leave a comment!


Announcing my Technical SEO Course on CXL Institute

If there was one thing that I could teach people in SEO, it was always the technical side of SEO that came up first. Mostly, because I think it’s a skill that doesn’t suit too many SEOs and there is already enough (good or bad, you’ll be the judge of that) content about the international, link building or content side of SEO out there. As technical SEO is getting more and more technical and in-depth about the subject itself, I’m excited to announce that I’m launching a new technical course with the folks of CXL institute.

The course will cover everything from structured data to XML sitemaps and back to some more basic on-page optimization. Along the way, I show you my process for auditing a site and coming up with the improvements. I’ll try to teach you about as many different issues and solutions as I could think of.

It’s not going to be ‘the most complete’ course ever on this topic, technical SEO evolves quickly, and likely some things will already be outdated now it’s published, while we have worked on it for months. But I’m going to do my best to inform you here and on CXL Institute about any changes or any improvements that we might be able to make in a future version. If you have any questions about the course or want to cheer me on, reach out via Twitter on @MartijnSch.


Keyword Gap Analysis: Identifying your competitor’s keywords with opportunity (with SEMrush)

Keyword research can provide you with a lot of insights, no matter what tool you’re using they all can provide you a great deal of insight into your own performance but also that of others. But while I was doing some keyword research I thought about writing a bit more about one specific part: gap analysis. In itself an easy to understand the concept, but it can provide a skewed view of your competition (or not). To demonstrate this we’ll take a look at an actual example of some sites while using SEMrush’s data.

What does your competition look like?

You know who your competitors are right? At least your direct ones, but often people who work in SEO or the level above (Marketing/Growth) don’t always know who the actual players are. I worked in the food delivery industry, but more often than not I was facing more competition from totally random sites or some big ones than our competitors (for good reason). So it’s important to know what your overlap is in the search rankings (it’s one of the reasons you should actually be tracking rankings, but that’s a topic for another day) with other sites. This way you know what competitors are rising/declining in your space and what you can learn from their strategies to apply to your own site. But this is exactly where the caveat is, is that actually the case!?

So let’s look at an example, as you can see in this screenshot from SEMrush the playing field for Site A is quite large. They’re ‘ranking’ for tens of thousands of keywords and are placing in a decently sized industry. While they’re ahead of their competition it’s also clear that there are some ‘competitors’ in the space that are behind them in search visibility.

So let’s take the next competitor, we’ll call them ‘Competitor A’. What we see here is that they rank for 250.000 keywords. A significant number still, compared to what we’re ranking for. It doesn’t mean though that all their keywords are what we’re ranking for. So let’s dive into gap analysis.

Keyword Gap Analysis

In short, there are three ways to look at keyword analysis:

  • What keywords am I ranking for, that my competitor is also ranking for (overlap)?
  • What keywords am I ranking for, that my competitor is not ranking for (competitive advantage)?
  • What keywords am I not ranking for, that my competitor is ranking for (opportunity)?

Today we’ll only talk about the last one, what keywords could I be ranking for, as my competitor is already ranking for them, to drive more growth. When using SEMrush you can do this by creating a report like this (within the Keyword Gap Analysis feature):

Screenshot of: Creating a keyword gap analysis report in SEMrush

You always have the three options available to select. In this case, we’ll do the Common Keywords Option. And the result that we should see looks something like this:

Screenshot of the result, with a list of keywords.

What are the keywords with actual opportunity?

So there is apparently xx.xxx keywords that I’m not ranking for (and likely should). That’s significant and almost leads me to believe that we’re not doing a good job. So what the problem often is, which is not a bad thing. Is that the majority of these keywords are being driven by the long tail (specific queries with very low volume). So what ends up happening is that I’m likely looking at tons of keywords that I don’t want to focus on (and hopefully will benefit from by just creating a little bit more generic good content). So when I did this for a competitor and filtered down on keywords that were for them at least ranking position <20 and had a volume >10 monthly I had only 2500 keywords left. That’s just a few % of the keywords that we got started with. It’s required to add that I’m not saying to ignore the other keywords, but now you have the keywords that you have a real opportunity to drive actual results. In the end, you should be able to rank well, as your competitor is already ranking (position: <20), there is actual volume (>10) and you’re not in there at all.

This is just something that I was playing with while exploring some industries, and it’s a topic that I haven’t seen a lot of content about over the last years. While the data is often available it will both help you get new content ideas but also helps you identify the actual value (keyword volume should turn into business results: impressions x CTR x Conversion Rate == $$$) on the revenue side.