Building a data warehouse for SEO in Google BigQuery

Why do you want/need a data warehouse for SEO?

A data warehouse (& data lake) stores and structures data (through data pipelines) and then makes it possible to visualize it. That means it can also be used to help create and power your SEO reporting infrastructure, especially when you’re dealing with lots of different data sources that you’re looking to combine. Or, if you have just a ton of data, you’re likely looking to implement a (quick) solution like this as it can power way more than your actual laptop can handle.

Some quick arguments for having a warehouse:

  • You need to combine multiple data sources with lots of data.
  • You want to enable the rest of the organization to access the same data.
  • You want to provide a deeper analysis into the inner workings of a vendor or search engine.

It’s not all sunshine and rainbows. Setting up a warehouse can be quick and easy, but maintaining is where the real work comes in. Especially when dealing with data formats and sources that change over time, it can create overhead. Be aware of that.

What can I do with a warehouse?

Imagine you’re a marketing team with a strong performance-marketing setup, you advertise in Google Ads and meanwhile in SEO try to compete for the same keywords to achieve great click share. It would be even more useful if your reporting could show an insight into the total number of clicks in search (not either paid or organic). By joining two datasets at scale you would be able to achieve this and meanwhile visualize progress. Excel/Google Sheets will give you the ability to do this (or repeat the process if you’re a true spreadsheet junkie) but not to have daily dashboards and share it with your colleagues easily. With a warehouse, you’d be able to store data from both ends (Google Ads and Google Search Console), mingle the data, and visualize it later on.

Seer Interactive wrote a good blog post about their decision to move to their own, homegrown, rank tracking solution. It provides an interesting insight into how they’re leveraging their internal warehouse for some of its data as well.

Do I actually need a warehouse?

Are you a small team, solo SEO, or work on a small site? Make this a hobby project in your time off. You likely, at your scale, don’t need this warehouse and can accomplish most things by connecting some of your data sources in a product like Google DataStudio. Smaller teams often have less (enterprise, duh!) SEO tools in their chest, so there is less data overall. A warehouse can easily be avoided at a smaller scale and be replaced by Google Sheets/Excel or a good visualization tool.

Why Google BigQuery?

Google BigQuery is RVshare’s choice for our marketing warehouse. Alternatives to Google BigQuery are Snowflake, Microsoft Azure, and Amazon’s Redshift. As we had a huge need for Google Ads data and it provided a full export into BigQuery for free, it was a no-brainer for us to start there and leverage their platform. If you don’t have that need, you can replicate most of this with the other services out there. For the sake of this article, as I have experience dealing with BQ, we’ll use that.

What are the costs?

It depends, but let me give you an insight into the actual costs of the warehouse for us. Google Search Console and Bing Webmaster Tools are free. Botify, Nozzle (SaaS pricing here), and Similar.ai are paid products, and you’ll require a contract agreement with them.

  • Google Search Console & Bing Webmaster Tools: Free.
  • Nozzle, Similar.ai, Botify: Requires contract agreements, reach out to me for some insight if you’re truly curious and seriously considering purchasing them.
  • StitchData: Starting at $1000/yearly, depending on volume. Although you’re likely fine with the minimum plan for just 1 data source.
  • SuperMetrics: $2280/yearly, this is for their Google BigQuery license that helps export Google Search Console. There are cheaper alternatives, but based on legacy it’s not worth for us to switch providers.
  • Google Cloud Platform – Google BigQuery: Storage in BigQuery is affordable, especially if you’re just importing a handful data sources. It gets expensive with larger data sets. So having the data itself is cheap. If you’re optimizing the way you process and visualize the data afterwards you can also save a lot of costs. Average costs for querying/analysis are $5 per TB to do that, and especially on small date ranges and selecting a few columns it’s hard to reach that quickly.

Loading Vendors into Google BigQuery

A few years ago, you needed to develop your data pipelines to stream data into Google BigQuery (BQ) and maintain the pipeline from the vendor to BQ yourself. This was causing a lot of overhead and required the need for having your own (data) engineers. Those days are clearly over as plenty of SaaS vendors provide the ability to facilitate this process for you for reasonable prices, as we just learned.

Bing Webmaster Tools & Google Search Console

Search Analytics reports from both Google and Bing are extremely useful as they provide an insight into volume, clicks, and CTR %. This helps you directly optimize your site for the right keywords. Both platforms have their own APIs that enable you to pull search analytics data from them. While Google’s is widely used available through most data connectors the Bing Webmaster Tools API is a different story. Find the resource link below to get more context on how to load this data into your warehouse as more steps are involved (and still nobody knows what type of data that API actually returns).

Resources

Saving Bing Search Query Data from the Bing Webmaster Tools API

→ Saving Google Search Console data with StitchData or Supermetrics

→ Alternatively, read about the Google Search Console API here to implement a pipeline yourself

Rank Tracking: Nozzle

Nozzle is our solution at the moment for rank tracking, at a relatively small scale. We chose them a few months ago, after having our data primarily in SEMrush, as they had the ability to make all our SERP data available to us via their BigQuery integration.

Technical SEO: Botify

Both at Postmates and RVshare I brought Botify in as it’s a great (enterprise) platform that combines log files, their crawl data, and visitor data with an insight into your technical performance.

Similar.ai

Lesser known is Similar.ai, which provides keyword data and entity extraction. Useful when you’re dealing with a massive scale of keywords of which you want to understand the different categories. Especially when they’re to create topical clusters it’s coming in very useful. With their Google Cloud Storage > Google BigQuery import we’re able to also show this next to our keyword data (from Google Search Console).

Bonus: Google Ads

If you’re advertising in paid search with Google Ads it can be useful to combine organic keyword data with paid data. It’s the reason why I like quickly setting up the Data Transfer Service with Google Ads so all reports are automatically synced. This is a free service between Google Ads and Google BigQuery. More information can be found here.

How to get started?

  1. Figure out what tools that you currently use provide a way to export their data?
  2. Create a new project in Google Cloud Platform (or use an existing one) and save your project ID.
  3. Create a new account for StitchData and where needed create an account (paid) for Supermetrics.
  4. Connect the right data sources to Google BigQuery or your preferred warehouse solution.

Good luck with the setup and let me know if I can help in any way. I’m curious how you’re getting value from your SEO warehouse or what use cases you’d like to solve with it. Leave a comment or find me on Twitter: @MartijnSch

Help me out, share this article:Tweet about this on Twitter
Twitter
Share on LinkedIn
Linkedin
Email this to someone
email
Share on Facebook
Facebook
Buffer this page
Buffer
Learnings from a year with Google Tag Manager Server Side
Learnings from Managing Marketing Budgets

Comments

  1. Hey Martijn – love this post and I’ve certainly seen clients doing the same kind of thing. You talk about the hard costs of doing this but would love to hear more about the soft costs – do you have a dedicated data analyst maintaining the data / setting it up? Did you build this overnight or over time gradually? When did you realize you needed a data lake vs relying on spreadsheets etc.

    The technical aspect is just one side of it – there’s also the whole “convincing the organization to do this” side of it too

Leave a Reply

Your email address will not be published / Required fields are marked *