Building a data warehouse for SEO in Google BigQuery

Why do you want/need a data warehouse for SEO?

A data warehouse (& data lake) stores and structures data (through data pipelines) and then makes it possible to visualize it. That means it can also be used to help create and power your SEO reporting infrastructure, especially when you’re dealing with lots of different data sources that you’re looking to combine. Or, if you have just a ton of data, you’re likely looking to implement a (quick) solution like this as it can power way more than your actual laptop can handle.

Some quick arguments for having a warehouse:

  • You need to combine multiple data sources with lots of data.
  • You want to enable the rest of the organization to access the same data.
  • You want to provide a deeper analysis into the inner workings of a vendor or search engine.

It’s not all sunshine and rainbows. Setting up a warehouse can be quick and easy, but maintaining is where the real work comes in. Especially when dealing with data formats and sources that change over time, it can create overhead. Be aware of that.

What can I do with a warehouse?

Imagine you’re a marketing team with a strong performance-marketing setup, you advertise in Google Ads and meanwhile in SEO try to compete for the same keywords to achieve great click share. It would be even more useful if your reporting could show an insight into the total number of clicks in search (not either paid or organic). By joining two datasets at scale you would be able to achieve this and meanwhile visualize progress. Excel/Google Sheets will give you the ability to do this (or repeat the process if you’re a true spreadsheet junkie) but not to have daily dashboards and share it with your colleagues easily. With a warehouse, you’d be able to store data from both ends (Google Ads and Google Search Console), mingle the data, and visualize it later on.

Seer Interactive wrote a good blog post about their decision to move to their own, homegrown, rank tracking solution. It provides an interesting insight into how they’re leveraging their internal warehouse for some of its data as well.

Do I actually need a warehouse?

Are you a small team, solo SEO, or work on a small site? Make this a hobby project in your time off. You likely, at your scale, don’t need this warehouse and can accomplish most things by connecting some of your data sources in a product like Google DataStudio. Smaller teams often have less (enterprise, duh!) SEO tools in their chest, so there is less data overall. A warehouse can easily be avoided at a smaller scale and be replaced by Google Sheets/Excel or a good visualization tool.

Why Google BigQuery?

Google BigQuery is RVshare’s choice for our marketing warehouse. Alternatives to Google BigQuery are Snowflake, Microsoft Azure, and Amazon’s Redshift. As we had a huge need for Google Ads data and it provided a full export into BigQuery for free, it was a no-brainer for us to start there and leverage their platform. If you don’t have that need, you can replicate most of this with the other services out there. For the sake of this article, as I have experience dealing with BQ, we’ll use that.

What are the costs?

It depends, but let me give you an insight into the actual costs of the warehouse for us. Google Search Console and Bing Webmaster Tools are free. Botify, Nozzle (SaaS pricing here), and Similar.ai are paid products, and you’ll require a contract agreement with them.

  • Google Search Console & Bing Webmaster Tools: Free.
  • Nozzle, Similar.ai, Botify: Requires contract agreements, reach out to me for some insight if you’re truly curious and seriously considering purchasing them.
  • StitchData: Starting at $1000/yearly, depending on volume. Although you’re likely fine with the minimum plan for just 1 data source.
  • SuperMetrics: $2280/yearly, this is for their Google BigQuery license that helps export Google Search Console. There are cheaper alternatives, but based on legacy it’s not worth for us to switch providers.
  • Google Cloud Platform – Google BigQuery: Storage in BigQuery is affordable, especially if you’re just importing a handful data sources. It gets expensive with larger data sets. So having the data itself is cheap. If you’re optimizing the way you process and visualize the data afterwards you can also save a lot of costs. Average costs for querying/analysis are $5 per TB to do that, and especially on small date ranges and selecting a few columns it’s hard to reach that quickly.

Loading Vendors into Google BigQuery

A few years ago, you needed to develop your data pipelines to stream data into Google BigQuery (BQ) and maintain the pipeline from the vendor to BQ yourself. This was causing a lot of overhead and required the need for having your own (data) engineers. Those days are clearly over as plenty of SaaS vendors provide the ability to facilitate this process for you for reasonable prices, as we just learned.

Bing Webmaster Tools & Google Search Console

Search Analytics reports from both Google and Bing are extremely useful as they provide an insight into volume, clicks, and CTR %. This helps you directly optimize your site for the right keywords. Both platforms have their own APIs that enable you to pull search analytics data from them. While Google’s is widely used available through most data connectors the Bing Webmaster Tools API is a different story. Find the resource link below to get more context on how to load this data into your warehouse as more steps are involved (and still nobody knows what type of data that API actually returns).

Resources

Saving Bing Search Query Data from the Bing Webmaster Tools API

→ Saving Google Search Console data with StitchData or Supermetrics

→ Alternatively, read about the Google Search Console API here to implement a pipeline yourself

Rank Tracking: Nozzle

Nozzle is our solution at the moment for rank tracking, at a relatively small scale. We chose them a few months ago, after having our data primarily in SEMrush, as they had the ability to make all our SERP data available to us via their BigQuery integration.

Technical SEO: Botify

Both at Postmates and RVshare I brought Botify in as it’s a great (enterprise) platform that combines log files, their crawl data, and visitor data with an insight into your technical performance.

Similar.ai

Lesser known is Similar.ai, which provides keyword data and entity extraction. Useful when you’re dealing with a massive scale of keywords of which you want to understand the different categories. Especially when they’re to create topical clusters it’s coming in very useful. With their Google Cloud Storage > Google BigQuery import we’re able to also show this next to our keyword data (from Google Search Console).

Bonus: Google Ads

If you’re advertising in paid search with Google Ads it can be useful to combine organic keyword data with paid data. It’s the reason why I like quickly setting up the Data Transfer Service with Google Ads so all reports are automatically synced. This is a free service between Google Ads and Google BigQuery. More information can be found here.

How to get started?

  1. Figure out what tools that you currently use provide a way to export their data?
  2. Create a new project in Google Cloud Platform (or use an existing one) and save your project ID.
  3. Create a new account for StitchData and where needed create an account (paid) for Supermetrics.
  4. Connect the right data sources to Google BigQuery or your preferred warehouse solution.

Good luck with the setup and let me know if I can help in any way. I’m curious how you’re getting value from your SEO warehouse or what use cases you’d like to solve with it. Leave a comment or find me on Twitter: @MartijnSch


The top metrics & KPIs other teams care about for SEO

SEOs care about many metrics, often the wrong ones (rankings, DA, PA, you name it, etc.). What is often forgotten are the metrics that are important for the other teams/departments within their organization. In the end, building a company isn’t just done by one company. Over the years, it’s been clear that you work with many departments simultaneously on the same effort, and often they care about your channels’ metrics. Just not always the same ones, so in this blog post, I wanted to shine some additional light on what metrics you should think about for other departments. It’s the followup to this tweet that got quite the attention, and this post gives me the ability to go a bit more into depth on the whys?

This list is likely incomplete and is using some generic names. Your organization might have different names or have additional departments that might not be covered here. Hopefully, this gives you a better insight into how to think about various departments related to SEO.

🏢 C-Level

Depending on what type of organization you work in and how broad your C-suite is, you are likely to report at some level into a COO/CMO that cares about the SEO metrics. But often in 100+ person companies, they don’t have the depth anymore to really deep dive into the SEO cases that you’re facing within an SEO team on a day-to-day basis.

Metrics:

  • Revenue, average order value (this number should in most use cases not be too much different from the performance of other channels), and the number of Transactions.
  • Sessions from organic search as an absolute number but also the percentage of total traffic. Primarily the latter as you want to keep a healthy/diverse balance for your marketing mix. Something that I blogged about before.

🛠 Product

What is Product building that you can benefit from, and how are you working with Product to prioritize the most important changes to the product to drive additional growth from organic search. No product is finished so there is always something that you can help prioritize from an SEO point-of-view.

Metrics:

  • Load time: There has been enough buzz about the importance of site speed for good reason.
  • Number of Pages per Template
  • Growth in Sessions
  • Best Performing Page Segments
  • Conversion Rate from Organic Search, etc.

Not necessarily in that order, but usually, metrics that are impacted with/by the Product organization.

💻 Engineering

Sitespeed, code velocity, sitespeed and load times. Well you get the point. It’s all about how fast the site is and how quickly you can work with an engineering team to get changes that you want fixed implemented.

Metrics:

  • Load times/site speed, traffic to specific sections of the site.
  • Velocity of tickets/items that you want Engineering to implement.

💲 Finance

Metrics that show the potential for growth and the return on investment. In the end, in many companies, Finance is the gatekeeper of money flowing in and out. They want to get a better insight into what you’re spending and how that eventually contributes to the bottom line. Providing a simple version of a P&L for SEO will likely return a happy smile if you’re able to produce that.

Metrics:

  • ROI % (how much have you spend on SEO resourcing: team, tools, other expenses for content) versus what will it return
  • Budget Spend, Returned Revenue, and future growth.

🖼  Marketing

Likely your closest allies in the ‘battle of SEO’ together with Product. Depending on the organizational structure, you probably find the SEO team itself here or in Product. So having enough impact on the metrics that your marketing team cares about is important.

📝 Content: Do you have a separate content team? They’ll likely care about the organic traffic coming to their pages, and they should care about the impact on those business metrics too. Besides that, any insight into specific keywords (volume, CTR) is always useful for a team like this to help optimize existing content.

Metrics:

  • Impact on branded search terms: sessions.
  • Increase/decline so you can measure the uplift of other brand awareness campaigns.

🧳   Sales & Business Development

At what scale are you still able to set up partnerships and does that actually fit into the scope of SEO at scale? Likely the answer is, no. That’s why you want to partner with a sales/biz dev team that can help you solidify partnerships and companies to work within your space. They have better skills and you can likely provide them meanwhile with more useful input on who to go after.

Metrics:

  • The number of big partnerships.
  • A shortlist of partners that you want them to go after, not just for dumb link building (preferably not, in my opinion). Instead, create lasting relationships that impact the industry, TAM (Total Addressable Market), and market presence.

📞  Customer Service

The better, faster, and more quickly you can answer your customers’ questions likely the better your business will thrive in today’s environment. Often this means that providing the answer directly in search (think featured snippets). You can’t just do this alone as an SEO team, you need the input from people in Customer Service, they’re the ones talking to your customers about the (mainly) negative and positive situations. The more you can support them with the metrics that they care about, the more comfortable both your lives might become.

Metrics:

  • Organic Traffic to Support related pages, the number of calls/chats that you avoid by better-optimized pages, etc.
  • The top questions that you can answer directly via featured snippets.
  • The top 100 pages on your support portal, based on organic search segmentation.

📈 Growth

When I was on the Growth team at Postmates the insane velocity that was produced there to grow faster was great to see. As SEO isn’t the fastest-growing channel often (especially not in the short-term, as I can throw 1M towards PPC tomorrow and create near-instant results) it’s important to show how it’s attributing to the mix of long and short-term initiatives of a growth team.

Metrics:

  • Growth % of the SEO channel, compared to MoM, WoW or YoY.
  • Long term contributions of growth, as lots of SEO growth is evergreen and at relatively low costs.

👩‍💻 Human Resources

Admittedly, this one is one of the most distanced departments from just SEO, but if you’re a big organization and recruiting for dozens or even hundreds or roles, how important it could be to help drive traffic to a Careers/Jobs section on a site. If that’s the case showing the importance of driving job applicants could be incredibly helpful to help understand what SEO can do for them.

Metrics:

  • The number of job applicants that applied because they found the jobs via Search.
  • Traffic to a specific segment of the site from Organic Search: Careers/Jobs.
  • The number of pages marked up properly with structured data for Jobs.

What metrics are missing? What do you measure for your organization? There are so many different business models out there that likely this list is far from complete for; for example, B2B cases are likely to be missing here.


Saving Bing Search Query Data from the Bing Webmaster Tools’ API

Over the last year, we spent a lot of time working on getting data from several marketing channels into our marketing data warehouse. The series that we did on this with the team has received lots of love from the community (thanks for that!). Retrieving Search Query data from Bing has proven to be one of the ‘harder’ data points: there is a lack of documentation, there a no real connectors directly to a data warehouse, and as it turns out the returned data (quality) is … ‘interesting’ to say the least. That’s why I wanted to write this blog post, to provide the code to easily pull out your search query data from Bing Webmaster Tools and give more people to evaluate their data. Hopefully, this provides the overall community with a better insight into the data quality coming out of the API.

Getting Started

  1. Create an account on Bing Webmaster Tools.
  2. Add & Verify a site.
  3. Create an API Key within the interface (help guide).
  4. Save the API Key and the formatted site URL.

The code

These days I spent most of my time (whenever I get to write code) coding in Python, that’s why these.

import datetime
import requests
import csv
import json
import re

URL = "https://example.com"
API_KEY = ''

request_url = "https://ssl.bing.com/webmaster/api.svc/json/GetQueryStats?apikey={}&siteUrl={}".format(API_KEY, URL)

request = requests.get(request_url)
if request.status_code == 200:
    query_data = json.loads(request.text)

    with open("bing_query_stats_{}.csv".format(datetime.date.today()), mode='w') as new_file:
        write_row = csv.writer(new_file, delimiter=',', quotechar='"')
        write_row.writerow(['AvgClickPosition', 'AvgImpressionPosition', 'Clicks', 'Impressions', 'Query', 'Created', 'Date'])

        for key in query_data["d"]:
            # Get date
            match = re.search('/Date\\((.*)\\)/', key["Date"])

            write_row.writerow([key["AvgClickPosition"] / 10,
                                key["AvgImpressionPosition"] / 10,
                                key["Clicks"],
                                key["Impressions"],
                                key["Query"],
                                datetime.datetime.now(),
                                datetime.datetime.fromtimestamp(int(match.group(1)) // 1000)])

Or find the same code here in a Gist file on Github.

Steps to take

  • Make sure you have all the needed dependencies installed: json, re, requests, csv.
    • pip install requests json re csv
  • Run the script: python bing_query_stats.py and enter the API Key and Site URL in the constants at the top of the script.
  • If everything is successful the information is saved in this file: bing_query_stats_YYYY-MM-DD.csv

Data Quality

As I mentioned in the intro, the data quality is questionable and leaves very much up to the imagination. It’s one of the reasons why I wanted to share this script, so others can get their data out and we can hopefully learn more together on what the data represents. The big caveat seems that the data is exported at the time of extraction with a date range of XX days and it’s not possible to select a date range. This means that you can only make this data useful if you save it over a longer period of time and based on that calculate daily performance. This is all doable in the setup we have where we’re using Airflow to save the data into our Google BigQuery data lake, but because it isn’t as straight forward this might be harder for others.

So please share your ideas on the data and what you ran into with me via @MartijnSch


Starting & Growing SEO Teams

“Tell me, who/what do I need to hire to run our SEO program? What is the first hire for a new SEO team?” Questions I often get, usually followed by: “Do you know anybody for our team?”. As so many companies around the Bay Area are hiring it makes sense, which also makes hiring harder. I’ve previously blogged about writing a better job description for SEO roles but I also wanted to shed some light into what I’d suggest as good setups for an SEO team and what roles + seniority to hire for depending on your company structure.

Why need SEO support?

Most startup founders or early employees don’t have an extensive background in Marketing or specifically SEO (and they shouldn’t). Most of the time, they have been too busy building the company, getting to product-market fit and iterating on their product/service. In a lot of growing B2C companies, you need to establish plans for long term growth. That’s what SEO can usually bring to these companies: a sustainable long term growth strategy. But in order to get there, you’ll need to bring in extra help to make sure that it actually is sustainable. Instead of employing short term SEO tactics that might put your growth more at risk if you approach it wrong (as many startups also do).

Why create an SEO Team?

So why do you need to create an SEO team, for many of us this is common knowledge as we’re in this on a daily basis? But let’s say you’re getting started, these could be some of the objectives:

  • Dedicated focus on SEO, there are too many other channels to take care off.
  • Need to grow a long-term channel to success.
  • Too many tasks, need to specialize with its own dedicated specialized IC/team.
  • Build out more brand awareness for the company (SEO is a great way of doing this long term).
  • Grow revenue & transactions at a low Customer Acquisition Cost.

Consultant versus Hiring Inhouse

Hiring an SEO Consultant versus an Inhouse SEO

Some teams can’t always hire talent right away (think about the Bay Area where basically all the bigger companies constantly have a need to hire SEO talent) or it might take too long to ramp up SEO. In some other cases, it made more sense for the company to hire a consultant in the short term to take care of some issues and figure out what’s actually needed instead of just hiring somebody with potentially the wrong skill set for the long term.

My take on this is usually that if you already know what you want your SEO team to work on & are able to wait another 2-3 months that you’re better off hiring somebody in-house (if resources are available). In other cases: you have a short term need, you need a technical SEO but want to hire a content person, etc. You’re likely better off starting with a specialized consultant in an area to make sure your issues around that are covered.

Also, read my blogpost on levels & seniority in SEO roles.

Hiring SEOs isn’t good enough: resources!

The Skills of an SEO Leader

Provide them with resources, when I joined Postmates one of the questions that I wanted to make sure was that they provided me with not just resources to set up some tools but also that I had engineers available to run the actual implementations and a designer to create the new pages that we needed there.

  • Engineering: It’s important, as you the SEO can’t make all the changes yourself, you’ll need the team to make actual changes. Most SEOs that I meet don’t have the knowledge about their infrastructure to actually push code or design something that complies with brand guidelines.
  • Design: You need to add additional content, you need more blog posts, but they can’t just be text. There need to be visual add-ons to it, so you need design support.
  • Content: In a bigger company there will be an actual huge need for content (either new or to edit existing content). 

How have you been growing SEO teams, what is missing, what should SEO teams really focus on? Leave a comment!


Announcing my Technical SEO Course on CXL Institute

If there was one thing that I could teach people in SEO, it was always the technical side of SEO that came up first. Mostly, because I think it’s a skill that doesn’t suit too many SEOs and there is already enough (good or bad, you’ll be the judge of that) content about the international, link building or content side of SEO out there. As technical SEO is getting more and more technical and in-depth about the subject itself, I’m excited to announce that I’m launching a new technical course with the folks of CXL institute.

The course will cover everything from structured data to XML sitemaps and back to some more basic on-page optimization. Along the way, I show you my process for auditing a site and coming up with the improvements. I’ll try to teach you about as many different issues and solutions as I could think of.

It’s not going to be ‘the most complete’ course ever on this topic, technical SEO evolves quickly, and likely some things will already be outdated now it’s published, while we have worked on it for months. But I’m going to do my best to inform you here and on CXL Institute about any changes or any improvements that we might be able to make in a future version. If you have any questions about the course or want to cheer me on, reach out via Twitter on @MartijnSch.