Saving Bing Search Query Data from the Bing Webmaster Tools’ API

Over the last year, we spent a lot of time working on getting data from several marketing channels into our marketing data warehouse. The series that we did on this with the team has received lots of love from the community (thanks for that!). Retrieving Search Query data from Bing has proven to be one of the ‘harder’ data points: there is a lack of documentation, there a no real connectors directly to a data warehouse, and as it turns out the returned data (quality) is … ‘interesting’ to say the least. That’s why I wanted to write this blog post, to provide the code to easily pull out your search query data from Bing Webmaster Tools and give more people to evaluate their data. Hopefully, this provides the overall community with a better insight into the data quality coming out of the API.

Getting Started

  1. Create an account on Bing Webmaster Tools.
  2. Add & Verify a site.
  3. Create an API Key within the interface (help guide).
  4. Save the API Key and the formatted site URL.

The code

These days I spent most of my time (whenever I get to write code) coding in Python, that’s why these.

import datetime
import requests
import csv
import json
import re

URL = "https://example.com"
API_KEY = ''

request_url = "https://ssl.bing.com/webmaster/api.svc/json/GetQueryStats?apikey={}&siteUrl={}".format(API_KEY, URL)

request = requests.get(request_url)
if request.status_code == 200:
    query_data = json.loads(request.text)

    with open("bing_query_stats_{}.csv".format(datetime.date.today()), mode='w') as new_file:
        write_row = csv.writer(new_file, delimiter=',', quotechar='"')
        write_row.writerow(['AvgClickPosition', 'AvgImpressionPosition', 'Clicks', 'Impressions', 'Query', 'Created', 'Date'])

        for key in query_data["d"]:
            # Get date
            match = re.search('/Date\\((.*)\\)/', key["Date"])

            write_row.writerow([key["AvgClickPosition"] / 10,
                                key["AvgImpressionPosition"] / 10,
                                key["Clicks"],
                                key["Impressions"],
                                key["Query"],
                                datetime.datetime.now(),
                                datetime.datetime.fromtimestamp(int(match.group(1)) // 1000)])

Or find the same code here in a Gist file on Github.

Steps to take

  • Make sure you have all the needed dependencies installed: json, re, requests, csv.
    • pip install requests json re csv
  • Run the script: python bing_query_stats.py and enter the API Key and Site URL in the constants at the top of the script.
  • If everything is successful the information is saved in this file: bing_query_stats_YYYY-MM-DD.csv

Data Quality

As I mentioned in the intro, the data quality is questionable and leaves very much up to the imagination. It’s one of the reasons why I wanted to share this script, so others can get their data out and we can hopefully learn more together on what the data represents. The big caveat seems that the data is exported at the time of extraction with a date range of XX days and it’s not possible to select a date range. This means that you can only make this data useful if you save it over a longer period of time and based on that calculate daily performance. This is all doable in the setup we have where we’re using Airflow to save the data into our Google BigQuery data lake, but because it isn’t as straight forward this might be harder for others.

So please share your ideas on the data and what you ran into with me via @MartijnSch


Case Study: How Restructuring 6800 Content Pieces For SEO Worked

I presented the content in this blog post about a week ago for the Traffic Think Tank community (highly recommend it), but after a Twitter thread on this topic as well, it’s time to turn it into a blog post.

Sometimes you have to take a stand and make something better when it’s already performing well. Over the last months, the RVshare marketing team worked on some great projects; one of them that I was involved in was restructuring 6800 pieces of content that we created a while ago. The content and pages they were on were performing outstanding (growing +100% YOY without any real effort), but we wanted to do more, to help users and boost SEO traffic. So we got started…

Why restructure content?

A couple of years ago, we published the last WordPress page/post in a series of 600+, the intent: go after a category near and dear to the core of the RVshare business: help more people rent an RV. We did that by creating tons of articles specifically for cities/areas. Now over two and a half years later, the content is driving millions of people yearly, mainly from SEO, but we knew that there was more as it’s not our core business. We also weren’t leveraging all the SEO features that have become available since two years ago, think about additional structured data like FAQs but also monetization that we thought was important. All improvements that we had to go back into every post for if we wanted to take advantage of it.

What we did, leveraging Mechanical Turk.

One of the biggest obstacles wasn’t necessarily rebuilding pages, coming up with a better design, etc. WE have a great team that is nailing this on a daily basis. But having to deal with 650 posts that contained ten sub-elements itself was a struggle. The content was structured in a similar way but some quick proof of concepts identified that scraping wasn’t the solution as the error ratio was way too high as with most projects we wanted to ensure that the content could be restructured at low costs not to avoid this project not having a valid business case (does the actual opportunity outweigh the potential costs to restructure the content?).

Scraping versus Mechanical Turk

As we had initially structured the content the same way: headline, description, etc. we were able to have at least a way to get the data out. When we did some testing to see if we would be able to scrape it looked unfortunate, there were too many edge cases as the HTML itself around it was barely structured enough to get the actual content out of it.

We looked into Mechanical Turk as the second option as it gave us the ability to quickly get thousands of people on a task to look at the content and take out what we needed. We wrote the briefing, divided the project in a few chunks, and within 10-12 hours, we had the content individualized per piece. We did our best to deal with most of the data cleaning from the workers directly in the briefing and form but also had some cleaning scripts ready. After it was cleaned, we imported the data into our headless CMS Prismic.

How to do this yourself?

  1. Create an account on Mechanical Turk.
  2. Create a project focused around content extraction.
  3. Identify what kind of content you want individualized, it works best if there is a current structure (list format, table) that can be followed by the Turks. This way, you can tell them to pick up content piece X, Y, Z, for a specific URL.
  4. Identify the fields that you want to be copied.
  5. Upload a list of URLs that you want them to cover and additionally the # that it has on the list.
  6. Start the project and verify the results.
  7. Upload the data automatically back into your CMS (we used a script that could directly put the content as a batch into our headless CMS Prismic.io)

Rebuilding

We decided to build the content from the ground up, which meant:

  • Build out category pages with the top content pieces by state.
  • Build out the main index page with the top content from all states.
  • Build the ability to showcase this content on all of our other templated pages across RVshare.

By building out the specific templates, it gave us additional power to streamline internal linking, create better internal relevance, build-out structured data but mainly figure out a right way on how to leverage a headless CMS with all its capabilities instead of just having raw (read: ‘dumb’) content that can’t be appropriately structured. We already use the headless CMS Prismic.io to do this, in which you can create custom post types, as you see in this screenshot. You define the custom post type and can pick the kind of fields that you want, which turns itself just another CMS after that. The content can then be leveraged through their API.

How to do this yourself?

We were previously leveraging WordPress ourselves, but all entities were saved as 1 post. If you’re able to do this differently and save pieces individually it’s many times easier to create overview pages by using categories (and/or tags). This is not right away something that you can always do without development support.

Results

Because of the design changes, engagement increased with over 25% because of the new format. Monetization is making it more interesting to keep on iterating on the results. Sessions were unfortunately really hard to measure we launched the integrations a few weeks prior to the kick-off of COVID-19 resulting in a downwards spiral and a surge in demand right after. Hopefully, in the long-term, we’ll be able to tell more about this. We are sure though that we didn’t suffer on SEO results.


Want to see the new structure of the pages? You can find it here as our effort on the top 10 campgrounds across the United States.


Part 5: Airflow on Google Cloud Composer – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, part 3, and part 4) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: making the data lake. In the fourth blog post, a more technical one, I’ll give some insights into how we’re leveraging Apache’s Airflow to build the more complicated data pipelines, and I give you some tips on how to get started.

This blog post is part of a series of five? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Getting Started with Cloud Composer

Cloud Composer is part of Google’s Cloud Platform and brings you most of the upside of using Apache Airflow (open source) and barely any of the downsides (setup, maintenance, etc.). Or to follow their main USP: “A fully managed workflow orchestration service built on Apache Airflow.” While we had worked with Airflow before, we weren’t looking forward to spending time having to worry about its management as we planned to spend most time setting up and maintaining the data pipelines. In the end, then you have to stick to create pipelines (DAGs).

What is it suitable for?

You want to load data from the Google Analytics API, store it locally, translate some values to something new, and have it available in Google BigQuery. However, you would build it; it’s multiple tasks and functions that are depending on itself. You wouldn’t want to load the data into BigQuery when the data wouldn’t have been cleaned (trash in, trash out sounds familiar?). With BigQuery, the next task is only being processed if the previous step was successful.

Tasks

Tasks are in almost every case; just one thing: get data from BigQuery, upload a file from GCS into BigQuery, download a file from Cloud Storage to local, process data. What makes Airflow very efficient to work with is that the majority of data processing tasks already have pre-built functions. The first three tasks that I listed here are operators (GoogleCloudStorageDownloadOperator, GoogleCloudStorageToBigQueryOperator) that operate as functions.

Versus Google Cloud Functions

If you mainly run very simple ‘pipelines’ that only exist of 1 function that needs to be executed or have only a handful use cases, it is likely overkill to leverage Cloud Composer; the costs might be too high, you still have overhead with DAGs. In that case, you might be better off with Google Cloud Functions as you can write similar scripts that will enable you to also trigger them with Google Cloud Scheduler to run at a specific time.

Costs

The costs for Google Cloud Composer are doable, for a basic setup, it’s around 450 dollars (if you run the instances 24 hours * 7 days a week) as you leverage multiple (a minimum of 3) small instances. For more information on the costs, I would point to this pricing example.

Building Pipelines

See above an example data pipeline, in typical Airflow fashion every task is depending on the previous task. In other words: notify_slack_channel would not run if any of the previous tasks would fail. All tasks are happening in a particular order from left to right. In most cases, data pipelines become more complicated as you can have multiple flows going on at the same time and combining them at the end.

Tips & Tricks

Google Cloud Build, Repositories

The files for Google Cloud Composer are saved in Google Cloud Storage. Which is smart in itself, but at the same time, you want them to live in a Git repository so you can efficiently work on it together. By leveraging this blog post, you’re able to connect the Cloud Storage bucket to a repository and set up a sync between the two. This will help you build a deployment pipeline basically and make sure that only production-ready code from your master branch ends up in GCS.

Managing Dependencies

After working with it for a few months now, I’m still not sure if managing dependencies through Google Cloud Composer is a good or bad thing, as it creates some obstacles if you want to run a deployment and want to add some Python libraries (as your servers could be down 10-30 mins at a time). For other setups, this usually is a bit more smooth and creates less downtime.

Sendgrid for Email Alerts

One of the upsides of Apache Airflow is that it sends alerts upon failure of tasks. Make sure to set up the Sendgrid notifications while you’re setting up Google Cloud Composer. This will be the most straightforward way of receiving email alerts (for free, as in most cases, you shouldn’t get too many failure emails).

README

Document the crap out of your setup and DAGs. When I took over some of the pipelines that were used at Postmates for XML sitemap generation it was a nightmare, it was hard to read, the code didn’t make a lot of sense, and we had to refactor certain things just because of that. As sometimes pipelines (just like regular code) can be left untouched/unviewed for months (as they literally sometimes only have one job) you want to make sure that you come back and understand what happens inside the tasks.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 4: Visualization with Google DataStudio – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, and part 3) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. In the fourth blog post, we’ll chat about how we are visualizing all the data we saved in previous steps by using Google DataStudio.

This blog post is part of a series of four? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.

How we build dashboards

Try to think ahead about what you need: date ranges, data/date comparisons, filters, what type of visualization. This will help you build a better first version right away as it gives you the ability to have a good version right away. What that mainly looked like for us:

  • Date ranges: The business is so seasonal that our Year over Year growth is most important for RVshare, and since we often don’t get to see all the context on metrics on a weekly basis, we default to 30 days.
  • Filters: For some channels (PPC, Social), it’s more relevant to be able to filter down the data on a campaign or social network level. Because in most cases the aggregate level doesn’t tell the whole story right away.
  • Visualization: We need the top metrics: sessions and revenue in view right away with the comparison YoY right away so we know within seconds what is going on and how that can improve things.

Talking to Stakeholders (Part Deux)

In the first blog post, we talked about connecting with our stakeholders (mainly our channel owners) and gathering their feedback to build the first initial versions of their dashboarding (beginning with the end in mind). We used this approach to put the first charts, tables, and graphs on the dashboards after which we connected back again with the owners to see what data points were missing and in some cases to validate the data that they were seeing on their dashboards. This helped us get additional feedback for fast follows and made for quick iterations on data that we had and could also show. For social media, as an example, it turned out that we wanted to show additional metrics that we hadn’t thought of initially but were in our data lake anyway. These sessions provided a good way for us to build additional pieces into our data warehouse while we were at it. These days some of these dashboards are used weekly to report to other teams in the organization or used within the team itself.

Best Practices

Blended Data

Do you want to add this to Google DataStudio, or do you want to create synced/aggregate tables in BigQuery? For most of our use cases, we have opted for using DataStudio to create JOIN blended data sources. It’s easier – we have the ability to quickly pull some new data together versus having to deal with the data structures and complicated queries. In some use cases, we noticed that we were missing data in our warehouse tables (not the lake) and were able to make adjustments/improvements to them by creating dashboards.

Single Account Owner

Because we work with Rick and Hussain as ‘third parties’, we opted for using 1 shared owner account, transferring owner access is incredibly hard when it’s a Google Apps account so we made sure that the dashboards are owned through an @rvshare.com account, it’s not a big topic but could cause tons of headache in the long term.

Keep It Simple St*pid

Your stakeholders probably have the desire / and the time to look at less than you think. Instead of having them jump through too many charts start simple and then add based on feedback if they want to see more, less is more in this case.

This has added benefit of them feeling engaged and more interested in using it. In our own use case, we leverage our reporting on a weekly basis for a team meeting, which makes it already a more often leveraged us case.

Calculated Fields – Yay or Nay?

As we made most of the tables that we leverage in DataStudio from scratch during our ETL process, we had the opportunity to decide if we wanted to leverage calculated fields in BigQuery or if we do the work in the queries itself. Honestly, the answer wasn’t easy, and as we made modifications in the dashboards, it became clear that having them set up in DataStudio wasn’t always scalable and easy as with data modifications or changes in tables they are removed.

Google BigQuery

Tables or Queries? In our case, we often used the table information from BigQuery and the specific columns in there to drive the visualization in DataStudio. The alternative for some of them is that we directly query the data in BigQuery, with the BI Engine reservation that we have in there we can speed up intense queries rather easily.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 3: Transforming Into a Data Warehouse – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1 and part 2) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. We’ll try to go a bit more into detail on how you can do this yourself in this post in which we transformed our marketing data lake into an actual data warehouse.

This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.

The Process of Building a Data Warehouse

In our endeavor of building a data warehouse, we had a couple of big initiatives that we first wanted to get done. We needed some reporting and visualization tables and aligned with that, we needed to make sure that we could have data that was cleaned for other purposes (deduplication, standardization: some typical ETL problems).

In order to streamline the processes, we used three different ways of getting the data streamlined: 

Google Cloud Functions

Google Cloud Functions is used for both transforming our data as loading our initial data for a few use cases. Early on we noticed that not every vendor was available through regular data loading platforms, like StitchData. An example of that was Google Search Console, as we didn’t want to have the need to run additional infrastructure for just dealing with Load scripts we leveraged Cloud Functions to run a daily script (with support from Scheduler to make them daily).

After loading the data we also Transform some other tables from our marketing data lake to new production tables using Cloud Functions. All of our scripts are currently written in Python or Node.js, but as Cloud Functions makes it possible to deal with multiple languages it provides us with the flexibility to leverage others over time.

Backfill: As Functions can easily be rewritten and tested within the interface, it also provides us with a good way to backfill data easily as we can easily adjust the dates that a script needs to run.

Scheduled Queries

In some other cases, we can also leverage Google BigQuery’s scheduled queries. In a few instances, we just want to load the data from raw data lake tables into a production table. Mainly because we don’t always need all the columns, we can limit our data drilling and be able to clean the data in the query itself. In that case, scheduled queries can come in pretty handy as they run on a certain schedule, can be easily updated, and already point towards another data set and table.

Airflow

For more complicated data flows we’re currently using Airflow via Google Cloud Composer. Cloud Composer, as we mentioned in a previous blog post, enables us to not have to worry about maintaining the Airflow infrastructure but gives us all the other upside of it. This gives us the ability to focus on creating and maintaining the DAGs that help drive the actual data structuring flows.

How we mainly use Airflow is to combine, clean and enhance data from multiple sources and then reupload it back into Google BigQuery for visualization in other tools. Singular use cases are more easily captured by one or two tasks, but in Airflow we run flows that usually have multiple tasks that need to be executed in a certain order and not at all if one of them fails. This is what Airflow is meant to do, and that’s how we’re leveraging it too. As an example of our affiliate marketing campaigns, we have a structure set up that only pays out once travel is concluded (a very standard approach in the travel industry). This means that we need to retrieve orders from our partner > verify them with our database > create a new format to upload back to our vendor and run the actual upload. In addition, we want to set up some alerting for the team as well. Resulting in 6 tasks in this case that need to be executed in the right order: the perfect use case for Airflow.

Creating Structure

In the previous blog post, I touched on how we wanted to set up raw tables that are transformed once or multiple times. We decided to do this to both make the data more streamlined and also to make them ready for visualization on our channel dashboards. The MarketLytics team did a great job documenting this with a very visual result that you can see here:

As discussed previously, we go through multiple stages with the data that we get into the data lake and transform it to the data warehouse.

Example of data enhancement: One of the most common scenarios that we’ve tried to solve for was connecting existing data from a vendor back to the data that we receive in our web analytics tool: Google Analytics. As an example, if properly tagged we should be able to get the data from a specific newsletter campaign from the UTM parameters and then connect the data to (in our case) Marketo to what we have there on deliverability and open rate (%).


Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.