Sometimes, you want to assign specific user interactions to a different user.
There are many cases where you want to send an event for user X while user Y performs the action. But it’s important that you can save this information to the right user.
In our case, we ran into the use case where an RV owner needs to accept a booking. Only after the approval takes place do we consider this an actual purchase (transaction in GA). According to regular logic, the RV owner would get the purchase attributed to its session in Google Analytics. In the end, that user goes through the steps in the interface and confirms the transaction.
Marketing <> Google Analytics Logic
This doesn’t work for Marketing, though. We did our best to acquire the renter, and they’re the ones purchasing. According to Google Analytics, we’d fire an e-commerce purchase on behalf of the owner. What this messes up is channel attribution & performance measurement of our campaigns. In this case, the owner’s path is likely not touching any paid or true marketing channels but either direct or email.
In summary, the wrong user would get credit for the conversion, which could cause issues with our ROAS measurement of marketing channels.
Switching Client IDs & Leveraging User IDs
When Google Analytics is loaded, it will set the client ID value in the _ga cookies in your browser. This value (the client id) will be used to tie events → pageviews → sessions together and be seen as one user (in combination with the userId value, obviously).
So what we’re doing is pretty simple to change this behavior:
Whenever a user goes through the checkout funnel and creates a user account, we save their Google Analytics Client IDs (for GA4 and UA) to their profile.
When user X confirms the purchase, we’re sending an event to Tag Manager with the purchase context, including the impacted data from user Y.
Instead of directly firing the hit to Google Analytics, we swap out the client ID and userID from user X to user Y so that the actual purchase will get attributed to that hit. You need to mimic a session ID.
Google Analytics will now stitch this purchase to user Y instead of user X. You can choose for yourself what you want to fire for user Y.
Where possible, make sure that you’re already sending a userID to Google Analytics to ensure that the interactions can be truly tied back to the same user.
A data warehouse (& data lake) stores and structures data (through data pipelines) and then makes it possible to visualize it. That means it can also be used to help create and power your SEO reporting infrastructure, especially when you’re dealing with lots of different data sources that you’re looking to combine. Or, if you have just a ton of data, you’re likely looking to implement a (quick) solution like this as it can power way more than your actual laptop can handle.
Some quick arguments for having a warehouse:
You need to combine multiple data sources with lots of data.
You want to enable the rest of the organization to access the same data.
You want to provide a deeper analysis into the inner workings of a vendor or search engine.
It’s not all sunshine and rainbows. Setting up a warehouse can be quick and easy, but maintaining is where the real work comes in. Especially when dealing with data formats and sources that change over time, it can create overhead. Be aware of that.
What can I do with a warehouse?
Imagine you’re a marketing team with a strong performance-marketing setup, you advertise in Google Ads and meanwhile in SEO try to compete for the same keywords to achieve great click share. It would be even more useful if your reporting could show an insight into the total number of clicks in search (not either paid or organic). By joining two datasets at scale you would be able to achieve this and meanwhile visualize progress. Excel/Google Sheets will give you the ability to do this (or repeat the process if you’re a true spreadsheet junkie) but not to have daily dashboards and share it with your colleagues easily. With a warehouse, you’d be able to store data from both ends (Google Ads and Google Search Console), mingle the data, and visualize it later on.
Are you a small team, solo SEO, or work on a small site? Make this a hobby project in your time off. You likely, at your scale, don’t need this warehouse and can accomplish most things by connecting some of your data sources in a product like Google DataStudio. Smaller teams often have less (enterprise, duh!) SEO tools in their chest, so there is less data overall. A warehouse can easily be avoided at a smaller scale and be replaced by Google Sheets/Excel or a good visualization tool.
Why Google BigQuery?
Google BigQuery is RVshare’s choice for our marketing warehouse. Alternatives to Google BigQuery are Snowflake, Microsoft Azure, and Amazon’s Redshift. As we had a huge need for Google Ads data and it provided a full export into BigQuery for free, it was a no-brainer for us to start there and leverage their platform. If you don’t have that need, you can replicate most of this with the other services out there. For the sake of this article, as I have experience dealing with BQ, we’ll use that.
What are the costs?
It depends, but let me give you an insight into the actual costs of the warehouse for us. Google Search Console and Bing Webmaster Tools are free. Botify, Nozzle (SaaS pricing here), and Similar.ai are paid products, and you’ll require a contract agreement with them.
Google Search Console & Bing Webmaster Tools: Free.
Nozzle, Similar.ai, Botify: Requires contract agreements, reach out to me for some insight if you’re truly curious and seriously considering purchasing them.
StitchData: Starting at $1000/yearly, depending on volume. Although you’re likely fine with the minimum plan for just 1 data source.
SuperMetrics: $2280/yearly, this is for their Google BigQuery license that helps export Google Search Console. There are cheaper alternatives, but based on legacy it’s not worth for us to switch providers.
Google Cloud Platform – Google BigQuery: Storage in BigQuery is affordable, especially if you’re just importing a handful data sources. It gets expensive with larger data sets. So having the data itself is cheap. If you’re optimizing the way you process and visualize the data afterwards you can also save a lot of costs. Average costs for querying/analysis are $5 per TB to do that, and especially on small date ranges and selecting a few columns it’s hard to reach that quickly.
Loading Vendors into Google BigQuery
A few years ago, you needed to develop your data pipelines to stream data into Google BigQuery (BQ) and maintain the pipeline from the vendor to BQ yourself. This was causing a lot of overhead and required the need for having your own (data) engineers. Those days are clearly over as plenty of SaaS vendors provide the ability to facilitate this process for you for reasonable prices, as we just learned.
Bing Webmaster Tools & Google Search Console
Search Analytics reports from both Google and Bing are extremely useful as they provide an insight into volume, clicks, and CTR %. This helps you directly optimize your site for the right keywords. Both platforms have their own APIs that enable you to pull search analytics data from them. While Google’s is widely used available through most data connectors the Bing Webmaster Tools API is a different story. Find the resource link below to get more context on how to load this data into your warehouse as more steps are involved (and still nobody knows what type of data that API actually returns).
Nozzle is our solution at the moment for rank tracking, at a relatively small scale. We chose them a few months ago, after having our data primarily in SEMrush, as they had the ability to make all our SERP data available to us via their BigQuery integration.
Technical SEO: Botify
Both at Postmates and RVshare I brought Botify in as it’s a great (enterprise) platform that combines log files, their crawl data, and visitor data with an insight into your technical performance.
Lesser known is Similar.ai, which provides keyword data and entity extraction. Useful when you’re dealing with a massive scale of keywords of which you want to understand the different categories. Especially when they’re to create topical clusters it’s coming in very useful. With their Google Cloud Storage > Google BigQuery import we’re able to also show this next to our keyword data (from Google Search Console).
Bonus: Google Ads
If you’re advertising in paid search with Google Ads it can be useful to combine organic keyword data with paid data. It’s the reason why I like quickly setting up the Data Transfer Service with Google Ads so all reports are automatically synced. This is a free service between Google Ads and Google BigQuery. More information can be found here.
How to get started?
Figure out what tools that you currently use provide a way to export their data?
Create a new project in Google Cloud Platform (or use an existing one) and save your project ID.
Create a new account for StitchData and where needed create an account (paid) for Supermetrics.
Connect the right data sources to Google BigQuery or your preferred warehouse solution.
Good luck with the setup and let me know if I can help in any way. I’m curious how you’re getting value from your SEO warehouse or what use cases you’d like to solve with it. Leave a comment or find me on Twitter: @MartijnSch
This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.
The Process of Building a Data Lake
Three Options: BigQuery Transfer Service, Firebase, SuperMetrics, and StitchData
Loading data from a dozen marketing vendors (not all of them are visualized in this graph) as efficiently as possible was important to us as we tried to save most money and resources for other parts of the data warehouse integration. In the end, we believe that we should let people do the jobs that they’re good at and leverage vendors instead of always following a ‘build ourselves’ strategy. This makes it possible for us to focus on the actual use cases instead of having to worry about the upkeep of integrating with over a dozen vendors, something that could/would take up a chunk of time in the long term.
That’s why we made a few decisions on how to load data:
Firebase: Google Analytics App & Web automatically can export to a BigQuery data set. So we decided to run our regular Universal Analytics integration of Google Analytics in sync with App & Web for now so we have the ability to cheaply export Google Analytics data.
BigQuery Transfer Service: Google Cloud offers some great integrations for YouTube, Google Ads that can be set up within minutes which makes it possible to instantly backfill and load data.
StitchData: We decided to go with Stitch as their pricing was more flexible and cheaper for the needs that we had. There are many alternatives to them, but they had integrations with most vendors that we needed. Since we didn’t require Enterprise level SLA’s (yet), we knew that we could rely on their standard integrations with many vendors that we have integrated with.
SuperMetrics: Some vendors weren’t available at the time in StitchData (Facebook Insights) so we decided to get a license for their BigQuery export.
‘Manual Import’: As a last resort we decided to build connections to BigQuery ourselves, so far we only had to do that for 2 vendors which made the workload for this lower than expected.
Understanding platform APIs and Data Granularity
There were over a dozen vendors that we wanted to import data from and in different ways, as described, to load the data into our data lake: Google BigQuery. We needed to ensure upfront that data and platforms that we would be using were up for the task, as some important areas could have thrown problems our way. This was needed to build a good understanding of the platforms and also how we could leverage them. What we looked at:
Data Freshness & Backfill capabilities: How often would the data refresh and how often would we require a refresh of the data? For some vendors a 24-hour sync was more than enough but for others this meant that a 24-hour sync would last for many hours, basically making yesterday’s data useless as it would take too long. On the other side, we needed to make sure that we could backfill data from vendors and with most decisions on how to load this data that was possible (except for manual backfill, in which case we needed to write scripts that could go back in time to backfill its data).
Cut time to collecting: Do we need all the data from an API or only partial responses? And in combination with the previous point, do we need to load data hourly (realtime wasn’t much of a case) or could we just gather the data every day to save time collecting it?
Hard to see all situations beginning end to end: Google Ads exports dozens of tables through their Transfer Service into BigQuery, which meant that it took a while to figure out what tables to use and how certain columns were translated into API fields & tables.
Using data to find gaps and roadblocks: What worked out for us is that we tried visualizing the data as soon as possible so we could quickly find anomalies (days of data missing due to a broken backfill process), otherwise the data wouldn’t match the interface because metric X turned out to be calculated differently from what we thought.
Associated Costs
BigQuery Storage: The lowest cost out of all of this, while we store dozens of GBs daily it’s ridiculously cheap to store all of this long-term.
Stitch: Pricing can be found here. This is really dependent on how much data you’re exporting and from how many sources as you’re billed based on usage.
SuperMetrics: Pricing can be found here. Basically $190 per vendor monthly.
BigQuery Transfer Service: This is free for a couple of months for Google Ads export and YouTube. Previously it was $2.50 – one less coffee at Starbucks monthly.
BigQuery BI Engine Reservation: While it’s not fully related to the data lake I want to add a mention for it. We’re paying $60 monthly for a 2GB size reservation. Currently, we’ve optimized queries as much as possible to be under this limit, if needed we might move up. This has been a game-changer to make querying for dashboarding lightning fast.
Cutting time to prove the value
We knew many use cases that we wanted to solve with our data lake, for example: show Google Analytics metrics next to the open, unsubscribe rates from our marketing newsletters (Marketo). This meant that we had the work cut out for us and could quickly take a sample of the data and build out the use case to see if the data could be matched (would we run into tracking issues somewhere, we sometimes did) and what possibilities we had for visualization. In short, we tried to cut as much time out of the initial process as possible, by not focusing on scalability yet, to figure out if the use cases that we had actually could provide value for the teams. Basically thinking about the end goals that we had in mind.
We did that by:
Talking to end stakeholder – interview & document: The goal of the projects was to, at a minimum, provide every channel owner (PPC, SEO, Social, etc.) with dashboards with their core metrics. That was the basic level that we asked of them to provide, on top of that, we discussed briefly with them all the other use cases that they knew about that they wanted to repeat. By doing this we immediately had documentation on how the process currently worked for them, what data sources were needed to be included and what needed to be built. But more importantly, it also created their buy-in as they were able to pick and choose what we build for them.
Taking a sample and building a use case: Load the data from the vendor into the data lake and investigate what the schemas were and if the granularity of the data was enough. As long as there was a key in multiple tables that we could leverage between vendors, we knew we were on the right track (examples: UTM tags in emails needed to match with Google Analytics conventions).
Automate & Backfill later: We never focused on making sure that this could scale over the next years or would backfill our data directly for multiple months at a time. In the end, we noticed that we knew the process was iterative and that we would come up with new use cases which would potentially mean that we had to backfill many times for the same vendors.
Backfill Manually: Depending on the vendor and type of automation we could put in place, we backfilled some of the data manually where it was needed. Sometimes manually connecting some data points in a database makes more sense when you try to match vendors than it does trying to overcomplicate things in a codebase. We figured this out quickly when we had 50 email templates that we manually needed to map to other data sources.
The (unexpected) problems that we ran into?
Certain things always go differently from the expectations that we set upfront; that’s why I wanted to touch on some of the ‘problems’ that we ran into and how we dealt with it.
Inaccurate data: Dashboard versus Dump
Exports are always different from the data you see in the dashboards, mainly because the information is stored differently from how it’s visualized. For example, Cost data from Google Ads is saved in micro-cents, which means you always have to divide it to get the correct number. CTRs from multiple vendors are just a calculation of Clicks / Impressions. And in some cases, it meant that specific data points aren’t available for export through an API (looking at you Facebook).
Overall this means that you have to assess upfront what data schema is available in export carefully and if this matches the use cases that you want to fulfill. If that’s not the case, sometimes it might not be worth it to try to automate it.
Duplication as a problem in Stitch
StitchData has a different way of exporting data into BigQuery than we were used to. This meant that we had to get used to deduplicating data before we used it for analysis; in short: they created new rows in tables when they notice the unique row has changed. This means that we always needed to find the latest row value for a record. Not a massive problem in itself, but it took us a few days to figure this out as the documentation around this, at the time, lacked.
Backfills
How much useful data do you have available by vendor? As some vendors are very structured: Google Ads have their own transfer service into Google BigQuery, so it was easy to backfill the data here. But for other vendors: Marketo, Sendgrid, it was harder as we made so many changes to professionalize it over time. This meant that for some, we were able to backfill data into 2018, whereas for others, we could barely get data from the last six months. Knowing this upfront is essential as it might make certain use cases irrelevant if you’re trying to learn more from YoY behavior, and it turns out the data isn’t compatible.
Exceptions, for some sources (read Marketo) we had to rely on other solutions. There are limitations to some of the data that we can get out of their (old skool) API. So for some of it, we’re getting daily email reports that are parsed and sometimes ‘manually’ ingested into Google BigQuery. It’s some of the workarounds that we had to do as we would want the data to be accessible through our data lake and some vendors aren’t always as accessible.
Naming Conventions
While we had an early discussion about the naming conventions that we wanted to follow, this was more food for thought than expected. Quickly we had tons of datasets in BigQuery that inside of them followed their conventions (defined by the data load providers). As we were applying ETL on some data tables, it meant that we needed to make sure that it was clear what every table and field in the schema meant. For example, there were tables that were only used: temporarily, for backfill, for production, for development, for transformation, etc. What made it more complicated is that most of them were also individualized by vendor. In our case, we settled on making the distinction between: ‘prod/dev’ + ‘raw/formatted/dashboard’ + ‘tables’ for datasets and then specifying the specific vendor name and goal with the tables in the dataset.
Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.