Assigning Purchases to Other Users in Google Analytics

Sometimes, you want to assign specific user interactions to a different user.

There are many cases where you want to send an event for user X while user Y performs the action. But it’s important that you can save this information to the right user.

In our case, we ran into the use case where an RV owner needs to accept a booking. Only after the approval takes place do we consider this an actual purchase (transaction in GA). According to regular logic, the RV owner would get the purchase attributed to its session in Google Analytics. In the end, that user goes through the steps in the interface and confirms the transaction.

Marketing <> Google Analytics Logic

This doesn’t work for Marketing, though. We did our best to acquire the renter, and they’re the ones purchasing. According to Google Analytics, we’d fire an e-commerce purchase on behalf of the owner. What this messes up is channel attribution & performance measurement of our campaigns. In this case, the owner’s path is likely not touching any paid or true marketing channels but either direct or email.

In summary, the wrong user would get credit for the conversion, which could cause issues with our ROAS measurement of marketing channels.

Switching Client IDs & Leveraging User IDs

When Google Analytics is loaded, it will set the client ID value in the _ga cookies in your browser. This value (the client id) will be used to tie events → pageviews → sessions together and be seen as one user (in combination with the userId value, obviously).

So what we’re doing is pretty simple to change this behavior:

  1. Whenever a user goes through the checkout funnel and creates a user account, we save their Google Analytics Client IDs (for GA4 and UA) to their profile.
  2. When user X confirms the purchase, we’re sending an event to Tag Manager with the purchase context, including the impacted data from user Y.
  3. Instead of directly firing the hit to Google Analytics, we swap out the client ID and userID from user X to user Y so that the actual purchase will get attributed to that hit. You need to mimic a session ID.
  4. Google Analytics will now stitch this purchase to user Y instead of user X. You can choose for yourself what you want to fire for user Y.

Resources

→ Setting the Client ID in Google Analytics 4

→ Setting the Client ID in Google Analytics with gtag.js

User ID

Where possible, make sure that you’re already sending a userID to Google Analytics to ensure that the interactions can be truly tied back to the same user.


Part 5: Airflow on Google Cloud Composer – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, part 3, and part 4) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: making the data lake. In the fourth blog post, a more technical one, I’ll give some insights into how we’re leveraging Apache’s Airflow to build the more complicated data pipelines, and I give you some tips on how to get started.

This blog post is part of a series of five? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Getting Started with Cloud Composer

Cloud Composer is part of Google’s Cloud Platform and brings you most of the upside of using Apache Airflow (open source) and barely any of the downsides (setup, maintenance, etc.). Or to follow their main USP: “A fully managed workflow orchestration service built on Apache Airflow.” While we had worked with Airflow before, we weren’t looking forward to spending time having to worry about its management as we planned to spend most time setting up and maintaining the data pipelines. In the end, then you have to stick to create pipelines (DAGs).

What is it suitable for?

You want to load data from the Google Analytics API, store it locally, translate some values to something new, and have it available in Google BigQuery. However, you would build it; it’s multiple tasks and functions that are depending on itself. You wouldn’t want to load the data into BigQuery when the data wouldn’t have been cleaned (trash in, trash out sounds familiar?). With BigQuery, the next task is only being processed if the previous step was successful.

Tasks

Tasks are in almost every case; just one thing: get data from BigQuery, upload a file from GCS into BigQuery, download a file from Cloud Storage to local, process data. What makes Airflow very efficient to work with is that the majority of data processing tasks already have pre-built functions. The first three tasks that I listed here are operators (GoogleCloudStorageDownloadOperator, GoogleCloudStorageToBigQueryOperator) that operate as functions.

Versus Google Cloud Functions

If you mainly run very simple ‘pipelines’ that only exist of 1 function that needs to be executed or have only a handful use cases, it is likely overkill to leverage Cloud Composer; the costs might be too high, you still have overhead with DAGs. In that case, you might be better off with Google Cloud Functions as you can write similar scripts that will enable you to also trigger them with Google Cloud Scheduler to run at a specific time.

Costs

The costs for Google Cloud Composer are doable, for a basic setup, it’s around 450 dollars (if you run the instances 24 hours * 7 days a week) as you leverage multiple (a minimum of 3) small instances. For more information on the costs, I would point to this pricing example.

Building Pipelines

See above an example data pipeline, in typical Airflow fashion every task is depending on the previous task. In other words: notify_slack_channel would not run if any of the previous tasks would fail. All tasks are happening in a particular order from left to right. In most cases, data pipelines become more complicated as you can have multiple flows going on at the same time and combining them at the end.

Tips & Tricks

Google Cloud Build, Repositories

The files for Google Cloud Composer are saved in Google Cloud Storage. Which is smart in itself, but at the same time, you want them to live in a Git repository so you can efficiently work on it together. By leveraging this blog post, you’re able to connect the Cloud Storage bucket to a repository and set up a sync between the two. This will help you build a deployment pipeline basically and make sure that only production-ready code from your master branch ends up in GCS.

Managing Dependencies

After working with it for a few months now, I’m still not sure if managing dependencies through Google Cloud Composer is a good or bad thing, as it creates some obstacles if you want to run a deployment and want to add some Python libraries (as your servers could be down 10-30 mins at a time). For other setups, this usually is a bit more smooth and creates less downtime.

Sendgrid for Email Alerts

One of the upsides of Apache Airflow is that it sends alerts upon failure of tasks. Make sure to set up the Sendgrid notifications while you’re setting up Google Cloud Composer. This will be the most straightforward way of receiving email alerts (for free, as in most cases, you shouldn’t get too many failure emails).

README

Document the crap out of your setup and DAGs. When I took over some of the pipelines that were used at Postmates for XML sitemap generation it was a nightmare, it was hard to read, the code didn’t make a lot of sense, and we had to refactor certain things just because of that. As sometimes pipelines (just like regular code) can be left untouched/unviewed for months (as they literally sometimes only have one job) you want to make sure that you come back and understand what happens inside the tasks.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 4: Visualization with Google DataStudio – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, and part 3) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. In the fourth blog post, we’ll chat about how we are visualizing all the data we saved in previous steps by using Google DataStudio.

This blog post is part of a series of four? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.

How we build dashboards

Try to think ahead about what you need: date ranges, data/date comparisons, filters, what type of visualization. This will help you build a better first version right away as it gives you the ability to have a good version right away. What that mainly looked like for us:

  • Date ranges: The business is so seasonal that our Year over Year growth is most important for RVshare, and since we often don’t get to see all the context on metrics on a weekly basis, we default to 30 days.
  • Filters: For some channels (PPC, Social), it’s more relevant to be able to filter down the data on a campaign or social network level. Because in most cases the aggregate level doesn’t tell the whole story right away.
  • Visualization: We need the top metrics: sessions and revenue in view right away with the comparison YoY right away so we know within seconds what is going on and how that can improve things.

Talking to Stakeholders (Part Deux)

In the first blog post, we talked about connecting with our stakeholders (mainly our channel owners) and gathering their feedback to build the first initial versions of their dashboarding (beginning with the end in mind). We used this approach to put the first charts, tables, and graphs on the dashboards after which we connected back again with the owners to see what data points were missing and in some cases to validate the data that they were seeing on their dashboards. This helped us get additional feedback for fast follows and made for quick iterations on data that we had and could also show. For social media, as an example, it turned out that we wanted to show additional metrics that we hadn’t thought of initially but were in our data lake anyway. These sessions provided a good way for us to build additional pieces into our data warehouse while we were at it. These days some of these dashboards are used weekly to report to other teams in the organization or used within the team itself.

Best Practices

Blended Data

Do you want to add this to Google DataStudio, or do you want to create synced/aggregate tables in BigQuery? For most of our use cases, we have opted for using DataStudio to create JOIN blended data sources. It’s easier – we have the ability to quickly pull some new data together versus having to deal with the data structures and complicated queries. In some use cases, we noticed that we were missing data in our warehouse tables (not the lake) and were able to make adjustments/improvements to them by creating dashboards.

Single Account Owner

Because we work with Rick and Hussain as ‘third parties’, we opted for using 1 shared owner account, transferring owner access is incredibly hard when it’s a Google Apps account so we made sure that the dashboards are owned through an @rvshare.com account, it’s not a big topic but could cause tons of headache in the long term.

Keep It Simple St*pid

Your stakeholders probably have the desire / and the time to look at less than you think. Instead of having them jump through too many charts start simple and then add based on feedback if they want to see more, less is more in this case.

This has added benefit of them feeling engaged and more interested in using it. In our own use case, we leverage our reporting on a weekly basis for a team meeting, which makes it already a more often leveraged us case.

Calculated Fields – Yay or Nay?

As we made most of the tables that we leverage in DataStudio from scratch during our ETL process, we had the opportunity to decide if we wanted to leverage calculated fields in BigQuery or if we do the work in the queries itself. Honestly, the answer wasn’t easy, and as we made modifications in the dashboards, it became clear that having them set up in DataStudio wasn’t always scalable and easy as with data modifications or changes in tables they are removed.

Google BigQuery

Tables or Queries? In our case, we often used the table information from BigQuery and the specific columns in there to drive the visualization in DataStudio. The alternative for some of them is that we directly query the data in BigQuery, with the BI Engine reservation that we have in there we can speed up intense queries rather easily.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 3: Transforming Into a Data Warehouse – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1 and part 2) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. We’ll try to go a bit more into detail on how you can do this yourself in this post in which we transformed our marketing data lake into an actual data warehouse.

This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.

The Process of Building a Data Warehouse

In our endeavor of building a data warehouse, we had a couple of big initiatives that we first wanted to get done. We needed some reporting and visualization tables and aligned with that, we needed to make sure that we could have data that was cleaned for other purposes (deduplication, standardization: some typical ETL problems).

In order to streamline the processes, we used three different ways of getting the data streamlined: 

Google Cloud Functions

Google Cloud Functions is used for both transforming our data as loading our initial data for a few use cases. Early on we noticed that not every vendor was available through regular data loading platforms, like StitchData. An example of that was Google Search Console, as we didn’t want to have the need to run additional infrastructure for just dealing with Load scripts we leveraged Cloud Functions to run a daily script (with support from Scheduler to make them daily).

After loading the data we also Transform some other tables from our marketing data lake to new production tables using Cloud Functions. All of our scripts are currently written in Python or Node.js, but as Cloud Functions makes it possible to deal with multiple languages it provides us with the flexibility to leverage others over time.

Backfill: As Functions can easily be rewritten and tested within the interface, it also provides us with a good way to backfill data easily as we can easily adjust the dates that a script needs to run.

Scheduled Queries

In some other cases, we can also leverage Google BigQuery’s scheduled queries. In a few instances, we just want to load the data from raw data lake tables into a production table. Mainly because we don’t always need all the columns, we can limit our data drilling and be able to clean the data in the query itself. In that case, scheduled queries can come in pretty handy as they run on a certain schedule, can be easily updated, and already point towards another data set and table.

Airflow

For more complicated data flows we’re currently using Airflow via Google Cloud Composer. Cloud Composer, as we mentioned in a previous blog post, enables us to not have to worry about maintaining the Airflow infrastructure but gives us all the other upside of it. This gives us the ability to focus on creating and maintaining the DAGs that help drive the actual data structuring flows.

How we mainly use Airflow is to combine, clean and enhance data from multiple sources and then reupload it back into Google BigQuery for visualization in other tools. Singular use cases are more easily captured by one or two tasks, but in Airflow we run flows that usually have multiple tasks that need to be executed in a certain order and not at all if one of them fails. This is what Airflow is meant to do, and that’s how we’re leveraging it too. As an example of our affiliate marketing campaigns, we have a structure set up that only pays out once travel is concluded (a very standard approach in the travel industry). This means that we need to retrieve orders from our partner > verify them with our database > create a new format to upload back to our vendor and run the actual upload. In addition, we want to set up some alerting for the team as well. Resulting in 6 tasks in this case that need to be executed in the right order: the perfect use case for Airflow.

Creating Structure

In the previous blog post, I touched on how we wanted to set up raw tables that are transformed once or multiple times. We decided to do this to both make the data more streamlined and also to make them ready for visualization on our channel dashboards. The MarketLytics team did a great job documenting this with a very visual result that you can see here:

As discussed previously, we go through multiple stages with the data that we get into the data lake and transform it to the data warehouse.

Example of data enhancement: One of the most common scenarios that we’ve tried to solve for was connecting existing data from a vendor back to the data that we receive in our web analytics tool: Google Analytics. As an example, if properly tagged we should be able to get the data from a specific newsletter campaign from the UTM parameters and then connect the data to (in our case) Marketo to what we have there on deliverability and open rate (%).


Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 1: Why Build a Marketing Data Warehouse? – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

This blog post is part of a series of three, in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Six months ago, we were looking at Google Sheets, Google Analytics, and many other tools/vendors daily and had no marketing data live in a data warehouse (nor a lake). We wanted to get better at using (marketing) data but mostly find a better way to connect insights from multiple sources to find insights but also make more data available to the rest of the internal teams (think: PPC, SEO, Email, Product).

Why?

  • Prove business value: Everything we do as Marketing needs to prove some kind of business value. We’re not doing things just to keep ourselves busy. So we wanted to make sure upfront that the business cases that we had in mind would have an upside by creating this setup. Some of these were: detailed insights into social media performance, attribution level data, and a way for us to connect transactions between data sources.
  • Access to ‘our’ raw data: As often with marketing data, you’re using the interface of the actual marketing vendor to pull your daily numbers. But since so much of the marketing data overlaps with internal data from other channels we wanted to get access to all the raw data that is ‘ours’ (to begin with) and have the ability to combine it. This, for example, would provide us with insights on the lifetime value of customers by channels or the ability to export cohorts of users.
  • The analysis doesn’t scale in a spreadsheet and isn’t repeatable: Besides creating reporting monkeys, we wanted our teams to not have to spend time pulling data from different sources and manually combining it. It’s too high a risk for potential mistakes, the process doesn’t scale, and valuable time of team members would be lost in the process.
  • More control of data quality: We want to control our data pipelines and be involved in the ETL (Extract, Transform, Load) process so we know how data is shaped. There shouldn’t be a black box of data being created in the background. With having this process potentially in place, we can control the quality and usefulness of data more (and continue more over time).
  • More flexibility for visualization and ‘deep analysis’: With having access to our raw data in a warehouse, we knew that we would have to rely less on building analysis right on top of vendor’s APIs or worry about potential issues on how to visualize data (on dashboards). 

Looking at the complete picture

Marketing data is often confined within the platforms and tools used to distribute the marketing message. But your prospects don’t confine themselves to just one source or medium to interact with your company (the path length is usually longer than one session before a transaction takes place). To be genuinely able to measure the effectiveness of your marketing activities, extracting the data from these platforms and having the ability to analyze them is essential. This goes back to our first point on Why. It needs to solve cases that provide business value.

Beginning with the end in mind

One of the biggest challenges in working with data is that often you don’t actually know precisely what data you need to solve for your use case. Usually, you tend to go through the cycle of “capture – store – enrich – analyze” a couple of times before arriving at really valuable insights.

By creating your own data warehouse, you create the flexibility to transform and relate the data to each other in ways that aren’t possible when your data is locked within separate platforms.

Build versus Buy?

While it was clear that we wanted to own as much of the data ourselves, an apparent next discussion was how to gain access to the actual data. Would we build the pipelines ourselves? Would we buy tools to do so, and how much of the transformation of data would fall on the team? How about maintenance?

Pros of Buying a solution

  • Plug and Play: A strong Pro for buying software usually is a decreased time to value because you’re skipping the building phase. So you avoid needing to have a strong learning curve and having to deal with the hard/soft-ware setup. What was a requirement for us is that we didn’t want to deal with a long process of setting up resources.
  • Scalability: As we grow, our datasets will grow. While paying extra for additional volume isn’t a problem, we wanted to make sure that the infrastructure would hold up and could last while we reach a certain factor of scale.

Cons for Buy & Build

  • High Costs: Buying tools would likely, over the long term, have more direct costs associated with it. While obviously, the indirect costs for building the solutions ourselves (increased headcount) would also be expected to be high.
  • Vendor Buy-in: Regardless of either buy or build, it’s a strong Con that we’re basically bought into a vendor’s platform that we wanted to make sure that certain aspects of it are able to be moved over to another service if we deem that to be the right choice at that point in time.
  • Standard data formatting: With some vendors, it was harder to change the data schema (often for a good reason) or pull additional data formats that we wanted to collect from, for example Facebook (the data was available in the API, just not in the vendor’s export).
  • Time to first use: Building pipelines for extracting the data from certain platforms could take a long time. But obviously the real business value will, in most cases, only show up when you’re actively working with the data. Losing months while dealing with extracting the data was something (also taking into account our seasonality, which provides a ton of data) that would delay this. So going with platforms that could directly provide us with extracted data was a great pro versus a strong con for building it ourselves.
  • Maintenance: We didn’t want to maintain extract pipelines from the marketing platforms that we use. While it’s a really important part of our data strategy, we wanted to avoid the need to spend a significant amount of our time to stay up to date and maintaining pipelines. With a buy solution, we would rely on the resources of the vendor to deal with this part of the data flows, which is very much what we would prefer at this point, mainly considering that the resources of our data team are minimal.

In the end, we decided to go with a combination of the two strategies. We decided to let most of the Extract processes be handled by software that we would buy, or that would already be available opensource as this is a task that we wanted to make as cost-effective as possible. Most of the Transform will be done through a set of tools that we would maintain and build ourselves as there are too many custom use cases and needs. With that, it meant that we also needed to ‘Load’ the data back into our data warehouse mainly ourselves. More on the details of that in our upcoming blog posts.

Costs?

An often asked question in this process is what kind of costs we were expecting to have. While we won’t disclose the costs for the additional help received, we’ll quickly touch on the costs for this whole setup thus far. Parts of this are variable based on the size of your data but so far our costs:

  • Google Cloud Platform: ~$400-700 monthly.
    • This includes costs for Google BigQuery, Cloud Composer and other tools. We’re expecting that this will increase at some point but gradually will go up over time.
    • Note: In another blog post we might dive deeper into Google Cloud Composer and how we leverage that for Airflow (processing data pipelines). Right now, Cloud Composer presents at least half the costs for Google Cloud Platform. If you decide not to use it, your costs will be significantly lower for GCP.
  • Stitch Data: $ 350-500 monthly based on volume for now. We’re expecting that this will increase at some point but gradually will go up over time.
  • SuperMetrics: $190 monthly per license per vendor (depending on the number of licenses you might need).

Strategy going forward

With some of these decisions made upfront, we decided to move forward with collecting the data channel by channel, starting with the most important channels. This way, instead of needing to collect all the data for every channel, we were able to directly start building dashboards on top of the data that would be useful for teams. The steps to take to get there were as follows, and we’ll dive into some of them in detail in upcoming blog posts:

  1. Extract all the data using Google BigQuery Transfer Service, Supermetrics and StitchData.
  2. Validate that the data in Google BigQuery is correct by comparing it to the regular data sets, existing dashboards, and the data that we would look at in the vendor’s platform.
  3. Transform the data so data sources can be combined and Transform the data to be directly useful for reporting purposes.
  4. Load this new data into new reporting tables, backfill historical data where possible, and sync it back into Google BigQuery.
  5. Depending on the channel, build dashboarding in Google DataStudio or Tableau.

In the next blog post, we’ll go into depth on how we created a marketing data lake, where all our raw marketing data from our vendors (Google Search Console, Google Ads, Bing Ads, etc.) lives. Expect that to go live within the next two weeks. Until then, feel free to share your thoughts on this blog post on Twitter.
Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics, who we’ve worked with alongside during this (ongoing) project.


Calculating Click Through Rates for SEO, based on Google Search Console Data (in R)

Updated June 4, 2019: Added a YouTube video which will guide you through the setup process in RStudio and how to run the script yourself.


Averages lie & average click-through rates aren’t very helpful! Here I said it. What I do believe in, is that you can calculate click-through rates (CTR) for your own site in a great way though. With data from Google Search Console. Especially while using R, so let’s dive in on how this works and what you can do with it!

Why ‘Averages’ Lie?

Look at the graph below, a great CTR for position 1. But for example, this research shows that the average CTR for position 1 is: 24% (AWR data, Feb 2019). Which one is correct? Neither of them. As it really depends on the industry, what features show up in the search results that might decrease CTR (think rich snippets like local packs, news results). All of this is making it really hard to make a good analysis of what you could expect if you rank higher for a bunch of keywords in your industry. So while I was working at Postmates on ranking certain category pages better we decided to calculate our own CTR and were intrigued by how far CTRs were off from research (the research isn’t wrong! It’s just generalized across industries). Eventually, with the data in hand, we were able to make better estimates/forecasting of how much traffic we could expect when rankings would increase in that segment. In the rest of this post, I’ll go more in-depth on the specific practice on how we calculated this.

Using Google Search Console Data

Visual of Google Search Console Report (this is not RVshare data)

You’ve seen this report in Google Search Console, providing you with a detailed view of the performance of your keywords and the average position for your keywords. In this graph, we see something positive, a CTR & position that go up slightly over time. But what if you would want to know the average CTR for a certain segment of keywords per position. That’s way harder to do in the interface. Because of that, I used the R script from Mark Edmondson that he wrote about here almost three years ago.

It will help you extract the data from Google Search Console in a raw way so you can use it to digest it and create your own visualizations (like the one we’ll talk about next).

Visualizing CTR Curves in R

CTR Curve visualized CTR per Position (note: this is randomized data for an unknown site)

So let’s dive right into how you can do this yourself, I’ll provide you with the full R script and you will need to download RStudio yourself in step 1.

  1. Download and Install RStudio
  2. Download the following .r script from Gist
  3. Run these commands to install the right packages for RStudio:
    1. install.packages(“googleAuthR”)
    2. install.packages(“searchConsoleR”)
    3. install.packages(“dplyr”)
    4. install.packages(“ggplot2”) if necessary
  4. Line #21 – Change this to the property name from Google Search Console
  5. Line #25 – Not neccesary: If you want the CTR curve for positions over 20, change the number.
  6. Line #40 – Recommended: Exclude the word(s) that are part of your brand name. So you get the right CTR curve for non-branded keywords only
  7. Line #41 – Not necessary: This script is taking a ‘sample’ of 50.000 keywords to calculate your CTR curve of. You can increase this limit to more if needed, if you have less than 50.000 keywords it’s not an issue
  8. Run the script! The output should be a visual as shown earlier in this post

Want to take a deep breath and let me help you go over this again? I’ve made a quick screen share video of what to do in RStudio and how to use the R script.

Hopefully, now, you’ve had a better chance to understand what the actual CTR is for your own site and you can use this to visualize CTR curves for specific parts of your site or pages that have a similar META description. Over time you could use this, for example, for measuring the impact on CTR.

Credits where credits are due! There are many use cases for using CTR data by visualizing it with R, and I’m grateful that a while ago Mark Edmondson opened my eyes about this + credits to Tim Wilson’s documentation on using R and improving visualizations.

Want to read this article in Spanish? Read it here.


Adding additional site speed metrics to Google Analytics: measuring First Input Delay (FID)

Web Analytics is still one of my pet peeves, and while I don’t get to spend a ton of time on it anymore these days, I still enjoy digging through blog posts and coming up with new ideas on what to track and how it can help for (speed) optimization. While I was looking through a big travel site’s source code a couple of weeks ago trying to figure out what we could improve, I noticed a Google Analytics event that was being fired that was ‘new’ to me. It was used to sent information about ‘interaction delays’ as an event. After digging, I figured out what it was, and as I couldn’t find a ton of information about the topic itself in relation to Google Analytics I think it’s worth a blog post.

Disclaimer: There is not a lot of new original material in here, a lot of the information can be found in old updates on the Google Developers Web Updates section, and most credits go to Philip Walton. But I think it’s worth giving these topics some more attention and providing some additional solutions on how to integrate this with Google Analytics.

Site Speed Reports in Google Analytics

The site speed reports in Google Analytics have been proven to be useful for optimizing average load times and identifying how long it takes for a certain page to load (page loaded). You can use the data to figure out if your server is slow (Server Connection Time and Server Response Time) or to look at Domain Lookup Time to see how long it takes for a DNS lookup. But have you ever noticed yourself that for some sites it takes a tiny bit while the page is loading to actually start interacting with it while it’s being painted (the JS/CSS scripts) on your screen? Mostly on slow connections, like your phones mobile network, this can be obvious from time to time. That’s why the following metrics can come in handy as they will start measuring the time to the first paint and the first input delay that can happen.

Why is this metric not already part of the reports? Google Analytics can only start measuring the data whenever you’re loading the script. The speed data that it reports on is being gathered through the Speed API in your browser, but other data for a metric like this isn’t part of that. It’s also a fairly technical metric as you will realize after this. So for most basic users, it would cause a lot of confusion I’d imagine.

First Input Delay – FID

The important definitions:

  • FCP – First Content Paint: The time it takes (in milliseconds) for the first paint (pixels) on the screen.
  • TTI – Time to Interactive: The time it takes for the page to start loading and to be fully interactive.
  • FID – First Input Delay: The time between the first interaction (click, scroll, JS) of the user and the time it takes for user input to be acted upon by the browser.

This FCP metric is already part of the reports that you might have seen in Lighthouse. The obvious problem with that is, is that it’s just a one-off metric. It could be different for your users and you likely want to have a higher sample size to measure this.

Measuring First Input Delays (FID)

So let’s talk about how useful this is actually going to be for your site, starting with the implementation for Google Analytics and Google Tag Manager. The ChromeLabs team has done an amazing job providing a small ‘library’ for tracking the performance metrics via JavaScript. So these are the steps to follow for tracking this in Google Analytics (Gtag/Analytics.js) and GTM:

Measuring First Input Delays in Google Analytics

The script that you have just included provides a Listener that can be used to check when an event needs to be fired or just to save it to the DataLayer.

If you’re using analytics.js add this to your HEAD (under the minified script and after initializing GA):

perfMetrics.onFirstInputDelay(function(delay, evt) {
  ga('send', 'event', {
    eventCategory: 'SiteSpeed Metrics',
    eventAction: 'First Input Delay',
    eventLabel: evt.type,
    eventValue: Math.round(delay),
    nonInteraction: true
  });
});

If you’re using gtag.js (GAs latest version) add this to your HEAD (under the minified script and after GA has been initialized):

perfMetrics.onFirstInputDelay(function(delay, evt) {
  gtag('event', 'First Input Delay', {
    'event_category': 'SiteSpeed Metrics',
    'event_label': evt.type,
    'value': Math.round(delay),
    'non_interaction': true
  });
});

Measuring Input Paint Delays in Google Tag Manager

The integration for Google Tag Manager is obviously a little bit more complex as you need to add a new Trigger and Variable.

window.dataLayer = window.dataLayer || [];
perfMetrics.onFirstInputDelay(function(delay, evt) {
  dataLayer.push({
    'event': 'first_input_delay', // Not necessarily needed if this loads before GTM is initialized.
    'first_input_delay_value': Math.round(delay)
  });
});

Create the Value:

Add it to the Google Analytics configuration, so it will be sent along either your Events or your Pageviews (really decide on this for whatever works best in your use case). In this case, I’ve added it to a Custom Dimension on a page level, but you can also easily send this to a Custom Metric to calculate averages.

Custom Reporting on Speed Metrics

When you’re using a custom metric to report on FID you can easily create a metric on a page level to show to average first input delay for a page type or template. In this case, I created an example of a report that will show this only for new visitors (who likely haven’t loaded assets like JS/CSS/Images that are cached).

Adding other speed metrics

This is just the First Input Delay that you could be adding as a speed-related metric to GA. If you do some digging and are interested in this topic I would recommend going through the rest of the events here. That will give you enough information and a similar integration to measure First Paint (FP), Time to Interactive (TTI).

All the resources on this topic:

Like I said in the disclaimer, I mostly wanted to make it easier to implement but all the documentation around how to set this up and what it entails can be found on these resources.


Measuring Content Performance: Content Engagement Metrics

What is the effectiveness of our content, how well does our content work? Who is writing the best content? What should we be writing about next? These were the top X questions I received on a weekly basis while working at The Next Web. And I’m probably not the only one, I hear a lot of companies that are asking the same questions. We were a publisher, with a big editorial team (~10 FTE editors) and publishing around 30-40 blog posts on a daily basis and had so for the past ~10 years. Which meant that there were over 65.000 articles in our archive that we could learn from. One of the hardest projects in the end.

“Tons of page views doesn’t always mean this post is doing well or is of the highest quality/the best journalism”. If one of our post hit Reddit, HackerNews or was a top story in Google News we for sure celebrated this but the impact on the business wasn’t always that big as we were hoping. Traffic from most of these sources has very low engagement and basically only brought in money through display advertising (with incredibly low CPMs). Overall, this meant that we wanted to find a better metric to score the performance of editors and find out what kind of quality really worked well for the business, not just for engagement & visitor metrics. So we got started looking into this…

Usage Metrics, Useful & Useless

What I still see a lot of companies do, and in all honestly you can’t always blame them for that. Is look at the basic usage metrics of content. How much time has a user spend reading this article, what is the bounce rate, etcetera. Which isn’t great as most of the time you don’t know the context. Is a good blog post one that has been read for 2 minutes, or 1 minute. Obviously the length of the blog post and if it has videos/images has an impact on this. All data that is usually easily overseen when analyzing content performance. Even with taking this into account it’s making it hard to come up with recommendations while working for a publisher. Are you really always going to recommend to use at least 2 ¾ images in a blog post and require a video that is exactly 23 seconds long so we know for sure people will watch it? Probably not.

Business results

Most of all, these metrics don’t align with your business goals. How often does the CEO/COO/CMO ask you what the bounce rate is of your articles so they can calculate the return on investment. Never right? That’s why over time we need better metrics that align with the business and that provide useful insight into what content really helps and what content is just good for vanity metrics like page views. You’ll be surprised in the end how many companies are still driven by these.

Capturing Goals in Google Analytics is important to measure impact of performance.

Creating a Content Engagement Metric

It’s going to be about the monetary value that content represents for your business but we won’t be setting it up that way. In the end you need to be able to evaluate the way you calculate this and over time this will likely change with your business. That’s why it’s important that we can make the decisions in an open and honest way so we can make changes later when we think they’re needed.

Setting up Goals

If you want to get started and do this on your own it’s going to be important to know what kind of events are important to your business and site. In our case we looked at, for example: newsletter subscriptions, article shares on social media (they bring in more visits, so useful for more revenue), ticket sales, ecommerce sales, etc. But also in general we wanted to know what the value for us was with a pageview. I’m not going to walk you through the setup of all these goals in this post, but if you’re unfamiliar with setting up goals in Google Analytics, read & watch this.

Custom & Calculated Metrics

So now you have determined what the goals are that matter to you. What is going to be important in this step is making sure that you assign ‘points/value’ to all of the goals. You don’t have to edit the goals for that (probably even better not to, to not interfere with the page value metric). So for example, think of it like this:

  • Pageview: 1 point
  • Article share: 5 points
  • Newsletter subscription: 10 points
  • Ecommerce sale: 150 points (or varying on the product value)
  • Ticket sale: 250 points

Now it’s time to set up a calculated metric to take this into account. Why a calculated metric? By using this we’re able over time to adjust the formula to our needs but also individually assign values to the goals if we wanted to (in that case we’d use goal completions as well).

When you get started with this, know what numbers you are using for the goals that you have setup. You can find that in the Goals interface within Google Analytics.

  1. Name: Content Engagement Score.
  2. Formatting Type: Integer (you can change this to Float or Currency if that’s more applicable)
  3. Formula: {{Goal1 Completions}} * 1 + {{Goal2 Completions}} * 5 + {{Goal3 Completions}} * 10…

In the end that should look something like this:

Reporting & Dashboarding for Content Quality

Now you have successfully created a calculated metric that we can use for reporting and informing other teams in the company about the content performance.

Trial & Error: Formulas should always be up for discussion

This formula isn’t going to be good the first time, you’ll have to tweak it and assign values that make more sense in your case. That’s why I also didn’t share the actual values of TNW (they were different from the example) as it matters to your specific business and goals.

To talk a little bit more about how things worked out for us, we tweaked it 3 to 4 times over a period of two months to get closer to our actual goal of measuring performance. In the end we could have editors who still would write for a ton of pageviews so get hundreds of thousands of points through that but editors who did great on business results weren’t rewarded value through that. So we upped some of the points assigned to our business goals to align them better.

Custom Reporting

Just setting up the goals probably already gets you quite far. It will allow you to create a page + goals report in which you can see how many goals have been hit for certain pages. Very useful if you want to measure the performance.

Dashboarding: Google DataStudio/Chartio versus (Google) Sheets

You need to share the reporting around this in an easy way, depending on your teams it will depend what works best for them. Sharing spreadsheets or making it easy for people to immediately take a look at a dashboard in a tool like Chartio and/or Google DataStudio.

Performance by day/editor/topic

Now you have all this data you can hopefully combine this with the data that you also have gathered through your custom dimensions. Previously I have blogged on The Next Web about all the ideas that I have around custom dimensions. Give it a go, it will be surprising to see how much easier it will become to report & analyze the performance across topics, time of day, team members when you can really align this with your business goals.


This is just an example of how we were measuring the performance of content for the business. There are many other ways to do this and you will have to customize the formula for your own business. What other metrics and ideas would you take into account to analyze and report this? Leave a comment or reach out on Twitter: @MartijnSch


Exporting Amplitude Data to Google BigQuery

I’ve written about using Amplitude before on this blog (in Dutch), but what if you want to combine the huge amount of data that you have in Amplitude with your other big data. The Enterprise version gives you the ability to export your data to a Redshift cluster. But a lot of companies these days are also on Google Cloud Platform and want to use Google BigQuery, which is similar in its setup to Redshift.

Amplitude

The Export API from Amplitude lets you download all your events data (regardless of your account plan) for free. There are some limits to the data that they can export but most startups/companies that use them are likely to stay under that on a daily/hourly basis. Which means that you can export the data. So basically everything that you need to do is setup a cronjob that either every hour or day can retrieve the data. It will parse the data and prepare new files that will be temporarily stored in Google Cloud Storage (this will ensure that the data is easier/faster available for upload into Google BigQuery). The next step is loading the data from GCS to GBQ.

Google BigQuery

Over the last years I wanted to do more with Google BigQuery and this project was perfect for it. As BigQuery has many connectors with multiple Google products (Analytics, Data Studio) and other vendors like Tableau it should give companies the ability to analyze their data and connect to other sources.

Schemas

Within Google BigQuery we’re going to save the data in two tables:

  • Events: As everything is an event in Amplitude that’s also one of the tables that you’ll need to use in Google BigQuery, that’s why every event will end up as being it’s own row in Google BigQuery.
  • Properties: Every event can have properties in a few different ways: events, users, groups, group properties and an actual data property. We are connecting them to the data from the events table.

FAQ

  • Do I need to be a paying customer for Amplitude? No, you don’t. The free plans, which I started to use this on will have the support for the Export API as well.
  • What is the cost of the integration? The costs that are associated with this setup are related to Google Cloud Platform. You’ll have to pay for storage in Google Cloud Storage and the storage of Google BigQuery. For the setup that I’m running we’re saving millions of rows monthly and the costs are in total less than 10 dollar.
  • What do I need to do to get this up and running? Check out the README in the repository on Github, it will give you a getting started checklist to ensure that you can run this script.

 

Feedback? / Contribute?

I haven’t been the first person to work on an integration with BigQuery. I’ve seen other solutions for Java and Python, but they all work a bit different. If you have any feedback on the setup, leave an Issue on Github, submit a Pull Request with your proposed changes. In the end, I can code, but I don’t consider myself to be an engineer 😉


Using Amplitude for Product & Web Analytics

I’ve previously published this blog post in Dutch on Webanalisten.nl.

What if you are looking for a product for web analytics but have a lot of events, a complicated product and sending more and more data over time. Sometimes it wouldn’t just work to go with Google Analytics (360), Adobe Analytics and maybe integrating your custom build solution or Snowplow might be too complicated for your organization. In this post I’d like to show you another tool that might be interesting to you: Amplitude. I’ve been working with it for the last year and it provides some great flexibility over other products in the industry.

What is Amplitude?

“Analytics for modern product teams”

Or in normal language, you can track any event and connect that to a user. All the events that you send can have properties, just like the user can have properties. You can filter by all these data points and move your data around to turn it into multiple types of charts: funnels, event hits, revenue, etc. In this article we’ll be running through how you can be using Amplitude and what it’s good for. Let’s go ahead and dive in!

Why would you be using Amplitude?

You want to measure what is happening within your product and what users are doing. Keeping in mind that all this data can help you improve their workflows and measure the impact certain changes have on their behaviour. In that aspect, Amplitude is directly competing with mostly tools outside of web analytics, like: Google Analytics for Firebase, Snowplow, KISSmetrics, Mixpanel, etc. In the next section we’ll explain why, as a lot of features are interpreted differently from regular web analytics but can still help you a lot in your daily work:

What’s the difference?

 

  • Instant Reporting/DIY: While most analytics tools provide you with a lot of pre configured dashboards. Amplitude will let you do this all on your own, which can be a time consuming task but in my opinion it also lets you think a bit more about the way you set-up your analytics infrastructure.
  • No default metrics: Bounce rate doesn’t exist as any event can be triggered to influence it (plus, would that be your most useful metric anyway?)
  • Funnels: Anything can be a  funnel, in my opinion that makes it very powerful as it doesn’t require you to create any predefined goals and also will make sure you can create funnels retroactively (recognize this pain point in Google Analytics?). If you added events a few weeks ago and now is the time to start creating a funnel of it, you’re able to. Want to change the funnel and add more/less events. You can.

 

 

  • User/Session: Sessions (on Web) don’t really exist in Amplitude. While in my opinion this is a metric that has a very loosely defined definition anyway it can come in handy from time to time to measure retention. It will provide this data on mobile where sessions are way easier to be defined (app open/close).
  • Channel/Traffic Source: If you’re looking for an easy way to track where your visitors are coming from with detailed reports that would be associated with costs. That’s just not what Amplitude is for. While it can easily save the UTM data that you’re sending along it won’t offer your great reporting around it. That’s why there focus is mostly on product analytics.
  • Merging Events/Linking Events: At the beginning of this section we talked about the need for setting up all the dashboards yourself. As you won’t have a very defined plan with your tracking for what’s to come over the next few years it can be hard to follow a certain naming convention from scratch. Usually turning your analytics data into an unstructured chaos over time. Within Amplitude you’re able to merge certain event names and link them together. So you can easily change your old event names to something new and still connect the data together. One of the features I really miss sometimes in other tools when I’m trying to redefine naming conventions and clean them up.

Why data governance is even more important

The role of data governance is becoming more important by using tools like this in combination with having the need for good documentation. If you come into an organization that is already sending hundreds of different events it can be really hard to get started with making a more deep analysis as you’re not always familiar with the

    • Naming conventions: You want to make sure that you’re using the right names for the events and that you’re making sure that their logical in order to send the data. It would be good to give this article, on creating measurement plans, that I wrote for online-behavior.com a read. We’ll talk later about how Amplitude can still help you if you would like to make changes to the events you sent.

 

  • Segments/Cohorts: As most of the data for users can be saved in event or user properties this will also mean that you need to make sure that the data in there doesn’t change too often as it might affect how you’ve set up your segments and cohorts.

 

  • Also funnels and many reports can be impacted by the way you save data.

Overview of features

  • Dashboarding/Charts: As we talked about the flexibility that Amplitude can provide you with, this mostly shows in the way you’re working with charts and adding them to dashboards. You can create dozens of different charts and add them to a dashboard. The dashboards will then, for example, give you the ability to change the date range. If you don’t like that you can still make all the changes from the actual chart.
  • A/B Testing – Significance Calculator: Are you running an A/B test on your site and sending the data to Amplitude. Within certain charts you can segment out the results and immediately calculate if they’re significant for what you’re analyzing. Saves time trying to find a significance calculator.

      • Custom Metrics: Just as many other web analytics tools, Amplitude will give the ability to create custom formulas within a chart to calculate other metrics.

     

  • Retroactive reporting: You have added tracking months ago but only today you’ve figured out that an event should be measure as a goal? You can set up a goal funnel really easily with old data and have all the old data being available to you.
  • Realtime: The fact that all of the events that you send to Amplitude are processed in real time makes it very powerful. Basically within minutes of launching a new feature you can start analyzing the data to see what’s going on or if it’s broken. Gone are the times were you need to wait for hours to have all the data that you are collecting be fully available.
  • Unlimited event/user properties & ‘dimensions’: Every event can have properties that can be related to the event. In addition to that a user can have properties that can be used too. So if I want to mark certain users with an action I can easily send that along with an event to update the records for this.
  • CLTV: Measuring the lifetime value of users will obviously require you to start identifying users (relatively easy to set up). But this will enable you to look into how you’re users are performing over time and if you have high retention for what that means for their customer lifetime value. This is an example report that would provide me with the performance of a segment of users over the last 12 weeks and what they’re worth to the business.

Chart for CLTV

What’s missing?

Google integrations? Obviously some things are missing, while the Cohort feature’ abilities are very powerful and Amplitude can provide you with some cool integrations with other software it still can’t make the connection with the audience data from Google. Which is obviously always going to be a big upside of the Google Analytics Suite.

Transactions/Purchase: The way Amplitude is tracking a conversion is a bit weird. You send all the products that you purchase as different revenue events. There is no concept of a purchase, which seems strange. Also it’s really hard to identify what the Xth purchase was, these are events that you need to setup yourself.

UTM/Traffic Source Reporting: It does exist but it isn’t great and definitely not as powerful as you’re used to in your regular web analytics tools. Does it get the job done for product analytics, Yes it does I’d say. If you’re looking to do anything more advanced with the data you should be building additional capabilities on your own end.

Use Cases

  • Funnels: Every event can be part of a funnel and that makes it very flexible and useful if you want to compare user behaviour. For example connecting certain user actions before a purchase funnel can be the case too.
  • Customer Lifetime Value/Retention:
  • Cohorts: Where you would have segments & audiences in Google Analytics you have the ability to also create cohorts of users to measure the impact of certain properties/events on their behaviour over time. For example this is a cohort that we used often at Postmates where we would look at what users that have come in with the sign up referrer that includes google, yahoo, bing (an organic search user). We would use this cohorts either to export them from other marketing purposes (email/push campaigns) or to analyze their behaviour against other cohorts.
    • How do organic search users in the last month behave different if they have used x feature?
    • How do users who have touched feature x find the other features?

Segmenting users with its Cohort feature.

Conclusion

Overall I’m pretty satisfied with Amplitude and how it can help you with its flexibility in adding/creating events and figuring out later what kind of dashboarding/charts you’ll create on top of this. But it’s likely (for now) not going to replace most of the data that you’re used to in web analytics as that would require a lot of additional setup and tracking. You can use it very effectively within organizations to track certain aspects and user behaviour. All in all a great addition to most analytics teams. All in all I would advise most companies to use these tools together as they can be very useful in providing more insights into what the user is doing specifically.

If you’ve worked with Amplitude and want to share more about your experiences, leave a comment! Currently I’m working on exporting Amplitude data to Google BigQuery for ‘big data’ analysis, in a future post I hope to share with you on how you can set that up yourself.