I presented the content in this blog post about a week ago for the Traffic Think Tank community (highly recommend it), but after a Twitter thread on this topic as well, it’s time to turn it into a blog post.
Sometimes you have to take a stand and make something better when it’s already performing well. Over the last months, the RVshare marketing team worked on some great projects; one of them that I was involved in was restructuring 6800 pieces of content that we created a while ago. The content and pages they were on were performing outstanding (growing +100% YOY without any real effort), but we wanted to do more, to help users and boost SEO traffic. So we got started…
Why restructure content?
A couple of years ago, we published the last WordPress page/post in a series of 600+, the intent: go after a category near and dear to the core of the RVshare business: help more people rent an RV. We did that by creating tons of articles specifically for cities/areas. Now over two and a half years later, the content is driving millions of people yearly, mainly from SEO, but we knew that there was more as it’s not our core business. We also weren’t leveraging all the SEO features that have become available since two years ago, think about additional structured data like FAQs but also monetization that we thought was important. All improvements that we had to go back into every post for if we wanted to take advantage of it.
What we did, leveraging Mechanical Turk.
One of the biggest obstacles wasn’t necessarily rebuilding pages, coming up with a better design, etc. WE have a great team that is nailing this on a daily basis. But having to deal with 650 posts that contained ten sub-elements itself was a struggle. The content was structured in a similar way but some quick proof of concepts identified that scraping wasn’t the solution as the error ratio was way too high as with most projects we wanted to ensure that the content could be restructured at low costs not to avoid this project not having a valid business case (does the actual opportunity outweigh the potential costs to restructure the content?).
Scraping versus Mechanical Turk
As we had initially structured the content the same way: headline, description, etc. we were able to have at least a way to get the data out. When we did some testing to see if we would be able to scrape it looked unfortunate, there were too many edge cases as the HTML itself around it was barely structured enough to get the actual content out of it.
We looked into Mechanical Turk as the second option as it gave us the ability to quickly get thousands of people on a task to look at the content and take out what we needed. We wrote the briefing, divided the project in a few chunks, and within 10-12 hours, we had the content individualized per piece. We did our best to deal with most of the data cleaning from the workers directly in the briefing and form but also had some cleaning scripts ready. After it was cleaned, we imported the data into our headless CMS Prismic.
Create a project focused around content extraction.
Identify what kind of content you want individualized, it works best if there is a current structure (list format, table) that can be followed by the Turks. This way, you can tell them to pick up content piece X, Y, Z, for a specific URL.
Identify the fields that you want to be copied.
Upload a list of URLs that you want them to cover and additionally the # that it has on the list.
Start the project and verify the results.
Upload the data automatically back into your CMS (we used a script that could directly put the content as a batch into our headless CMS Prismic.io)
We decided to build the content from the ground up, which meant:
Build out category pages with the top content pieces by state.
Build out the main index page with the top content from all states.
Build the ability to showcase this content on all of our other templated pages across RVshare.
By building out the specific templates, it gave us additional power to streamline internal linking, create better internal relevance, build-out structured data but mainly figure out a right way on how to leverage a headless CMS with all its capabilities instead of just having raw (read: ‘dumb’) content that can’t be appropriately structured. We already use the headless CMS Prismic.io to do this, in which you can create custom post types, as you see in this screenshot. You define the custom post type and can pick the kind of fields that you want, which turns itself just another CMS after that. The content can then be leveraged through their API.
How to do this yourself?
We were previously leveraging WordPress ourselves, but all entities were saved as 1 post. If you’re able to do this differently and save pieces individually it’s many times easier to create overview pages by using categories (and/or tags). This is not right away something that you can always do without development support.
Because of the design changes, engagement increased with over 25% because of the new format. Monetization is making it more interesting to keep on iterating on the results. Sessions were unfortunately really hard to measure we launched the integrations a few weeks prior to the kick-off of COVID-19 resulting in a downwards spiral and a surge in demand right after. Hopefully, in the long-term, we’ll be able to tell more about this. We are sure though that we didn’t suffer on SEO results.
This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.
The Process of Building a Data Lake
Three Options: BigQuery Transfer Service, Firebase, SuperMetrics, and StitchData
Loading data from a dozen marketing vendors (not all of them are visualized in this graph) as efficiently as possible was important to us as we tried to save most money and resources for other parts of the data warehouse integration. In the end, we believe that we should let people do the jobs that they’re good at and leverage vendors instead of always following a ‘build ourselves’ strategy. This makes it possible for us to focus on the actual use cases instead of having to worry about the upkeep of integrating with over a dozen vendors, something that could/would take up a chunk of time in the long term.
That’s why we made a few decisions on how to load data:
Firebase: Google Analytics App & Web automatically can export to a BigQuery data set. So we decided to run our regular Universal Analytics integration of Google Analytics in sync with App & Web for now so we have the ability to cheaply export Google Analytics data.
BigQuery Transfer Service: Google Cloud offers some great integrations for YouTube, Google Ads that can be set up within minutes which makes it possible to instantly backfill and load data.
StitchData: We decided to go with Stitch as their pricing was more flexible and cheaper for the needs that we had. There are many alternatives to them, but they had integrations with most vendors that we needed. Since we didn’t require Enterprise level SLA’s (yet), we knew that we could rely on their standard integrations with many vendors that we have integrated with.
SuperMetrics: Some vendors weren’t available at the time in StitchData (Facebook Insights) so we decided to get a license for their BigQuery export.
‘Manual Import’: As a last resort we decided to build connections to BigQuery ourselves, so far we only had to do that for 2 vendors which made the workload for this lower than expected.
Understanding platform APIs and Data Granularity
There were over a dozen vendors that we wanted to import data from and in different ways, as described, to load the data into our data lake: Google BigQuery. We needed to ensure upfront that data and platforms that we would be using were up for the task, as some important areas could have thrown problems our way. This was needed to build a good understanding of the platforms and also how we could leverage them. What we looked at:
Data Freshness & Backfill capabilities: How often would the data refresh and how often would we require a refresh of the data? For some vendors a 24-hour sync was more than enough but for others this meant that a 24-hour sync would last for many hours, basically making yesterday’s data useless as it would take too long. On the other side, we needed to make sure that we could backfill data from vendors and with most decisions on how to load this data that was possible (except for manual backfill, in which case we needed to write scripts that could go back in time to backfill its data).
Cut time to collecting: Do we need all the data from an API or only partial responses? And in combination with the previous point, do we need to load data hourly (realtime wasn’t much of a case) or could we just gather the data every day to save time collecting it?
Hard to see all situations beginning end to end: Google Ads exports dozens of tables through their Transfer Service into BigQuery, which meant that it took a while to figure out what tables to use and how certain columns were translated into API fields & tables.
Using data to find gaps and roadblocks: What worked out for us is that we tried visualizing the data as soon as possible so we could quickly find anomalies (days of data missing due to a broken backfill process), otherwise the data wouldn’t match the interface because metric X turned out to be calculated differently from what we thought.
BigQuery Storage: The lowest cost out of all of this, while we store dozens of GBs daily it’s ridiculously cheap to store all of this long-term.
Stitch: Pricing can be found here. This is really dependent on how much data you’re exporting and from how many sources as you’re billed based on usage.
SuperMetrics: Pricing can be found here. Basically $190 per vendor monthly.
BigQuery Transfer Service: This is free for a couple of months for Google Ads export and YouTube. Previously it was $2.50 – one less coffee at Starbucks monthly.
BigQuery BI Engine Reservation: While it’s not fully related to the data lake I want to add a mention for it. We’re paying $60 monthly for a 2GB size reservation. Currently, we’ve optimized queries as much as possible to be under this limit, if needed we might move up. This has been a game-changer to make querying for dashboarding lightning fast.
Cutting time to prove the value
We knew many use cases that we wanted to solve with our data lake, for example: show Google Analytics metrics next to the open, unsubscribe rates from our marketing newsletters (Marketo). This meant that we had the work cut out for us and could quickly take a sample of the data and build out the use case to see if the data could be matched (would we run into tracking issues somewhere, we sometimes did) and what possibilities we had for visualization. In short, we tried to cut as much time out of the initial process as possible, by not focusing on scalability yet, to figure out if the use cases that we had actually could provide value for the teams. Basically thinking about the end goals that we had in mind.
We did that by:
Talking to end stakeholder – interview & document: The goal of the projects was to, at a minimum, provide every channel owner (PPC, SEO, Social, etc.) with dashboards with their core metrics. That was the basic level that we asked of them to provide, on top of that, we discussed briefly with them all the other use cases that they knew about that they wanted to repeat. By doing this we immediately had documentation on how the process currently worked for them, what data sources were needed to be included and what needed to be built. But more importantly, it also created their buy-in as they were able to pick and choose what we build for them.
Taking a sample and building a use case: Load the data from the vendor into the data lake and investigate what the schemas were and if the granularity of the data was enough. As long as there was a key in multiple tables that we could leverage between vendors, we knew we were on the right track (examples: UTM tags in emails needed to match with Google Analytics conventions).
Automate & Backfill later: We never focused on making sure that this could scale over the next years or would backfill our data directly for multiple months at a time. In the end, we noticed that we knew the process was iterative and that we would come up with new use cases which would potentially mean that we had to backfill many times for the same vendors.
Backfill Manually: Depending on the vendor and type of automation we could put in place, we backfilled some of the data manually where it was needed. Sometimes manually connecting some data points in a database makes more sense when you try to match vendors than it does trying to overcomplicate things in a codebase. We figured this out quickly when we had 50 email templates that we manually needed to map to other data sources.
The (unexpected) problems that we ran into?
Certain things always go differently from the expectations that we set upfront; that’s why I wanted to touch on some of the ‘problems’ that we ran into and how we dealt with it.
Inaccurate data: Dashboard versus Dump
Exports are always different from the data you see in the dashboards, mainly because the information is stored differently from how it’s visualized. For example, Cost data from Google Ads is saved in micro-cents, which means you always have to divide it to get the correct number. CTRs from multiple vendors are just a calculation of Clicks / Impressions. And in some cases, it meant that specific data points aren’t available for export through an API (looking at you Facebook).
Overall this means that you have to assess upfront what data schema is available in export carefully and if this matches the use cases that you want to fulfill. If that’s not the case, sometimes it might not be worth it to try to automate it.
Duplication as a problem in Stitch
StitchData has a different way of exporting data into BigQuery than we were used to. This meant that we had to get used to deduplicating data before we used it for analysis; in short: they created new rows in tables when they notice the unique row has changed. This means that we always needed to find the latest row value for a record. Not a massive problem in itself, but it took us a few days to figure this out as the documentation around this, at the time, lacked.
How much useful data do you have available by vendor? As some vendors are very structured: Google Ads have their own transfer service into Google BigQuery, so it was easy to backfill the data here. But for other vendors: Marketo, Sendgrid, it was harder as we made so many changes to professionalize it over time. This meant that for some, we were able to backfill data into 2018, whereas for others, we could barely get data from the last six months. Knowing this upfront is essential as it might make certain use cases irrelevant if you’re trying to learn more from YoY behavior, and it turns out the data isn’t compatible.
Exceptions, for some sources (read Marketo) we had to rely on other solutions. There are limitations to some of the data that we can get out of their (old skool) API. So for some of it, we’re getting daily email reports that are parsed and sometimes ‘manually’ ingested into Google BigQuery. It’s some of the workarounds that we had to do as we would want the data to be accessible through our data lake and some vendors aren’t always as accessible.
While we had an early discussion about the naming conventions that we wanted to follow, this was more food for thought than expected. Quickly we had tons of datasets in BigQuery that inside of them followed their conventions (defined by the data load providers). As we were applying ETL on some data tables, it meant that we needed to make sure that it was clear what every table and field in the schema meant. For example, there were tables that were only used: temporarily, for backfill, for production, for development, for transformation, etc. What made it more complicated is that most of them were also individualized by vendor. In our case, we settled on making the distinction between: ‘prod/dev’ + ‘raw/formatted/dashboard’ + ‘tables’ for datasets and then specifying the specific vendor name and goal with the tables in the dataset.
For the last four years, I wrote blog posts (2019, 2018, 2017 & 2016) listing the books that I read in the past year and that I wanted to be reading during that year. As always, the past year I didn’t read all the books that I’ve listed out in the blog post as I discovered some new ones and changed my focus on some others. But I did read a lot, as I finished ~25 books (and put two books aside that weren’t worth finishing).
What books I didn’t get to in 2019 and have re-added to the list for 2020:
The Intelligent Investor: Wanting to read more about investing, this is apparently a good book to start with to learn more.
My favorites from 2019:
High Growth Handbook: My favorite book from last year as it touched right on the topics that I care about on a daily basis, great interviews with daily practitioners on many topics that relate to High Growth companies including a lot of tactical advice.
The Ride of a Lifetime: A fascinating read on how Robert Iger made it’s way to the top of Disney as their CEO. One of my favorites as it shows the personal side of growth he had to go through to reach the top and it provides some insights along the way of how big Disney is of a company.
Loonshots: A different book from what I would usually read as it went into detail on projects that have changed the world and created innovations that we still benefit from these days (think radar, energy).
Secrets of Sand Hill Road: A great book if you want to know more about investing, money and the workings around startups related to investors, deal flows and structures.
What books I’d like to be reading in 2020:
The Great CEO Within: Started this book on the 1st of January and it aligns a lot with the High Growth Handbook but takes a bit more the approach from a CEO point of view with tons of tactical advice. With the lack of books around other C-level positions, it will hopefully provide enough useful input for other exec positions.