Case Study: How Restructuring 6800 Content Pieces For SEO Worked

I presented the content in this blog post about a week ago for the Traffic Think Tank community (highly recommend it), but after a Twitter thread on this topic as well, it’s time to turn it into a blog post.

Sometimes you have to take a stand and make something better when it’s already performing well. Over the last months, the RVshare marketing team worked on some great projects; one of them that I was involved in was restructuring 6800 pieces of content that we created a while ago. The content and pages they were on were performing outstanding (growing +100% YOY without any real effort), but we wanted to do more, to help users and boost SEO traffic. So we got started…

Why restructure content?

A couple of years ago, we published the last WordPress page/post in a series of 600+, the intent: go after a category near and dear to the core of the RVshare business: help more people rent an RV. We did that by creating tons of articles specifically for cities/areas. Now over two and a half years later, the content is driving millions of people yearly, mainly from SEO, but we knew that there was more as it’s not our core business. We also weren’t leveraging all the SEO features that have become available since two years ago, think about additional structured data like FAQs but also monetization that we thought was important. All improvements that we had to go back into every post for if we wanted to take advantage of it.

What we did, leveraging Mechanical Turk.

One of the biggest obstacles wasn’t necessarily rebuilding pages, coming up with a better design, etc. WE have a great team that is nailing this on a daily basis. But having to deal with 650 posts that contained ten sub-elements itself was a struggle. The content was structured in a similar way but some quick proof of concepts identified that scraping wasn’t the solution as the error ratio was way too high as with most projects we wanted to ensure that the content could be restructured at low costs not to avoid this project not having a valid business case (does the actual opportunity outweigh the potential costs to restructure the content?).

Scraping versus Mechanical Turk

As we had initially structured the content the same way: headline, description, etc. we were able to have at least a way to get the data out. When we did some testing to see if we would be able to scrape it looked unfortunate, there were too many edge cases as the HTML itself around it was barely structured enough to get the actual content out of it.

We looked into Mechanical Turk as the second option as it gave us the ability to quickly get thousands of people on a task to look at the content and take out what we needed. We wrote the briefing, divided the project in a few chunks, and within 10-12 hours, we had the content individualized per piece. We did our best to deal with most of the data cleaning from the workers directly in the briefing and form but also had some cleaning scripts ready. After it was cleaned, we imported the data into our headless CMS Prismic.

How to do this yourself?

  1. Create an account on Mechanical Turk.
  2. Create a project focused around content extraction.
  3. Identify what kind of content you want individualized, it works best if there is a current structure (list format, table) that can be followed by the Turks. This way, you can tell them to pick up content piece X, Y, Z, for a specific URL.
  4. Identify the fields that you want to be copied.
  5. Upload a list of URLs that you want them to cover and additionally the # that it has on the list.
  6. Start the project and verify the results.
  7. Upload the data automatically back into your CMS (we used a script that could directly put the content as a batch into our headless CMS Prismic.io)

Rebuilding

We decided to build the content from the ground up, which meant:

  • Build out category pages with the top content pieces by state.
  • Build out the main index page with the top content from all states.
  • Build the ability to showcase this content on all of our other templated pages across RVshare.

By building out the specific templates, it gave us additional power to streamline internal linking, create better internal relevance, build-out structured data but mainly figure out a right way on how to leverage a headless CMS with all its capabilities instead of just having raw (read: ‘dumb’) content that can’t be appropriately structured. We already use the headless CMS Prismic.io to do this, in which you can create custom post types, as you see in this screenshot. You define the custom post type and can pick the kind of fields that you want, which turns itself just another CMS after that. The content can then be leveraged through their API.

How to do this yourself?

We were previously leveraging WordPress ourselves, but all entities were saved as 1 post. If you’re able to do this differently and save pieces individually it’s many times easier to create overview pages by using categories (and/or tags). This is not right away something that you can always do without development support.

Results

Because of the design changes, engagement increased with over 25% because of the new format. Monetization is making it more interesting to keep on iterating on the results. Sessions were unfortunately really hard to measure we launched the integrations a few weeks prior to the kick-off of COVID-19 resulting in a downwards spiral and a surge in demand right after. Hopefully, in the long-term, we’ll be able to tell more about this. We are sure though that we didn’t suffer on SEO results.


Want to see the new structure of the pages? You can find it here as our effort on the top 10 campgrounds across the United States.


Part 5: Airflow on Google Cloud Composer – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, part 3, and part 4) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: making the data lake. In the fourth blog post, a more technical one, I’ll give some insights into how we’re leveraging Apache’s Airflow to build the more complicated data pipelines, and I give you some tips on how to get started.

This blog post is part of a series of five? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Getting Started with Cloud Composer

Cloud Composer is part of Google’s Cloud Platform and brings you most of the upside of using Apache Airflow (open source) and barely any of the downsides (setup, maintenance, etc.). Or to follow their main USP: “A fully managed workflow orchestration service built on Apache Airflow.” While we had worked with Airflow before, we weren’t looking forward to spending time having to worry about its management as we planned to spend most time setting up and maintaining the data pipelines. In the end, then you have to stick to create pipelines (DAGs).

What is it suitable for?

You want to load data from the Google Analytics API, store it locally, translate some values to something new, and have it available in Google BigQuery. However, you would build it; it’s multiple tasks and functions that are depending on itself. You wouldn’t want to load the data into BigQuery when the data wouldn’t have been cleaned (trash in, trash out sounds familiar?). With BigQuery, the next task is only being processed if the previous step was successful.

Tasks

Tasks are in almost every case; just one thing: get data from BigQuery, upload a file from GCS into BigQuery, download a file from Cloud Storage to local, process data. What makes Airflow very efficient to work with is that the majority of data processing tasks already have pre-built functions. The first three tasks that I listed here are operators (GoogleCloudStorageDownloadOperator, GoogleCloudStorageToBigQueryOperator) that operate as functions.

Versus Google Cloud Functions

If you mainly run very simple ‘pipelines’ that only exist of 1 function that needs to be executed or have only a handful use cases, it is likely overkill to leverage Cloud Composer; the costs might be too high, you still have overhead with DAGs. In that case, you might be better off with Google Cloud Functions as you can write similar scripts that will enable you to also trigger them with Google Cloud Scheduler to run at a specific time.

Costs

The costs for Google Cloud Composer are doable, for a basic setup, it’s around 450 dollars (if you run the instances 24 hours * 7 days a week) as you leverage multiple (a minimum of 3) small instances. For more information on the costs, I would point to this pricing example.

Building Pipelines

See above an example data pipeline, in typical Airflow fashion every task is depending on the previous task. In other words: notify_slack_channel would not run if any of the previous tasks would fail. All tasks are happening in a particular order from left to right. In most cases, data pipelines become more complicated as you can have multiple flows going on at the same time and combining them at the end.

Tips & Tricks

Google Cloud Build, Repositories

The files for Google Cloud Composer are saved in Google Cloud Storage. Which is smart in itself, but at the same time, you want them to live in a Git repository so you can efficiently work on it together. By leveraging this blog post, you’re able to connect the Cloud Storage bucket to a repository and set up a sync between the two. This will help you build a deployment pipeline basically and make sure that only production-ready code from your master branch ends up in GCS.

Managing Dependencies

After working with it for a few months now, I’m still not sure if managing dependencies through Google Cloud Composer is a good or bad thing, as it creates some obstacles if you want to run a deployment and want to add some Python libraries (as your servers could be down 10-30 mins at a time). For other setups, this usually is a bit more smooth and creates less downtime.

Sendgrid for Email Alerts

One of the upsides of Apache Airflow is that it sends alerts upon failure of tasks. Make sure to set up the Sendgrid notifications while you’re setting up Google Cloud Composer. This will be the most straightforward way of receiving email alerts (for free, as in most cases, you shouldn’t get too many failure emails).

README

Document the crap out of your setup and DAGs. When I took over some of the pipelines that were used at Postmates for XML sitemap generation it was a nightmare, it was hard to read, the code didn’t make a lot of sense, and we had to refactor certain things just because of that. As sometimes pipelines (just like regular code) can be left untouched/unviewed for months (as they literally sometimes only have one job) you want to make sure that you come back and understand what happens inside the tasks.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 4: Visualization with Google DataStudio – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1, part 2, and part 3) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. In the fourth blog post, we’ll chat about how we are visualizing all the data we saved in previous steps by using Google DataStudio.

This blog post is part of a series of four? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.

How we build dashboards

Try to think ahead about what you need: date ranges, data/date comparisons, filters, what type of visualization. This will help you build a better first version right away as it gives you the ability to have a good version right away. What that mainly looked like for us:

  • Date ranges: The business is so seasonal that our Year over Year growth is most important for RVshare, and since we often don’t get to see all the context on metrics on a weekly basis, we default to 30 days.
  • Filters: For some channels (PPC, Social), it’s more relevant to be able to filter down the data on a campaign or social network level. Because in most cases the aggregate level doesn’t tell the whole story right away.
  • Visualization: We need the top metrics: sessions and revenue in view right away with the comparison YoY right away so we know within seconds what is going on and how that can improve things.

Talking to Stakeholders (Part Deux)

In the first blog post, we talked about connecting with our stakeholders (mainly our channel owners) and gathering their feedback to build the first initial versions of their dashboarding (beginning with the end in mind). We used this approach to put the first charts, tables, and graphs on the dashboards after which we connected back again with the owners to see what data points were missing and in some cases to validate the data that they were seeing on their dashboards. This helped us get additional feedback for fast follows and made for quick iterations on data that we had and could also show. For social media, as an example, it turned out that we wanted to show additional metrics that we hadn’t thought of initially but were in our data lake anyway. These sessions provided a good way for us to build additional pieces into our data warehouse while we were at it. These days some of these dashboards are used weekly to report to other teams in the organization or used within the team itself.

Best Practices

Blended Data

Do you want to add this to Google DataStudio, or do you want to create synced/aggregate tables in BigQuery? For most of our use cases, we have opted for using DataStudio to create JOIN blended data sources. It’s easier – we have the ability to quickly pull some new data together versus having to deal with the data structures and complicated queries. In some use cases, we noticed that we were missing data in our warehouse tables (not the lake) and were able to make adjustments/improvements to them by creating dashboards.

Single Account Owner

Because we work with Rick and Hussain as ‘third parties’, we opted for using 1 shared owner account, transferring owner access is incredibly hard when it’s a Google Apps account so we made sure that the dashboards are owned through an @rvshare.com account, it’s not a big topic but could cause tons of headache in the long term.

Keep It Simple St*pid

Your stakeholders probably have the desire / and the time to look at less than you think. Instead of having them jump through too many charts start simple and then add based on feedback if they want to see more, less is more in this case.

This has added benefit of them feeling engaged and more interested in using it. In our own use case, we leverage our reporting on a weekly basis for a team meeting, which makes it already a more often leveraged us case.

Calculated Fields – Yay or Nay?

As we made most of the tables that we leverage in DataStudio from scratch during our ETL process, we had the opportunity to decide if we wanted to leverage calculated fields in BigQuery or if we do the work in the queries itself. Honestly, the answer wasn’t easy, and as we made modifications in the dashboards, it became clear that having them set up in DataStudio wasn’t always scalable and easy as with data modifications or changes in tables they are removed.

Google BigQuery

Tables or Queries? In our case, we often used the table information from BigQuery and the specific columns in there to drive the visualization in DataStudio. The alternative for some of them is that we directly query the data in BigQuery, with the BI Engine reservation that we have in there we can speed up intense queries rather easily.


Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 3: Transforming Into a Data Warehouse – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog posts (part 1 and part 2) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. We’ll try to go a bit more into detail on how you can do this yourself in this post in which we transformed our marketing data lake into an actual data warehouse.

This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.

The Process of Building a Data Warehouse

In our endeavor of building a data warehouse, we had a couple of big initiatives that we first wanted to get done. We needed some reporting and visualization tables and aligned with that, we needed to make sure that we could have data that was cleaned for other purposes (deduplication, standardization: some typical ETL problems).

In order to streamline the processes, we used three different ways of getting the data streamlined: 

Google Cloud Functions

Google Cloud Functions is used for both transforming our data as loading our initial data for a few use cases. Early on we noticed that not every vendor was available through regular data loading platforms, like StitchData. An example of that was Google Search Console, as we didn’t want to have the need to run additional infrastructure for just dealing with Load scripts we leveraged Cloud Functions to run a daily script (with support from Scheduler to make them daily).

After loading the data we also Transform some other tables from our marketing data lake to new production tables using Cloud Functions. All of our scripts are currently written in Python or Node.js, but as Cloud Functions makes it possible to deal with multiple languages it provides us with the flexibility to leverage others over time.

Backfill: As Functions can easily be rewritten and tested within the interface, it also provides us with a good way to backfill data easily as we can easily adjust the dates that a script needs to run.

Scheduled Queries

In some other cases, we can also leverage Google BigQuery’s scheduled queries. In a few instances, we just want to load the data from raw data lake tables into a production table. Mainly because we don’t always need all the columns, we can limit our data drilling and be able to clean the data in the query itself. In that case, scheduled queries can come in pretty handy as they run on a certain schedule, can be easily updated, and already point towards another data set and table.

Airflow

For more complicated data flows we’re currently using Airflow via Google Cloud Composer. Cloud Composer, as we mentioned in a previous blog post, enables us to not have to worry about maintaining the Airflow infrastructure but gives us all the other upside of it. This gives us the ability to focus on creating and maintaining the DAGs that help drive the actual data structuring flows.

How we mainly use Airflow is to combine, clean and enhance data from multiple sources and then reupload it back into Google BigQuery for visualization in other tools. Singular use cases are more easily captured by one or two tasks, but in Airflow we run flows that usually have multiple tasks that need to be executed in a certain order and not at all if one of them fails. This is what Airflow is meant to do, and that’s how we’re leveraging it too. As an example of our affiliate marketing campaigns, we have a structure set up that only pays out once travel is concluded (a very standard approach in the travel industry). This means that we need to retrieve orders from our partner > verify them with our database > create a new format to upload back to our vendor and run the actual upload. In addition, we want to set up some alerting for the team as well. Resulting in 6 tasks in this case that need to be executed in the right order: the perfect use case for Airflow.

Creating Structure

In the previous blog post, I touched on how we wanted to set up raw tables that are transformed once or multiple times. We decided to do this to both make the data more streamlined and also to make them ready for visualization on our channel dashboards. The MarketLytics team did a great job documenting this with a very visual result that you can see here:

As discussed previously, we go through multiple stages with the data that we get into the data lake and transform it to the data warehouse.

Example of data enhancement: One of the most common scenarios that we’ve tried to solve for was connecting existing data from a vendor back to the data that we receive in our web analytics tool: Google Analytics. As an example, if properly tagged we should be able to get the data from a specific newsletter campaign from the UTM parameters and then connect the data to (in our case) Marketo to what we have there on deliverability and open rate (%).


Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.


Part 2: Creating the Marketing Data Lake – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

In the previous blog post in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: building the data lake. We’ll try to go a bit more into detail on how you can do this yourself in this post in which we’ll create a marketing data lake. If you’re looking for more inspiration on this topic I would recommend reading this post by the Google Cloud team on how to build a Marketing Datawarehouse. A lot of the thoughts in their article align with our views and the process we followed.

This blog post is part of a series of four? (we found enough content to add more articles ;-)), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.

The Process of Building a Data Lake

Three Options: BigQuery Transfer Service, Firebase, SuperMetrics, and StitchData

Loading data from a dozen marketing vendors (not all of them are visualized in this graph) as efficiently as possible was important to us as we tried to save most money and resources for other parts of the data warehouse integration. In the end, we believe that we should let people do the jobs that they’re good at and leverage vendors instead of always following a ‘build ourselves’ strategy. This makes it possible for us to focus on the actual use cases instead of having to worry about the upkeep of integrating with over a dozen vendors, something that could/would take up a chunk of time in the long term.

That’s why we made a few decisions on how to load data:

  • Firebase: Google Analytics App & Web automatically can export to a BigQuery data set. So we decided to run our regular Universal Analytics integration of Google Analytics in sync with App & Web for now so we have the ability to cheaply export Google Analytics data.
  • BigQuery Transfer Service: Google Cloud offers some great integrations for YouTube, Google Ads that can be set up within minutes which makes it possible to instantly backfill and load data.
  • StitchData: We decided to go with Stitch as their pricing was more flexible and cheaper for the needs that we had. There are many alternatives to them, but they had integrations with most vendors that we needed. Since we didn’t require Enterprise level SLA’s (yet), we knew that we could rely on their standard integrations with many vendors that we have integrated with.
  • SuperMetrics: Some vendors weren’t available at the time in StitchData (Facebook Insights) so we decided to get a license for their BigQuery export.
  • ‘Manual Import’: As a last resort we decided to build connections to BigQuery ourselves, so far we only had to do that for 2 vendors which made the workload for this lower than expected.

Understanding platform APIs and Data Granularity

There were over a dozen vendors that we wanted to import data from and in different ways, as described, to load the data into our data lake: Google BigQuery. We needed to ensure upfront that data and platforms that we would be using were up for the task, as some important areas could have thrown problems our way. This was needed to build a good understanding of the platforms and also how we could leverage them. What we looked at:

  • Data Freshness & Backfill capabilities: How often would the data refresh and how often would we require a refresh of the data? For some vendors a 24-hour sync was more than enough but for others this meant that a 24-hour sync would last for many hours, basically making yesterday’s data useless as it would take too long. On the other side, we needed to make sure that we could backfill data from vendors and with most decisions on how to load this data that was possible (except for manual backfill, in which case we needed to write scripts that could go back in time to backfill its data).
  • Cut time to collecting: Do we need all the data from an API or only partial responses? And in combination with the previous point, do we need to load data hourly (realtime wasn’t much of a case) or could we just gather the data every day to save time collecting it?
  • Hard to see all situations beginning end to end: Google Ads exports dozens of tables through their Transfer Service into BigQuery, which meant that it took a while to figure out what tables to use and how certain columns were translated into API fields & tables.
  • Using data to find gaps and roadblocks: What worked out for us is that we tried visualizing the data as soon as possible so we could quickly find anomalies (days of data missing due to a broken backfill process), otherwise the data wouldn’t match the interface because metric X turned out to be calculated differently from what we thought.

Associated Costs

  • BigQuery Storage: The lowest cost out of all of this, while we store dozens of GBs daily it’s ridiculously cheap to store all of this long-term.
  • Stitch: Pricing can be found here. This is really dependent on how much data you’re exporting and from how many sources as you’re billed based on usage.
  • SuperMetrics: Pricing can be found here. Basically $190 per vendor monthly.
  • BigQuery Transfer Service: This is free for a couple of months for Google Ads export and YouTube. Previously it was $2.50 – one less coffee at Starbucks monthly.
  • BigQuery BI Engine Reservation: While it’s not fully related to the data lake I want to add a mention for it. We’re paying $60 monthly for a 2GB size reservation. Currently, we’ve optimized queries as much as possible to be under this limit, if needed we might move up. This has been a game-changer to make querying for dashboarding lightning fast.

Cutting time to prove the value

We knew many use cases that we wanted to solve with our data lake, for example: show Google Analytics metrics next to the open, unsubscribe rates from our marketing newsletters (Marketo). This meant that we had the work cut out for us and could quickly take a sample of the data and build out the use case to see if the data could be matched (would we run into tracking issues somewhere, we sometimes did) and what possibilities we had for visualization. In short, we tried to cut as much time out of the initial process as possible, by not focusing on scalability yet, to figure out if the use cases that we had actually could provide value for the teams. Basically thinking about the end goals that we had in mind.

We did that by:

  • Talking to end stakeholder – interview & document: The goal of the projects was to, at a minimum, provide every channel owner (PPC, SEO, Social, etc.) with dashboards with their core metrics. That was the basic level that we asked of them to provide, on top of that, we discussed briefly with them all the other use cases that they knew about that they wanted to repeat. By doing this we immediately had documentation on how the process currently worked for them, what data sources were needed to be included and what needed to be built. But more importantly, it also created their buy-in as they were able to pick and choose what we build for them.
  • Taking a sample and building a use case: Load the data from the vendor into the data lake and investigate what the schemas were and if the granularity of the data was enough. As long as there was a key in multiple tables that we could leverage between vendors, we knew we were on the right track (examples: UTM tags in emails needed to match with Google Analytics conventions).
  • Automate & Backfill later: We never focused on making sure that this could scale over the next years or would backfill our data directly for multiple months at a time. In the end, we noticed that we knew the process was iterative and that we would come up with new use cases which would potentially mean that we had to backfill many times for the same vendors.
  • Backfill Manually: Depending on the vendor and type of automation we could put in place, we backfilled some of the data manually where it was needed. Sometimes manually connecting some data points in a database makes more sense when you try to match vendors than it does trying to overcomplicate things in a codebase. We figured this out quickly when we had 50 email templates that we manually needed to map to other data sources.

The (unexpected) problems that we ran into?

Certain things always go differently from the expectations that we set upfront; that’s why I wanted to touch on some of the ‘problems’ that we ran into and how we dealt with it.

Inaccurate data: Dashboard versus Dump

Exports are always different from the data you see in the dashboards, mainly because the information is stored differently from how it’s visualized. For example, Cost data from Google Ads is saved in micro-cents, which means you always have to divide it to get the correct number. CTRs from multiple vendors are just a calculation of Clicks / Impressions. And in some cases, it meant that specific data points aren’t available for export through an API (looking at you Facebook).

Overall this means that you have to assess upfront what data schema is available in export carefully and if this matches the use cases that you want to fulfill. If that’s not the case, sometimes it might not be worth it to try to automate it.

Duplication as a problem in Stitch

StitchData has a different way of exporting data into BigQuery than we were used to. This meant that we had to get used to deduplicating data before we used it for analysis; in short: they created new rows in tables when they notice the unique row has changed. This means that we always needed to find the latest row value for a record. Not a massive problem in itself, but it took us a few days to figure this out as the documentation around this, at the time, lacked.

Backfills

How much useful data do you have available by vendor? As some vendors are very structured: Google Ads have their own transfer service into Google BigQuery, so it was easy to backfill the data here. But for other vendors: Marketo, Sendgrid, it was harder as we made so many changes to professionalize it over time. This meant that for some, we were able to backfill data into 2018, whereas for others, we could barely get data from the last six months. Knowing this upfront is essential as it might make certain use cases irrelevant if you’re trying to learn more from YoY behavior, and it turns out the data isn’t compatible.

Exceptions, for some sources (read Marketo) we had to rely on other solutions. There are limitations to some of the data that we can get out of their (old skool) API. So for some of it, we’re getting daily email reports that are parsed and sometimes ‘manually’ ingested into Google BigQuery. It’s some of the workarounds that we had to do as we would want the data to be accessible through our data lake and some vendors aren’t always as accessible.

Naming Conventions

While we had an early discussion about the naming conventions that we wanted to follow, this was more food for thought than expected. Quickly we had tons of datasets in BigQuery that inside of them followed their conventions (defined by the data load providers). As we were applying ETL on some data tables, it meant that we needed to make sure that it was clear what every table and field in the schema meant. For example, there were tables that were only used: temporarily, for backfill, for production, for development, for transformation, etc. What made it more complicated is that most of them were also individualized by vendor. In our case, we settled on making the distinction between: ‘prod/dev’ + ‘raw/formatted/dashboard’ + ‘tables’ for datasets and then specifying the specific vendor name and goal with the tables in the dataset.


Again… This blog post is written with the help of @RickDronkers and @hu_me / MarketLytics who we’ve worked with alongside during this (ongoing) project.