Exporting Amplitude Data to Google BigQuery
I’ve written about using Amplitude before on this blog (in Dutch), but what if you want to combine the huge amount of data that you have in Amplitude with your other big data. The Enterprise version gives you the ability to export your data to a Redshift cluster. But a lot of companies these days are also on Google Cloud Platform and want to use Google BigQuery, which is similar in its setup to Redshift.
The Export API from Amplitude lets you download all your events data (regardless of your account plan) for free. There are some limits to the data that they can export but most startups/companies that use them are likely to stay under that on a daily/hourly basis. Which means that you can export the data. So basically everything that you need to do is setup a cronjob that either every hour or day can retrieve the data. It will parse the data and prepare new files that will be temporarily stored in Google Cloud Storage (this will ensure that the data is easier/faster available for upload into Google BigQuery). The next step is loading the data from GCS to GBQ.
Over the last years I wanted to do more with Google BigQuery and this project was perfect for it. As BigQuery has many connectors with multiple Google products (Analytics, Data Studio) and other vendors like Tableau it should give companies the ability to analyze their data and connect to other sources.
Within Google BigQuery we’re going to save the data in two tables:
- Events: As everything is an event in Amplitude that’s also one of the tables that you’ll need to use in Google BigQuery, that’s why every event will end up as being it’s own row in Google BigQuery.
- Properties: Every event can have properties in a few different ways: events, users, groups, group properties and an actual data property. We are connecting them to the data from the events table.
- Do I need to be a paying customer for Amplitude? No, you don’t. The free plans, which I started to use this on will have the support for the Export API as well.
- What is the cost of the integration? The costs that are associated with this setup are related to Google Cloud Platform. You’ll have to pay for storage in Google Cloud Storage and the storage of Google BigQuery. For the setup that I’m running we’re saving millions of rows monthly and the costs are in total less than 10 dollar.
- What do I need to do to get this up and running? Check out the README in the repository on Github, it will give you a getting started checklist to ensure that you can run this script.
Feedback? / Contribute?
I haven’t been the first person to work on an integration with BigQuery. I’ve seen other solutions for Java and Python, but they all work a bit different. If you have any feedback on the setup, leave an Issue on Github, submit a Pull Request with your proposed changes. In the end, I can code, but I don’t consider myself to be an engineer 😉