Skip to main content

Quick start guide

info
This documentation only applies to Snowplow Open Source. See the feature comparison page for more information about the different Snowplow offerings.

This guide will take you through how to spin up an open source pipeline using the Snowplow Terraform modules. (Not familiar with Terraform? Take a look at Infrastructure as code with Terraform.)

Skip installation
Would you like to explore Snowplow for free with zero setup? Check out our free trial.

👉 Take me there! 👈

Prerequisites

Install Terraform 1.0.0 or higher. Follow the instructions to make sure the terraform binary is available on your PATH. You can also use tfenv to manage your Terraform installation.

Clone the repository at https://github.com/snowplow/quickstart-examples to your machine:

git clone https://github.com/snowplow/quickstart-examples.git

Install AWS CLI version 2.

Configure the CLI against a role that has the AdminstratorAccess policy attached.

caution

AdminstratorAccess allows all actions on all AWS services and shouldn't be used in production

Details on how to configure the AWS Terraform Provider can be found on the registry.

Storage options

The sections below will guide you through setting up your destination to receive Snowplow data, but for now here is an overview.

WarehouseAWSGCPAzure
Postgres
Snowflake
Databrickscoming soon
Redshift
BigQuery
Synapsecoming soon

There are four main storage options for you to select: Postgres, Redshift, Snowflake and Databricks. Additionally, there is an S3 option, which is primarily used to archive enriched (and/or raw) events and to store failed events.

We recommend to only load data into a single destination, but nothing prevents you from loading into multiple destinations with the same pipeline (e.g. for testing purposes).

Set up a VPC to deploy into

AWS provides a default VPC in every region for your sub-account. Take a note of the identifiers of this VPC and the associated subnets for later parts of the deployment.

Set up Iglu Server

The first step is to set up the Iglu Server stack required by the rest of your pipeline.

This will allow you to create and evolve your own custom events and entities. Iglu Server stores the schemas for your events and entities and fetches them as your events are processed by the pipeline.

Step 1: Update the iglu_server input variables

Once you have cloned the quickstart-examples repository, you will need to navigate to the iglu_server directory to update the input variables in terraform.tfvars.

cd quickstart-examples/terraform/aws/iglu_server/default
nano terraform.tfvars # or other text editor of your choosing

To update your input variables, you’ll need to know a few things:

  • Your IP Address. Help.
  • A UUIDv4 to be used as the Iglu Server’s API Key. Help.
  • How to generate an SSH Key.
tip

On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub. You can get the value with cat ~/.ssh/id_rsa.pub.

Telemetry notice

By default, Snowplow collects telemetry data for each of the Quick Start Terraform modules. Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the user_provided_id variable.

If you wish to disable telemetry, you can do so by setting telemetry_enabled to false.

See our telemetry principles for more information.

Step 2: Run the iglu_server Terraform script

You can now use Terraform to create your Iglu Server stack.

You will be asked to select a region, you can find more information about available AWS regions here.

terraform init
terraform plan
terraform apply

The deployment will take roughly 15 minutes.

Once the deployment is done, it will output iglu_server_dns_name. Make a note of this, you’ll need it when setting up your pipeline. If you have attached a custom SSL certificate and set up your own DNS records, then you don’t need this value.

Prepare the destination

Depending on the destination(s) you’ve choosen, you might need to perform a few extra steps to prepare for loading data there.

tip

Feel free to go ahead with these while your Iglu Server stack is deploying.

No extra steps needed — the necessary resources like a PostgreSQL instance, database, table and user will be created by the Terraform modules.

Set up the pipeline

In this section, you will update the input variables for the Terraform module, and then run the Terraform script to set up your pipeline. At the end you will have a working Snowplow pipeline ready to receive web, mobile or server-side data.

Step 1: Update the pipeline input variables

Navigate to the pipeline directory in the quickstart-examples repository and update the input variables in terraform.tfvars.

cd quickstart-examples/terraform/aws/pipeline/default
nano terraform.tfvars # or other text editor of your choosing

To update your input variables, you’ll need to know a few things:

  • Your IP Address. Help.
  • Your Iglu Server’s domain name from the previous step
  • Your Iglu Server’s API Key from the previous step
  • How to generate an SSH Key.
tip

On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub. You can get the value with cat ~/.ssh/id_rsa.pub.

Destination-specific variables

As mentioned above, there are several options for the pipeline’s destination database. For each destination you’d like to configure, set the <destination>_enabled variable (e.g. redshift_enabled) to true and fill all the relevant configuration options (starting with <destination>_).

When in doubt, refer back to the destination setup section where you have picked values for many of the variables.

caution

For all active destinations, change any _password setting to a value that only you know.

If you are using Postgres, set the postgres_db_ip_allowlist to a list of CIDR addresses that will need to access the database — this can be systems like BI Tools, or your local IP address, so that you can query the database from your laptop.

Step 2: Run the pipeline Terraform script

You will be asked to select a region, you can find more information about available AWS regions here.

terraform init
terraform plan
terraform apply

This will output your collector_dns_name, postgres_db_address, postgres_db_port and postgres_db_id.

Make a note of the outputs: you'll need them when sending events and connecting to your database.

Empty outputs

Depending on your cloud and chosen destination, some of these outputs might be empty — you can ignore those.

If you have attached a custom SSL certificate and set up your own DNS records, then you don't need collector_dns_name, as you will use your own DNS record to send events from the Snowplow trackers.

Terraform errors

For solutions to some common Terraform errors that you might encounter when running terraform plan or terraform apply, see the FAQs section.

If you are curious, here’s what has been deployed. Now it’s time to send your first events to your pipeline!

Was this page helpful?