Skip to main content

Snowflake loader

It is possible to run Snowflake Loader on AWS, GCP and Azure.

Early release

Currently, Azure support is in its pilot phase and we encourage you to share your thoughts and feedback with us on Discourse to help shape future development.

If you are interested in deploying BDP Enterprise (a private SaaS with extra features and SLAs) in your Azure account, please join our waiting list.

Alternatively, if you prefer loading data into Snowflake or Databricks hosted on Azure without managing any non-Azure infrastructure, consider BDP Cloud - our SaaS offering that fully supports these destinations.

Setting up Snowflake​

You can use the steps outlined in our quick start guide to create most of the necessary Snowflake resources.

There are two different authentication methods with Snowflake Loader:

  • With the TempCreds method, there are no additional Snowflake resources needed.
  • With the NoCreds method, the Loader needs a Snowflake stage.

This choice is controlled by the loadAuthMethod configuration setting.

note

For GCP pipelines, only the NoCreds method is available.

Using the NoCreds method

First, create a Snowflake stage. For that, you will need a Snowflake database, Snowflake schema, Snowflake storage integration, Snowflake file format, and the path to the transformed events bucket (in S3, GCS or Azure Blob Storage).

You can follow this tutorial to create the storage integration.

Assuming you created the other required resources for it, you can create the Snowflake stage by following this document.

Finally, use the transformedStage configuration setting to point the loader to your stage.

Running the loader​

There are dedicated terraform modules for deploying Snowflake Loader on AWS and Azure. You can see how they are used in our full pipeline deployment examples here.

We don't have a terraform module for deploying Snowflake Loader on GCP yet. Therefore, it needs to be deployed manually at the moment.

Downloading the artifact​

The asset is published as a jar file attached to the Github release notes for each version.

It's also available as a Docker image on Docker Hub under snowplow/rdb-loader-snowflake:5.7.3.

Configuring rdb-loader-snowflake​

The loader takes two configuration files:

  • a config.hocon file with application settings
  • an iglu_resolver.json file with the resolver configuration for your Iglu schema registry.
Minimal ConfigurationExtended Configuration
aws/snowflake.config.minimal.hoconaws/snowflake.config.reference.hocon
gcp/snowflake.config.minimal.hocongcp/snowflake.config.reference.hocon
azure/snowflake.config.minimal.hoconazure/snowflake.config.reference.hocon

For details about each setting, see the configuration reference.

See here for details on how to prepare the Iglu resolver file.

tip

All self-describing schemas for events processed by RDB Loader must be hosted on Iglu Server 0.6.0 or above. Iglu Central is a registry containing Snowplow-authored schemas. If you want to use them alongside your own, you will need to add it to your resolver file. Keep it mind that it could override your own private schemas if you give it higher priority. For details on this see here.

Running the Snowflake loader​

The two config files need to be passed in as base64-encoded strings:

$ docker run snowplow/rdb-loader-snowflake:5.7.3 \
--iglu-config $RESOLVER_BASE64 \
--config $CONFIG_BASE64
Telemetry notice

By default, Snowplow collects telemetry data for Snowflake Loader (since version 5.0.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId configuration setting.

If you wish to disable telemetry, you can do so by setting telemetry.disable to true.

See our telemetry principles for more information.

Was this page helpful?