Skip to main content

Configuration

info

This page details general configurations that can apply across many of our packages, each package has specific configuration variables that define how the models run, please see each child page for the specifics of each package.

Mixing Variables​

caution

When using multiple dbt packages you must be careful to specify which scope a variable or configuration is defined within. In general, always specify each value in your dbt_project.yml nested under the specific package e.g.

dbt_project.yml
vars:
snowplow_web:
snowplow__atomic_schema: schema_with_snowplow_web_events
snowplow_mobile:
snowplow__atomic_schema: schema_with_snowplow_mobile_events

You can read more about variable scoping in dbt's docs around variable precedence.

Disabling a standard module​

If you do not require certain modules provided by the package you have the option to disable them. For instance to disable the users module in the snowplow_web package:

dbt_project.yml
models:
snowplow_web:
users:
enabled: false

Note that any dependent modules will also need to be disabled - for instance if you disabled the sessions module in the web package, you will also have to disable the users module.

Warehouse specific configurations​

Postgres​

In most modern analytical data warehouses constraints are usually either unsupported or unenforced. For this reason it is better to use dbt to assert the data constraints without actually materializing them in the database using dbt test. Here you can test the constraint is unique and not null. The snowplow_web package already includes these dbt tests for primary keys, see the testing section for more details.

To optimism performance of large Postgres datasets you can create indexes in your dbt model config for columns that are commonly used in joins or where clauses. For example:

# snowplow_web_sessions_custom.sql
{{
config(
...
indexes=[{'columns': [‘domain_sessionid’], 'unique': True}]
)
}}

Databricks​

You can connect to Databricks using either the dbt-spark or the dbt-databricks connectors. The dbt-spark adapter does not allow dbt to take advantage of certain features that are unique to Databricks, which you can take advantage of when using the dbt-databricks adapter. Where possible, we would recommend using the dbt-databricks adapter.

Unity Catalog support​

With the rollout of Unity Catalog (UC), the dbt-databricks adapter has added support in dbt for the three-level-namespace as of dbt-databricks>=1.1.1. As a result of this, we have introduced the snowplow__databricks_catalog variable which should be used if your Databricks environment has UC enabled, and you are using a version of the dbt-databricks adapter that supports UC. The default value for this variable is hive_metastore which is also the default name of your UC, but this can be changed with the snowplow__databricks_catalog variable.

Since there are many different situations, we've created the following table to help guide your setup process (this should help resolve the Cannot set database in Databricks! error):

Adapter supports UC and UC EnabledAdapter supports UC and UC not enabledAdapter does not support UC
Events land in default atomic schemasnowplow__databricks_catalog = '{name_of_catalog}'Nothing neededsnowplow__databricks_catalog = 'atomic'
Events land in custom schema (not atomic)snowplow__atomic_schema = '{name_of_schema}' snowplow__databricks_catalog = '{name_of_catalog}'snowplow__atomic_schema = '{name_of_schema}'snowplow__atomic_schema = '{name_of_schema}' snowplow__databricks_catalog = '{name_of_schema}'

Optimization of models​

The dbt-databricks adapter allows our data models to take advantage of the auto-optimization features in Databricks. If you are using the dbt-spark adapter, you will need to manually alter the table properties of your derived and manifest tables using the following command after running the data model at least once. You will need to run the command in your Databricks environment once for each table, and we would recommend applying this to the tables in the _derived and _snowplow_manifest schemas:

ALTER TABLE {TABLE_NAME} SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);

BigQuery​

As mentioned in the Quickstart, many of our packages allow you to specify which column your events table is partitioned on. It will likely be partitioned on collector_tstamp or derived_tstamp. If it is partitioned on collector_tstamp you should set snowplow__derived_tstamp_partitioned to false. This will ensure only the collector_tstamp column is used for partition pruning when querying the events table:

dbt_project.yml
vars:
snowplow_mobile:
snowplow__derived_tstamp_partitioned: false
Was this page helpful?