Skip to main content

Custom models

tip

On this page, <package> can be one of: web, mobile, ecommerce, normalize

Custom modelsโ€‹

The Snowplow packages are designed to be easily customized or extended within your own dbt project by building your own custom models. The standard models we provide (base, page/screen views, sessions and users) are not designed to be modified by you. An example dbt project with custom models can be seen in the custom example directory of the snowplow-web repo and snowplow-mobile repo.

In general there are 3 types of custom models you may end up building:

  1. Full refresh/view models: these models build from the derived tables but require a full refresh each time they are built. This could include things like daily aggregations or rolling views.
  2. Incremental models: these models build from the existing (or custom) _this_run tables(s) and are incrementally upserted to each run. This could include things like a custom sessions table with additional user information. These are the most common types of custom models.
  3. This run models: these models build from the base_events_this_run table that contains all columns for events being (re)processed in the run. These can be used to feed custom incremental models.

Guidelines & Best Practiceโ€‹

The standard models carry out the heavy lifting in establishing an incremental structure and providing the core logic for the most common aggregation use cases. It also allows custom models to be plugged in without impeding the maintenance of standard models.

The following best practices should be followed to ensure that updates and bug fixes to the model can be rolled out with minimal complication:

  • Custom models should not modify any of the tables generated by the Snowplow packages i.e. the scratch, derived or manifest tables.
  • Customizations should not modify the SQL provided by the package - they should only comprise of a new set of SQL statements, which produce a separate table.
  • The logic for custom SQL should be idempotent, and restart-safe - in other words, it should be written in such a way that a failure mid-way, or a re-run of the model will not change the deterministic output.
  • Redshift/Postgres users: be careful about joining context or self-describing event tables onto the main events table. Deduplication logic needs to be applied to avoid many-to-many joins. We have provided a macro - get_sde_or_context() - for you to use for this purpose from v0.14.0 of the snowplow-utils package. Check out the duplicates section and the package documentation on how to use it.

In short, the standard models can be treated as the source code for a distinct piece of software, and custom models can be treated as self-maintained, additive plugins - in much the same way as a Java package may permit one to leverage public classes in their own API, and provide an entry point for custom programs to run, but will not permit one to modify the original API.

The _this_run and derived tables are considered part of the 'public' class of tables in this model structure, and so we can give assurances that non-breaking releases won't alter them. The other tables may be used in custom SQL, but their logic and structure may change from release to release, or they may be removed. If one does use a scratch table in custom logic, any breaking changes can be mitigated by either amending the custom logic to suit, or copying the relevant steps from an old version of the model into the custom model. (However this will rarely be necessary).

Creating a custom modelโ€‹

Full refresh/view modelโ€‹

These models build directly off the derived tables and can be built the same way you would any other dbt model, using either the table or view as the materialization and referencing the derived table using the {{ ref(...) }} syntax. These models should NOT be tagged with the snowplow_<package>_incremental tag.

Note that because these models are not tagged, they will not be run when using the snowplow_<package> selector.

Incremental modelsโ€‹

As these models form part of the incremental processing of the package, these models should be tagged with snowplow_<package>_incremental in order to leverage the incremental logic of this package. We recommend creating a sub directory of your /models directory to contain all your custom models. In this example we created the sub directory snowplow_<package>_custom_models. We can then apply the tag to all models in this directory:

dbt_project.yml
models:
your_dbt_project:
snowplow_<package>_custom_models:
+tags: snowplow_<package>_incremental #Adds tag to all models in the 'snowplow_<package>_custom_models' directory

These models should also make use of the optimized materialization set such that materialized='incremental and snowplow_optimize=true in your model config. Finally, as well as referencing a _this_run table these models should make use of the is_run_with_new_events macro to only process the table when new events are available in the current run. This macro snowplow_utils.is_run_with_new_events(package_name) will evaluate whether the particular model, i.e. {{ this }}, has already processed the events in the given run of the model. This is returned as a boolean and effectively blocks the upsert to incremental models if the run only contains old data. This protects against your derived incremental tables being temporarily updated with incomplete data during batched back-fills of other models.

/models/snowplow_<package>_custom_models/my_custom_model.sql
{{
config(
materialized='incremental',
upsert_date_key='my_date_field',
unique_key='unique_id_field',
snowplow_optimize=true,
...
)
}}


select
...
from {{ ref('snowplow_<package>_<table>_this_run') }}
where {{ snowplow_utils.is_run_with_new_events('snowplow_<package>') }} --returns false if run doesn't contain new events.

By doing all of this you ensure your table will be managed correctly in the manifest tables and make use of the incremental logic of the package. Listed below are the recommended tables to reference as your input for a custom model, depending on the level of aggregation required:

  • Event level: snowplow_web_base_events_this_run
  • Page view level: snowplow_web_page_views_this_run
  • Session level: snowplow_web_sessions_this_run
  • User level: snowplow_web_users_this_run

This run modelsโ€‹

If you need to produce a custom _this_run type model you should build this model off the relevant package's snowplow_<package>_base_events_this_run table. This table has all the columns from your events table in, including any self describing events and custom contexts (except in Redshift). Note that this table has some custom logic in each package to identify which events to include in each run, including your app_ids filter. For more information about the way this works see the incremental logic page, but the key takeaway is all events with a session ID that exists in events since the last processing are included in the next run - the session ID field varies by package and the normalize package does not have this filter.

Custom SDEs or Contexts in Redshift

Since version 0.16.0 of snowplow_web it has been possible to include custom SDE or contexts in the snowplow_web_base_events_this_run table by making use of the snowplow__entities_or_sdes variable. See the section on Utilizing custom contexts or SDEs for details of the structure, but as an example to include the values from the contexts_com_mycompany_click_1 context you would use:

dbt_project.yml
vars:
...
snowplow__entities_or_sdes: [{'name': 'contexts_com_mycompany_click_1', 'prefix': 'click', 'alias': 'mc', 'single_entity': true}]
...
...

which will add all fields from that context, with field names prefixed by click_ to the snowplow_web_base_events_this_run table, available to then use in any custom model. Note that you should only join single entity context tables (i.e. those that only have a single object per event, rather than multiple) to avoid duplicates being generated in your base events table, which would then lead to incorrect processing of the rest of the package. Any multi-valued contexts should be joined directly in your custom model, making use of our get_sde_or_context() macro.

We plan to add this feature to our other packages in the future.

To build a custom _this_run table you need to ensure that this model is materialized as a table and you use {{ ref(snowplow_<package>_base_events_this_run) }}. This type of model also need to be tagged with snowplow_<package>_incremental. You can then use this model in other Incremental type custom models.

Retiring Custom modelsโ€‹

This process is the same for all types of custom model.

If you want to retire a custom model, you should delete the models from your project or disable the models.

There is no need to remove the models from the snowplow_<package>_incremental_manifest manifest table. The packages identifies enabled models tagged with snowplow_<package>_incremental within your project and selects these models from the manifest in order to calculate the state of the web model as described above.

danger

Do NOT just use --exclude on the retired models from your job in production. Currently the packages is unable to identify which models are due to be executed in a given run. As a result, if you exclude a model the package will get stuck in State 3 and continue to attempt to sync your excluded with the remaining models.

Back-filling custom modelsโ€‹

Full refresh/view model & This run modelsโ€‹

As these models are rebuilt fully each time the package is run, there is no special action to take to backfill these types of models.

Incremental Modelsโ€‹

Over time you may wish to add more custom incremental models to extend the functionality of this package. As you introduce new custom models into your project, assuming they are tagged correctly, the package will automatically replay all events up until the latest event to have been processed by the other models.

During back-filling, the derived tables are blocked from updating. This is to protect against a batched back-fill temporarily introducing incomplete data into these derived tables. The batch size of this back-fill is limited as outlined in the identification of events to process section, this means it might take several runs to complete the back-fill, during which time no new events will be processed by the main models.

Back-filling a model can be performed either as part of the entire run of the Snowplow package, or in isolation to reduce cost (recommended):

dbt run --select snowplow_<package> tag:snowplow_<package>_incremental # Will execute all Snowplow <package> models, as well as custom.
dbt run --select +my_custom_model # Will execute only your custom model + any upstream nodes.

Tearing down and restarting a subset of modelsโ€‹

Full refresh/view model & This run modelsโ€‹

As these models are rebuilt fully each time the package is run, there is no special action to take to restart these types of models.

Incremental modelsโ€‹

As the code base for your custom incremental models evolves, you will likely need to replay events through a given model. In order to do this you must:

  1. Manually drop the table(s) from your custom model(s) in your database (you may wish to simply rename them until the back-fill is completed in case of any issues).
  2. Remove the models from the manifest table (See the Complete refresh section for an explanation as to why), this can be achieved either by:
    1. (Recommended) using the models_to_remove variable at run time
      dbt run --select +snowplow_<package>_custom_incremental_model --vars '{snowplow__start_date: "yyyy-mm-dd", models_to_remove: snowplow_<package>_custom_incremental_model}'
    2. (High Risk) manually deleting the record from the snowplow_<package>_incremental_manifest table.

By removing the snowplow_<package>_custom_incremental_model model from the manifest the <package> will be in State 2 and will replay all events.

tip

If you want to replay events through a series of dependent models, you only need to pass the name of the endmost model within the run.

Tips for developing custom modelsโ€‹

note

Most of these tips apply only to Incremental type custom models

While developing custom models you may benefit from the following:

  1. Minimizing the amount of data being processed to reduce cost & run time.
  2. Use recent events from your events table to ensure you have all the latest contexts and event types available.
  3. BigQuery only: Automatic handling of evolving schemas for custom contexts and unstructured events.

1. Reducing Costsโ€‹

By setting snowplow__backfill_limit_days to 1 in your dbt_project.yml file you will only process a days worth of data per run.

We have provided the get_value_by_target macro to dynamically switch the backfill limit depending on your environment i.e. dev vs. prod, with your environment determined by your target name:

dbt_project.yml
vars:
snowplow_<package>:
snowplow__backfill_limit_days: "{{ snowplow_utils.get_value_by_target(
dev_value=1,
default_value=30,
dev_target_name='dev') }}"

2. Using Recent Dataโ€‹

This can be achieved by setting snowplow__start_date to a recent date. To dynamically change the start date depending on your environment, you can use the following:

dbt_project.yml
vars:
snowplow_<package>:
snowplow__start_date: "{{ snowplow_utils.get_value_by_target(
dev_value=snowplow_utils.n_timedeltas_ago(1, 'weeks'),
default_value='2020-01-01',
dev_target_name='dev') }}"

3. Handling of schema evolutionโ€‹

BigQuery Only

As your schemas for such custom contexts and unstructured events evolve, multiple versions of the same column will be created in your events table e.g. custom_context_1_0_0, custom_context_1_0_1. These columns contain nested fields i.e. are of a datatype RECORD. When modeling Snowplow data it can be useful to combine or coalesce each nested field across all versions of the column for a continuous view over time.

The snowplow-utils package provides the combine_column_versions macro, which will automatically coalesce the fields within every version of the specified column. This mitigates the need for you to update your models every time a new column version is created.

Please refer to the snowplow-utils docs for the full documentation on these macros.

Was this page helpful?