Skip to main content

Web

info

Some variables are only available in the latest version of our package, or might have changed format from older versions. If you are unable to use the latest version, check the dbt_project.yml file of our package for the version you are using to see what options are available to you.

Package Configuration Variablesโ€‹

This package utilizes a set of variables that are configured to recommended values for optimal performance of the models. Depending on your use case, you might want to override these values by adding to your dbt_project.yml file.

note

All variables in Snowplow packages start with snowplow__ but we have removed these in the below tables for brevity.

Warehouse and trackerโ€‹

Variable NameDescriptionDefault
atomic_schemaThe schema (dataset for BigQuery) that contains your atomic events table.atomic
databaseThe database that contains your atomic events table.target.database
dev_target_nameThe target name of your development environment as defined in your profiles.yml file. See the Manifest Tables section for more details.dev
eventsThis is used internally by the packages to reference your events table based on other variable values and should not be changed.events
events_tableThe name of the table that contains your atomic events.events
ga4_categories_seedName of the model for the GA4 category mapping seed table, either a seed or a model (if you want to use a source, create a model to select from it).snowplow_web_dim_ga4_source_categories
geo_mapping_seedName of the model for the Geo mapping seed table, either a seed or a model (if you want to use a source, create a model to select from it).snowplow_web_dim_geo_country_mapping
heartbeatPage ping heartbeat time as defined in your tracker configuration.10
min_visit_lengthMinimum visit length as defined in your tracker configuration.5
rfc_5646_seedName of the model for the RFC 5646 (language) mapping seed table, either a seed or a model (if you want to use a source, create a model to select from it).snowplow_web_dim_rfc_5646_language_mapping
sessions_tableThe users module requires data from the derived sessions table. If you choose to disable the standard sessions table in favor of your own custom table, set this to reference your new table e.g. {{ ref('snowplow_web_sessions_custom') }}. Please see the README in the custom_example directory for more information on this sort of implementation."{{ ref( 'snowplow_web_sessions' ) }}"

Operation and logicโ€‹

Variable NameDescriptionDefault
allow_refreshUsed as the default value to return from the allow_refresh() macro. This macro determines whether the manifest tables can be refreshed or not, depending on your environment. See the Manifest Tables section for more details.false
backfill_limit_daysThe maximum numbers of days of new data to be processed since the latest event processed. Please refer to the incremental logic section for more details.30
conversion_events(Version 0.15.0+) A list of dictionaries that define a conversion event for your modeling, to add the relevant columns to the sessions table. The dictionary keys are name (required), condition (required), value, default_value, and list_events. For more information see the package documentation.
cwv_days_to_measureThe number of days to use for web vital measurements (if enabled).28
cwv_percentileThe percentile that the web vitals measurements that are produced for all page views (if enabled).75
days_late_allowedThe maximum allowed number of days between the event creation and it being sent to the collector. Exists to reduce lengthy table scans that can occur as a result of late arriving data.3
limit_page_views_to_sessionA boolean whether to ensure page view aggregations are limited to pings in the same session as the page_view event, to ensure deterministic behavior. If false you may get different results for the same page_view depending on which sessions are included in a run. See the stray page ping section for more information.true
list_event_countsA boolean whether to include a json-type (varies by warehouse) column in the sessions table with a count of events for each event_type in that session.false
lookback_window_hoursThe number of hours to look before the latest event processed - to account for late arriving data, which comes out of order.6
max_session_daysThe maximum allowed session length in days. For a session exceeding this length, all events after this limit will stop being processed. Exists to reduce lengthy table scans that can occur due to long sessions which are usually a result of bots.3
page_view_stitchingDetermines whether to apply the user mapping to the page views table. Note this can be an expensive operation to do every run. One way to mitigate this is by running this update with less frequency than your usual run by enabling this variable only for that specific run. Please see the User Mapping section for more details.false
session_identifiersA list of key:value dictionaries which contain all of the contexts and fields where your session identifiers are located. For each entry in the list, if your map contains the schema value atomic, then this refers to a field found directly in the atomic events table. If you are trying to introduce a context/entity with an identifier in it, the package will look for the context in your events table with the name specified in the schema field. It will use the specified value in the field key as the field name to access. For Redshift/Postgres, using the schema key the package will try to find a table in your snowplow__events_schema schema with the same name as the schema value provided, and join that. If multiple fields are specified, the package will try to coalesce all fields in the order specified in the list. For a better understanding of the advanced usage of this variable, please see the Utils advanced operation section for more details.[{"schema" : "atomic", "field" : "domain_sessionid"}]
session_lookback_daysNumber of days to limit scan on snowplow_web_base_sessions_lifecycle_manifest manifest. Exists to improve performance of model when we have a lot of sessions. Should be set to as large a number as practical.730
session_stitchingDetermines whether to apply the user mapping to the sessions table. Please see the User Mapping section for more details.true
session_sqlThis allows you to override the session_identifiers SQL, to define completely custom SQL in order to build out a session identifier for your events. If you are interested in using this instead of providing identifiers through the session_identifiers variable, please see the Utils advanced operation section for more details on how to do that.
session_timestampDetermines which timestamp is used to build the sessionization logic. It's a good idea to have this timestamp be the same timestamp as the field you partition your events table on.collector_tstamp
start_dateThe date to start processing events from in the package on first run or a full refresh, based on collector_tstamp.'2020-01-01'
total_all_conversionsA boolean flag whether to calculate and add the cv__all_volume and cv__all_total columns. For more information see the package documentation.false
upsert_lookback_daysNumber of days to look back over the incremental derived tables during the upsert. Where performance is not a concern, should be set to as long a value as possible. Having too short a period can result in duplicates. Please see the Snowplow Optimized Materialization section for more details.30
user_identifiersA list of key:value dictionaries which contain all of the contexts and fields where your user identifiers are located. For each entry in the list, if your map contains the schema value atomic, then this refers to a field found directly in the atomic events table. If you are trying to introduce a context/entity with an identifier in it, the package will look for the context in your events table with the name specified in the schema field. It will use the specified value in the field key as the field name to access. For Redshift/Postgres, using the schema key the package will try to find a table in your snowplow__events_schema schema with the same name as the schema value provided, and join that. If multiple fields are specified, the package will try to coalesce all fields in the order specified in the list. For a better understanding of the advanced usage of this variable, please see the Utils advanced operation section for more details.[{"schema" : "atomic", "field" : "domain_userid"}]
user_sqlThis allows you to override the user_identifiers SQL, to define completely custom SQL in order to build out a user identifier for your events. If you are interested in using this instead of providing identifiers through the user_identifiers variable, please see the Utils advanced operation section for more details on how to do that.
user_stitching_idThis is the user_id you want to stitch to sessions (and/or page views) with matching domain_userids. It supports raw sql expressions.user_id
tip

When modifying the session/user_identifiers or using session/user_sql in the web package these will overwrite the domain_sessionid and domain_userid fields in tables, rather than being session/user_identifier as in the core utils implementation. This is for historic reasons to mitigate breaking changes. Original values for these fields can be found in original_domain_session/userid in each table.

Contexts, filters, and logsโ€‹

Variable NameDescriptionDefault
app_idA list of app_ids to filter the events table on for processing within the package.[ ] (no filter applied)
enable_consentFlag to enable the consent module.false
enable_cwvFlag to enable the Core Web Vitals module.false
enable_iabFlag to include the IAB enrichment data in the models.false
enable_uaFlag to include the UA Parser enrichment data in the models.false
enable_yauaaFlag to include the YAUAA enrichment data in the models.false
has_log_enabledWhen executed, the package logs information about the current run to the CLI. This can be disabled by setting to false.true
page_view_passthroughsField(s) to carry through from the events table to the derived table. The field is from the page_view event record. Aggregation is not supported. A list of either flat column names from the events table or a dictionary with the keys sql for the SQL code to select the column and alias for the alias of the column in the output.[] (no passthroughs)
session_passthroughsField(s) to carry through from the events table to the derived table. The field is based on the first page_view or page_ping event for that session. Aggregation is not supported. A list of either flat column names from the events table or a dictionary with the keys sql for the SQL code to select the column and alias for the alias of the column in the output.[] (no passthroughs)
ua_bot_filterFlag to filter out bots via the useragent string pattern match.true
user_first_passthroughsField(s) to carry through from the events table to the derived table. The field is based on the first session record for that user. Aggregation is not supported. A list of either flat column names from the events table or a dictionary with the keys sql for the SQL code to select the column and alias for the alias of the column in the output.[] (no passthroughs)
user_last_passthroughsField(s) to carry through from the events table to the derived table. The field is based on the last session record for that user. Aggregation is not supported. A list of either flat column names from the events table or a dictionary with the keys sql for the SQL code to select the column and alias for the alias of the column in the output. Note flat fields will be aliased with a last_ prefix, dictionary provided aliases will not by default.[] (no passthroughs)

Warehouse Specificโ€‹

Variable NameDescriptionDefault
databricks_catalogThe catalogue your atomic events table is in. Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to hive_metastore) or the same value as your snowplow__atomic_schema (unless changed it should be 'atomic').hive_metastore

Output Schemasโ€‹

By default all scratch/staging tables will be created in the <target.schema>_scratch schema, the derived tables, will be created in <target.schema>_derived and all manifest tables in <target.schema>_snowplow_manifest. Some of these schemas are only used by specific packages, ensure you add the correct configurations for each packages you are using. To change, please add the following to your dbt_project.yml file:

tip

If you want to use just your connection schema with no suffixes, set the +schema: values to null

models:
snowplow_web:
base:
manifest:
+schema: my_manifest_schema
scratch:
+schema: my_scratch_schema
sessions:
+schema: my_derived_schema
scratch:
+schema: my_scratch_schema
user_mapping:
+schema: my_derived_schema
users:
+schema: my_derived_schema
scratch:
+schema: my_scratch_schema
page_views:
+schema: my_derived_schema
scratch:
+schema: my_scratch_schema

Config Generatorโ€‹

You can use the below inputs to generate the code that you need to place into your dbt_project.yml file to configure the package as you require. Any values not specified will use their default values from the package.

Warehouse and tracker
Schema (dataset) that contains your atomic events
Database that contains your atomic events
Target name of your development environment as defined in your `profiles.yml` file
The name of the table that contains your atomic events
Page ping heartbeat time as defined in your tracker configuration
Minimum visit length as defined in your tracker configuration
The users module requires data from the derived sessions table. If you choose to disable the standard sessions table in favor of your own custom table, set this to reference your new table e.g. {{ ref("snowplow_web_sessions_custom") }}
Operation and Logic
The maximum numbers of days of new data to be processed since the latest event processed
Conversion Definition

> Click the plus sign to add a new entry
The number of days to use for web vital measurements (if enabled)
The percentile that the web vitals measurements that are produced for all page views (if enabled)
The maximum allowed number of days between the event creation and it being sent to the collector
The number of hours to look before the latest event processed - to account for late arriving data, which comes out of order
The maximum allowed session length in days. For a session exceeding this length, all events after this limit will stop being processed
Session Identifiers

> Click the plus sign to add a new entry
Number of days to limit scan on `snowplow_web_base_sessions_lifecycle_manifest` manifest
The date to start processing events from in the package on first run or a full refresh, based on `collector_tstamp`
Number of days to look back over the incremental derived tables during the upsert
User Identifiers

> Click the plus sign to add a new entry
Contexts, Filters, and Logs
App IDs

> Click the plus sign to add a new entry
Page View Passthroughs

> Click the plus sign to add a new entry
Session Passthroughs

> Click the plus sign to add a new entry
User First Passthroughs

> Click the plus sign to add a new entry
User Last Passthroughs

> Click the plus sign to add a new entry
Warehouse Specific
The catalogue your atomic events table is in
(Redshift) Entities or SDEs

> Click the plus sign to add a new entry

Project Variables:

vars:
snowplow_web: null
Was this page helpful?