Incremental Sessionization Logic
On this page, <package>
can be one of: web
, mobile
, ecommerce
, normalize
The general principle behind an incremental model is to identify new events/rows since the previous run of the model, and then only process these new events. This minimizes cost and reduces run times. This is great for basic event streams, however you begin to encounter issues for any aggregations you wish to do.
For mobile and web event data we typically consider a session to be a complete 'visit' and as such calculate metrics across the entire session. This means that when we have a new event for a previously processed session, we have to reprocess all historic events for that session as well as the new events - otherwise our metrics would be incorrect. The logic we have implemented to achieve this is:
- Identify new events since the previous run of the package
- Identify the
session_id
associated with the new events - Look back over the events table to find all events associated with these
sessions_id
- Run all these events through the page/screen views, sessions and users modules
Given the large nature of event tables, Step 3 can be an expensive operation. To minimize cost ideally we want to:
- Know when any given session started. This would allow us to limit scans on the events table when looking back for previous events.
- This is achieved by the
snowplow_<package>_base_sessions_lifecycle_manifest
model, which records the start and end timestamp of all sessions.
- This is achieved by the
- Limit the maximum allowed session length. Sessions generated by bots can persist for years. This would mean scanning years of data every run of the package.
- For the web package this is achieved by the
snowplow_web_base_quarantined_sessions
model, which stores thesession_id
of any sessions that have exceeded the max allowed session length (snowplow__max_session_days
). For such sessions, all events are processed up until the max allowed length. Moving forward, no more data is processed for that session. This is not required for mobile.
- For the web package this is achieved by the
The Incremental Manifestโ
Most of our packages use centralized manifest tables, snowplow_<package>_incremental_manifest
, to record what events have already been processed and by which model/node. This allows for easy identification of what events to process in subsequent runs of the package. The manifest table is updated as part of an on-run-end
hook, which calls the snowplow_incremental_post_hook()
macro.
Example from the from the snowplow_web_incremental_manifest
:
model | last_success |
---|---|
snowplow_web_page_views_this_run | '2021-06-03' |
snowplow_web_page_views | '2021-06-03' |
snowplow_web_sessions | '2021-06-02' |
Identification of events to processโ
The identification of which events to process is performed by the get_run_limits
macro which is called in the snowplow_<package>_base_new_event_limits
model. This macro uses the metadata recorded in snowplow_<package>_incremental_manifest
to determine the correct events to process next based on the current state of the Snowplow dbt model. The selection of these events is done by specifying a range of collector_tstamp
's to process, between lower_limit
and upper_limit
. The calculation of these limits is as follows.
First we query snowplow_<package>_incremental_manifest
, filtering for all enabled models tagged with snowplow_<package>_incremental
within your dbt project:
select
min(last_success) as min_last_success,
max(last_success) as max_last_success,
coalesce(count(*), 0) as models
from snowplow_<package>_incremental_manifest
where model in (array_of_snowplow_tagged_enabled_models)
Based on the results the model enters 1 of 4 states:
In all states the upper_limit
is limited by the snowplow__backfill_limit_days
variable. This protects against back-fills with many rows causing very long run times.
State 1: First run of the packageโ
The query returns models = 0
indicating that no models exist in the manifest.
lower_limit
: snowplow__start_date
upper_limit
: least(current_tstamp, snowplow__start_date + snowplow__backfill_limit_days)
State 2: New model introducedโ
models < size(array_of_snowplow_tagged_enabled_models)
and therefore a new model, tagged with snowplow_<package>_incremental
, has been added since the last run. The package will replay all previously processed events in order to back-fill the new model.
lower_limit
: snowplow__start_date
upper_limit
: least(max_last_success, snowplow__start_date + snowplow__backfill_limit_days)
State 3: Models out of syncโ
min_last_success < max_last_success
and therefore the tagged models are out of sync, for example due to a particular model failing to execute successfully during the previous run. The package will attempt to sync all models.
lower_limit
: min_last_success - snowplow__lookback_window_hours
upper_limit
: least(max_last_success, min_last_success + snowplow__backfill_limit_days)
State 4: Standard runโ
If none of the above criteria are met, then we consider it a 'standard run' and we carry on from the last processed event.
lower_limit
: max_last_success - snowplow__lookback_window_hours
upper_limit
: least(current_tstamp, max_last_success + snowplow__backfill_limit_days)
How to identify the current stateโ
If you want to check the current state of a model, run the snowplow_<package>_base_new_event_limits
model. This will log the current state to the CLI while causing no disruption to the incremental processing of events.
dbt run --select snowplow_<package>_base_new_event_limits
...
00:26:29 + Snowplow: Standard incremental run
00:26:29 + Snowplow: Processing data between 2021-01-05 17:59:32 and 2021-01-07 23:59:32