Snowplow Media Player Package
The package source code can be found in the snowplow/dbt-snowplow-media-player repo, and the docs for the model design here.
The package is built as an extension of the dbt-snowplow-web package that transforms raw media player event data into derived tables for easier querying generated by the Snowplow JavaScript tracker in combination with media tracking specific plugins such as the Media Tracking plugin or the YouTube Tracking plugin.
In order to keep the documentations separate and less verbose, this guide will assume the reader is already familiar with configuring and running the web model and will only explain how to operate the media-player package in conjunction with the web model as the media model was designed to be run together with it. Please refer to the snowplow-web dbt docs for a full breakdown of the package and how to set it up.
Please note that the media player package is not compatible with the Flutter tracker as it is reliant on the Snowplow JavaScript tracker.
Overviewโ
This package consists of a series of dbt models with the goal to produce the following main aggregated models from the raw media player events and relevant contexts:
snowplow_media_player_base
: This derived table summarizes the key media player events and metrics of each media element on a media_id and pageview level which is considered as a base aggregation level for media interactions.snowplow_media_player_plays_by_pageview
: This view removes impressions from the '_base' table to summarize media plays on a page_view by media_id level.snowplow_media_player_media_stats
: This derived table aggregates the '_base' table to individual media_id level, calculating the main KPIs and overall video/audio metrics.
The package is built on top of the dbt-snowplow-web package taking that as a basis to carry out the incremental update. It is designed to be run together with the web model in a similar manner to how a custom module would run.
The _interactions_this_run
table takes the snowplow_web_base_events_this_run
table generated by the web package as an input then adds the various contexts to enrich the base table with the additional media related fields. It could be used for custom models for more in-depth event level derived tables and further analysis.
The _base_this_run
table then aggregates the _interactions_this_run
table to media_id and pageview level and serves as a basis for the incrementalized derived table _media_base
.
The main _media_stats
derived table will also be updated incrementally based on the _media_base
derived table, however not through the snowplow_incremental
materialization, but using the native dbt incremental materialization on a pageview basis after a set time window passed. This is to prevent complex and expensive queries due to metrics which need to take the whole page_view events into calculation. This way the metrics will only be calculated once per pageview / media, after no new events are expected.
The additional _pivot_base
table is there to calculate the percent_progress boundaries and weights that are used to calculate the total play_time and other related media fields.
Operating with the Web packageโ
Due to its unique relationship with the web package, in order to operate the media player package together with the web model there are several considerations to keep in mind. Depending on the use case one of the following scenarios may happen:
- The web package was already being used and the media player package added later
- The web package has not been used
- Only the media player package needs to be run at any time
- Scenario 1
- Scenario 2
- Scenario 3
Adding the media player data model to an existing dbt project with web model data already runningโ
If you have already been using the web package then your new media player models will need to sync up with the data that has been processed (see the incremental logic section for why), so the backfilling needs to happen before we can process more web data. Please note that during backfill no new web data is allowed to be processed and depending on the snowplow__backfill_limit_days
configured and the period that needs backfilling it can take multiple run for all models to sync up and new web events to start to be processed again. You can temporarily overwrite the snowplow__backfill_limit_days
variable in your snowplow_web
configuration if you want to backfill in less runs.
To begin the synching process please run the following script:
dbt run --select snowplow_web.base snowplow_media_player --vars 'snowplow__start_date: <date_when_media_player_tracking_starts>'
Using the snowplow__start_date
allows you to start your media player data at a later date than the rest of your web data if you need to.
The web model's update logic will recognize the new media player models and begin backfilling them between the date you defined within snowplow_start_date
and the upper limit defined by the variable snowplow_backfill_limit_days
that is set for the web model. You should see the following in the logs:
Snowplow: New Snowplow incremental model. Backfilling
After this you should be able to see all media_player models added to the derived.snowplow_web_incremental_manifest
table. Any subsequent run from this point onwards could be carried out using the recommended web model running method, using the snowplow_web
selector.
dbt run --selector snowplow_web
Starting both the media and web model from scratchโ
As neither package has been run before, the snowplow_web_incremental_manifest
table is new; all models from both packages (plus any custom modules tagged with snowplow_web_incremental
) will be processed using the recommended web model running method - using the snowplow_web
selector - without any extra steps required.
dbt run --selector snowplow_web
3. Only running the media player package from the same dbt projectโ
Although the media player package is not designed for standalone usage, there can be scenarios where you wish to only have the media player models enabled and not the web model. In such case the web model still has to be configured, but you can disable all models the media player package does not rely on. You can disable all non-required models like so:
models:
snowplow_web:
page_views:
enabled: false
sessions:
enabled: false
user_mapping:
enabled: false
users:
enabled: false
Running it, however can still be achieved by running the selector as defined in the web model.
dbt run --selector snowplow_web
After the run finishes, you should only see the media player related models to be present within the snowplow_web_incremental_manifest table (if you did not have them enabled before).
Custom modelsโ
There are two custom models included in the package which could potentially be used in downstream models:
the
snowplow_media_player_session_stats
table, which aggregates the snowplow_media_base table on a session levelthe
snowplow_media_player_user_stats
table, which aggregates the snowplow_media_player_session_stats to user level
By default these are disabled, but you can enable them in the project's profiles.yml
, if needed.
models:
snowplow_media_player:
custom:
enabled: true
Just like in case of the web model, users are encouraged to use the Media Player model and its incremental logic to design their own custom models / modules. The snowplow_media_player_interactions_this_run
table is designed with this in mind, where a couple of potentially useful fields are generated that the Media Player model does not use downstream but they nonetheless have the potential to be incorporated into users custom models.
One such example is the player_current_time
, which is the playback position of a specific media in seconds whenever a media player event is fired, from which more precise time-based calculations could be made.
e.g. subsequent events' player_current_time
could be used to deduct the player_end_time
of an event. Subtracting these two would result in calculated play_times instead of taking the percent_progress fields as a base like the Media Player model.
with interaction_ends as (
select
event_id,
event_type,
start_tstamp,
lead(start_tstamp, 1) over(partition by play_id order by start_tstamp) as end_tstamp,
player_current_time,
lead(player_current_time, 1) over(partition by play_id order by start_tstamp) as player_end_time
from scratch.snowplow_media_player_interactions_this_run
)
select
event_id,
player_end_time - player_current_time as play_time_sec_calculated
from interaction_ends
where event_type = 'play'