Snowplow Normalize Models
This page is auto-generated from our dbt packages, some information may be incomplete
Snowplow Normalize
Snowplow Normalize Base Events This Run
models/base/scratch/<adaptor>/snowplow_normalize_base_events_this_run.sql
Description
For any given run, this table contains all required events to be consumed by subsequent nodes in the Snowplow dbt normalize package. This is a cleaned, deduped dataset, containing all columns from the raw events table.
Note: This table should be used as the input to any custom modules that require event level data, rather than selecting straight from atomic.events
File Paths
- bigquery
- databricks
- snowflake
models/base/scratch/bigquery/snowplow_normalize_base_events_this_run.sql
models/base/scratch/databricks/snowplow_normalize_base_events_this_run.sql
models/base/scratch/snowflake/snowplow_normalize_base_events_this_run.sql
Details
Columns
Base event this run table column lists may be incomplete and is missing contexts/unstructs, please check your warehouse for a more accurate column list.
Column Name | Description |
---|---|
app_id | Application ID e.g. ‘angry-birds’ is used to distinguish different applications that are being tracked by the same Snowplow stack, e.g. production versus dev. |
platform | Platform e.g. ‘web’ |
etl_tstamp | Timestamp event began ETL e.g. ‘2017-01-26 00:01:25.292’ |
collector_tstamp | Time stamp for the event recorded by the collector e.g. ‘2013-11-26 00:02:05’ |
dvce_created_tstamp | Timestamp event was recorded on the client device e.g. ‘2013-11-26 00:03:57.885’ |
event | The type of event recorded e.g. ‘page_view’ |
event_id | A UUID for each event e.g. ‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’ |
txn_id | Transaction ID set client-side, used to de-dupe records e.g. 421828 |
name_tracker | Tracker namespace e.g. ‘sp1’ |
v_tracker | Tracker version e.g. ‘js-3.0.0’ |
v_collector | Collector version e.g. ‘ssc-2.1.0-kinesis’ |
v_etl | ETL version e.g. ‘snowplow-micro-1.1.0-common-1.4.2’ |
user_id | Unique ID set by business e.g. ‘jon.doe@email.com’ |
user_ipaddress | User IP address e.g. ‘92.231.54.234’ |
user_fingerprint | A user fingerprint generated by looking at the individual browser features e.g. 2161814971 |
domain_userid | User ID set by Snowplow using 1st party cookie e.g. ‘bc2e92ec6c204a14’ |
domain_sessionidx | A visit / session index e.g. 3 |
network_userid | User ID set by Snowplow using 3rd party cookie e.g. ‘ecdff4d0-9175-40ac-a8bb-325c49733607’ |
geo_country | ISO 3166-1 code for the country the visitor is located in e.g. ‘GB’, ‘US’ |
geo_region | ISO-3166-2 code for country region the visitor is in e.g. ‘I9’, ‘TX’ |
geo_city | City the visitor is in e.g. ‘New York’, ‘London’ |
geo_zipcode | Postcode the visitor is in e.g. ‘94109’ |
geo_latitude | Visitor location latitude e.g. 37.443604 |
geo_longitude | Visitor location longitude e.g. -122.4124 |
geo_region_name | Visitor region name e.g. ‘Florida’ |
ip_isp | Visitor’s ISP e.g. ‘FDN Communications’ |
ip_organization | Organization associated with the visitor’s IP address – defaults to ISP name if none is found e.g. ‘Bouygues Telecom’ |
ip_domain | Second level domain name associated with the visitor’s IP address e.g. ‘nuvox.net’ |
ip_netspeed | Visitor’s connection type e.g. ‘Cable/DSL’ |
page_url | The page URL e.g. ‘http://www.example.com’ |
page_title | Web page title e.g. ‘Snowplow Docs – Understanding the structure of Snowplow data’ |
page_referrer | URL of the referrer e.g. ‘http://www.referrer.com’ |
page_urlscheme | Scheme aka protocol e.g. ‘https’ |
page_urlhost | Host aka domain e.g. ‘“www.snowplow.io’ |
page_urlport | Port if specified, 80 if not 80 |
page_urlpath | Path to page e.g. ‘/product/index.html’ |
page_urlquery | Querystring e.g. ‘id=GTM-DLRG’ |
page_urlfragment | Fragment aka anchor e.g. ‘4-conclusion’ |
refr_urlscheme | Referer scheme e.g. ‘http’ |
refr_urlhost | Referer host e.g. ‘www.bing.com’ |
refr_urlport | Referer port e.g. 80 |
refr_urlpath | Referer page path e.g. ‘/images/search’ |
refr_urlquery | Referer URL querystring e.g. ‘q=psychic+oracle+cards’ |
refr_urlfragment | Referer URL fragment |
refr_medium | Type of referer e.g. ‘search’, ‘internal’ |
refr_source | Name of referer if recognised e.g. ‘Bing images’ |
refr_term | Keywords if source is a search engine e.g. ‘psychic oracle cards’ |
mkt_medium | Type of traffic source e.g. ‘cpc’, ‘affiliate’, ‘organic’, ‘social’ |
mkt_source | The company / website where the traffic came from e.g. ‘Google’, ‘Facebook’ |
mkt_term | Any keywords associated with the referrer e.g. ‘new age tarot decks’ |
mkt_content | The content of the ad. (Or an ID so that it can be looked up.) e.g. 13894723 |
mkt_campaign | The campaign ID e.g. ‘diageo-123’ |
se_category | Category of event e.g. ‘ecomm’, ‘video’ |
se_action | Action performed / event name e.g. ‘add-to-basket’, ‘play-video’ |
se_label | The object of the action e.g. the ID of the video played or SKU of the product added-to-basket e.g. ‘pbz00123’ |
se_property | A property associated with the object of the action e.g. ‘HD’, ‘large’ |
se_value | A value associated with the event / action e.g. the value of goods added-to-basket e.g. 9.99 |
tr_orderid | Order ID e.g. ‘#134’ |
tr_affiliation | Transaction affiliation (e.g. store where sale took place) e.g. ‘web’ |
tr_total | Total transaction value e.g. 12.99 |
tr_tax | Total tax included in transaction value e.g. 3.00 |
tr_shipping | Delivery cost charged e.g. 0.00 |
tr_city | Delivery address, city e.g. ‘London’ |
tr_state | Delivery address, state e.g. ‘Washington’ |
tr_country | Delivery address, country e.g. ‘France’ |
ti_orderid | Order ID e.g. ‘#134’ |
ti_sku | Product SKU e.g. ‘pbz00123’ |
ti_name | Product name e.g. ‘Cone pendulum’ |
ti_category | Product category e.g. ‘New Age’ |
ti_price | Product unit price e.g. 9.99 |
ti_quantity | Number of product in transaction e.g. 2 |
pp_xoffset_min | Minimum page x offset seen in the last ping period e.g. 0 |
pp_xoffset_max | Maximum page x offset seen in the last ping period e.g. 100 |
pp_yoffset_min | Minimum page y offset seen in the last ping period e.g. 0 |
pp_yoffset_max | Maximum page y offset seen in the last ping period e.g. 200 |
useragent | Raw useragent |
br_name | Browser name e.g. ‘Firefox 12’ |
br_family | Browser family e.g. ‘Firefox’ |
br_version | Browser version e.g. ‘12.0’ |
br_type | Browser type e.g. ‘Browser’ |
br_renderengine | Browser rendering engine e.g. ‘GECKO’ |
br_lang | Language the browser is set to e.g. ‘en-GB’ |
br_features_pdf | Whether the browser recognizes PDFs e.g. True |
br_features_flash | Whether Flash is installed e.g. True |
br_features_java | Whether Java is installed e.g. True |
br_features_director | Whether Adobe Shockwave is installed e.g. True |
br_features_quicktime | Whether QuickTime is installed e.g. True |
br_features_realplayer | Whether RealPlayer is installed e.g. True |
br_features_windowsmedia | Whether mplayer2 is installed e.g. True |
br_features_gears | Whether Google Gears is installed e.g. True |
br_features_silverlight | Whether Microsoft Silverlight is installed e.g. True |
br_cookies | Whether cookies are enabled e.g. True |
br_colordepth | Bit depth of the browser color palette e.g. 24 |
br_viewwidth | Viewport width e.g. 1000 |
br_viewheight | Viewport height e.g. 1000 |
os_name | Name of operating system e.g. ‘Android’ |
os_family | Operating system family e.g. ‘Linux’ |
os_manufacturer | Company responsible for OS e.g. ‘Apple’ |
os_timezone | Client operating system timezone e.g. ‘Europe/London’ |
dvce_type | Type of device e.g. ‘Computer’ |
dvce_ismobile | Is the device mobile? e.g. True |
dvce_screenwidth | Screen width in pixels e.g. 1900 |
dvce_screenheight | Screen height in pixels e.g. 1024 |
doc_charset | The page’s character encoding e.g. , ‘UTF-8’ |
doc_width | The page’s width in pixels e.g. 1024 |
doc_height | The page’s height in pixels e.g. 3000 |
tr_currency | Currency e.g. ‘USD’ |
tr_total_base | Total in base currency e.g. 12.99 |
tr_tax_base | Total tax in base currency e.g. 3.00 |
tr_shipping_base | decimal Delivery cost in base currency e.g. 0.00 |
ti_currency | Currency e.g. ‘EUR’ |
ti_price_base | decimal Price in base currency e.g. 9.99 |
base_currency | Reporting currency e.g. ‘GBP’ |
geo_timezone | Visitor timezone name e.g. ‘Europe/London’ |
mkt_clickid | The click ID e.g. ‘ac3d8e459’ |
mkt_network | The ad network to which the click ID belongs e.g. ‘DoubleClick’ |
etl_tags | JSON of tags for this ETL run e.g. “[‘prod’]” |
dvce_sent_tstamp | When the event was sent by the client device e.g. ‘2013-11-26 00:03:58.032’ |
refr_domain_userid | The Snowplow domain_userid of the referring website e.g. ‘bc2e92ec6c204a14’ |
refr_dvce_tstamp | The time of attaching the domain_userid to the inbound link e.g. ‘2013-11-26 00:02:05’ |
domain_sessionid | A visit / session UUID e.g. ‘c6ef3124-b53a-4b13-a233-0088f79dcbcb’ |
derived_tstamp | Timestamp making allowance for innaccurate device clock e.g. ‘2013-11-26 00:02:04’ |
event_vendor | Who defined the event e.g. ‘com.acme’ |
event_name | Event name e.g. ‘link_click’ |
event_format | Format for event e.g. ‘jsonschema’ |
event_version | Version of event schema e.g. ‘1-0-2’ |
event_fingerprint | Hash client-set event fields e.g. AADCE520E20C2899F4CED228A79A3083 |
true_tstamp | User-set “true timestamp” for the event e.g. ‘2013-11-26 00:02:04’ |
page_view_id |
Code
- bigquery
- databricks
- snowflake
{{
config(
tags=["this_run"]
)
}}
{%- set lower_limit, upper_limit, session_start_limit = snowplow_utils.return_base_new_event_limits(ref('snowplow_normalize_base_new_event_limits')) %}
-- without downstream joins, it's safe to dedupe by picking the first event_id found.
select
array_agg(e order by e.collector_tstamp limit 1)[offset(0)].*
from (
select
a.*
from {{ var('snowplow__events') }} as a
where
{# dvce_sent_tstamp is an optional field and not all trackers/webhooks populate it, this means this filter needs to be optional #}
{% if var("snowplow__days_late_allowed") == -1 %}
1 = 1
{% else %}
a.dvce_sent_tstamp <= {{ snowplow_utils.timestamp_add('day', var("snowplow__days_late_allowed", 3), 'a.dvce_created_tstamp') }}
{% endif %}
and a.collector_tstamp >= {{ lower_limit }}
and a.collector_tstamp <= {{ upper_limit }}
{% if var('snowplow__derived_tstamp_partitioned', true) and target.type == 'bigquery' | as_bool() %}
and a.derived_tstamp >= {{ snowplow_utils.timestamp_add('hour', -1, lower_limit) }}
and a.derived_tstamp <= {{ upper_limit }}
{% endif %}
and {{ snowplow_utils.app_id_filter(var("snowplow__app_id",[])) }}
) e
group by e.event_id
{{
config(
tags=["this_run"]
)
}}
{%- set lower_limit, upper_limit, session_start_limit = snowplow_utils.return_base_new_event_limits(ref('snowplow_normalize_base_new_event_limits')) %}
select
a.*
from {{ var('snowplow__events') }} as a
where
{# dvce_sent_tstamp is an optional field and not all trackers/webhooks populate it, this means this filter needs to be optional #}
{% if var("snowplow__days_late_allowed") == -1 %}
1 = 1
{% else %}
a.dvce_sent_tstamp <= {{ snowplow_utils.timestamp_add('day', var("snowplow__days_late_allowed", 3), 'a.dvce_created_tstamp') }}
{% endif %}
and a.collector_tstamp >= {{ lower_limit }}
and a.collector_tstamp <= {{ upper_limit }}
and {{ snowplow_utils.app_id_filter(var("snowplow__app_id",[])) }}
qualify row_number() over (partition by a.event_id order by a.collector_tstamp, a.etl_tstamp) = 1
{{
config(
tags=["this_run"],
sql_header=snowplow_utils.set_query_tag(var('snowplow__query_tag', 'snowplow_dbt'))
)
}}
{%- set lower_limit, upper_limit, session_start_limit = snowplow_utils.return_base_new_event_limits(ref('snowplow_normalize_base_new_event_limits')) %}
select
a.*
from {{ var('snowplow__events') }} as a
where
{# dvce_sent_tstamp is an optional field and not all trackers/webhooks populate it, this means this filter needs to be optional #}
{% if var("snowplow__days_late_allowed") == -1 %}
1 = 1
{% else %}
a.dvce_sent_tstamp <= {{ snowplow_utils.timestamp_add('day', var("snowplow__days_late_allowed", 3), 'a.dvce_created_tstamp') }}
{% endif %}
and a.collector_tstamp >= {{ lower_limit }}
and a.collector_tstamp <= {{ upper_limit }}
and {{ snowplow_utils.app_id_filter(var("snowplow__app_id",[])) }}
qualify row_number() over (partition by a.event_id order by a.collector_tstamp) = 1
Depends On
Snowplow Normalize Base New Event Limits
models/base/scratch/snowplow_normalize_base_new_event_limits.sql
Description
This table contains the lower and upper timestamp limits for the given run of the normalize model. These limits are used to select new events from the events table.
Type: Table
Details
Columns
Column Name | Description | Type |
---|---|---|
lower_limit | The lower collector_tstamp limit for the run | timestamp |
upper_limit | The upper collector_tstamp limit for the run | timestamp |
Code
- default
{{ config(
post_hook=["{{snowplow_utils.print_run_limits(this)}}"],
sql_header=snowplow_utils.set_query_tag(var('snowplow__query_tag', 'snowplow_dbt'))
)
}}
{%- set models_in_run = snowplow_utils.get_enabled_snowplow_models('snowplow_normalize') -%}
{% set min_last_success,
max_last_success,
models_matched_from_manifest,
has_matched_all_models = snowplow_utils.get_incremental_manifest_status(ref('snowplow_normalize_incremental_manifest'), models_in_run) -%}
{% set run_limits_query = snowplow_utils.get_run_limits(min_last_success,
max_last_success,
models_matched_from_manifest,
has_matched_all_models,
var("snowplow__start_date","2020-01-01")) -%}
{{ run_limits_query }}
Depends On
- Models
- Macros
Referenced By
Snowplow Normalize Incremental Manifest
models/base/manifest/snowplow_normalize_incremental_manifest.sql
Description
This incremental table is a manifest of the timestamp of the latest event consumed per model within the snowplow-normalize
package as well as any models leveraging the incremental framework provided by the package. The latest event's timestamp is based off collector_tstamp
. This table is used to determine what events should be processed in the next run of the model.
Type: Table
Details
Columns
Column Name | Description | Type |
---|---|---|
model | The name of the model. | string |
last_success | The latest event consumed by the model, based on collector_tstamp | timestamp |
Code
- default
{{
config(
materialized='incremental',
full_refresh=snowplow_normalize.allow_refresh()
)
}}
-- Boilerplate to generate table.
-- Table updated as part of end-run hook
with prep as (
select
cast(null as {{ snowplow_utils.type_max_string() }}) model,
cast('1970-01-01' as {{ type_timestamp() }}) as last_success
)
select *
from prep
where false
Depends On
- Macros
- macro.dbt.type_timestamp
- macro.snowplow_normalize.allow_refresh
- macro.snowplow_utils.type_max_string