How schema definitions translate to the warehouse
Self-describing events and entities use schemas to define which fields should be present, and of what type (e.g. string, number). This page explains what happens to this information in the warehouse.
Location
Where can you find the data carried by a self-describing event or an entity?
- Redshift and Postgres
- Databricks
- BigQuery
- Snowflake
Each type of self-describing event and each type of entity get their own dedicated tables. The name of such a table is composed of the schema vendor, schema name and its major version (more on versioning later).
All characters are converted to lowercase and all symbols (like .
) are replaced with an underscore.
Examples:
Kind | Schema | Resulting table |
---|---|---|
Self-describing event | com.example/button_press/jsonschema/1-0-0 | com_example_button_press_1 |
Entity | com.example/user/jsonschema/1-0-0 | com_example_user_1 |
Inside the table, there will be columns corresponding to the fields in the schema. Their types are determined according to the logic described below.
The name of each column is the name of the schema field converted to snake case.
If an event or entity includes fields not defined in the schema, those fields will not be stored in the warehouse.
For example, suppose you have the following field in the schema:
"lastName": {
"type": "string",
"maxLength": 100
}
It will be translated into a column called last_name
(notice the underscore), of type VARCHAR(100)
.
Each type of self-describing event and each type of entity get their own dedicated columns in the events
table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning later).
The column name is prefixed by unstruct_event_
for self-describing events, and by contexts_
for entities. (In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)
All characters are converted to lowercase and all symbols (like .
) are replaced with an underscore.
Examples:
Kind | Schema | Resulting column |
---|---|---|
Self-describing event | com.example/button_press/jsonschema/1-0-0 | events.unstruct_event_com_example_button_press_1 |
Entity | com.example/user/jsonschema/1-0-0 | events.contexts_com_example_user_1 |
For self-describing events, the column will be of a STRUCT
type, while for entities the type will be ARRAY
of STRUCT
(because an event can have more than one entity attached).
Inside the STRUCT
, there will be fields corresponding to the fields in the schema. Their types are determined according to the logic described below.
The name of each record field is the name of the schema field converted to snake case.
If an event or entity includes fields not defined in the schema, those fields will not be stored in the warehouse.
For example, suppose you have the following field in the schema:
"lastName": {
"type": "string",
"maxLength": 100
}
It will be translated into a field called last_name
(notice the underscore), of type STRING
.
Each type of self-describing event and each type of entity get their own dedicated columns in the events
table. The name of such a column is composed of the schema vendor, schema name and full schema version (more on versioning later).
The column name is prefixed by unstruct_event_
for self-describing events, and by contexts_
for entities. (In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)
All characters are converted to lowercase and all symbols (like .
) are replaced with an underscore.
Examples:
Kind | Schema | Resulting column |
---|---|---|
Self-describing event | com.example/button_press/jsonschema/1-0-0 | events.unstruct_event_com_example_button_press_1_0_0 |
Entity | com.example/user/jsonschema/1-0-0 | events.contexts_com_example_user_1_0_0 |
For self-describing events, the column will be of a RECORD
type, while for entities the type will be REPEATED RECORD
(because an event can have more than one entity attached).
Inside the record, there will be fields corresponding to the fields in the schema. Their types are determined according to the logic described below.
The name of each record field is the name of the schema field converted to snake case.
If an event or entity includes fields not defined in the schema, those fields will not be stored in the warehouse.
For example, suppose you have the following field in the schema:
"lastName": {
"type": "string",
"maxLength": 100
}
It will be translated into a field called last_name
(notice the underscore), of type STRING
.
Each type of self-describing event and each type of entity get their own dedicated columns in the events
table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning later).
The column name is prefixed by unstruct_event_
for self-describing events, and by contexts_
for entities. (In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)
All characters are converted to lowercase and all symbols (like .
) are replaced with an underscore.
Examples:
Kind | Schema | Resulting column |
---|---|---|
Self-describing event | com.example/button_press/jsonschema/1-0-0 | events.unstruct_event_com_example_button_press_1 |
Entity | com.example/user/jsonschema/1-0-0 | events.contexts_com_example_user_1 |
For self-describing events, the column will be of an OBJECT
type, while for entities the type will be an ARRAY
of objects (because an event can have more than one entity attached).
Inside the object, there will be keys corresponding to the fields in the schema. The values for the keys will be of type VARIANT
.
If an event or entity includes fields not defined in the schema, those fields will be included in the object. However, remember that you need to set additionalProperties
to true
in the respective schema for such events and entities to pass schema validation.
For example, suppose you have the following field in the schema:
"lastName": {
"type": "string",
"maxLength": 100
}
It will be translated into an object with a lastName
key that points to a value of type VARIANT
.
Versioning
What happens when you evolve your schema to a new version?
- Redshift and Postgres
- Databricks
- BigQuery
- Snowflake
Because the table name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new table:
Schema | Resulting table |
---|---|
com.example/button_press/jsonschema/1-0-0 | com_example_button_press_1 |
com.example/button_press/jsonschema/1-2-0 | com_example_button_press_1 |
com.example/button_press/jsonschema/2-0-0 | com_example_button_press_2 |
When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing table automatically. For example, if you change the maxLength
of a string
field, the limit of the VARCHAR
column would be updated accordingly.
If you make a breaking schema change (e.g. change a type of a field from a string
to a number
) without creating a new major schema version, the loader will not be able to adapt the table to receive new data. Your loading process will halt.
Once the loader creates a column for a given schema version as NULLABLE
or NOT NULL
, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version 1-0-0
and not nullable in version 1-0-1
, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting null
values for 1-0-0
and rejecting them for 1-0-1
.)
Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column:
Schema | Resulting column |
---|---|
com.example/button_press/jsonschema/1-0-0 | unstruct_event_com_example_button_press_1 |
com.example/button_press/jsonschema/1-2-0 | unstruct_event_com_example_button_press_1 |
com.example/button_press/jsonschema/2-0-0 | unstruct_event_com_example_button_press_2 |
When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing column automatically. For example, if you add a new optional field in the schema, a new optional field will be added to the STRUCT
.
If you make a breaking schema change (e.g. change a type of a field from a string
to a number
) without creating a new major schema version, the loader will not be able to modify the column to accommodate the new data.
In this case, upon receiving the first event with the offending schema, the loader will instead create a new column, with a name like unstruct_event_com_example_button_press_1_0_1_recovered_9999999
, where:
1-0-1
is the version of the offending schema9999999
is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one)
To resolve this situation:
- Create a new schema version (e.g.
1-0-2
) that reverts the offending changes and is again compatible with the original column. The data for events with that schema will start going to the original column as expected. - You might also want to manually adapt the data in the
..._recovered_...
column and copy it to the original one.
Note that this behavior was introduced in RDB Loader 5.3.0.
Because the column name for the self-describing event or entity includes the full schema version, each version of a schema gets a new column:
Schema | Resulting column |
---|---|
com.example/button_press/jsonschema/1-0-0 | unstruct_event_com_example_button_press_1_0_0 |
com.example/button_press/jsonschema/1-2-0 | unstruct_event_com_example_button_press_1_2_0 |
com.example/button_press/jsonschema/2-0-0 | unstruct_event_com_example_button_press_2_0_0 |
If you are modeling your data with dbt, you can use this macro to aggregate the data across multiple columns.
While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a string
to a number
), this is not particularly relevant for BigQuery. Indeed, each schema version gets its own column, so there is no difference between major and minor versions. That said, we believe sticking to our recommendation is a good idea:
- Breaking changes might affect downstream consumers of the data, even if they don’t affect BigQuery
- In the future, you might decide to migrate to a different data warehouse where our rules are stricter (e.g. Databricks)
Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column:
Schema | Resulting column |
---|---|
com.example/button_press/jsonschema/1-0-0 | unstruct_event_com_example_button_press_1 |
com.example/button_press/jsonschema/1-2-0 | unstruct_event_com_example_button_press_1 |
com.example/button_press/jsonschema/2-0-0 | unstruct_event_com_example_button_press_2 |
While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a string
to a number
), this is not particularly relevant for Snowflake. Indeed, the event or entity data is stored in the column as is in the VARIANT
form, so Snowflake is not “aware” of the schema. That said, we believe sticking to our recommendation is a good idea:
- Breaking changes might affect downstream consumers of the data, even if they don’t affect Snowflake
- In the future, you might decide to migrate to a different data warehouse where our rules are stricter (e.g. Databricks)
Also, creating a new major version of the schema (and hence a new column) is the only way to indicate a change in semantics, where the data is in the same format but has different meaning (e.g. amounts in dollars vs euros).
Types
How do schema types translate to the database types?
Nullability
- Redshift and Postgres
- Databricks
- BigQuery
- Snowflake
All non-required schema fields translate to nullable columns.
Required fields translate to NOT NULL
columns:
{
"properties": {
"myRequiredField": {"type": ...}
},
"required": [ "myRequiredField" ]
}
However, it is possible to define a required field where null
values are allowed (the Enrich application will still validate that the field is present, even if it’s null
):
"myRequiredField": {
"type": ["null", ...]
}
OR
"myRequiredField": {
"enum": ["null", ...]
}
In this case, the column will be nullable. It does not matter if "null"
is in the beginning, middle or end of the list of types or enum values.
See also how versioning affects this.
All schema fields, including the required ones, translate to nullable fields inside the STRUCT
.
All non-required schema fields translate to nullable RECORD
fields.
Required schema fields translate to required RECORD
fields:
{
"properties": {
"myRequiredField": {"type": ...}
},
"required": [ "myRequiredField" ]
}
However, it is possible to define a required field where null
values are allowed (the Enrich application will still validate that the field is present, even if it’s null
):
"myRequiredField": {
"type": ["null", ...]
}
OR
"myRequiredField": {
"enum": ["null", ...]
}
In this case, the RECORD
field will be nullable. It does not matter if "null"
is in the beginning, middle or end of the list of types or enum values.
All fields are nullable (because they are stored inside the VARIANT
type).
Types themselves
- Redshift and Postgres
- Databricks
- BigQuery
- Snowflake
The row order in this table is important. Type lookup stops after the first match is found scanning from top to bottom.
Json Schema | Redshift/Postgres Type |
The
If content size is longer than 4096 it would be truncated when inserted into the Redshift. |
|
OR
|
|
|
If content size is longer than 4096 it would be truncated when inserted into the Redshift. |
|
|
|
|
|
Content is stringified and quoted. If content size is longer than 65535 it would be truncated when inserted into the Redshift. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
If nothing matches above, this is a catch-all. |
Values will be quoted as in JSON. If content size is longer than 4096 it would be truncated when inserted into the Redshift. |
The row order in this table is important. Type lookup stops after the first match is found scanning from top to bottom.
Json Schema | Databricks Type |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
More details
For example,
result is |
|
More details
For example,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Values will be quoted as in JSON. |
If nothing matches above, this is a catch-all. |
Values will be quoted as in JSON. |
The row order in this table is important. Type lookup stops after the first match is found scanning from top to bottom.
Json Schema | BigQuery Type |
If the Objects can be nullable. Nested fields can also be nullable (same rules as for everything else). |
|
If the Arrays can be nullable. Nested fields can also be nullable (same rules as for everything else). |
|
|
|
|
|
|
|
|
|
|
|
OR
|
|
|
|
|
|
|
Values will be quoted as in JSON. |
If nothing matches above, this is a catch-all. |
Values will be quoted as in JSON. |
All types are VARIANT
.