S3 loader configuration reference
This is a complete list of the options that can be configured in the S3 loader HOCON config file. The example configs in github show how to prepare an input file.
parameter | description |
---|---|
purpose | Required. Use RAW to sink data exactly as-is. Use ENRICHED_EVENTS to also enable event latency metrics. Use SELF_DESCRIBING to enable partitioning self-describing data by its schema |
input.appName | Required. Kinesis Client Lib app name (corresponds to DynamoDB table name) |
input.streamName | Required. Name of the kinesis stream from which to read |
input.position | Required. Use TRIM_HORIZON to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use LATEST to start streaming just after the most recent record in the shard |
input.customEndpoint | Optional. Override the default endpoint for kinesis client api calls |
input.maxRecords | Required. How many records the client should pull from kinesis each time |
output.s3.path | Required. Full path to output data, e.g. s3://acme-snowplow-output/raw/ |
output.s3.partitionFormat | Optional. Added in version 2.1.0. Configures how files are partitioned into S3 directories.When loading raw files, you might choose to partition by date={yy}-{mm}-{dd} . When loading self describing jsons, you might choose to partition by {vendor}.{name}/model={model}/date={yy}-{mm}-{dd} . Valid substitutions are {vendor} , {name} , {format} , {model} for self-describing jsons; and {yy} , {mm} , {dd} , {hh} for year, month, day and hour. Defaults to {vendor}.{schema} when loading self-describing JSONs, or blank (no partitioning) when loading raw or enriched events |
output.s3.filenamePrefix | Optional. Adds a prefix to output |
output.s3.compression | Required. Either LZO or GZIP |
output.s3.maxTimeout | Required. Maximum Timeout that the application is allowed to fail for, e.g. in case of S3 outage |
output.s3.customEndpoint | Optional. Override the default endpoint for s3 client api calls |
region | Optional. When used with the output.s3.customEndpoint option, this sets the region of the bucket. Also sets the region of the dynamoDB table. Defaults to the current region |
output.bad.streamName | Required. Name of a kinesis stream to output failures |
buffer.byteLimit | Required. Maximum bytes to read from kinesis before flushing a file to S3 |
buffer.recordLimit | Required. Maximum records to read from kinesis before flushing a file to S3 |
buffer.timeLimit | Required. Maximum time to wait in milliseconds between writing files to S3 |
monitoring.snowplow.collector | Optional. E.g. https://snplow.acme.com. URI of a snowplow collector. Used for monitoring application lifecycle and failure events |
monitoring.snowplow.appId | Required only if the collector uri is also configured. Sets the appId field of the snowplow events |
monitoring.sentry.dsn | Optional, for tracking uncaught run time exceptions |
monitoring.metrics.cloudwatch | Optional boolean, with default true. This is used to disable sending metrics to cloudwatch |
monitoring.metrics.hostname | Optional, for sending loading metrics (latency and event counts) to a statsd server |
monitoring.metrics.port | Optional, port of the statsd server |
monitoring.metrics.tags | E.g.{ "key1": "value1", "key2": "value2" } . Tags are used to annotate the statsd metric with any contextual information |
monitoring.metrics.prefix | Optional, default snoplow.s3loader . Configures the prefix of statsd metric names |