Metadata

The metadata format, consumed in the TypingTransform stage, is an opinionated format for specifying common data typing actions.

It is designed to:

  • Support common data typing conversions found in business datasets.
  • Support limited ‘schema evolution’ of source data in the form of allowed lists of accepted input formats.
  • Collect errors into array columns so that a user can decide how to handle errors once all have been collected.

Common

Attributes

Attribute Type Required Description
id String true A unique identifier for this field. Ideally this is a GUID and should remain constant even when changing field attributes over time.
name String true The field name.
description String false A description of the field which will be embedded in the dataset metadata (and persisted in formats like Parquet/ORC).
type String true The data type to convert the field to. Supported values: boolean, date, decimal, double, integer, long, string, timestamp.
trim Boolean true Trim the field prior to data typing attempts.
nullable Boolean true Whether the field is allowed to be nullable. Will throw fatal exception if this constraint is not met.
nullableValues Array[String] false Values which when found will be converted to null. This is generally used for converting text serialisation formats back to correct null data types. This conversion is executed prior to nullable check.
nullReplacementValue String false An optional value that a null value input value is replaced with before typing.
metadata Object false Metadata to attach to the column after TypingTransform.

These values are limited to the basic JSON types string, double, long and array.

With array types the values must be all of the same type (i.e. [true, false] works but [true, 0] will not) and can only be values of basic types (string, double, long).

Examples

{
  "id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
  "name" : "first_name",
  "description" : "Customer First Name",
  "type" : "string",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Binary

Additional Attributes

Attribute Type Required Description
encoding String true The binary-to-text encoding format of the value. Valid values base64, hexadecimal.

Examples

{
  "id" : "982cbf60-7ba7-4e50-a09b-d8624a5c49e6",
  "name" : "id",
  "description" : "GUID identifier",
  "type" : "binary",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "encoding" : "base64",
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }  
}

Boolean

Additional Attributes

Attribute Type Required Description
trueValues Array[String] true A list of values which are considered as true. Try to order this list so the values are arranged from most frequent to least frequent.
falseValues Array[String] true A list of values which are considered as false. Try to order this list so the values are arranged from most frequent to least frequent.

Examples

{
  "id" : "982cbf60-7ba7-4e50-a09b-d8624a5c49e6",
  "name" : "marketing_opt_in_flag",
  "description" : "Whether the customer has opted in to receive marketing communications.",
  "type" : "boolean",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "trueValues" : [ "true", "TRUE", "t", "1" ],
  "falseValues" : [ "false", "FALSE", "f", "0" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }  
}

Date

Date vs Timestamp

This class does not store or represent a time or time-zone. Instead, it is a description of the date, as used for birthdays. It cannot represent an instant on the time-line without additional information such as an offset or time-zone.

This means that if users will be executing SQL statements which have conditional logic based on date comparisons (such as WHERE [date] < CURRENT_DATE()) then it is safer to use a Timestamp with a hard-coded time component for that source data so you get consistent results regardless of which time zone your users are located.

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Examples

{
  "id" : "0e8109ba-1000-4b7d-8a4c-b01bae07027f",
  "name" : "birth_date",
  "description" : "Customer Birth Date",
  "type" : "date",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters" : [ "uuuuMMdd", "uuuu-MM-dd" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Decimal

Additional Attributes

Attribute Type Required Description
precision Integer true The total number of digits. e.g. 1234.567 has a precision of 7.
scale Integer true The number of digits in the fraction part. e.g. 1234.567 has a scale of 3.
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0.###;-#,##0.###

Examples

{
  "id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
  "name" : "account_balance",
  "description" : "The current account balance",
  "type" : "decimal",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "precision": 10,
  "scale": 2,
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }  
}

Double

A Double is a double-precision 64-bit IEEE 754 floating point number.

Double vs Decimal

A Decimal should be used whenever precision is required or for numbers which must sum up correctly or balance, e.g. monetary transactions.

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0.###;-#,##0.###

Examples

{
  "id" : "31541ea3-5b74-4753-857c-770bd601c35b",
  "name" : "last_meter_reading",
  "description" : "The last reading from the customer power meter.",
  "type" : "double",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Integer

Use Integer when dealing with values up to ±2 billion (-231 to +231-1)

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0;-#,##0

Examples

{
  "id" : "a66f3bbe-d1c6-44c7-b096-a4be59fdcd78",
  "name" : "update_count",
  "description" : "Number of updates to this customer record.",
  "type" : "integer",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Long

Use a Long Integer when dealing with values greater than ±2 billion (-263 to +263-1)

Additional Attributes

Attribute Type Required Description
formatters Array[String] false The formatters to try to convert this field based on the Java DecimalFormat patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Formatters be used to specify uncommon number formats such as #,##0.###;#,##0.###- for a decimal with trailing minus sign or #,##0.###;(#,##0.###) for a decimal where the negative is displayed in accounting format with brakets instead of minus sign.

Default: #,##0;-#,##0

Examples

{
  "id" : "1c0eec1d-17cd-45da-8744-7a9ef5b8b086",
  "name" : "transaction_num",
  "description" : "Global transaction sequence number.",
  "type" : "long",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

String

Additional Attributes

Attribute Type Required Description
minLength Integer false The minimum length of the string value.
maxLength Integer false The maximum length of the string value.

Examples

{
  "id" : "9712c383-22d1-44a6-9ca2-0087af4857f1",
  "name" : "first_name",
  "description" : "Customer First Name",
  "type" : "string",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "primaryKey" : false,
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

Time

TimeType

Spark does not have an internal TimeType representation of time. This type can be used to ensure time values are able to be successfully parsed as LocalTime objects but they are always stored in Spark as string formatted in the standard HH:mm:ss format type meaning they can safely used in the to_utc_timestamp SQL function (but be very careful with timezone offsets). If they cannot be parsed then an error will be inserted into the _errors array like all other types.

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Examples

{
  "id" : "0f5162ce-64ca-409d-abd1-f0b5bb5830de",
  "name" : "transaction_time",
  "description" : "Time of the database transaction",
  "type" : "time",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters" : [ "HHmmss" ],
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }  
}

Timestamp

Additional Attributes

Attribute Type Required Description
formatters Array[String] true The formatters to try to convert this field based on the Java DateTimeFormatter patterns. Try to order this list so the values are arranged from most frequent to least frequent.

Custom formats ssssssssss and sssssssssssss have been added to support epoch time (i.e. 1527727035) and epoch millis time (i.e. 1527727035456) respectively. Both require timezoneId of UTC.
timezoneId String true The timezone of the incoming timestamp. This uses the SimpleDateFormat supported timezones. All timestamps are internally stored in UTC to allow correctly sequenced events when dealing with events from multiple systems which may all run with different internal timezones.
time Map[String, Integer] false Use this capability if converting a Date label into a Timestamp for relative comparisons. Required fields are hour, minute, second and nano . These values can be agreed with source data suppliers to ensure intra-system data alignment. See below for example.

Examples

{
  "id" : "8e42c8f0-22a8-40db-9798-6dd533c1de36",
  "name" : "create_date",
  "description" : "Customer Creation Date",
  "type" : "timestamp",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters": [
    "dd/MM/uuuu HH:mm:ss",
    "dd/MM/uuuu H:mm:ss",
    "dd/MM/uuuu HH:mm",
    "dd/MM/uuuu H:mm",
    "d/MM/uuuu HH:mm:ss",
    "d/MM/uuuu H:mm:ss",
    "d/MM/uuuu HH:mm",
    "d/MM/uuuu H:mm"
  ],
  "timezoneId": "+1000",    
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }
}

For converting a Date label into a Timestamp supply the time key:

{
  "id" : "8e42c8f0-22a8-40db-9798-6dd533c1de36",
  "name" : "create_date",
  "description" : "Customer Creation Date",
  "type" : "timestamp",
  "trim" : true,
  "nullable" : true,
  "nullableValues" : [ "", "null" ],
  "formatters": [
    "dd/MM/uuuu",
  ],
  "timezoneId": "Australia/Sydney",    
  "time": {
    "hour": 23,
    "minute": 59,
    "second": 59,
    "nano": 0,
  },
  "metadata": {
    "primaryKey" : true,
    "position": 1
  }  
}