Load

*Load stages write out Spark datasets to a database or file system.

*Load stages should meet this criteria:

  • Take in a single dataset.
  • Perform target specific validation that the dataset has been written correctly.

AvroLoad

Since: 1.0.0 - Supports Streaming: False

The AvroLoad writes an input DataFrame to a target Apache Avro file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Avro file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "AvroLoad",
  "name": "write customer avro extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.avro"
}

Complete

{
  "type": "AvroLoad",
  "name": "write customer avro extract",
  "description": "write customer avro extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.avro",
  "authentication": {},
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

AzureEventHubsLoad

Since: 1.0.0 - Supports Streaming: False

The AzureEventHubsLoad writes an input DataFrame to a target Azure Event Hubs stream. The input to this stage needs to be a single column dataset of signature value: string and is intended to be used after a JSONTransform stage which would prepare the data for sending to the external server.

In the future additional Transform stages (like ProtoBufTransform) could be added to prepare binary payloads instead of just json string.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
namespaceName String true The Event Hubs namespace.
eventHubName String true The Event Hubs entity.
sharedAccessSignatureKeyName String true The Event Hubs Shared Access Signature Key Name.
sharedAccessSignatureKey String true The Event Hubs Shared Access Signature Key.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism. Azure EventHubs will throw a ServerBusyException if too many executors write to a target in parallel which can be decreased by reducing the number of partitions.
retryCount Integer false The maximum number of retries for the exponential backoff algorithm.

Default: 10.
retryMaxBackoff Long false The maximum time (in seconds) for the exponential backoff algorithm to wait between retries.

Default: 30.
retryMinBackoff Long false The minimum time (in seconds) for the exponential backoff algorithm to wait between retries.

Default: 0.

Examples

Minimal

{
  "type": "AzureEventHubsLoad",
  "name": "write customer to azure event hubs",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "namespaceName": "mynamespace",
  "eventHubName": "myeventhub",
  "sharedAccessSignatureKeyName": "mysignaturename",
  "sharedAccessSignatureKey": "ctzMq410TV3wS7upTBcunJTDLEJwMAZuFPfr0mrrA08="
}

Complete

{
  "type": "AzureEventHubsLoad",
  "name": "write customer to azure event hubs",
  "description": "write customer to azure event hubs",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "namespaceName": "mynamespace",
  "eventHubName": "myeventhub",
  "sharedAccessSignatureKeyName": "mysignaturename",
  "sharedAccessSignatureKey": "ctzMq410TV3wS7upTBcunJTDLEJwMAZuFPfr0mrrA08=",
  "numPartitions": 4,
  "retryCount": 30,
  "retryMaxBackoff": 60,
  "retryMinBackoff": 5
}

ConsoleLoad

Since: 1.2.0 - Supports Streaming: True

The ConsoleLoad prints an input streaming DataFrame the console.

This stage has been included for testing Structured Streaming jobs as it can be very difficult to debug. Generally this stage would only be included when Arc is run in a test mode (i.e. the environment is set to test).

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
outputMode String false The output mode of the console writer. Allowed values Append, Complete, Update. See Output Modes for full details.

Default: Append

Examples

Minimal

{
  "type": "ConsoleLoad",
  "name": "write a streaming dataset to console",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer"
}

Complete

{
  "type": "ConsoleLoad",
  "name": "write a streaming dataset to console",
  "description": "write a streaming dataset to console",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputMode": "Append"
}

DatabricksDeltaLoad

Since: 1.8.0 - Supports Streaming: True

Experimental

The DatabricksDeltaLoad is currently in experimental state whilst the requirements become clearer.

This means this API is likely to change.

The DatabricksDeltaLoad writes an input DataFrame to a target Databricks Delta file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Delta file to write to.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "DatabricksDeltaLoad",
  "name": "write customer Delta extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "/delta/customers"
}

Complete

{
  "type": "DatabricksDeltaLoad",
  "name": "write customer Delta extract",
  "description": "write customer Delta extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "/delta/customers",
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

DatabricksSQLDWLoad

Since: 1.8.1 - Supports Streaming: False

Experimental

The DatabricksSQLDWLoad is currently in experimental state whilst the requirements become clearer.

This means this API is likely to change.

The DatabricksSQLDWLoad writes an input DataFrame to a target Azure SQL Data Warehouse file using a proprietary driver within a Databricks Runtime Environment.

Known limitations:

  • SQL Server date fields can only be between range 1753-01-01 to 9999-12-31.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
jdbcURL URI true URI of the Delta file to write to.
dbTable String true The table to create in SQL DW.
tempDir URI true A Azure Blob Storage path to temporarily hold the data before executing the SQLDW load.
authentication Map[String, String] true An authentication map for authenticating with a remote service. See authentication documentation.. Note this stage only works with the AzureSharedKey authentication method.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
forwardSparkAzureStorageCredentials Boolean false If true, the library automatically discovers the credentials that Spark is using to connect to the Blob Storage container and forwards those credentials to SQL DW over JDBC.

Default: true.
tableOptions String false Used to specify table options when creating the SQL DW table.
maxStrLength Integer false The default length of String/NVARCHAR columns when creating the table in SQLDW.

Default: 256.
params Map[String, String] false Parameters for connecting to the Azure SQL Data Warehouse so that password is not logged.

Examples

Minimal

{
  "type": "DatabricksSQLDWLoad",
  "name": "write customer extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "authentication": {
    "method": "AzureSharedKey",
    "accountName": "myaccount",
    "signature": "ctzMq410TV3wS7upTBcunJTDLEJwMAZuFPfr0mrrA08="
  },
  "jdbcURL": "jdbc:sqlserver://localhost;user=MyUserName",
  "dbTable": "customer",
  "tempDir": "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
  "params": {
    "password": "notlogged"
  }
}

Complete

{
  "type": "DatabricksSQLDWLoad",
  "name": "write customer extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "authentication": {
    "method": "AzureSharedKey",
    "accountName": "myaccount",
    "signature": "ctzMq410TV3wS7upTBcunJTDLEJwMAZuFPfr0mrrA08="
  },  
  "jdbcURL": "jdbc:sqlserver://localhost;user=MyUserName",
  "dbTable": "customer",
  "tempDir": "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>",
  "forwardSparkAzureStorageCredentials": true, 
  "tableOptions": "CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN", 
  "maxStrLength": 1024,
  "params": {
    "password": "notlogged"
  }
}

DelimitedLoad

Since: 1.0.0 - Supports Streaming: True

The DelimitedLoad writes an input DataFrame to a target delimited file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Delimited file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
customDelimiter String true* A custom string to use as delimiter. Required if delimiter is set to Custom.
delimiter String false The type of delimiter in the file. Supported values: Comma, Pipe, DefaultHive. DefaultHive is ASCII character 1, the default delimiter for Apache Hive extracts.

Default: Comma.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
header Boolean false Whether to write a header row.

Default: false.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
quote String false The type of quoting in the file. Supported values: None, SingleQuote, DoubleQuote.

Default: DoubleQuote.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "DelimitedLoad",
  "name": "write customer as csv",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.csv"
}

Complete

{
  "type": "DelimitedLoad",
  "name": "write customer as csv",
  "description": "write customer as csv",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.csv",
  "authentication": {},
  "delimiter": "Custom",
  "customDelimiter": "#",
  "header": true,
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "quote": "DoubleQuote",
  "saveMode": "Overwrite"
}

ElasticsearchLoad

Since: 1.9.0 - Supports Streaming: False

Experimental

The ElasticsearchLoad is currently in experimental state whilst the requirements become clearer.

This means this API is likely to change.

The ElasticsearchLoad writes an input DataFrame to a target Elasticsearch cluster.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
output String true The name of the target Elasticsearch index.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
params Map[String, String] false Map of configuration parameters. Parameters for connecting to the Elasticsearch cluster are detailed here.
partitionBy Array[String] false Columns to partition the data by.

Examples

Minimal

{
  "type": "ElasticsearchLoad",
  "name": "write customer",
  "environments": [
    "production",
    "test"
  ],
  "output": "customer",
  "inputView": "customer",
  "params": {
    "es.nodes": "<my>.elasticsearch.com",
    "es.port": "443",
    "es.nodes.wan.only": "true",
    "es.net.ssl": "true"
  }
}

Complete

{
  "type": "ElasticsearchLoad",
  "name": "write customer",
  "environments": [
    "production",
    "test"
  ],
  "output": "customer",
  "inputView": "customer",
  "params": {
    "es.nodes": "<my>.elasticsearch.com",
    "es.port": "443",
    "es.nodes.wan.only": "true",
    "es.net.ssl": "true"
  },
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

HTTPLoad

Since: 1.0.0 - Supports Streaming: True

The HTTPLoad takes an input DataFrame and executes a series of POST requests against a remote HTTP service. The input to this stage needs to be a single column dataset of signature value: string and is intended to be used after a JSONTransform stage which would prepare the data for sending to the external server.

In the future additional Transform stages (like ProtoBufTransform) could be added to prepare binary payloads instead of just json string.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the HTTP server.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
headers Map[String, String] false HTTP Headers to set for the HTTP request. These are not limited to the Internet Engineering Task Force standard headers.
validStatusCodes Array[Integer] false A list of valid status codes which will result in a successful stage if the list contains the HTTP server response code. If not provided the default values are [200, 201, 202]. Note: all request response codes must be contained in this list for the stage to be successful.

Examples

Minimal

{
  "type": "HTTPLoad",
  "name": "load customers to the customer api",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "http://internalserver/api/customer"
}

Complete

{
  "type": "HTTPLoad",
  "name": "load customers to the customer api",
  "description": "load customers to the customer api",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "http://internalserver/api/customer",
  "headers": {
    "Authorization": "Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==",
    "custom-header": "payload"
  },
  "validStatusCodes": [
    200,
    201
  ]
}

JDBCLoad

Since: 1.0.0 - Supports Streaming: True

The JDBCLoad writes an input DataFrame to a target JDBC Database. See Spark JDBC documentation.

Whilst it is possible to use JDBCLoad to create tables directly in the target database Spark only has a limited knowledge of the schema required in the destination database and so will translate things like StringType internally to a TEXT type in the target database (because internally Spark does not have limited length strings). The recommendation is to use a preceding JDBCExecute to execute a CREATE TABLE statement which creates the intended schema then inserting into that table with saveMode set to Append.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
jdbcURL String true The JDBC URL to connect to. e.g., jdbc:mysql://localhost:3306.
tableName String true The target JDBC table. Must be in database.schema.table format.
params Map[String, String] true Map of configuration parameters.. Currently requires user and password to be set here - see example below.
batchsize Integer false The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers.

Default: 1000.
bulkload Boolean false Whether to enable a bulk copy. This is currently only available for sqlserver targets but more targets can be added as drivers become available.

Default: false.
createTableColumnTypes String false The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g: “name CHAR(64), comments VARCHAR(1024)”). The specified types should be valid spark sql data types.
createTableOptions String false This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g., CREATE TABLE t (name string) ENGINE=InnoDB).
description String false An optional stage description to help document job files and print to job logs to assist debugging.
isolationLevel String false The transaction isolation level, which applies to current connection. It can be one of NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to standard transaction isolation levels defined by JDBC’s Connection object, with default of READ_UNCOMMITTED. Please refer the documentation in java.sql.Connection.
numPartitions Integer false The number of partitions that will be used for controlling parallelism. This also determines the maximum number of concurrent JDBC connections.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.
tablock Boolean false When in bulkload mode whether to set TABLOCK on the driver.

Default: true.
truncate Boolean false If using SaveMode equal to Overwrite, this additional option causes Spark to TRUNCATE TABLE of existing data instead of executing a DELETE FROM statement.

Examples

Minimal

{
  "type": "JDBCLoad",
  "name": "write customer to postgres",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "jdbcURL": "jdbc:postgresql://localhost:5432/customer",
  "tableName": "mydatabase.myschema.customer",
  "params": {
    "user": "mydbuser",
    "password": "mydbpassword"
  }
}

Complete

{
  "type": "JDBCLoad",
  "name": "write customer to postgres",
  "description": "write customer to postgres",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "jdbcURL": "jdbc:postgresql://localhost:5432/customer",
  "tableName": "mydatabase.myschema.customer",
  "batchsize": 10000,
  "bulkload": false,
  "createTableColumnTypes": "name CHAR(64), comments VARCHAR(1024)",
  "createTableOptions": "CREATE TABLE t (name string) ENGINE=InnoDB",
  "isolationLevel": "READ_COMMITTED",
  "numPartitions": 10,
  "params": {
    "user": "mydbuser",
    "password": "mydbpassword"
  },
  "saveMode": "Append",
  "tablock": false,
  "truncate": false
}

JSONLoad

Since: 1.0.0 - Supports Streaming: True

The JSONLoad writes an input DataFrame to a target JSON file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Delimited file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "JSONLoad",
  "name": "write customer json extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.json"
}

Complete

{
  "type": "JSONLoad",
  "name": "write customer json extract",
  "description": "write customer json extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.json",
  "authentication": {},
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

KafkaLoad

Since: 1.0.8 - Supports Streaming: True

The KafkaLoad writes an input DataFrame to a target Kafka topic. The input to this stage needs to be a single column dataset of signature value: string - intended to be used after a JSONTransform stage - or a two columns of signature key: string, value: string which could be created by a SQLTransform stage.

In the future additional Transform stages (like ProtoBufTransform) may be added to prepare binary payloads instead of just json string.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
bootstrapServers String true A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. e.g. host1:port1,host2:port2,...
topic String true The target Kafka topic.
acks Integer false The number of acknowledgments the producer requires the leader to have received before considering a request complete.

Alowed values:
1: the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers.
0: the job will not wait for any acknowledgment from the server at all.
-1: the leader will wait for the full set of in-sync replicas to acknowledge the record (safest).

Default: 1.
batchSize Integer false Number of records to send in single requet to reduce number of requests to Kafka.

Default: 16384.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
retries Integer false How many times to try to resend any record whose send fails with a potentially transient error.

Default: 0.

Examples

Minimal

{
  "type": "KafkaLoad",
  "name": "write customer to kafka",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "bootstrapServers": "kafka:29092",
  "topic": "customers"
}

Complete

{
  "type": "KafkaLoad",
  "name": "write customer to kafka",
  "description": "write customer to kafka",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "bootstrapServers": "kafka:29092",
  "topic": "customers",
  "acks": 1,
  "batchSize": 16384,
  "numPartitions": 10,
  "retries": 0
}

ORCLoad

Since: 1.0.0 - Supports Streaming: True

The ORCLoad writes an input DataFrame to a target Apache ORC file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the ORC file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "ORCLoad",
  "name": "write customer ORC extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.orc"
}

Complete

{
  "type": "ORCLoad",
  "name": "write customer ORC extract",
  "description": "write customer ORC extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.orc",
  "authentication": {},
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

ParquetLoad

Since: 1.0.0 - Supports Streaming: True

The ParquetLoad writes an input DataFrame to a target Apache Parquet file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Parquet file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "ParquetLoad",
  "name": "write customer Parquet extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.parquet"
}

Complete

{
  "type": "ParquetLoad",
  "name": "write customer Parquet extract",
  "description": "write customer Parquet extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.parquet",
  "authentication": {},
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}

TextLoad

Since: 1.9.0 - Supports Streaming: False

The TextLoad writes an input DataFrame to a target text file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the Parquet file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.
singleFile Boolean false Write to a single text file instead of a directory containing one or more partitions. Warning: this will pull the entire dataset to memory on the driver process so will not work for large datasets unless the driver has a sufficiently large memory allocation.
prefix String false A string to append before the row data when in singleFile mode.
separator String false A separator string to append between the row data when in singleFile mode.
suffix String false A string to append after the row data when in singleFile mode.

Examples

Minimal

{
  "type": "TextLoad",
  "name": "write customer Text extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.text"
}

Complete

{
  "type": "TextLoad",
  "name": "write customer Text extract",
  "description": "write customer text extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.text",
  "authentication": {},
  "numPartitions": 10,
  "saveMode": "Overwrite",
  "singleFile": true,
  "prefix": "[",
  "separator": ",\n",
  "suffix": "]"
}

XMLLoad

Since: 1.0.0 - Supports Streaming: False

The XMLLoad writes an input DataFrame to a target XML file.

Parameters

Attribute Type Required Description
name String true Name of the stage for logging.
environments Array[String] true A list of environments under which this stage will be executed. See environments documentation.
inputView String true Name of incoming Spark dataset.
outputURI URI true URI of the XML file to write to.
authentication Map[String, String] false An authentication map for authenticating with a remote service. See authentication documentation.
description String false An optional stage description to help document job files and print to job logs to assist debugging.
numPartitions Integer false The number of partitions that will be used for controlling parallelism.
partitionBy Array[String] false Columns to partition the data by.
saveMode String false The mode for writing the output file to describe how errors are handled. Available options are: Append, ErrorIfExists, Ignore, Overwrite. Default is Overwrite if not specified.

Examples

Minimal

{
  "type": "XMLLoad",
  "name": "write customer XML extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.xml"
}

Complete

{
  "type": "XMLLoad",
  "name": "write customer XML extract",
  "description": "write customer XML extract",
  "environments": [
    "production",
    "test"
  ],
  "inputView": "customer",
  "outputURI": "hdfs://output_data/customer/customer.xml",
  "authentication": {},
  "numPartitions": 10,
  "partitionBy": [
    "country"
  ],
  "saveMode": "Overwrite"
}