Deploy
Arc has been packaged as a Docker image to simplify deployment as a stateless process on cloud infrastructure.
Running a Job
An example command to start a job is:
docker run \
-e "ETL_CONF_ENV=production" \
-e "ETL_CONF_JOB_PATH=/opt/tutorial/basic/job/0" \
-it -p 4040:4040 seddonm1/arc:1.13.3 \
bin/spark-submit \
--master local[*] \
--class au.com.agl.arc.ARC \
/opt/spark/jars/arc.jar \
--etl.config.uri=file:///opt/tutorial/basic/job/0/basic.json
This job executes the following job file which is included in the docker image:
{"stages":
[{
"type": "SQLValidate",
"name": "a simple stage which prints a message",
"environments": [
"production",
"test"
],
"inputURI": ${ETL_CONF_JOB_PATH}"/print_message.sql",
"sqlParams": {
"message0": "Hello",
"message1": "World!"
},
"authentication": {},
"params": {}
}]
}
This example is included to demonstrate:
ETL_CONF_ENV
is a reserved environment variable which determines which stages to execute in the current mode. For each of the stages the job designer can specify an array ofstages
under which that stage will be executed (in the case aboveproduction
andtest
are specified).
The purpose of this stage is so that it is possible to add or remove stages for execution modes liketest
orintegration
which are executed by a CI/CD tool prior to deployment and that you do not want to run inproduction
mode - so maybe a comparison against a known ‘good’ test dataset could be executed in onlytest
mode.ETL_CONF_JOB_PATH
is an environment variable that is parsed and included by string interpolation when the job file is executed. So when then job starts Arc will attempt to resolve all environment variables set in thebasic.json
job file. In this case"inputURI": ${ETL_CONF_JOB_PATH}"/print_message.sql",
becomes"inputURI": "/opt/tutorial/basic/job/0/print_message.sql",
after resolution. This is included so that potentially different paths would be set for running intest
vsproduction
mode.In this sample job the spark master is
local[*]
indicating that this is a single instance ‘cluster’ where Arc relies on vertical not horizonal scaling. Depending on the constrains of the job (i.e. CPU vs disk IO) it is often better to execute with vertical scaling on cloud compute rather than pay the cost of network shuffling.etl.config.uri
is a reserved JVM property which describes to Arc which job to execute. See below for all the properties that can be passed to Arc.
Configuration Parameters
Variable | Property | Description |
---|---|---|
ETL_CONF_JOB_ID | etl.config.job.id | A job identifier added to all the logging messages. |
ETL_CONF_JOB_NAME | etl.config.job.name | A job name added to all logging messages and Spark history server. |
ETL_CONF_TAGS | etl.config.tags | Custom key/value tags separated by space to add to all logging messages. E.g. ETL_CONF_TAGS=cost_center=123456 owner=jovyan . |
ETL_CONF_ENV | etl.config.environment | The environment to run under.E.g. if ETL_CONF_ENV is set to production then a stage with "environments": ["production", "test"] would be executed and one with "environments": ["test"] would not be executed. |
ETL_CONF_ENV_ID | etl.config.environment.id | An environment identifier to be added to all logging messages. Could be something like a UUID which allows joining to logs produced by ephemeral compute started by something like Terraform. |
ETL_CONF_URI | etl.config.uri | The URI of the job file to execute. |
ETL_CONF_STREAMING | etl.config.streaming | Run in Structured Streaming mode or not. Boolean default false . |
ETL_CONF_DISABLE_DEPENDENCY_VALIDATION | etl.config.disableDependencyValidation | Disable config dependency graph validation in case of dependency resolution defects. Boolean default false . |
Additionally there are permissions arguments that can be used to retrieve the job file from cloud storage:
Variable | Property | Description |
---|---|---|
ETL_CONF_S3A_ENDPOINT | etl.config.fs.s3a.endpoint | The endpoint for connecting to Amazon S3. |
ETL_CONF_S3A_CONNECTION_SSL_ENABLED | etl.config.fs.s3a.connection.ssl.enabled | Whether to enable SSL connection to Amazon S3. |
ETL_CONF_S3A_ACCESS_KEY | etl.config.fs.s3a.access.key | The access key for connecting to Amazon S3. |
ETL_CONF_S3A_SECRET_KEY | etl.config.fs.s3a.secret.key | The secret for connecting to Amazon S3. |
ETL_CONF_AZURE_ACCOUNT_NAME | etl.config.fs.azure.account.name | The account name for connecting to Azure Blob Storage. |
ETL_CONF_AZURE_ACCOUNT_KEY | etl.config.fs.azure.account.key | The account key for connecting to Azure Blob Storage. |
ETL_CONF_ADL_OAUTH2_CLIENT_ID | etl.config.fs.adl.oauth2.client.id | The OAuth client identifier for connecting to Azure Data Lake. |
ETL_CONF_ADL_OAUTH2_REFRESH_TOKEN | etl.config.fs.adl.oauth2.refresh.token | The OAuth refresh token for connecting to Azure Data Lake. |
ETL_CONF_GOOGLE_CLOUD_PROJECT_ID | etl.config.fs.gs.project.id | The project identifier for connecting to Google Cloud Storage. |
ETL_CONF_GOOGLE_CLOUD_AUTH_SERVICE_ACCOUNT_JSON_KEYFILE | etl.config.fs.google.cloud.auth.service.account.json.keyfile | The service account json keyfile path for connecting to Google Cloud Storage. |
Examples
Streaming
This is an example of a streaming job source. This job is intended to be executed after the integration test envornment has been started:
Start integration test environments:
docker-compose -f src/it/resources/docker-compose.yml up --build -d
Start the streaming job:
docker run \
--net "arc-integration" \
-e "ETL_CONF_ENV=test" \
-e "ETL_CONF_STREAMING=true" \
-e "ETL_CONF_ROWS_PER_SECOND=10" \
-it -p 4040:4040 seddonm1/arc:1.13.3 \
bin/spark-submit \
--master local[*] \
--class au.com.agl.arc.ARC \
/opt/spark/jars/arc.jar \
--etl.config.uri=file:///opt/tutorial/streaming/job/0/streaming.json
Spark and ulimit
On larger instances with many cores per machine it is possible to exceed the default (1024
) max open files (ulimit
). This should be verified on your instances if you are receiving too many open files
type errors.