Skip to content

EMR Load

Overview

Use the EMR Load component to load data into an existing table from objects stored on an EMR cluster.

Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by Amazon Redshift when that option is not specified.

Mandatory settings are:

  • Target Table Name
  • Load Columns
  • EMR URL Location
  • EMR Object Prefix
  • Data File Type

In addition, it is likely you will need to confirm the following settings:

  • Delimiter
  • Compression Method
  • Ignore Header Rows

This component requires working AWS Credentials with read access to the EMR cluster containing the source data files. The is easily achieved by attaching an IAM role to the instance when launching Matillion ETL for Redshift; however, it can also be managed manually by editing an environment.

For more information on all the settings in this component, read the Amazon Redshift COPY syntax documentation.


Properties

Name = string

A human-readable name for the component.


Schema = drop-down

Select the table schema. The special value, [Environment Default], will use the schema defined in the environment. For more information on using multiple schemas, read Schemas.


Target Table = string

An existing table to load data into.


Load Columns = dual listbox

Use the arrow buttons to add columns into the load.


EMR URL Location = string

The URL of the EMR source path to retrieve files from. This follows the format emr://myemrclusterid/location, where location is optional.


EMR Object Prefix = string

All files that begin with this prefix will be included in the load into the target table.


IAM Role ARN = string

Supply the value of a role ARN that is already attached to your Amazon Redshift cluster, and has the necessary permissions to access the EMR cluster. This is optional, since without this style of setup, the credentials of the environment (instance credentials or manually entered access keys) will be used. Read here to learn more.


Data Type = drop-down

Available options are:

  • Delimited
  • CSV
  • Fixed Width: This requires an additional "Fixed Width Spec".
  • JSON: This requires an additional "JSON Format".
  • Avro: This requires an additional "Avro Format".

Delimiter = string

The delimiter that separates columns. The default is a comma ,. A [TAB] character can be specified as "\ ".


Fixed Width Spec = drop-down

Loads the data from a file where each column width is a fixed length, rather than separated by a delimiter. Each column is described by a name and length, separated by a colon. Each described column is then separated by a comma.

For example, you have four columns: name, id, age, state. These columns have the respective lengths; 12,8,2,2.

The written description to convert this data into a table using fixed-width columns would then be:

name:12,id:8,age:2,state:2

The columns can have any plaintext name. For more information, read Loading fixed-width data from Amazon S3.


CSV Quoter = string

Specifies the character to be used as the quote character when using the CSV option.


JSON Layout = string

Defaults to auto, which should work for the majority of JSON files if the fields match the table field names. Optionally can specify the URL to a JSONPaths file to map the data elements in the JSON source data to the columns in the target table.


AVRO Layout = string

Defaults to auto which should work for the majority of Avro files if the fields match the table field names. Optionally can specify the URL to a JSONPaths file to map the data elements in the Avro source data to the columns in the target table.


Compression Method = drop-down

Whether the input file is compressed in gzip format, LZOP format, or not compressed at all.


Encoding = drop-down

The encoding the data is in. This defaults to UTF-8.


Remove Quotes = boolean

Whether to remove any quotes surrounding data values.


Replace Invalid Characters = string

If there are any invalid unicode characters in the data, this parameter specifies the single character replacement for them. Defaults to ?.


Maximum Errors = integer

The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values. This value defaults to 0, but the Amazon default is 1000.


Date Format = string

Defaults to auto. This can be used to manually specify a date format.


Time Format = string

Defaults to auto.

This can be used to manually specify a time format.


Ignore Header Rows = string

The number of rows at the top of the file to ignore. Defaults to 0.


Accept Any Date = boolean

If this is enabled, invalid dates such as '45-65-2018' are not considered an error, but will be loaded as the null value.


Ignore Blank Lines = boolean

If this is set, any blank lines in the input file are ignored.


Truncate Columns = boolean

If this is set, any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit instead of causing an error.


Fill Record = boolean

Allows data files to be loaded when contiguous columns are missing at the end of some of the records. The remaining columns are set to null.


Trim Blanks = boolean

Removes trailing and leading whitespace from the input data.


Null As = string

This option replaces the specified string with null in the output table. Use this if your data has a particular representation of missing data.


Empty As Null = boolean

If this is set, empty columns in the input file will become NULL.


Blanks As Null = boolean

If this is set, blank columns in the input file will become NULL.


Comp Update = drop-down

Apply automatic compression to the target table. Default is On.


Escape = boolean

When this option is specified, the backslash character (\) in input data is treated as an escape character.


Stat Update = drop-down

Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command.


Round Decimals = boolean

If this option is set, round any decimals to fit into the column when the number of decimal places in the input data is larger than defined for the target column.


Explicit IDs boolean

Whether or not to load data from the EMR Objects into an IDENTITY column.


Snowflake Delta Lake on Databricks Amazon Redshift Google BigQuery Azure Synapse Analytics