S3 Load
The S3 Load component lets users load data into an existing table from objects stored in Amazon Simple Storage Service (Amazon S3).
The S3 Load component requires working AWS Credentials, with Read access to the bucket containing the source data files.
To access an S3 bucket from a different AWS account, read Background: Cross-account permissions and using IAM roles.
Note
If you're using a Matillion Fully-Managed solution, you may need to allow-list these IP address ranges from which Matillion Fully-Managed agents will call out to their source systems or to cloud data platforms.
Properties
Name
= string
A human-readable name for the component.
Stage
= drop-down
Select a staging area for the data. Staging areas can be created through Snowflake using the CREATE STAGE command. Internal stages can be set up this way to store staged data within Snowflake. Selecting [Custom] will avail the user of properties to specify a custom staging area on S3. Users can add a fully qualified stage by typing the stage name. This should follow the format databaseName.schemaName.stageName
Authentication
= drop-down
Select the authentication method. Users can choose either:
- Credentials: Uses AWS security credentials.
- Storage Integration: Use a Snowflake storage integration. A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of permitted or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage). More information can be found at CREATE STORAGE INTEGRATION.
Storage Integration
= drop-down
Select the storage integration. Storage integrations are required to permit Snowflake to read data from and write to a cloud storage location. Integrations must be set up in advance of selecting them. Storage integrations can be configured to support Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage, regardless of the cloud provider that hosts your Snowflake account.
S3 Object Prefix
= string
The URL of the S3 bucket to load files from. The URL follows the format s3://bucket/path/
. Type the URL or select from the list.
The "path" parameter in the URL is the subfolder and should be included. When a user enters a forward slash character /
after a folder name, a validation of the file path is triggered.
Pattern
= string
A string that will partially match all file paths and names that are to be included in the load. Defaults to .*
indicating all files within the S3 Object Prefix. This property is a pattern on the complete path of the file, and is not just relative to the directory configured in the S3 Object Prefix property.
The subfolder containing the object to load must be included here.
Encryption
= drop-down
Decide how the files are encrypted inside the S3 bucket. This property is available when using an existing Amazon S3 location for staging.
- None: No encryption.
- SSE KMS: Encrypt the data according to a key stored on KMS. Read AWS Key Management Service (AWS KMS) to learn more.
- SSE S3: Encrypt the data according to a key stored on an S3 bucket. Read Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3) to learn more.
KMS Key ID
= drop-down
The ID of the KMS encryption key you have chosen to use in the Encryption property.
Warehouse
= drop-down
The Snowflake warehouse used to run the queries. The special value, [Environment Default], will use the warehouse defined in the environment. Read Overview of Warehouses to learn more.
Database
= drop-down
The Snowflake database. The special value, [Environment Default], will use the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.
Schema
= drop-down
The Snowflake schema. The special value, [Environment Default], will use the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.
Target Table
= string
Select an existing table to load data into. The tables available for selection depend on the chosen schema.
Load Columns
= dual listbox
Choose the columns to load. If you leave this parameter empty, all columns will be loaded.
Format
= drop-down
Select a pre-made file format that will automatically set many of the S3 Load component properties. These formats can be created through the Create File Format component.
Users can add a fully qualified format by typing the format name. This should read as databaseName.schemaName.formatName
File Type
= drop-down
Select the type of data to load. Available data types are: AVRO, CSV, JSON, ORC, PARQUET, and XML. Some file types may require additional formatting—this is explained in the Snowflake documentation. Component properties will change to reflect the selected file type.
Compression
= drop-down
Select the compression method if you wish to compress your data. If you do not wish to compress at all, select NONE. The default setting is AUTO.
Record Delimiter
= string
CSV only. Input a delimiter for records. This can be one or more single-byte or multibyte characters that separate records in an input file.
Accepted values include: leaving the field empty; a newline character \
or its hex equivalent 0x0a
; a carriage return \\r
or its hex equivalent 0x0d
. Also accepts a value of NONE.
If you set the Skip Header to a value such as 1, then you should use a record delimiter that includes a line feed or carriage return, such as \
or \\r
. Otherwise, your entire file will be interpreted as the header row, and no data will be loaded.
The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.
The default (if the field is left blank) is a newline character.
Field Delimiter
= string
CSV only. Input a delimiter for fields. This can be one or more single-byte or multibyte characters that separate fields in an input file.
Accepted characters include common escape sequences, octal values (prefixed by \), or hex values (prefixed by 0x). Also accepts a value of NONE.
This delimiter is limited to a maximum of 20 characters.
While multi-character delimiters are supported, the field delimiter cannot be a substring of the record delimiter, and vice versa. For example, if the field delimiter is "aa", the record delimiter cannot be "aabb".
The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.
The Default setting is a comma: ,
.
Skip Header
= integer
CSV only. Specify the number of rows to skip. The default is 0.
If Skip Header is used, the value of the record delimiter will not be used to determine where the header line is. Instead, the specified number of CRLF will be skipped. For example, if the value of Skip Header = 1, skips to the first CRLF that it finds. If you have set the Field Delimiter property to be a single character without a CRLF, then skips to the end of the file (treating the entire file as a header).
Skip Blank Lines
= boolean
CSV only. When True, ignores blank lines that only contain a line feed in a data file and does not try to load them. Default setting is False.
Date Format
= string
CSV only. Define the format of date values in the data files to be loaded. If a value is not specified or is AUTO, the value for the DATE_INPUT_FORMAT session parameter is used. The default setting is AUTO.
Time Format
= string
CSV only. Define the format of time values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIME_INPUT_FORMAT session parameter is used. The default setting is AUTO.
Timestamp Format
= string
CSV only. Define the format of timestamp values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIMESTAMP_INPUT_FORMAT session parameter is used.
Escape
= string
CSV only. Specify a single character to be used as the escape character for field values that are enclosed. Default is NONE.
Escape Unenclosed Field
= string
CSV only. Specify a single character to be used as the escape character for unenclosed field values only. Default is \\
. If you have set a value in the property Field Optionally Enclosed, all fields will become enclosed, rendering the Escape Unenclosed Field property redundant, in which case, it will be ignored.
Trim Space
= boolean
CSV only. When True, removes whitespace from fields. Default setting is False.
Field Optionally Enclosed
= string
CSV only. Specify a character used to enclose strings. The value can be NONE, single quote character '
, or double quote character "
. To use the single quote character, use the octal or hex representation 0x27
or the double single-quoted escape ''
. Default is NONE.
When a field contains one of these characters, escape the field using the same character. For example, to escape a string like this: 1 "2" 3, use double quotation to escape, like this: 1 ""2"" 3.
Null If
= editor
Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.
Error On Column Count Mismatch
= boolean
CSV only. When True, generates an error if the number of delimited columns in an input file does not match the number of columns in the corresponding table. When False (default), an error is not generated and the load continues. If the file is successfully loaded in this case:
- Where the input file contains records with more fields than columns in the table, the matching fields are loaded in order of occurrence in the file, and the remaining fields are not loaded.
- Where the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values.
Empty Field As Null
= boolean
CSV only. When True, inserts NULL values for empty fields in an input file. This is the default setting.
Replace Invalid Characters
= boolean
When True, Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When False (default), the load operation produces an error when invalid UTF-8 character encoding is detected.
Encoding Type
= drop-down
CSV only. Select the string that specifies the character set of the source data when loading data into a table. Please refer to the Snowflake documentation for more information.
Enable Octal
= boolean
JSON only. When True, enables the parsing of octal values. Default setting is False.
Allow Duplicates
= boolean
JSON only. When True, allows duplicate object field names. Default setting is False.
Strip Outer Array
= boolean
JSON only. When True, instructs the JSON parser to remove outer brackets. Default setting is False.
Strip Null Values
= boolean
JSON only. When True, instructs the JSON parser to remove any object fields or array elements containing NULL values. Default setting is False.
Ignore UTF8 Errors
= boolean
JSON and XML only. When True, replaces any invalid UTF-8 sequences with Unicode characters. When False (default), UTF-8 errors will not produce an error in the pipeline run.
Preserve Space
= boolean
XML only. When True, the XML parser preserves leading and trailing spaces in element content. Default setting is False.
Strip Outer Element
= boolean
XML only. When True, the XML parser strips out any outer XML elements, exposing second-level elements as separate documents. Default setting is False.
Disable Snowflake Data
= boolean
XML only. When True, the XML parser will not recognise Snowflake semi-structured data tags. Default setting is False.
Disable Auto Convert
= boolean
XML only. When True, the XML parser will disable automatic conversion of numeric and Boolean values from text to native representations. Default setting is False.
On Error
= drop-down
Decide how to proceed upon an error.
- Abort Statement: Aborts the load if any error is encountered. This is the default setting.
- Continue: Continue loading the file.
- Skip File: Skip file if any errors are encountered in the file.
- Skip File When n Errors: Skip file when the number of errors in the file is equal to or greater than the specified number in the next property, n.
- Skip File When n% Errors: Skip file when the percentage of errors in the file exceeds the specified percentage of n.
n
= integer
Specify the number of errors or the percentage of errors required to skip the file. This parameter only accepts integer characters. %
is not accepted. Specify percentages as a number only.
Size Limit (B)
= integer
Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, please refer to the Snowflake documentation.
Purge Files
= boolean
When True, purges data files after the data is successfully loaded. Default setting is False.
Match By Column Name
= drop-down
Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data.
- Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive.
- Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive.
- None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.
Truncate Columns
= boolean
When True, strings are automatically truncated to the target column length. When False (default), the COPY statement produces an error if a loaded string exceeds the target column length.
Force Load
= boolean
When True, loads all files, regardless of whether they have been loaded previously and haven't changed since they were loaded. Default setting is False.
When set to True, this option reloads files and can lead to duplicated data in a table.
Metadata Fields
= dual listbox
Snowflake metadata columns available to include in the load. For more information, read Querying Metadata for Staged Files. This property is only available when an external stage is selected.
File patterns with Snowflake
In Snowflake, the Pattern parameter in the COPY INTO syntax is a pattern on the complete path of the file and is not just relative to the directory configured in the S3 Object Prefix parameter.
The table below provides an example of S3 Object Prefix and Pattern behaviours, including success and failure states.
S3Object Prefix | Pattern | Outcome | Comments |
---|---|---|---|
s3://testbucket/ | testDirectory/alphabet_0_0_0.csv.gz | Success | This is the format that the S3 Load Generator will generate. |
s3://testbucket/testDirectory/ | testDirectory/alphabet_0_0_0.csv.gz | Success | Loads the file successfully because the pattern is matching the full path. |
s3://testbucket/testDirectory/ | .*.csv.gz | Success | Would load all files ending in .csv.gz in the testDirectory directory. |
s3://testbucket/testDirectory/ | alphabet_0_0_0.csv.gz | Failure | Does not load the file because the pattern does not match. |