Cloud Storage Load
The Cloud Storage Load Component lets users load data stored on the Google Cloud Storage service into an existing Snowflake table. Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by Snowflake when that option is not specified.
The Cloud Storage Load Component lets users load data stored on the Google Cloud Storage service into an existing BigQuery table. Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by BigQuery when that option is not specified.
This component requires working Google Cloud Storage credentials with "read" access to the source data files.
External Stages and Metadata
If using an external stage, please be aware of the following information:
- All columns to be included in the load must be included in the Load Columns property, in the correct order and associated with the correct data type. The Create Table component can be used to specify the metadata of the columns.
- The Metadata Fields property will insert metadata columns at the end of the existing table and overwrite the same number of columns unless additional columns are added via the Create Table component..
- The table's data and structure are only accessed at runtime, meaning that additional columns must be added (and data types set) before the job is run.
Properties
Name
= string
A human-readable name for the component.
Stage
= drop-down
Select your stage. To learn more about stages in Matillion ETL, read Manage Stages.
Storage Integration
= drop-down
Select the storage integration. Storage integrations are required to permit Snowflake to read data from and write to a cloud storage location. Integrations must be set up in advance of selecting them in Matillion ETL. To learn more about setting up a storage integration, read Storage Integration Setup Guide.
Storage integrations can be configured to support Amazon S3, Google Cloud Storage, or Microsoft Azure cloud storage regardless of the cloud provider that hosts your Snowflake account.
Google Storage URL Location
= string
Input, or select via the file path tree, the URL of the Google Storage bucket from which to retrieve files. This follows a format such as gs://<bucket-name>/<path>
.
When a user enters a forward slash character / after a folder name, a validation of the file path is triggered. This works in the same manner as the Go button.
Pattern
= string
A regular expression pattern string that specifies the file names and/or paths to match. Files should be preceded by .. E.g. ..file_name.file_type**
For more information, please refer to the Snowflake documentation.
Warehouse
= drop-down
The Snowflake warehouse used to run the queries. The special value, [Environment Default], will use the warehouse defined in the environment. Read Overview of Warehouses to learn more.
Database
= drop-down
The Snowflake database. The special value, [Environment Default], will use the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.
Schema
= drop-down
The Snowflake schema. The special value, [Environment Default], will use the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.
Target Table
= string
Choose a target table to unload data from into a GCS bucket.
Warning
This table will be recreated and will drop any existing table of the same name.
Load Columns
= dual listbox
Choose which columns from your selected target table to load.
Format
= drop-down
Select the format.
File Type
= drop-down
The file type for the file format. Available file types are: CSV, JSON, and PARQUET.
Component properties beneath this property will change depending on the chosen file type. Matillion ETL provides settings based on the specific file type.
For additional information on file type options, please refer to the Snowflake documentation.
Compression
= drop-down
Select the compression method if you wish to compress your data. If you do not wish to compress at all, select NONE. The default setting is AUTO. For more information, please refer to the relevant Snowflake documentation per your file type:
Record Delimiter
= string
(CSV only) Input a delimiter for records. This can be one or more single-byte or multibyte characters that separate records in an input file.
Note
- Accepted values include: leaving the field empty; a newline character
\
or its hex equivalent0x0a
; a carriage return \r or its hex equivalent 0x0d. Also accepts a value of NONE. - If you set the Skip Header to a value such as 1, then you should use a record delimiter that includes a line feed or carriage return, such as
\
or\\r
. Otherwise, your entire file will be interpreted as the header row, and no data will be loaded. - The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
- Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.
- The default (if the field is left blank) is a newline character.
- This delimiter is limited to a maximum of 20 characters.
- While multi-character delimiters are supported, the record delimiter cannot be a substring of the field delimiter, and vice versa. For example, if the record delimiter is "aa", the field delimiter cannot be "aabb".
Field Delimiter
= string
(CSV only) Input a delimiter for fields. This can be one or more singlebyte or multibyte characters that separate fields in a file.
Note
- Accepted characters include common escape sequences, octal values (prefixed by \\), or hex values (prefixed by 0x). Also accepts a value of NONE.
- This delimiter is limited to a maximum of 20 characters.
- While multi-character delimiters are supported, the field delimiter cannot be a substring of the record delimiter, and vice versa. For example, if the field delimiter is "aa", the record delimiter cannot be "aabb".
- The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.
- Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.
- Default setting is a comma (,).
Skip Header
= integer
(CSV only) Specify the number of rows to skip. The default is 0.
If Skip Header is used, the value of the record delimiter will not be used to determine where the header line is. Instead, the specified number of CRLF will be skipped.
For example, if the value of Skip Header = 1, then Matillion ETL will skip to the first CRLF that it finds. If you have set the Field Delimiter property to be a single character without a CRLF, then Matillion ETL skips to the end of the file (treating the entire file as a header).
Date Format
= string
(CSV only) Specify the date format. Snowflake accepts dates in the most common forms, including YYYY-MM-DD and DD-MON-YYYY among others. Additionally, all accepted timestamps are valid inputs for dates.
Please refer to the Snowflake documentation for more information about date and time formats.
Time Format
= string
(CSV only) Specify the time format. Snowflake supports a single TIME data type, for storing times in the form of HH:MI:SS. TIME supports an optional precision parameter for fractional seconds, e.g. TIME(3). Time precision ranges between 0 (seconds) to 9 (nanoseconds). The default precision is 9.
Please refer to the Snowflake documentation for more information about date and time formats.
Timestamp Format
= string
(CSV only) Specify the timestamp format. Snowflake supports three variations of timestamp: TIMESTAMP_LTZ, TIMESTAMP_NTZ, and TIMESTAMP_TZ. To learn more, please consult the Snowflake documentation.
Escape
= string
(CSV only) Specify a single character to be used as the escape character for field values that are enclosed. Default is NONE.
Escape Unenclosed Field
= string
(CSV only) Specify a single character to be used as the escape character for unenclosed field values only. Accepts common escape sequences, octal values, or hex values. Also accepts a value of NONE (default). Default is \
.
If a character is specified in the "Escape" field, it will override this field.
If you have set a value in the property Field Optionally Enclosed, all fields will become enclosed, rendering the Escape Unenclosed Field property redundant, in which case it will be ignored.
Trim Space
= drop-down
(CSV only) Specify whether to remove white space from fields (True) or not (False). Default setting is False.
Field Optionally Enclosed
= string
(CSV only) Specify a character used to enclose strings. The value can be NONE, single quote character ('), or double quote character ("). To use the single quote character, use the octal or hex representation (0x27) or the double single-quoted escape (''). Default is NONE.
When a field contains one of these characters, escape the field using the same character. For example, to escape a string like this: 1 "2" 3, use double quotation to escape, like this: 1 ""2"" 3.
Null If
= string
Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with a SQL NULL value for that field in the loaded table. Click + to add a string.
Error On Column Count Mismatch
= drop-down
(CSV only) Specify whether to generate an error (True) if the number of delimited columns in an input file does not match the number of columns in the corresponding table. If set to False an error is not generated and the load continues. If the file is successfully loaded in this case:
- Where the input file contains records with more fields than columns in the table, the matching fields are loaded in order of occurrence in the file, and the remaining fields are not loaded.
- Where the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values.
In Matillion ETL, the default setting is False.
Empty Field As Null
= drop-down
(CSV only) Specify whether to insert NULL values for empty fields in an input file. Default is True.
Encoding Type
= drop-down
(CSV only) Select the string that specifies the character set of the source data when loading data into a table. Please refer to the Snowflake documentation.
Enable Octal
= drop-down
(JSON only) Select True to enable the parsing of octal values. Default setting is False.
Allow Duplicates
= drop-down
(JSON only) Select True to allow duplicate object field names. Default setting is False.
Strip Outer Array
= drop-down
(JSON only) Select True to instruct the JSON parser to remove outer brackets. Default setting is False.
Strip Null Values
= drop-down
(JSON only) Select True to instruct the JSON parser to remove any object fields or array elements containing null values. Default setting is False.
Ignore UTF8 Errors
= drop-down
(JSON, XML only) Select True to replace any invalid UTF-8 sequences with Unicode characters. If set to False, UTF-8 errors will not produce an error in the job run. Default setting is False.
Preserve Space
= drop-down
Select True for the XML parser to preserve leading and trailing spaces in element content. Default setting is False.
Strip Outer Element
= drop-down
Select True for the XML parser to strip out any outer XML elements, exposing 2nd level elements as separate documents. Default setting is False.
Disable Snowflake Data
= drop-down
Select True if you wish the XML parser to not recognise Snowflake semi-structured data tags. Default setting is False.
Disable Auto Convert
= drop-down
Select True to specify that the XML parser will disable automatic conversion of numeric and Boolean values from text to native representations. Default setting is False.
On Error
= drop-down
Decide how to proceed upon an error.
- Abort Statement: Aborts the load if any error is encountered.
- Continue: Continue loading the file.
- Skip File: Skip file if any errors are encountered in the file.
- Skip File When n Errors: Skip file when the number of errors in the file is equal to or greater than the specified number in the next property, n.
- Skip File When n% Errors: Skip file when the percentage of errors in the file exceeds the specified percentage of n.
Default setting is Abort Statement.
n
= integer
Specify the number of errors or the percentage of errors required for Matillion ETL to skip the file.
This parameter only accepts integer characters. % is not accepted. Specify percentages as a number only.
Size Limit (B)
= integer
Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, please refer to the Snowflake documentation.
Purge Files
= drop-down
Select True to purge data files after the data is successfully loaded. Default setting is False.
Truncate Columns
= drop-down
When True, Matillion ETL will automatically truncate strings to the target column length.
When False, the COPY statement produces an error if a loaded string exceeds the target column length.
Default setting is False.
Force Load
= drop-down
Select True to load all files, regardless of whether they have been loaded previously and haven't changed since they were loaded. Default setting is False.
When set to True, this option reloads files and can lead to duplicated data in a table.
Metadata Fields
= dual listbox
Snowflake metadata columns available to include in the load.
Snowflake automatically generates metadata for files in internal stages (i.e. Snowflake) and external stages (Google Cloud Storage, Microsoft Azure, or Amazon S3). This metadata is "stored" in virtual columns. These metadata columns are added to the staged data, but are only added to the table when included in a query of the table. For more information, read Querying Metadata for Staged Files.
This property is only available when an external stage is selected. To manage stages, click the Environments panel in the bottom-left, then right-click a Matillion ETL environment, and click Manage Stages. To learn more, read Manage Stages.
Name
= string
A human-readable name for the component.
Project
= drop-down
Select the Google Cloud project. The special value, [Environment Default], will use the project defined in the environment. For more information, read Creating and managing projects.
Dataset
= drop-down
Select the Google BigQuery dataset to load data into. The special value, [Environment Default], will use the dataset defined in the environment. For more information, read Introduction to datasets.
Target Table
= string
Choose a target table to load.
Warning
This table will be recreated and will drop any existing table of the same name.
Load Columns
= dual listbox
Choose which columns from your selected target table to load.
Google Storage URL Location
= string
Select the Google Cloud Storage bucket. Users can click through the file tree, or use the URL template: gs://<bucket>/<path>
.
When a user enters a forward slash character / after a folder name, a validation of the file path is triggered. This works in the same manner as the Go button.
File Format
= drop-down
Select the type of expected data to load. Some file formats may require additional formatting, which is explained in the GCP Documentation.
Available options are: AVRO, Cloud Datastore Backup, CSV, JSON (New line delimited), ORC, PARQUET (see here for details).
Component properties will change to reflect the choice made here, and Matillion ETL will provide options based on the specific file type.
Select
= drop-down
Select your write preference:
- Append to Table: Append an existing table with the loaded data.
- Overwrite Table: Overwrite an existing table with the loaded data.
- Write if Empty: Only write loaded data if the target table is empty.
Number of Errors Allows
= integer
The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values. This value defaults to 0.
Ignore Unknown Values
= drop-down
- Yes: Accept rows that contain values that do not match the schema. Unknown values are ignored.
- Note: Will ignore extra values at the end of a line for CSV files.
- No: Omit any rows with invalid values.
Delimiter
= string
(CSV format only) Specify a delimiter character to separate columns. The default value is a comma ,.
A [TAB] character can be specified as "/t".
CSV Quoter
= string
Quote character to use to enclose records. Default (field left blank) is double-quote ("). To indicate no quote character at all, use an empty string.
Encoding
= drop-down
The encoding type to use on the output data. This defaults to UTF-8.
Header rows to skip
= integer
The number of rows at the top of the file to ignore. Defaults to 0.
Allow Quoted Newlines
= drop-down
- Yes: Allow a CSV value to contain a newline character when the value is encased in quotation marks.
- No: A new line character, regardless of quotations, is always considered a new row.
Allow jagged rows
= drop-down
- Yes: Missing values are treated as "null" but accepted.
- No: Rows with missing data are treated as bad records. Note: A bad record will count toward the "Maximum Errors" count.
Null Marker
= string
The String value which represents the NULL value in the text file.
Snowflake | Delta Lake on Databricks | Amazon Redshift | Google BigQuery | Azure Synapse Analytics |
---|---|---|---|---|
✅ | ❌ | ❌ | ✅ | ❌ |