S3 Load Generator (Redshift)
Overview
S3 Load Generator is a wizard that helps you load delimited data from public objects in an S3 Bucket (Amazon Simple Storage Service).
Unlike common components, the Load Generator does not appear as a standalone component when dragged onto the job canvas. Instead, the Load Generator takes the form of a wizard that lets you load and view files on the fly, altering load component properties and observing their effects without the need for a separate Transformation job. The generator can also guess the Schema of a table, relieving much your work.
Please Note
This component requires working AWS Credentials with "read" access to the bucket that contains the source data's file(s). This is achieved by attaching an IAM role to the instance when launching Matillion ETL for Redshift; however, it can also be managed by editing an Environment. See the example at the bottom of this article.
Furthermore, Matillion requires use of a policy that contains the s3:ListBucket action, such as the policy provided in the Managing Credentials documentation.
The following section explains how to configure the S3 Load Generator in Matillion ETL for Redshift:
Configuring the S3 Load Generator Wizard
1. Begin by loading Matillion ETL for Redshift, and create an Orchestration job in the Project.
Create an Orchestration job
2. In the Components tab, search for the S3 Load Generator, and drag it on to the canvas.
S3 Load Generator component
Please Note
The three-page S3 Load Generator wizard will immediately open in the Get Sample page, after the component has been dropped onto the canvas.
3. In the Amazon S3 Storage field, use to select a file from an existing S3 bucket.
Please Note
This file must be delimited (including .csv) and you must have permission to access the file.
4. In the Compression field use the drop-down menu provided. If the file is compressed, the correct compression method must be selected; supported compression methods are gzip and BZip2.
5. In the Encoding field use the drop-down menu provided. Supported data Encoding types are UTF8 (default), UTF16, UTF16BE, and UTF16LE.
6. In the Row Limit field, select the number of rows to be returned from the sample file, then click Get Sample.
Please Note
The wizard will load the intended file's data and attempt to guess the schema. The raw data will be displayed in the sample box, situated underneath the aforementioned fields.
7. Click Next to progress to the next page of the wizard.
Get sample
8. In the Guess Schema page, the input file's data will be automatically displayed in the section at the top of the dialog. You might want to scroll through the list to view the entirety of the data.
9. Column data is displayed in the lower-right panel, and is available for editing. By default, columns are named "Column1", "Column2", and so on. For the purpose of this document, "Column1" and "Column2" will be called something more descriptive, like States and State Codes, respectively.
10. Schema Configuration properties are displayed in the lower-left panel. Modify the fields relevant to the file type. The available properties are detailed in the next section of the document, see Properties.
11. Click Next to progress to the final page of the wizard.
Please Note
Clicking Guess Schema will order the S3 Load Generator to attempt to guess as many of these properties as possible. However, all of the properties are available for you to edit, as described above.
Guess schema
12. On the Validation page, the input file's data will be displayed in the panel at the top of the dialog. To view the resulting table's output, click Test, situated to the bottom-left hand side.
Please Note
Running a test will create the requested table on the Redshift cluster, and show a sample for you to inspect.
13. To modify the output data, click Back, situated at the button in the centre of the dialog, to return to previous steps within the wizard, and make property and column alterations.
14. Click Create & Run, situated to the right of the Back button, to complete the wizard.
Validation
The wizard will close, return you to the job interface, and create two linked components; Create/Replace Table and Load. Each component is parameterised by the S3 Load Generator, and can be run as a job by linking to a Start component.
Job canvas
Properties
Property | Setting | Description |
---|---|---|
For more information on all the settings in this component, see the Amazon Redshift COPY syntax for more information. | ||
Table Name | Text | The descriptive name of the table. |
Data Type | Select | CSV: Comma Separated Values file, delimited by commas. Delimited: Any file containing delimited data. |
Delimiter | Text | (Only available if 'Data Type' is 'Delimited'. The delimiter separates columns; the default is a Comma (,). A [TAB] character can be specified as "\\ ". |
CSV Quoter | Text | Specifies the character to be used as the quote character when using the CSV option. |
Region | Select | The Amazon S3 region hosting the S3 bucket. This is not normally required and can be left as "None" if the bucket is in the same region as your Redshift cluster. |
Replace Invalid Characters | Text | If there are any invalid unicode characters in the data, this parameter specifies the single character replacement for them. Defaults to '?'. |
Max Errors | Text | The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values. This value defaults to 0, but the Amazon default is 1000. |
Date Format | Text | Defaults to 'auto' - this can be used to manually specify a date format. |
Time Format | Text | Defaults to 'auto' - this can be used to manually specify a time format. |
Ignore Header Rows | Text | The number of rows at the top of the file to ignore - defaults to 0. |
Null As | Text | This option replaces the specified string with null in the output table. Use this if your data has a particular representation of missing data. |
Accept Any Date | Checkbox | If this is enabled, invalid dates such as '45-65-2018' are not considered an error, but will be loaded as a null value. |
Ignore Blank Lines | Checkbox | If this is enabled, any blank lines in the input file are ignored. |
Truncate Columns | Checkbox | If this is enabled, any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit instead of causing an error. |
Fill Record | Checkbox | If this is enabled, Matillion ETL allows data files to be loaded when contiguous columns are missing at the end of some of the records. The remaining columns are set to null. |
Trim Blanks | Checkbox | If this is enabled, Matillion ETL removes trailing and leading whitespace from the input data. |
Empty As Null | Checkbox | If this is enabled, empty columns in the input file will become NULL. |
Blank As Null | Checkbox | If this is enabled, blank columns in the input file will become NULL. |
Comp Update | Checkbox | Controls whether compression encodings are automatically applied during a COPY. This is usually a good idea to optimise the compression used when storing the data. |
Stat Update | Checkbox | Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command. |
Round Decimals | Checkbox | If this option is enabled, round any decimals to fit into the column when the number of decimal places in the input data is larger than defined for the target column. |