Skip to content

Azure Blob Load Generator

The Azure Blob Load Generator component helps users load delimited data from public objects in an Azure Storage Blob. The load wizard takes the form of a tool that empowers users to load and view files on the fly, altering load component properties and observing their effects without the need for a separate transformation job. The generator can also guess the schema of a table, relieving much of the end user's work.

This component requires working Azure credentials with "read" access to the Azure Blob containing the source data's files.

Note

  • Azure Data Lake Storage Gen2 uses a set of analytical capabilities that is built on Azure Blob Storage.
  • Azure Data Lake Storage Gen2 merges the capabilities of Azure Data Lake Storage Gen 1 with Azure Blob Storage. For more information, and to migrate Azure Data Lake Storage from Gen1 to Gen2, read, Introduction to Azure Data Lake Storage Gen 2.

The following section explains how to configure the Azure Blob Load Generator in Matillion ETL for Snowflake, Synapse and Delta Lake.


Configuring the Azure Blob Load Generator wizard

  1. Create a new orchestration job.
  2. In the Components tab, search for the Azure Blob Load Generator, and drag it on to the job canvas.

    Note

    The three-page Load Generator wizard will immediately open in the Get Sample page, after the component has been dropped onto the job canvas.

  3. In the Azure Blob Storage field, click ... to select a file from an existing Azure storage location.

    Note

    Alternatively, you can manually type the Azure storage location URL into the Azure Blob Storage field, by using the following template: azure://<account>/<container>/<path>. This file must be delimited (including .csv), and you must have permission to access the file.

  4. In the Compression field use the drop-down menu provided. If the file is compressed, the correct compression method must be selected; supported compression methods are gzip and BZip2. These will differ depending on the chosen data warehouse.

  5. In the Row Limit field, select the number of rows to be returned from the sample file, then click Get Sample.

    Note

    The wizard will load the intended file's data and attempt to guess the schema. The raw data will be displayed in the sample box, situated underneath the aforementioned fields.

  6. Click Next to progress to the next page of the wizard.

  7. In the Guess Schema page, the input file's data will be automatically displayed in the section at the top of the dialog. You may want to scroll through the list to view the entirety of the data.
  8. Click Guess Schema and the Azure Blob Load Generator will attempt to guess as many of the file's properties as possible. All of the column data in the lower-right panel is available to edit. Click + to add as many columns as you require. You will need to specify the Name, Type, Size, and Decimal Places values. To remove a column, click -.
  9. Schema Configuration properties are displayed in the lower-left panel. Modify the fields relevant to the file type. Click Next to progress to the final page of the wizard.
  10. On the Validation page, the input file's data will be displayed in the panel at the top of the dialog. To view the resulting table's output, click Test, situated to the lower-left.

    Note

    Running a test will create the requested table on the relevant data warehouse cluster and show a sample for you to inspect.

  11. To modify the output data, click Back to return to previous steps within the wizard, and make property and column alterations. Once modifications have been made, click Create and Run to complete the wizard.

The wizard will close, returning you to the job canvas, and create two linked components; Create Table and Azure Blob Storage Load. Each component is parameterized by the Azure Blob Load Generator, and can be run as a job by linking to a Start component.


Properties

Name = string

A human-readable name for the component.


Field Delimiter = string

The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as \t.


Field Optionally Enclosed By = string

A character that is used to enclose strings. Can be single quote (') or double quote (") or NONE (default). Note that the character chosen can be escaped by that same character.


Ignore Header Rows = integer

The number of rows at the top of the file to ignore. Default is 0.


Date Format = string

Defaults to "auto". This can be used to manually specify a date format.


Timestamp Format = string

Defaults to "auto". This can be used to manually specify a timestamp format.


Escape Unenclosed Field = string

(CSV only) Specify a single character to be used as the escape character for unenclosed field values only. Accepts common escape sequences, octal values, or hex values. Also accepts a value of NONE (default). Default is \.

If a character is specified in the "Escape" field, it will override this field.

If you have set a value in the property Field Optionally Enclosed, all fields will become enclosed, rendering the Escape Unenclosed Field property redundant, in which case it will be ignored.


Escape = string

(CSV only) Specify a single character to be used as the escape character for field values that are enclosed. Default is NONE.


Null As = string

This option replaces the specified string with NULL in the output table. Use this if your data has a particular representation of missing data.


Trim Blanks = checkbox

Tick to remove trailing and leading white space from the input data.


Error on Column Count = checkbox

If enabled, a parsing error is generated when the number of delimited columns (fields) in the input data file does not match the number of columns in the corresponding table.


Empty Field As Null = checkbox

If ticked, empty columns in the input file will become NULL.

Name = string

A human-readable name for the component.


Field Delimiter = string

The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as \t.


Date Format = string

(CSV and JSON only) Manually set a date format. If none is set, the default (auto) is yyyy-MM-dd.


Timestamp Format = string

(CSV and JSON only) Manually set a timestamp format. If none is set, the default (auto) is yyyy-MM-dd'T'HH:mm:ss.[SSS][XXX].


Null As = string

This option replaces the specified string with NULL in the output table. Use this if your data has a particular representation of missing data.


Ignore Header Rows = integer

The number of rows at the top of the file to ignore. Default is 0.

Name = string

A human-readable name for the component.


Field Delimiter = string

The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as \t.


Field Optionally Enclosed By = string

A character that is used to enclose strings. Can be single quote (') or double quote (") or NONE (default). Note that the character chosen can be escaped by that same character.


First Row = integer

(CSV only) Specify the first row that is read in all files for the COPY command. The default value is 1.


Date Format = string

(CSV only) Specify the date format of the date mapping to SQL server date formats. To learn more about Transact-SQL date and time data types and functions, please read Microsoft's documentation.


Max Errors = integer

Specify the maximum number of rejected rows allowed in the load before the COPY operation is cancelled. Each row that the COPY operation cannot import is ignored and counted as one error. The default value for this property is 0.