Create External Table
Create External Table is an orchestration component that creates a table to reference data stored in an Amazon S3 bucket. The table references external data, meaning the table itself does not hold the data.
If the component requires access to a cloud provider, it will use credentials as follows:
- If using Matillion Full SaaS: The component will use the cloud credentials associated with your environment to access resources.
- If using Hybrid SaaS: By default the component will inherit the agent's execution role (service account role). However, if there are cloud credentials associated to your environment, these will overwrite the role.
Referencing externally held data can be valuable when wanting to query large datasets without resorting to storing that same volume of data on the Amazon Redshift cluster.
External tables are part of Amazon Redshift Spectrum, and may not be available in all regions. For a list of supported regions, read Amazon Redshift endpoints and quotas.
For more information about working with external tables, read Creating external tables for Redshift Spectrum.
Properties
External Schema
= drop-down
Select the table schema. The special value, [Environment Default], will use the schema defined in the Data Productivity Cloud environment. For more information on using multiple schemas, read Schemas.
New Table Name
= string
The name of the external table to be created or used.
Method
= drop-down
- Create: Create the new table with the given name. The pipeline will fail if a table of that name already exists.
- Create if not exists: Create the new table with the given name unless one already exists. The pipeline will succeed and continue in either case. If the schema of the existing table does not match the schema defined in this component, no attempt is made to fix or correct it, which could lead to errors later in the pipeline if you did not expect an existing table to exist, or to have a different schema to the one defined in this component.
- Replace: Create the new table, potentially overwriting any existing table of the same name.
Since other database objects depend upon this table, DROP ...
cascade is used, which may actually remove many other database objects.
Table Metadata
= column editor
- Name: The name of the new column.
- Type: Select the data type.
- Text: Can hold any type of data, subject to a maximum size.
- Integer: Suitable for whole-number types (no decimals).
- Numeric: Suitable for numeric types, with or without decimals.
- Real: Suitable for data of a single precision floating-point number.
- Double Precision: Suitable for data of a double precision floating-point number.
- Boolean: Suitable for data that is either true or false.
- Date: Suitable for dates without times.
- DateTime: Suitable for dates, times, or timestamps (both date and time).
- Super: Suitable for storing semi-structured data or documents as values.
- Size: For text types, this is the maximum length. This is a limit on the number of bytes, not characters. For Amazon Redshift, since all data is stored using UTF-8, any non-ASCII character will count as 2 or more bytes. For numeric types, this is the total number of digits allowed, whether before or after the decimal point.
- Scale: Scale is the number of digits to the right of the decimal point in a Float. For example, the number 123.45 has a scale of 2.
To use grid variables, tick the Use Grid Variable checkbox at the bottom of the Table Metadata dialog.
Click the Text Mode toggle at the bottom of the Table Metadata dialog to open a multi-line editor that lets you add items in a single block. For more information, read Text mode.
Partition
= column editor
Assign one or more columns in this table as potential partitions.
Partition columns allow queries on large datasets to be optimized when that query is made against the columns chosen as partition columns. When a partition is created, values for that column become distinct Amazon S3 storage locations, allowing rows of data in a location that is dependent on their partition column value.
For example, it is common for a date column to be chosen as a partition column, thus storing all other data according to the date it belongs to. When creating partitioned data using the Add Partition component, it is vital that those partitioned columns have already been marked using this property.
To use grid variables, tick the Use Grid Variable checkbox at the bottom of the Partition dialog.
Location
= column editor
The Amazon S3 bucket location for the external table data.
If the component requires access to a cloud provider, it will use credentials as follows:
- If using Matillion Full SaaS: The component will use the cloud credentials associated with your environment to access resources.
- If using Hybrid SaaS: By default the component will inherit the agent's execution role (service account role). However, if there are cloud credentials associated to your environment, these will overwrite the role.
Format
= drop-down
Choose a file format for the source file.
Field Terminator
= string
(TEXTFILE only) The delimiter to be used that separates fields (columns) in the file. Defaults to \\A
.
Line Terminator
= string
(TEXTFILE only) The delimiter to be used that separates records (rows) in the file. Defaults to newline.
\
can also signify a newline.\\r
can signify a carriage return.
Skip Header Rows
= string
The number of rows at the top of the file to skip. The default setting is an empty field.
Strip Outer Array
= drop-down
(JSON only) Strips the outer array from the JSON file, enabling JSON files that contain a single, anonymous array to be loaded without error. The default setting is No.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
❌ | ❌ | ✅ |