JSON Loadπ
JSON Load is an orchestration component that allows you to load data into a table from one or more JSON files from a source location. It automatically infers the schema from the structure of the selected file(s), creates the corresponding table, and loads the data into it.
Authentication configurationπ
You can authorize this component using either of these two methods:
- Cloud Provider Credentials: By default, explicit Cloud Provider Credentials are used to facilitate cross-account or cross-cloud data loads.
- Agent Identity: If cloud provider credentials are not specified, the component attempts to use the Matillion Agentβs native service identity to access the source data.
While the agent's identity is sufficient for resources within its own environment, explicit cloud provider credentials are advised to ensure access to source data loads residing in different cloud providers or accounts. This approach provides the flexibility required for diverse data locations, and maintains consistency with the authentication models used in the Excel Query and Google Sheets components.
Propertiesπ
Name = string
A human-readable name for the component.
Connectπ
Type = drop-down
- Snowflake Managed: Load your file from a Snowflake internal stage.
- S3: Load your file from an S3 bucket.
- Google Cloud Storage: Load your file from a Google Cloud Storage bucket.
- Azure Blob Storage: Load your file from an Azure Blob Storage container.
Stage = drop-down
Select a staging area for the data. The special value [New Stage] will create a temporary stage to be used for loading the data when the corresponding parameter values are provided.
The options in this drop-down menu depend on the values you select for the Database and Schema parameters. If you change these values, the list of available options updates automatically, and the previously selected option may become invalid.
When [New Stage] is selected, the component uses the cloud credentials configured for your environment to access the required resources.
Staged file path = string (optional)
A stage may include directories. The user has the option to browse and select files from a specific directory path, for example, /Example/Path.
If this is left blank, and if the Pattern parameter is also empty, all files from the stage will be loaded.
S3 object prefix = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the S3 bucket is located, or select from the list of S3 buckets.
This must have the format s3://<bucket>/<path>.
Encryption = drop-down
Decide how the files are encrypted inside the S3 bucket. This property is available when using an existing Amazon S3 location for staging.
- Client Side Encryption: Encrypt the data according to a client-side master key. For more information, read Protecting data using client-side encryption.
- KMS Encryption: Encrypt the data according to a key stored on KMS. For more information, read Using server-side encryption with AWS KMS keys.
- S3 Encryption: Encrypt the data according to a key stored on an S3 bucket. For more information, read Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3).
- None: No encryption.
KMS key ID = drop-down
The ID of the KMS encryption key you have chosen to use in the Encryption property.
Only available when Encryption is set to KMS Encryption.
Master key = drop-down
The secret definition denoting your master key for client-side encryption. Your password should be saved as a secret definition before using this component.
Only available when Encryption is set to Client Side Encryption.
Storage integration = drop-down
Select the storage integration. Storage integrations are required to permit Snowflake to read from and write to a cloud storage location. Integrations must be set up in advance and configured to support Google Cloud Storage. For more information, read Create storage integration.
Google Storage URL location = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the Google Cloud Storage bucket is located, or select from the list of Google Cloud Storage buckets.
This must have the format gs://<bucket>/<path>.
Azure storage location = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the Azure Storage account is located, or select from the list of storage accounts.
This must have the format azure://<STORAGEACCOUNT>/<path>.
Pattern = string (optional)
A valid regular expression (regex) to match part of a file's path or name. Files that match the pattern will be included in the load.
If this parameter is left empty, the component automatically uses the pattern *, which matches all files within the specified Staged file path or Stage parameters. The pattern applies to the entire file path, not just the directory defined in Staged file path.
The subfolder containing the object to load must be included here.
Note
- Only JSON files are supported.
- Ensure that the pattern entered correctly targets the files you intend to load.
Type = drop-down
- S3: Load your file from an S3 bucket.
- Google Cloud Storage: Load your file from a Google Cloud Storage bucket.
- Azure Blob Storage: Load your file from an Azure Blob Storage container.
S3 object prefix = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the S3 bucket is located, or select from the list of S3 buckets.
This must have the format s3://<bucket>/<path>.
Google Storage URL location = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the Google Cloud Storage bucket is located, or select from the list of Google Cloud Storage buckets.
This must have the format gs://<bucket>/<path>.
For this to work, an external location with access to the specified location must be configured. For more information, read External Locations| Databricks on AWS
Azure storage location = file explorer
To retrieve the intended files, use the file explorer to enter the container path where the Azure Storage account is located, or select from the list of storage accounts.
This must have the format azure://<STORAGEACCOUNT>/<path>.
Glob pattern = string (optional)
A valid glob pattern to match part of a file's path or name. Files that match the pattern will be included in the load.
If this parameter is left empty, the component automatically uses the pattern *, which matches all files within the specified location parameters. The pattern applies to the entire file path, not just the directory defined in the location parameter.
Note
- Only JSON files are supported.
- Ensure that the pattern entered correctly targets the files you intend to load.
Configureπ
File format = drop-down
The default value is set to [New File Format]. Specify a file format, and a temporary format with those settings will be used when the component runs. Alternatively, select a pre-made file format.
Compression = drop-down
Select the compression algorithm used on the file being loaded.
The default setting is AUTO.
Enable octal = boolean
When True, enables the parsing of octal values.
Default setting is False.
Allow duplicates = boolean
When True, allows duplicate object field names.
Default setting is False.
Strip outer array = boolean
When True, instructs the JSON parser to remove outer brackets.
Default setting is False.
Strip null values = boolean
When True, instructs the JSON parser to remove any object fields or array elements containing NULL values.
Default setting is False.
Ignore UTF8 errors = boolean
When True, replaces any invalid UTF-8 sequences with Unicode characters. When False, UTF-8 errors will not produce an error in the pipeline run.
Default setting is False.
Binary format = drop-down
Determines how binary string values are encoded in data files. Use this setting when importing data into a table's binary columns.
Choose from the following options:
- HEX: Binary string values are encoded as hexadecimal strings.
- BASE64: Binary string values are encoded as Base64 strings.
- UTF8: Binary string values are encoded as UTF-8 strings.
Default setting is HEX.
This will only work when Match by column name is set to either Case Insensitive or Case Sensitive.
Multi line = boolean
Allows multiple lines in a JSON record. When set to No, any JSON record containing a newline character is considered invalid, and will trigger an error.
Default setting is False.
Replace invalid characters = boolean
Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When False, the load operation produces an error when invalid UTF-8 character encoding is detected.
Default setting is False.
Skip byte order mark = boolean
Determines whether to ignore a BOM (byte order mark) in the input file.
If set to FALSE, Snowflake processes any BOM found in the file, which may lead to errors or cause the BOM to be included in the first column of the table.
Default setting is True.
Null if = editor (optional)
Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.
This will only work when Match By Column Name is set to either Case Insensitive or Case Sensitive.
Ignore corrupt files = boolean (optional)
When set to Yes, corrupt input files are skipped and not included in the load.
Ignore missing files = boolean (optional)
When set to Yes, files that cannot be found are ignored during the load process.
Modified after = string (optional)
An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp.
Modified before = string (optional)
An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp.
Recursive file lookup = boolean (optional)
When set to Yes, files are read recursively from all subdirectories under the specified path. When enabled, partition inference is disabled. To control which files are loaded, use the Glob pattern property instead.
Allows backslash escaping any character = boolean (optional)
If Yes, will allow the quoting of all characters using the backslash quoting mechanism in JSON records.
Allow comments = boolean (optional)
If Yes, will allow JAVA/C++ style comments in JSON records.
Allow non-numeric numbers = boolean (optional)
If Yes, will allow using non-numeric number strings such as NaN, Infinity, -Infinity in JSON records.
Allow numeric leading zeros = boolean (optional)
If Yes, will allow leading zeros in numbers in JSON records.
Allow single quotes = boolean (optional)
If Yes, will allow single quotes in addition to double quotes in JSON records.
Allow unquoted control chars = boolean (optional)
If Yes, will allow JSON strings to include unquoted control characters (ASCII characters with values less than 32, including tab and line feed characters) in JSON records.
Allow unquoted field names = boolean (optional)
If Yes, will allow unquoted JSON field names (which are allowed by JavaScript, but not by the JSON specification).
Bad records path = string (optional)
An optional path to store bad records encountered during parsing. When specified, records that cannot be parsed are written to this location instead of causing the load to fail.
Column name of corrupt record = string (optional)
An optional column name to store the malformed JSON string when a record cannot be parsed. This allows you to capture and analyze corrupt records in the target table.
Date format = string (optional)
Describe the format of date values in the data files to be loaded. For example, yyyy-MM-dd.
Drop field if all null = boolean (optional)
If Yes, will ignore columns that contain only null values when inferring the schema.
Encoding = string (optional)
Select the character encoding to use when reading the input files.
Infer timestamp = boolean (optional)
If Yes, will attempt to infer timestamp strings as a TimestampType during schema inference.
Line separator = string (optional)
Specify the character or sequence of characters used to separate individual records (rows) in the JSON file.
Locale = string (optional)
Sets the locale for date and timestamp parsing. The locale should be in BCP 47 language tag format (e.g., en-US).
Mode = drop-down (optional)
Controls how the parser handles corrupt or malformed records. Options include:
- PERMISSIVE: Sets fields to null when encountering corrupted records and places the malformed string into a field configured by Column name of corrupt record. This is the default setting.
- DROPMALFORMED: Ignores and drops rows containing corrupted records.
- FAILFAST: Throws an exception when encountering corrupted records.
Multi line = boolean (optional)
If Yes, will parse records which may span multiple lines.
Primitives as string = boolean
If Yes, primitive data types are interpreted as strings in JSON files. If not specified, the default is No.
Prefers decimal = boolean (optional)
If Yes, all floating-point values will be treated as a decimal type during schema inference.
Reader case sensitive = boolean (optional)
Specifies case sensitivity behavior during schema rescue. If Yes, data columns are matched to the schema by name in a case-sensitive manner.
Rescued data column = string (optional)
An optional column name to store data that does not match the schema. When specified, fields that don't match the expected schema are collected into this column as a JSON string.
Single variant column = string (optional)
An optional column name to store the entire JSON record as a single VARIANT type value, rather than parsing it into separate columns.
Timestamp format = string (optional)
Describe the format of timestamp values in the data files to be loaded.
Time zone = string (optional)
Sets the timezone to use when parsing timestamps.
Destinationπ
Warehouse = drop-down
The Snowflake warehouse used to run the queries. The special value [Environment Default] uses the warehouse defined in the environment. Read Overview of Warehouses to learn more.
Database = drop-down
The Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.
Schema = drop-down
The Snowflake schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.
Target table name = string
Specify the name of the newly created or existing table to load data into. The table will be created according to the Load strategy you select below.
Load strategy = drop-down
- Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
- Truncate and Insert: If the specified table name already exists, all rows within the table will be removed and new rows will be inserted per the next run of this pipeline.
- Fail if Exists: If the specified table name already exists, this pipeline will fail to run, and no data will be copied to the table.
- Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It's appended onto the end of the existing data in the table. If the specified table name doesn't exist, then an error message will appear and the table is not created. You must always provide a table name.
If not specified, the default is Fail if Exists.
Catalog = drop-down
Select a Databricks Unity Catalog. The special value [Environment Default] uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.
Schema (Database) = drop-down
The Databricks schema. The special value [Environment Default] uses the schema defined in the environment. Read Create and manage schemas to learn more.
Target table name = string
Specify the name of the newly created or existing table to load data into. The table will be created according to the Load strategy you select below.
Load strategy = drop-down
- Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
- Truncate and Insert: If the specified table name already exists, all rows within the table will be removed and new rows will be inserted per the next run of this pipeline.
- Fail if Exists: If the specified table name already exists, this pipeline will fail to run, and no data will be copied to the table.
- Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It's appended onto the end of the existing data in the table. If the specified table name doesn't exist, then an error message will appear and the table is not created. You must always provide a table name.
If not specified, the default is Fail if Exists.
Note
Databricks uses COPY INTO, which is idempotent. This means it tracks which source files have already been loaded into the target Delta table and skips them on subsequent runs.
Enabling Force load overrides this behavior for the Truncate and Insert and Append load strategies, causing all matching source files to be reprocessed.
Advanced Settingsπ
On error = drop-down
Decide how to proceed upon an error.
- Abort Statement: Aborts the load if any error is encountered. This is the default setting.
- Continue: Continue loading the file.
- Skip File: Skip file if any errors are encountered in the file.
- Skip File When n Errors: Skip file when the number of errors in the file is equal to or greater than the specified number in the next property,
N. - Skip File When n% Errors: Skip file when the percentage of errors in the file exceeds the specified percentage of
N.
N = integer
Specify the number of errors or the percentage of errors required to skip the file. This parameter only accepts integer characters. % is not accepted. Specify percentages as a number only.
This parameter is only available when On error is set to either Skip File When n Errors or Skip File When n% Errors.
Size limit (in bytes) = integer (optional)
Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, refer to the Snowflake documentation.
Purge files = boolean
When True, purges data files after the data is successfully loaded.
Default setting is False.
Match by column name = drop-down
Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data.
- Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive.
- Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive.
- None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.
Truncate columns = boolean
When True, strings are automatically truncated to the target column length. When False, the COPY statement produces an error if a loaded string exceeds the target column length.
Default setting is False.
Force load = boolean
When True, loads all files, regardless of whether they have been loaded previously and haven't changed since they were loaded.
Default setting is False.
When set to True, this option reloads files and can lead to duplicated data in a table.
Force load = boolean
When True, loads all files, regardless of whether they have been loaded previously and haven't changed since they were loaded. Default setting is False.
When set to True, this option reloads files and can lead to duplicated data in a table. The Load strategy in the Destination needs to be Truncate and Insert or Append for this to work.
| Snowflake | Databricks | Amazon Redshift |
|---|---|---|
| β | β | β |