Extract Nested Data

Project Availability:

Snowflake ✅ Databricks ✅ Amazon Redshift ✅

Component Type:

Transformation

Connection Inputs:

One

Connection Outputs:

Unlimited

The Extract Nested Data transformation component unpacks data stored in a JSON format into structured data in the form of columns and rows of data in a table. The component is especially useful when using a Custom Connector to access an API that returns data in JSON format.

The input to this component should include one or more variant-type columns containing JSON that is to be unpacked. The component will operate on every suitable column in the input.

Each element in the source data can be mapped to a different column in your target table. For example, consider the following JSON structure, containing an array of three elements:

{
    "name": "varchar",
    "png": "varchar",
    "alt": "varchar"
}

When data in this format is extracted, each of the three array elements can be mapped to a different column in the target table. When we sample the output from the transformation, we will see something like the following table:

Name	Flag	Alt_Text
Cyprus	https://flagcdn.com/w320/cy.png	The flag of Cyprus.
Somalia	https://flagcdn.com/w320/so.png	The flag of Somalia.
Venezuela	https://flagcdn.com/w320/ve.png	The flag of Venezuela.

When sampling data, there is a limit of approximately 800 KB on the amount of data that can be returned in the sample. If very large JSON structures are being returned by the component, this may result in the sample being truncated with fewer than the requested number of rows.

Use Extract Nested Data's Columns property to define which elements in the source array will be mapped to columns in the target structured data.

Note

For an alternative method of extracting semi-structured data in a Snowflake project, you can use the Flatten Variant component.
For a method of extracting structured data in a Databricks project, you can use the Extract Structured Data component.
Extraction of data stored in XML format isn't currently supported in this component.

Use case

The Extract Nested Data component is used to flatten and extract fields from semi-structured data in JSON format. Some common uses for this include:

Taking data from an API that returns a nested JSON object, and putting it into table columns.
Extracting nested data from sources like S3 logs, CloudTrail, or Kafka, making the data ready for analysis.
Preparing semi-structured JSON data for loading into relational tables without losing granularity.

Properties

SnowflakeDatabricksAmazon Redshift

Name = string

A human-readable name for the component.

Include Input Columns = boolean

Choose whether to include input columns in the output.

Columns = data structure

Use this property to select which elements from the JSON input will be mapped to columns in the output. The Columns dialog shows a graphical representation of every addressable element in the input. If the input has multiple columns of JSON data, all will be included here. Each element has a corresponding checkbox. Select an element's checkbox to include that element in the output. No elements are selected by default.

To select every element, click Select all.
To deselect every element, click Clear all.
To edit an element, click the three dots ... next to the element, and then click Edit element.
To add a new element, click the three dots ... next to the VARIANT heading at the top of the structure, and then click Add element. Each element should be assigned a unique Key, a Type, and an Alias.
To delete an element, click the three dots ... next to it and click Delete element.
To automatically add every element to the structure, click Autofill.
To remove all elements added to the structure so far, reverting to a blank structure, click Reset.

Click Save when you have finished editing and selecting elements.

Outer Join = boolean

Determines how to handle input rows that can't be expanded (for example, because they have no fields to expand, or because they can't be accessed). Select No to completely omit these rows from the output, Yes to generate an output row with NULL values.

Input Alias = string

If two input elements have identical names, one will be given this prefix to differentiate them. More than two identically named elements will result in an error. The default is i, and this does not need changing in the vast majority of use cases.

Array Prefix = string

If two array structures have identical names, one will be given this prefix to differentiate them. More than two identically named structures will result in an error. The default is f, and this does not need changing in the vast majority of use cases.

Casting Method = drop-down

Select how invalid or unparsable input elements will be handled:

Fail on invalid data (the default)
Replace all unparseable values with null
Replace unparseable dates and timestamps with null

Case Columns Alias Names = drop-down

Set the case that will be used for alias column names. Settings include Upper, Lower, or No (the default).

Name = string

A human-readable name for the component.

Include Input Columns = boolean

Choose whether to include input columns in the output.

Columns = data structure

Use this property to select which elements from the JSON input will be mapped to columns in the output. The Columns dialog shows a graphical representation of every addressable element in the input. If the input has multiple columns of JSON data, all will be included here. Each element has a corresponding checkbox. Select an element's checkbox to include that element in the output. No elements are selected by default.

To select every element, click Select all.
To deselect every element, click Clear all.
To edit an element, click the three dots ... next to the element, and then click Edit element.
To add a new element, click the three dots ... next to the VARIANT heading at the top of the structure, and then click Add element. Each element should be assigned a unique Key, a Type, and an Alias.
To delete an element, click the three dots ... next to it and click Delete element.
To automatically add every element to the structure, click Autofill.
To remove all elements added to the structure so far, reverting to a blank structure, click Reset.

Click Save when you have finished editing and selecting elements.

Outer Join = boolean

Determines how to handle input rows that can't be expanded (for example, because they have no fields to expand, or because they can't be accessed). Select No to completely omit these rows from the output, Yes to generate an output row with NULL values.

Input Alias = string

If two input elements have identical names, one will be given this prefix to differentiate them. More than two identically named elements will result in an error. The default is i, and this does not need changing in the vast majority of use cases.

Array Prefix = string

If two array structures have identical names, one will be given this prefix to differentiate them. More than two identically named structures will result in an error. The default is f, and this does not need changing in the vast majority of use cases.

Casting Method = drop-down

Select how invalid or unparsable input elements will be handled:

Fail on invalid data (the default)
Replace all unparseable values with null

Name = string

A human-readable name for the component.

Include Input Columns = boolean

Choose whether to include input columns in the output.

Columns = data structure

Use this property to select which elements from the JSON input will be mapped to columns in the output. The Columns dialog shows a graphical representation of every addressable element in the input. If the input has multiple columns of JSON data, all will be included here. Each element has a corresponding checkbox. Select an element's checkbox to include that element in the output. No elements are selected by default.

To select every element, click Select all.
To deselect every element, click Clear all or Reset.

Click Save when you have finished editing and selecting elements.

Column Aliases = string

If two input elements have identical names, one will be given this prefix to differentiate them. More than two identically named elements will result in an error. The default is i, and this does not need changing in the vast majority of use cases.

Array Prefix = string

If two array structures have identical names, one will be given this prefix to differentiate them. More than two identically named structures will result in an error. The default is f, and this does not need changing in the vast majority of use cases.

Got feedback or spotted something we can improve?

We'd love to hear from you. Join the conversation in the Documentation forum!