Cortex Parse Document

Public preview

Editions

Production use of this feature is available for specific editions only. Contact our sales team for more information.

Project Availability:

Snowflake ✅ Databricks ❌ Amazon Redshift ❌

Component Type:

Transformation

Connection Inputs:

None

Connection Outputs:

Unlimited

The Cortex Parse Document transformation component lets you extract content from PDFs and image files for use in your pipeline. It uses Snowflake Cortex to extract content from a document on an internal or external stage in the form of an object that contains JSON-encoded objects as strings.

To use this component, you must use a Snowflake role that has been granted the SNOWFLAKE.CORTEX_USER database role. Read Required Privileges to learn more about granting this privilege.

To learn more about Snowflake Cortex, such as availability, usage quotas, managing costs, and more, read Large Language Model (LLM) Functions (Snowflake Cortex).

As per Snowflake's documentation:

PARSE_DOCUMENT supports processing of documents stored in an internal Snowflake stage, or an external stage. In creating your stage, Server Side Encryption is required. Otherwise, PARSE_DOCUMENT will return an error that the provided file isn't in the expected format or is client-side encrypted.

Use case

You can use this component to obtain a range of information from PDFs and image files. For example, use it to:

Automatically extract customer details from completed PDF forms.
Extract details from scanned receipts when processing expenses.

Video example

Expand this box to watch our video about using the Cortex Parse Document component.

Video

Properties

Name = string

A human-readable name for the component.

Database = drop-down

The Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema = drop-down

The Snowflake schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Stage = drop-down

The internal stage (Snowflake managed) or external stage (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) where the file to extract content from is stored.

File Pattern = string (optional)

Use a file pattern to specify and then match files based on their names or extensions. For example, file1.pdf would return any file with that name and extension.

Regular Expressions are supported. For example, .*\.pdf would match all .PDF files in a stage.

Mode = drop-down

OCR: This mode is optimized for text extraction from documents. This mode is recommended when extracting content from documents that don't have a strong semantic structure. This is the default setting.
Layout: This mode is optimized for text and layout extraction, including elements such as tables. According to the Snowflake documentation, when using this mode, the data is returned as Markdown, and can capture the layout and structure of content elements better than OCR.

Read How Parse Document works to learn more about Snowflake's suggested usage of the OCR and Layout modes.

Output Column Name = string

The name of the new column that is output when the component is executed.