Cortex Parse Document
Editions
Production use of this feature is available for specific editions only. Contact our sales team for more information.
Cortex Parse Document is a transformation component that uses Snowflake Cortex to extract content from a document on an internal or external stage in the form of an object that contains JSON-encoded objects as strings.
You must use a Snowflake role that has been granted the SNOWFLAKE.CORTEX_USER database role. Read Required Privileges to learn more about granting this privilege.
As per Snowflake's documentation:
PARSE_DOCUMENT supports processing of documents stored in an internal Snowflake stage, or an external stage. In creating your stage, Server Side Encryption is required. Otherwise, PARSE_DOCUMENT will return an error that the provided file isn't in the expected format or is client-side encrypted.
To learn more about Snowflake Cortex, such as availability, usage quotas, managing costs, and more, visit Large Language Model (LLM) Functions (Snowflake Cortex).
Properties
Database
= drop-down
The Snowflake database. The special value, [Environment Default], will use the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.
Schema
= drop-down
The Snowflake schema. The special value, [Environment Default], will use the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.
Stage
= drop-down
The internal stage (Snowflake managed) or external stage (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) where the file to extract content from is stored.
File Pattern
= string optional
Use a file pattern to specify and then match files based on their names or extensions. For example, file1.pdf
would return any file with that name and extension.
Regular Expressions are supported. For example, .*\.pdf
would match all .PDF files in a stage.
Mode
= drop-down
- OCR: This mode is optimized for text extraction from documents. This mode is recommended when extracting content from documents that don't have a strong semantic structure. This is the default setting.
- Layout: This mode is optimized for text and layout extraction, including elements such as tables. According to the Snowflake documentation, when using this mode, the data is returned as Markdown, and can capture the layout and structure of content elements better than OCR.
Read How Parse Document works to learn more about Snowflake's suggested usage of the OCR and Layout modes.
Output Column Name
= string
The name of the new column that is output when the component is executed.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ❌ | ❌ |