Skip to content

Chunk Text

Chunk Text is an orchestration component that performs pushdown text chunking using a Python user-defined function (UDF) in Snowflake via the computational power of your Snowflake warehouse. Specify an existing Snowflake source table and set a target table. If the target table already exists, the table will be overwritten.

You can choose text or Markdown as your data format.


Properties

Name = string

A human-readable name for the component.


Database = drop-down

The source Snowflake database to connect to. The special value, [Environment Default], will use the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.


Schema = drop-down

The source Snowflake schema to connect to. The special value, [Environment Default], will use the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.


Table = drop-down

An existing Snowflake table to use as the input. The tables available will depend on the schema you select.


Text Colum = drop-down

The column in your table that holds the text data you wish to chunk.


Include Input Columns = dual listbox

Select any other input columns that you wish to include in the output table.


Data Format = boolean

The format of the text to chunk.

  • Text: Your text data is chunked using a recursive character splitting method.
  • Markdown: Your text data is chunked using a header splitting method.
  • HTML: Your text data is chunked using a header splitting method.

Note

If your text data is Markdown, you can still set this parameter to Text.

Chunking Method = drop-down

Currently supports recursive character splitting.

Recursive character splitting lets you define characters to recursively split your text on until chunks are small enough. Common characters are \n\n, \n', ,.`. This method attempts to keep all paragraphs (and then sentences, and then words) together as long as possible.


Chunk Size = integer

The maximum size of chunks in characters. For example 100 or 250.


Chunk Overlap = integer

The number of overlapping characters between two chunks. Overlapping chunks can help to preserve context across chunks.

The integer value sets the total number of characters to overlap. For example 10 or 25.


Separators = column editor

Define separator characters to recursively split on. Common characters are \n\n, \n, , ..

Order matters in this list. If you wish to preserve the structure of a text document as much as possible, ensure your separators are ordered—i.e. with \n\n above .. You can reorder your rows with click-and-drag.

Chunking Method = drop-down

Currently supports header splitting, wherein you can specify which Markdown headers so split your text on.


Headers To Split On = column editor

  • Header: Specify Markdown header syntax to define headers to split on. For example, use "#" to add a Header 1 to the list of headers to split on. Your output table will offer metadata about each row of chunked text, confirming which header element the chunked text is a child element to.
  • Alias: An alias for this header to contextualize the operation. For example, "Header 1", "H1", "Page title", and so on.

Chunking Method = drop-down

Currently supports header splitting, wherein you can specify which HTML headers so split your text on. You don't need to specify HTML tag characters <>. h1, h2, and so on is sufficient.


Headers To Split On = column editor

  • Header: Specify HTML header syntax to define headers to split on. For example, use "h1" to add a Header 1 to the list of headers to split on. Your output table will offer metadata about each row of chunked text, confirming which header element the chunked text is a child element to.
  • Alias: An alias for this header to contextualize the operation. For example, "Header 1", "H1", "Page title", and so on.

Timeout = integer

The number of seconds to wait for script termination. After the set number of seconds has elapsed, the script is forcibly terminated. The default is 360 seconds (6 minutes).


Database = drop-down

The target Snowflake database. The special value, [Environment Default], will use the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.


Schema = drop-down

The target Snowflake schema. The special value, [Environment Default], will use the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.


Table = string

The name of your Snowflake target table. If the table already exists, it will be overwritten when you run this pipeline.


Output Column Name = string

Set a contextual name for your chunked text output column.


Snowflake Databricks Amazon Redshift