Cortex Finetune

Editions

Production use of this feature is available for specific editions only. Contact our sales team for more information.

Project Availability:

Snowflake ✅ Databricks ❌ Amazon Redshift ❌

Component Type:

Orchestration

Connection Inputs:

One

Connection Outputs:

Unlimited

The Cortex Finetune orchestration component lets you fine-tune large language models (LLMs) using Snowflake Cortex. With this component, you can adapt powerful pre-trained LLMs to your organization's specific use case—whether it's customer support, summarization, content generation, or domain-specific reasoning—without needing to train models from scratch.

Fine-tuning a model on your own labeled dataset allows for more accurate, reliable responses aligned with your data and requirements. The resulting model can be called using your assigned name via the CORTEX.COMPLETE function in Snowflake.

To use this component, you must use a Snowflake role that has been granted the SNOWFLAKE.CORTEX_USER database role. Read Required Privileges to learn more about granting this privilege.

To learn more about Snowflake Cortex, such as availability, usage quotas, managing costs, and more, read Large Language Model (LLM) Functions (Snowflake Cortex).

Use case

The versatility of this component means it can be used in many ways. For example, you can use it to:

Fine-tune a model using technical support transcripts to generate responses to new support tickets.
Fine-tune a model using internal financial data and report templates to automatically draft summaries and analysis documents.

Properties

Reference material is provided below for the Model, Training Data, and Validation Data properties.

Name = string

A human-readable name for the component.

Model

Name = string

This name is used to reference the fine-tuned model in downstream Cortex functions.

Base Model = drop-down

Select the base LLM that will be fine-tuned using your training data. Available models include:

llama3-8b: Optimized for text classification, summarization, and sentiment analysis.
llama3-70b: High-performance model ideal for chat, content creation, and enterprise use.
llama3.1-8b: Lightweight, fast model with a 24K context window for moderate tasks.
llama3.1-70b: Cost-effective, open-source model for advanced enterprise applications.
mistral-7b: Fast, efficient model for summarization and simple question answering.
mistral-8x7b: Versatile model for generation, classification, and QA with low latency.

Choosing a smaller model (such as llama3-8b) is recommended for faster training and cost-efficiency in smaller-scale or experimental use cases.

Database = drop-down

Select the Snowflake database where your training and (optionally) validation tables are stored.

Schema = drop-down

Select the schema within the chosen database that contains your input tables.

Creation Mode = drop-down

Defines how the component executes the model creation process.

Synchronous: The component runs in a blocking manner and waits for the training job to complete.
Asynchronous: Initiates the training job and allows the pipeline to continue running other components without waiting for completion.

Note

For asynchronous mode, you can find the job ID in the task history table to track job progress.
For synchronous mode, the job ID, job status, and model name are recorded in the pipeline logs.

Training Data

You must supply a labeled dataset that contains pairs of prompts and expected completions.

Table = drop-down

Select the table that contains your training data.

Prompt Column = drop-down

Choose the column containing the user-provided prompts.

Completions Column = drop-down

Choose the column containing the expected or target responses for those prompts.

Validation Data

To assess model performance, you can either auto-split the training data or provide a separate validation dataset.

Automatically Split Training Data = boolean

Yes: A portion of the training table is automatically used as validation data. You do not need to provide a separate validation table.
No: You must supply a separate validation table and define corresponding columns.

Table = drop-down

Select the table that contains your validation data. Only available if Automatically Split Training Data is set to No.

Prompt Column = drop-down

Select the prompt column in the validation dataset. Only available if Automatically Split Training Data is set to No.

Completions Column = drop-down

Select the column with expected completions in the validation dataset. Only available if Automatically Split Training Data is set to No.

Epochs = integer

Specify the number of epochs—that is, the number of times the model should pass through the full training dataset during fine-tuning. A higher number of epochs may improve accuracy but may also increase training time and cost.

Once fine-tuning completes successfully, the model will be available for use with the assigned name through the CORTEX.COMPLETE function in Snowflake SQL and available via Matillion's Cortex Completions component. The model is registered within your Snowflake account under the selected database and schema.