Skip to content

AI Similarity

Public preview

Editions

Production use of this feature is available for specific editions only. Contact our sales team for more information.

The AI Similarity transformation component uses the Databricks ai_similarity() function to invoke generative AI to compare two strings and compute the semantic similarity score. This function uses a Databricks chat model serving endpoint made available by Databricks Foundation Model APIs. This lets the comparison go beyond simple string matching, as the chat model understands meaning, context, and phrasing.

The input is two columns of text data which are to be compared. Both columns must be in the same input table. If you want to compare data from different tables, it will be necessary to first perform additional transformations, such as a Join, to put the data into a single table.

The output is a float value, representing the semantic similarity between the two input strings. The output score is relative and should only be used for ranking. Scores of 1 indicate that the two texts are equal.

Note

Make sure you have read and understand the Requirements set out by Databricks before using this component.

Use case

Some typical use cases for this component include:

  • Deduplication of text data, by identifying and grouping duplicate or near-duplicate entries in datasets like product descriptions, survey responses, or user comments. For example, "iPhone 14 Pro Max 256GB" and "Apple iPhone 14 Pro Max, 256 GB" are non-matching strings but have a high similarity score so can be considered duplicates.
  • Record linking through semantic joins on datasets where the matching field contains slightly different wording.
  • Content overlap detection, to check whether content is reworded or copied from other sources.

Properties

Name = string

A human-readable name for the component.


Columns = column editor

Base Column: The base column. Comparison Column: The column to compare against your base column.


Include Input Columns = boolean

  • Yes: Includes both your input columns and the new semantic similarity scores column. This will also include those input columns not selected in Columns.
  • No: Only includes the new semantic similarity scores column.