Amazon OpenSearch Upsert

Editions

Production use of this feature is available for specific editions only. Contact our sales team for more information.

Amazon OpenSearch Upsert is an orchestration component that lets you convert data stored in your cloud data warehouse into vector embeddings to then be stored in an Amazon OpenSearch Index. This will allow you to use alternative embedding models (for example, OpenAI or Amazon Bedrock) instead of Amazon OpenSearch's built-in embedding.

Note

Currently, this component only supports provisioned OpenSearch services, not serverless.

Prerequisites

Before you use the Amazon OpenSearch Upsert component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.

Permissions

You'll need to ensure you have permissions for Amazon OpenSearch Service. If you're using Amazon Bedrock for your embeddings, you'll also need permission to invoke the model.

Amazon OpenSearch Service

To use the Amazon OpenSearch Upsert component, ensure that your IAM role or user has the necessary permissions to interact with Amazon OpenSearch. Below is an example of an IAM policy that grants the required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "es:ESHttpPost",
        "es:ESHttpPut",
        "es:ESHttpGet",
        "es:ESHttpDelete"
      ],
      "Resource": "arn:aws:es:<region>:<account-id>:domain/<your-domain-name>/*"
    }
  ]
}

If you are using fine-grained access control in Amazon OpenSearch Service, you'll also need OpenSearch user/role permissions (for example, a role mapped to the appropriate OpenSearch index with write access).

Amazon Bedrock

If you are using Amazon Bedrock as your embedding provider, ensure that your IAM role or user has the necessary permissions to invoke the model. Below is an example of an IAM policy that grants the required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:<region>:<account-id>:foundation-model/<model-name>"
    }
  ]
}

Properties

Reference material is provided below for the Source, Configure, and Destination properties.

Name = string

A human-readable name for the component.

Source

SnowflakeDatabricksAmazon Redshift

Database = drop-down

The Snowflake source database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema = drop-down

The Snowflake source schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Catalog = drop-down

Select a source Databricks Unity Catalog. The special value [Environment Default] uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.

Schema (Database) = drop-down

The Databricks source schema. The special value [Environment Default] uses the schema defined in the environment. Read Create and manage schemas to learn more.

Schema = drop-down

The Amazon Redshift source schema. The special value [Environment Default] uses the schema defined in the environment. Read Schemas to learn more.

Table = drop-down

Select the table that contains the data you want to upsert into Amazon OpenSearch.

Key Column = drop-down

This column is used to uniquely identify each row in the table. It is used to ensure that the data is not duplicated when it is loaded into the destination. An example use case would be a column of product IDs that you want to use to identify each product in the table.

Text Column = drop-down

This column is used to generate vectors for the text data in the table, which are then upserted as embeddings to Amazon OpenSearch. An example use case of this column would be a column of product reviews that you want to convert into vectors for semantic search or to perform sentiment analysis on.

Limit = integer (optional)

Set the Limit to control the maximum number of records (rows) to load from the table. The default is 1000.

Configure

Embedding Provider = drop-down

The embedding provider is the API service used to convert the search term into a vector. Choose either OpenAI or Amazon Bedrock. The embedding provider receives a search term (e.g. "How do I log in?") and returns a vector.

Choose your provider:

OpenAIAmazon Bedrock

OpenAI API Key = drop-down

Use the drop-down menu to select the corresponding secret definition that denotes the value of your OpenAI API key.

Read Secrets and secret definitions to learn how to create a new secret definition.

To create a new OpenAI API key:

Log in to OpenAI.
Click your avatar in the top-right of the UI.
Click View API keys.
Click + Create new secret key.
Give a name for your new secret key and click Create secret key.
Copy your new secret key and save it. Then click Done.

Model = drop-down

Select an OpenAI embedding model.

Currently supports:

text-embedding-ada-002
text-embedding-3-small
text-embedding-3-large

API Batch Size = integer

Set the size of array of data per API call. The default size is 10. When set to 10, 1000 rows would therefore require 100 API calls.

You may wish to reduce this number if a row contains a high volume of data, and conversely, increase this number for rows with low data volume.

Region = drop-down

Select the AWS region for your embedding model.

Model = drop-down

Select an embedding model.

Currently supports:

Titan Embeddings G1 - Text

Destination

Endpoint URL = string

The URL of the Amazon OpenSearch domain endpoint to upsert your vector embeddings to. To find your endpoint URL:

Log in to the Amazon OpenSearch Service console.
Navigate to the Domains page.
Click on the domain you want to use.
Copy the Domain Endpoint URL from the domain details page.

Index = string

The name of an existing Amazon OpenSearch index where the vector embeddings will be upserted. An index in Amazon OpenSearch is similar to a table in a relational database.

Below is an example code snippet you could use to create an index. You can run the following command in the OpenSearch Dev Tools console or a REST API client like curl or Postman:

PUT /test-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 1536
      },
      "rawData": {
        "type": "text",
        "index": false
      },
      "key": {
        "type": "object"
      }
    }
  }
}

Note

The dimension value must match the output dimension of the embedding model you have chosen:

text-embedding-ada-002 and text-embedding-3-small output vectors of dimension "1536".
text-embedding-3-large output vectors of dimension "3072".

If you're using the text-embedding-3-large model, update "dimension": 1536 to "dimension": 3072 in the mapping above.

Region = drop-down

Select the AWS region of your Amazon OpenSearch Service domain.

Snowflake	Databricks	Amazon Redshift
✅	✅	✅

Got feedback or spotted something we can improve?

We'd love to hear from you. Join the conversation in the Documentation forum!