Amazon OpenSearch Upsert
Editions
Production use of this feature is available for specific editions only. Contact our sales team for more information.
Amazon OpenSearch Upsert is an orchestration component that lets you convert data stored in your cloud data warehouse into vector embeddings to then be stored in an Amazon OpenSearch Index. This will allow you to use alternative embedding models (for example, OpenAI or Amazon Bedrock) instead of Amazon OpenSearch's built-in embedding.
Note
Currently, this component only supports provisioned OpenSearch services, not serverless.
Prerequisites
Before you use the Amazon OpenSearch Upsert component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.
Permissions
You'll need to ensure you have permissions for Amazon OpenSearch Service. If you're using Amazon Bedrock for your embeddings, you'll also need permission to invoke the model.
Amazon OpenSearch Service
To use the Amazon OpenSearch Upsert component, ensure that your IAM role or user has the necessary permissions to interact with Amazon OpenSearch. Below is an example of an IAM policy that grants the required permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"es:ESHttpPost",
"es:ESHttpPut",
"es:ESHttpGet",
"es:ESHttpDelete"
],
"Resource": "arn:aws:es:<region>:<account-id>:domain/<your-domain-name>/*"
}
]
}
If you are using fine-grained access control in Amazon OpenSearch Service, you'll also need OpenSearch user/role permissions (for example, a role mapped to the appropriate OpenSearch index with write access).
Amazon Bedrock
If you are using Amazon Bedrock as your embedding provider, ensure that your IAM role or user has the necessary permissions to invoke the model. Below is an example of an IAM policy that grants the required permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:<region>:<account-id>:foundation-model/<model-name>"
}
]
}
Properties
Reference material is provided below for the Source, Configure, and Destination properties.
Source
Database
= drop-down
The Snowflake source database. The special value [Environment Default]
uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.
Schema
= drop-down
The Snowflake source schema. The special value [Environment Default]
uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.
Catalog
= drop-down
Select a source Databricks Unity Catalog. The special value [Environment Default]
uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.
Schema (Database)
= drop-down
The Databricks source schema. The special value [Environment Default]
uses the schema defined in the environment. Read Create and manage schemas to learn more.
Schema
= drop-down
The Amazon Redshift source schema. The special value [Environment Default]
uses the schema defined in the environment. Read Schemas to learn more.
Table
= drop-down
Select the table that contains the data you want to upsert into Amazon OpenSearch.
Key Column
= drop-down
This column is used to uniquely identify each row in the table. It is used to ensure that the data is not duplicated when it is loaded into the destination. An example use case would be a column of product IDs that you want to use to identify each product in the table.
Text Column
= drop-down
This column is used to generate vectors for the text data in the table, which are then upserted as embeddings to Amazon OpenSearch. An example use case of this column would be a column of product reviews that you want to convert into vectors for semantic search or to perform sentiment analysis on.
Limit
= integer (optional)
Set the Limit
to control the maximum number of records (rows) to load from the table. The default is 1000.
Configure
Embedding Provider
= drop-down
The embedding provider is the API service used to convert the search term into a vector. Choose either OpenAI or Amazon Bedrock. The embedding provider receives a search term (e.g. "How do I log in?") and returns a vector.
Choose your provider:
OpenAI API Key
= drop-down
Use the drop-down menu to select the corresponding secret definition that denotes the value of your OpenAI API key.
Read Secrets and secret definitions to learn how to create a new secret definition.
To create a new OpenAI API key:
- Log in to OpenAI.
- Click your avatar in the top-right of the UI.
- Click View API keys.
- Click + Create new secret key.
- Give a name for your new secret key and click Create secret key.
- Copy your new secret key and save it. Then click Done.
Model
= drop-down
Select an OpenAI embedding model.
Currently supports:
- text-embedding-ada-002
- text-embedding-3-small
- text-embedding-3-large
API Batch Size
= integer
Set the size of array of data per API call. The default size is 10. When set to 10, 1000 rows would therefore require 100 API calls.
You may wish to reduce this number if a row contains a high volume of data, and conversely, increase this number for rows with low data volume.
Region
= drop-down
Select the AWS region for your embedding model.
Model
= drop-down
Select an embedding model.
Currently supports:
Destination
Endpoint URL
= string
The URL of the Amazon OpenSearch domain endpoint to upsert your vector embeddings to. To find your endpoint URL:
- Log in to the Amazon OpenSearch Service console.
- Navigate to the Domains page.
- Click on the domain you want to use.
- Copy the Domain Endpoint URL from the domain details page.
Index
= string
The name of the Amazon OpenSearch index where the vector embeddings will be upserted. An index in Amazon OpenSearch is similar to a table in a relational database.
To create an index, you can run the following command in the OpenSearch Dev Tools console or a REST API client like curl or Postman:
PUT /test-index
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"vector": {
"type": "knn_vector",
"dimension": 1536
},
"rawData": {
"type": "text",
"index": false
},
"key": {
"type": "object"
}
}
}
}
Region
= drop-down
Select the AWS region of your Amazon OpenSearch Service domain.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ✅ | ✅ |