Amazon Transcribe Input
Amazon Transcribe Input is an orchestration component that extracts data from media and audio files and converts that data into text.
The component uses the Amazon Transcribe API to retrieve data to load into a table—this stages the data, so the table is reloaded each time. You can then use transformations to enrich and manage the data in permanent tables. For more information, read Amazon Transcribe.
Prerequisites
Before you use the Amazon Transcribe Input component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.
Properties
Reference material is provided below for the Destination and Configure properties.
Destination
Select your cloud data warehouse.
:mod-destination-sf:
:mod-warehouse-sf: :mod-database-sf: :mod-schema-sf: :mod-table-name-sf: :mod-load-strategy-sf: :mod-clean-staged-files-sf: :mod-stage-platform-sf:
:mod-stage-amazon-s3-bucket:
:mod-stage-internal-stage-type-sf:
:mod-stage-azure-storage-account: :mod-stage-azure-container:
:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
:mod-destination-db:
:mod-catalog-db: :mod-schema-database-db: :mod-table-name-db: :mod-load-strategy-db: :mod-clean-staged-files-db: :mod-stage-platform-db:
:mod-stage-amazon-s3-bucket:
:mod-stage-azure-storage-account: :mod-stage-azure-container:
:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
:mod-destination-rs:
:mod-schema-rs: :mod-table-name-rs: :mod-load-strategy-rs: :mod-clean-staged-files-rs: :mod-amazon-s3-bucket-rs:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
Configure
S3 Bucket Region
= drop-down
The AWS region where the S3 bucket you want to connect to is located.
S3 Object Prefix
= string
The S3 path to the bucket, folder, or file that will be processed. For example, s3://bucket-name/
, or s3://bucket-name/folder/
, or s3://bucket-name/folder/specific-file.pdf
.
Source File Filter Pattern
= string
A regex pattern that is used to filter files. This is useful when you select a bucket/folder in the S3 Object Prefix
parameter, and want to filter which files are processed. For example, a value of .*\.pdf$
will match all files with a .pdf
ending, or could be used to match on specific parts of a file name.
Max Speakers
= integer
The maximum number of speakers that will be detected in the audio files. For example, for a meeting recording with 5 people, enter 5
. The default is 2.
Concurrent Jobs
= string
Set how many AWS transcribe jobs can run concurrently. The default is 10. The maximum is 100.
Deactivate soft delete for Azure blobs (Databricks)
If you intend to set your destination as Databricks and your stage platform as Azure Storage, you must turn off the "Enable soft delete for blobs" setting in your Azure account for your pipeline to run successfully. To do this:
- Log in to the Azure portal.
- In the top-left, click ☰ → Storage Accounts.
- Select the intended storage account.
- In the menu, under Data management, click Data protection.
- Untick Enable soft delete for blobs. For more information, read Soft delete for blobs.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ✅ | ✅ |