Skip to content

Amazon Transcribe Input

Amazon Transcribe Input is an orchestration component that extracts data from media and audio files and converts that data into text.

The component uses the Amazon Transcribe API to retrieve data to load into a table—this stages the data, so the table is reloaded each time. You can then use transformations to enrich and manage the data in permanent tables. For more information, read Amazon Transcribe.


Prerequisites

Before you use the Amazon Transcribe Input component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.


Properties

Reference material is provided below for the Destination and Configure properties.

Destination

Select your cloud data warehouse.

:mod-destination-sf:

:mod-warehouse-sf: :mod-database-sf: :mod-schema-sf: :mod-table-name-sf: :mod-load-strategy-sf: :mod-clean-staged-files-sf: :mod-stage-platform-sf:

:mod-stage-amazon-s3-bucket:

:mod-stage-internal-stage-type-sf:

:mod-stage-azure-storage-account: :mod-stage-azure-container:

:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:

:mod-destination-db:

:mod-catalog-db: :mod-schema-database-db: :mod-table-name-db: :mod-load-strategy-db: :mod-clean-staged-files-db: :mod-stage-platform-db:

:mod-stage-amazon-s3-bucket:

:mod-stage-azure-storage-account: :mod-stage-azure-container:

:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:

:mod-destination-rs:

:mod-schema-rs: :mod-table-name-rs: :mod-load-strategy-rs: :mod-clean-staged-files-rs: :mod-amazon-s3-bucket-rs:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:


Configure

S3 Bucket Region = drop-down

The AWS region where the S3 bucket you want to connect to is located.


S3 Object Prefix = string

The S3 path to the bucket, folder, or file that will be processed. For example, s3://bucket-name/, or s3://bucket-name/folder/, or s3://bucket-name/folder/specific-file.pdf.


Source File Filter Pattern = string

A regex pattern that is used to filter files. This is useful when you select a bucket/folder in the S3 Object Prefix parameter, and want to filter which files are processed. For example, a value of .*\.pdf$ will match all files with a .pdf ending, or could be used to match on specific parts of a file name.


Max Speakers = integer

The maximum number of speakers that will be detected in the audio files. For example, for a meeting recording with 5 people, enter 5. The default is 2.


Concurrent Jobs = string

Set how many AWS transcribe jobs can run concurrently. The default is 10. The maximum is 100.


Deactivate soft delete for Azure blobs (Databricks)

If you intend to set your destination as Databricks and your stage platform as Azure Storage, you must turn off the "Enable soft delete for blobs" setting in your Azure account for your pipeline to run successfully. To do this:

  1. Log in to the Azure portal.
  2. In the top-left, click ☰ → Storage Accounts.
  3. Select the intended storage account.
  4. In the menu, under Data management, click Data protection.
  5. Untick Enable soft delete for blobs. For more information, read Soft delete for blobs.

Snowflake Databricks Amazon Redshift