Skip to content

Amazon Textract Input

Amazon Textract Input is an orchestration component that uses the Textract API to extract text, handwriting, layout elements, and data from scanned documents. You can choose to include footers and page numbers in the extraction process.

The component uses the Textract API to retrieve data to load into a table—this stages the data, so the table is reloaded each time. You can then use transformations to enrich and manage the data in permanent tables. For more information, read Amazon Textract.


Prerequisites

Before you use the Amazon Textract Input component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.


Properties

Reference material is provided below for the Destination and Configure properties.

Destination

Select your cloud data warehouse.

:mod-destination-sf:

:mod-warehouse-sf: :mod-database-sf: :mod-schema-sf: :mod-table-name-sf: :mod-load-strategy-sf: :mod-clean-staged-files-sf: :mod-stage-platform-sf:

:mod-stage-amazon-s3-bucket:

:mod-stage-internal-stage-type-sf:

:mod-stage-azure-storage-account: :mod-stage-azure-container:

:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:

:mod-destination-db:

:mod-catalog-db: :mod-schema-database-db: :mod-table-name-db: :mod-load-strategy-db: :mod-clean-staged-files-db: :mod-stage-platform-db:

:mod-stage-amazon-s3-bucket:

:mod-stage-azure-storage-account: :mod-stage-azure-container:

:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:

:mod-destination-rs:

:mod-schema-rs: :mod-table-name-rs: :mod-load-strategy-rs: :mod-clean-staged-files-rs: :mod-amazon-s3-bucket-rs:

:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:

:mod-cloud-storage-amazon-s3-bucket:

:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:

:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:


Configure

S3 Bucket region = drop-down

The AWS region where the S3 bucket you want to connect to is located.


S3 Object Prefix = string

The S3 path to the bucket, folder, or file that will be processed. For example, s3://bucket-name/, or s3://bucket-name/folder/, or s3://bucket-name/folder/specific-file.pdf.


Source File Filter Pattern = string

A regex pattern that is used to filter files. This is useful when you select a bucket/folder in the S3 Object Prefix parameter, and want to filter which files are processed. For example, a value of .*\.pdf$ will match all files with a .pdf ending, or could be used to match on specific parts of a file name.


Features = dual listbox

Select assets of text you want to extract from the document. Available options are Page numbers and/or Footers.


Deactivate soft delete for Azure blobs (Databricks)

If you intend to set your destination as Databricks and your stage platform as Azure Storage, you must turn off the "Enable soft delete for blobs" setting in your Azure account for your pipeline to run successfully. To do this:

  1. Log in to the Azure portal.
  2. In the top-left, click ☰ → Storage Accounts.
  3. Select the intended storage account.
  4. In the menu, under Data management, click Data protection.
  5. Untick Enable soft delete for blobs. For more information, read Soft delete for blobs.

Snowflake Databricks Amazon Redshift