Amazon Textract Input
Amazon Textract Input is an orchestration component that uses the Textract API to extract text, handwriting, layout elements, and data from scanned documents. You can choose to include footers and page numbers in the extraction process.
The component uses the Textract API to retrieve data to load into a table—this stages the data, so the table is reloaded each time. You can then use transformations to enrich and manage the data in permanent tables. For more information, read Amazon Textract.
Prerequisites
Before you use the Amazon Textract Input component, you'll need to add AWS cloud credentials to the Data Productivity Cloud.
Properties
Reference material is provided below for the Destination and Configure properties.
Destination
Select your cloud data warehouse.
:mod-destination-sf:
:mod-warehouse-sf: :mod-database-sf: :mod-schema-sf: :mod-table-name-sf: :mod-load-strategy-sf: :mod-clean-staged-files-sf: :mod-stage-platform-sf:
:mod-stage-amazon-s3-bucket:
:mod-stage-internal-stage-type-sf:
:mod-stage-azure-storage-account: :mod-stage-azure-container:
:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
:mod-destination-db:
:mod-catalog-db: :mod-schema-database-db: :mod-table-name-db: :mod-load-strategy-db: :mod-clean-staged-files-db: :mod-stage-platform-db:
:mod-stage-amazon-s3-bucket:
:mod-stage-azure-storage-account: :mod-stage-azure-container:
:mod-stage-gcs-storage-integration: :mod-stage-gcs-bucket: :mod-stage-overwrite:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
:mod-destination-rs:
:mod-schema-rs: :mod-table-name-rs: :mod-load-strategy-rs: :mod-clean-staged-files-rs: :mod-amazon-s3-bucket-rs:
:mod-cloud-storage-load-strategy: :mod-cloud-storage-folder-path: :mod-cloud-storage-file-prefix: :mod-cloud-storage-storage:
:mod-cloud-storage-amazon-s3-bucket:
:mod-cloud-storage-azure-storage-account: :mod-cloud-storage-azure-container:
:mod-cloud-storage-gcs-bucket: :mod-cloud-storage-overwrite:
Configure
S3 Bucket region
= drop-down
The AWS region where the S3 bucket you want to connect to is located.
S3 Object Prefix
= string
The S3 path to the bucket, folder, or file that will be processed. For example, s3://bucket-name/
, or s3://bucket-name/folder/
, or s3://bucket-name/folder/specific-file.pdf
.
Source File Filter Pattern
= string
A regex pattern that is used to filter files. This is useful when you select a bucket/folder in the S3 Object Prefix
parameter, and want to filter which files are processed. For example, a value of .*\.pdf$
will match all files with a .pdf
ending, or could be used to match on specific parts of a file name.
Features
= dual listbox
Select assets of text you want to extract from the document. Available options are Page numbers
and/or Footers
.
Deactivate soft delete for Azure blobs (Databricks)
If you intend to set your destination as Databricks and your stage platform as Azure Storage, you must turn off the "Enable soft delete for blobs" setting in your Azure account for your pipeline to run successfully. To do this:
- Log in to the Azure portal.
- In the top-left, click ☰ → Storage Accounts.
- Select the intended storage account.
- In the menu, under Data management, click Data protection.
- Untick Enable soft delete for blobs. For more information, read Soft delete for blobs.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ✅ | ✅ |