File Iterator
The File Iterator component lets users loop over matching files in a remote file system.
The component searches for files in a number of remote file systems, running its attached component once for each file found. Filenames and path names are mapped into environment variables, which can then be referenced from the attached component.
To attach the iterator to another component, use the connection ring beneath the file iterator to connect to the input of the other component. The two components will automatically "snap" together, with the file iterator sitting on top of the other component, and can be dragged around the canvas as a single component. For more information about stacking and detaching iterators, read Stacking and detaching iterators.
If you need to iterate more than one component, put them into a separate orchestration pipeline or transformation pipeline and use a Run Transformation or Run Orchestration component attached to the iterator. In this way, you can run an entire pipeline flow multiple times, once for each row of variable values.
All iterator components are limited to a maximum 5000 iterations.
If the component requires access to a cloud provider, it will use credentials as follows:
- If using Matillion Full SaaS: The component will use the cloud credentials associated with your environment to access resources.
- If using Hybrid SaaS: By default the component will inherit the agent's execution role (service account role). However, if there are cloud credentials associated to your environment, these will overwrite the role.
Properties
Name
= string
A human-readable name for the component.
Input Data Type
= drop-down
Select the remote file system to search. Available data types are:
- Azure Blob Storage
- Google Cloud Storage
- FTP
- SFTP
- S3
- Windows Fileshare.
Input Data URL
= string
The URL of the source files, including the full path to the folder you wish to iterate over. You can further refine the filenames to be iterated over using the Filter Regex property.
Clicking this property will open the Input Data URL dialog. This displays a list of all existing storage accounts. Select a storage account, then a container, and then a subfolder if required. This constructs a URL with the following format:
DATATYPE://<account>/<container>/<path>
You can also type the URL directly into the Storage Accounts path field, instead of selecting listed elements. This is particularly useful when using variables in the URL, for example:
AZURE://${jv_blobStorageAccount}/${jv_containerName}
Special characters used in this field must be URL-safe.
Domain
= string
Input your connection domain.
Username
= string
Provide a valid username for the connection.
Password
= drop-down
The secret definition denoting the password for the connection. Your password should be saved as a secret definition before using this component.
Key
= drop-down
The secret definition denoting your SFTP key for the connection. Your SFTP key should be saved as a secret definition before using this component.
This parameter is optional and will only be used if the data source requests it.
This must be the complete private key, beginning with "-----BEGIN RSA PRIVATE KEY-----" and conforming to the same structure as an RSA private key.
The following private key formats are currently supported:
- DSA
- RSA
- ECDSA
- Ed25519
In a Hybrid SaaS configuration, you need to manually convert the private key into a format that allows it to be stored in your AWS Secrets Manager or Azure Key Vault. You can do this with the following command:
ssh-keygen -p -f YOUR_PRIVATE_KEY -m pem
Set Home Directory as Root
= boolean
- No: The URL path is from the server root.
- Yes: The URL path is relative to the user's home directory. Default setting is Yes.
This property is only available when the Input Data Type is set to FTP or SFTP.
Recursive
= boolean
- No: Only search for files within the folder identified by the Input Data URL.
- Yes: Consider files in subdirectories when searching for files.
This property is only available when the Input Data Type is set to FTP, SFTP, or Windows Fileshare.
Max Recursive Depth
= integer
Set the maximum recursion depth into subdirectories. This property is only available when Recursive is set to Yes.
Ignore Hidden
= boolean
- No: Include hidden files.
- Yes: Ignore hidden files, even if they otherwise match the filter. This is the default setting.
This property is only available when the Input Data Type is set to FTP, SFTP, or Windows Fileshare.
Max Iterations
= integer
The total number of iterations to perform. The maximum cannot exceed 5000.
Filter Regex
= string
The java-standard regular expression used to test against each candidate file's full path. If you want ALL files, specify .*
Filter Regex starts with a variable that represents the folder name with /.*
as the suffix. The forward slash defines to look within the folder. The .*
is the wildcard to return all files in that folder.
Example: ${jv_folder}/.*
If Filter Regex has a folder structure ${jv_folder}/.*
, you do need to have a Recursive value as YES to find the folder beyond Input Data URL path DataType://${jv_blobStorageAccount}/${jv_containerName}/
.
Concurrency
= drop-down
- Concurrent: Iterations are run concurrently.
- Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default setting.
Full SaaS deployments are limited to 20 concurrent tasks, with additional tasks being queued. Hybrid SaaS deployments with customer-hosted agents have 20 concurrent tasks per agent instance, with a maximum of 100 instances if configured accordingly.
Variables
= column editor
Select project variables that will hold the values of file attributes. This will allow you to use the matching file's metadata (such as its filename) in the component attached to the File Iterator. The project variables must have been defined prior to using them in this component. Read Variables for more information.
Use + to add a variable, and specify the following:
- Variable: Select an existing project variable to hold a given file attribute.
- File Attribute: For each matched file, the project variable will be populated with the attribute selected here. The attributes which can be used are:
- Base Folder.
- Subfolder. Useful when recursing.
- Filename.
- Last modified. A date formatted as ISO8601, with a UTC indicator. For example:
2021-01-04T10:45:15.123Z
.
Users may experience a lag in how their data warehousing platform updates the last modified date, for example between when Data Productivity Cloud interacts with the file versus the actual last modified date. This behaviour is a limitation of the platform and is subject to that platform's metadata.
Break on Failure
= boolean
- No: Attempt to run the attached component for each iteration, regardless of success or failure. This is the default setting.
- Yes: If the attached component does not run successfully, fail immediately.
If a failure occurs during any iteration, the failure link is followed. This parameter controls whether it is followed immediately or after all iterations have been attempted.
This property is only available when Concurrency is set to Sequential. When set to Concurrent, all iterations will be attempted.
Snowflake | Databricks | Amazon Redshift |
---|---|---|
✅ | ✅ | ✅ |