Skip to content

File Iterator

The File Iterator component lets users loop over matching files in a remote file system.

The component searches for files in a number of remote file systems, running its attached component once for each file found. Filenames and path names are mapped into environment variables, which can then be referenced from the attached component.

To attach the iterator to another component, use the connection ring beneath the file iterator to connect to the input of the other component. The two components will automatically "snap" together, with the file iterator sitting on top of the other component, and can be dragged around the canvas as a single component. To uncouple the two components, delete the file iterator component.

If you need to iterate more than one component, put them into a separate orchestration pipeline or transformation pipeline and use a Run Transformation or Run Orchestration component attached to the iterator. In this way, you can run an entire pipeline flow multiple times, once for each row of variable values.

All iterator components are limited to a maximum 5000 iterations.


Properties

Name = string

A human-readable name for the component.


Input Data Type = drop-down

Select the remote file system to search. Available data types are:

  • Azure Blob Storage
  • FTP
  • S3
  • Windows Fileshare.

Input Data URL = string

The URL of the source files, including the full path to the folder you wish to iterate over. You can further refine the filenames to be iterated over using the Filter Regex property.

Clicking this property will open the Input Data URL dialog. This displays a list of all existing storage accounts. Select a storage account, then a container, and then a subfolder if required. This constructs a URL with the following format:

DATATYPE://<account>/<container>/<path>

You can also type the URL directly into the Storage Accounts path field, instead of selecting listed elements. This is particularly useful when using variables in the URL, for example:

AZURE://${jv_blobStorageAccount}/${jv_containerName}

Special characters used in this field must be URL-safe.


Domain = string

Input your connection domain.


Username = string

Provide a valid username for the connection.


Password = drop-down

The secret definition containing the password for the connection. Your password should be saved as a secret definition before using this component.


Set Home Directory as Root = boolean

  • No: The URL path is from the server root.
  • Yes: The URL path is relative to the user's home directory. Default setting is Yes.

This property is only available when the Input Data Type is set to FTP.


Recursive = boolean

  • No: Only search for files within the folder identified by the Input Data URL.
  • Yes: Consider files in subdirectories when searching for files.

This property is only available when the Input Data Type is set to FTP or Windows Fileshare.


Max Recursive Depth = integer

Set the maximum recursion depth into subdirectories. This property is only available when Recursive is set to Yes.


Ignore Hidden = boolean

  • No: Include hidden files.
  • Yes: Ignore hidden files, even if they otherwise match the filter. This is the default setting.

This property is only available when the Input Data Type is set to FTP or Windows Fileshare.


Max Iterations = integer

The total number of iterations to perform. The maximum cannot exceed 5000.


Filter Regex = string

The java-standard regular expression used to test against each candidate file's full path. If you want ALL files, specify .*

Filter Regex starts with a variable that represents the folder name with /.* as the suffix. The forward slash defines to look within the folder. The .* is the wildcard to return all files in that folder.

Example: ${jv_folder}/.*

If Filter Regex has a folder structure ${jv_folder}/.*, you do need to have a Recursive value as YES to find the folder beyond Input Data URL path DataType://${jv_blobStorageAccount}/${jv_containerName}/.


Concurrency = drop-down

  • Concurrent: Iterations are run concurrently.
  • Sequential: Iterations are done in sequence, waiting for each to complete before starting the next. This is the default setting.

Note: The maximum concurrency is limited by the number of available threads (2x the number of processors on the cloud instance).


Variables = column editor

Select project variables that will hold the values of file attributes. This will allow you to use the matching file's metadata (such as its filename) in the component attached to the File Iterator. The project variables must have been defined prior to using them in this component. Read Variables for more information.

Use + to add a variable, and specify the following:

  • Variable: Select an existing project variable to hold a given file attribute.
  • File Attribute: For each matched file, the project variable will be populated with the attribute selected here. The attributes which can be used are:
    • Base Folder.
    • Subfolder. Useful when recursing.
    • Filename.
    • Last modified. A date formatted as ISO8601, with a UTC indicator. For example: 2021-01-04T10:45:15.123Z.

Users may experience a lag in how their data warehousing platform updates the last modified date, for example between when Data Productivity Cloud interacts with the file versus the actual last modified date. This behaviour is a limitation of the platform and is subject to that platform's metadata.


Break on Failure = boolean

  • No: Attempt to run the attached component for each iteration, regardless of success or failure. This is the default setting.
  • Yes: If the attached component does not run successfully, fail immediately.

If a failure occurs during any iteration, the failure link is followed. This parameter controls whether it is followed immediately or after all iterations have been attempted.

This property is only available when Concurrency is set to Sequential. When set to Concurrent, all iterations will be attempted.


Snowflake Databricks Amazon Redshift (preview)