Create a pipeline with custom connectors
This page describes how to configure a batch pipeline in Data Loader using a custom data source. Although Data Loader provides connectors to a range of commonly used data sources, there may be times when you need to connect to a data source not shown in the list of standard connectors. With Matillion Custom Connector, you can create custom data source connectors to build data pipelines in Data Loader.
This guide refers to standard custom connectors. Flex connectors are also available, which are preconfigured custom connectors tailored to a specific service. They are accessed from the list of Sources when creating a pipeline. After you have created a Flex connector, it will appear in the list of Custom Connectors when choosing a data source, and can be used in the same way as a standard custom connector at that point. Read Flex connector setup for more information.
This is a two-step process:
Prerequisites
- You need to know how to authenticate with your third-party data source.
- You need the API documentation for the third-party service, including the main URL.
- You must have access to an instance of a supported destination platform, and a role or user with appropriate permissions on that platform. Currently only Snowflake on AWS is supported.
- Ensure that Custom Connector is set to use the same region (for example, US (N. Virginia)) as Data Loader.
- You will require your AWS credentials and a staging location, as the data will have to be staged before it's loaded into Snowflake.
Create a connector
- Create a custom connector for your data source by following the steps in the Custom Connector setup.
- After creating the connector, it will be available in the Data Loader UI for building data pipelines.
Create a pipeline
- From the Custom Connector dashboard:
- Click Use connector next to a connector.
- Click Add a pipeline.
- This will take you to the Choose endpoints interface of Data Loader.
- From Choose endpoints, you can select the specific endpoints you want to use for building the structure of your data.
- You can then proceed with configuring the selected endpoints, authentication, behavior, variables, parameters, keys, destination, and sync frequency.
- Alternatively, from Data Loader:
- Click Add pipeline.
- On Choose source page, select the Custom Connector tab.
- Select your custom connector from the grid of data sources.
- Your connector may have multiple endpoints. On the Choose endpoints interface, select one or more endpoints to use in building the structure of your data.
- To select an endpoint, click and drag it from the left-hand column to the right-hand column.
- You can also click an endpoint and then click the right arrow button to select it.
- Use the double-arrow button to select all endpoints or the left arrow button to deselect endpoints.
- Once you have selected the desired endpoints, click Continue with x endpoints.
Configure your endpoints
Next, you need to configure the endpoints you have selected, to determine the structure of your target table, prevent duplicates, allow updates, and execute incremental loading.
You can configure multiple endpoints here. All configured endpoints will be available when the connector is used in a pipeline, allowing the user to select which of the endpoints they want to use.
Before you start configuring your endpoint, you will be prompted to provide a "table name" in the Data Warehouse table name field, under the General tab. The Data Warehouse table name field is mandatory, you must specify a table name to proceed. The only restriction is a maximum length of 100 characters.
The fully qualified table name, which may contain a maximum of 255 characters, is created by combining the custom name and the custom table prefix.
Complete the following details on the Configure your endpoints screen. This must be done for each endpoint you have selected.
Configure the endpoints
You need to configure the endpoints you have selected to determine the structure of your target table, prevent duplicates, allow updates, and execute incremental loading.
You can configure multiple endpoints here. All configured endpoints will be available when the connector is used in a pipeline, allowing the user to select which of the endpoints they want to use.
Before you start configuring your endpoint, you will be prompted to provide a "table name" in the Data Warehouse table name field, under the General tab.
The Data Warehouse table name field is mandatory—you must specify a table name to proceed. The only restriction is a maximum length of 100 characters. The fully qualified table name, which may contain a maximum of 255 characters, is created by combining the custom name and the custom table prefix.
Complete the following details on the Configure your endpoints screen. This must be done for each endpoint you have selected.
Authentication
On the Authentication tab, enter the credentials needed to authenticate with the data source. This will vary depending on the type of authentication you selected. Data Loader currently supports basic authentication, bearer token, API key, OAuth, and no authentication.
When using OAuth authentication, the OAuth is stored on the connector side, whereas details for the other authentication types are stored within Data Loader.
Authentication details added during connector creation are to allow the new connector to be tested to ensure it produces a valid response. When the connector is used in a pipeline, new credentials will need to be added, and these can be different from those added during connector creation.
Behavior
The Behavior tab displays a tree view of the elements provided by the data source, and lets you select which elements you want to include as columns in the target table. By default, all elements are selected to be included as columns in the target table.
Click the checkbox next to an element to deselect it. You can expand the tree to see and select/deselect individual elements within an object or array.
Parameters
When you load data through the pipeline, the increment of data loaded will depend on the parameters passed in the API call. On the Parameters tab, you should configure any parameters needed for the endpoint to correctly select the data you want the pipeline run to retrieve.
:::info{title='Note'}
- All parameters are mandatory and fields inside the Parameter tab turn red when left empty.
- Query
- URI
- Header
- Body
- Tool tips are available to all error icons to offer further information on what configuration is missing. :::
Use the Dynamic? toggle to select whether the parameter is dynamic (toggle on/green) or static (toggle off/gray).
- Static parameters have an unchanging value that you must type into the Value field.
- Dynamic parameters have a behavior dependent on which Parser is selected from the drop-down: DATE or STRING.
The DATE parser retrieves source data based on some date in your current data set. To define which date will be used, you need to create a variable (see above) mapping to that date. The MAX variable type will identify the maximum value of the mapped field in your data set and load new data based on that value. For example, if the maximum (latest) date in the current data set is one week ago, all data newer than one week ago will be loaded when the pipeline runs.
The STRING parser allows you to write a custom query that will be used in the API call. To create this query:
- Click the Configure parser icon to the right of the parameter.
- Select the variables you want to use in the query from the Variables drop-down. You can select multiple variables, however your query must use every variable you select. Do not select variables which you do not intend to use.
-
Type the query into the Query field. To refer to the selected variables in the query, the variables must be enclosed in brackets as in this example:
tickets updated_at>${UPDATE_DATE} created_at>${CREATE_DATE}
-
In the Default field, re-enter the query with a default value for each variable. This is so that the initial run of the pipeline can retrieve data. An example of a query with default values is:
tickets updated_at>2018-02-23T09:33:31Z created_at>2020-01-01
-
Click Save when finished.
Static or dynamic are the two options available for all four parameters. DATE or STRING is the choice for the parser for the "Query", "URI", and "Header". The "body" parameter only uses the STRING parser.
Ensure the "Body" parameter's "query" and "default" fields are valid JSON objects. Example:
{
"createdAfter": "${variableName}",
"anotherField": {
"name": "test name"
}
}
You can select one or more variables that the query will depend on and then use the ${variableName}
syntax to refer to them in the "Body" parameter.
Variables
Variables are used in API calls to set values for parameters (described below). Variables are set on the Parameters tab.
To set a variable:
- Click Add new variable.
- Select whether the variable type is MAX or DATE. This will determine how it is used with a parameter (see below).
- Provide a unique name for the variable.
- Select which data field the variable maps to.
- Click Add to save the variable.
Keys
You must set a key column, which will be used to match existing data, to prevent duplicates, and to allow updates. At least one key column is required.
Add a key column by selecting a property from the drop-down.
:::warning{title='Warning'} Once the pipeline loads data successfully, you should not edit the keys as this will result in the pipeline malfunctioning. :::
Click Continue when every tab has been completed for all endpoints.
Choose destination
- Choose the destination platform for your pipeline. Currently, only Snowflake on AWS is supported.
- Configure a connection to your destination, as described in Connect to Snowflake.
:::info{title='Note'} You will also have to provide your AWS credentials and a staging location, as the data will have to be staged before it's loaded into Snowflake. :::
Set frequency
Property | Description |
---|---|
Pipeline name | A descriptive label for your pipeline. This is how the pipeline appears on the pipeline dashboard and how Data Loader refers to the pipeline. |
Sync every | The frequency at which the pipeline should sync. Day values include 1—7. Hour values include 1—23. Minute values include 5—59. The input is also the length of delay before the first sync. |
Currently, you can't specify a start time. The pipeline will immediately run at the time it is first created, and then continue to run from that time with the specified frequency.
Once you are happy with your pipeline configuration, click Create pipeline to complete the process and add the pipeline to your dashboard.
Guidelines for connector and pipeline creation
When creating a custom connector and configuring a pipeline, consider the following:
- Region: If you can't find your custom connector listed in Data Loader, make sure your Data Loader region setting is the same as the region setting used to create the connector in Custom Connector.
- Populated arrays: Each element in a populated array will have its own row, and the data will be de-normalized, meaning that there will be redundant values in the target table. Consider this behavior when designing your pipeline.
- Empty arrays: We recommend that you disable all empty arrays to get an initial run working, and check the logs when there is an error.
- Composite key: You need to add a new column to the key to create a composite key, which will be unique inside the array you are extracting. This is so that rows that already exist (and should be updated) can be identified.
- One array per pipeline: If in doubt, only specify one array per pipeline. The composite key uniqueness rule still applies.
- Unique table: Ensure the target table is unique by using an appropriate table prefix.