Create a pipeline with custom connectors
This page describes how to configure a batch pipeline in Data Loader using a custom data source. Although Data Loader provides connectors to a range of commonly used data sources, there may be times when you need to connect to a data source not shown in the list of standard connectors. With Custom Connector, you can create custom data source connectors to build data pipelines in Data Loader.
This guide refers to standard custom connectors. Flex connectors are also available, which are preconfigured custom connectors tailored to a specific service. They are accessed from the list of Sources when creating a pipeline. After you have created a Flex connector, it will appear in the list of Custom Connectors when choosing a data source, and can be used in the same way as a standard custom connector at that point. Read Flex connector setup for more information.
This is a two-step process:
Prerequisites
- You will need the base URL and endpoint information for the API of the service you are connecting to. Review the service's API documentation for more information.
- You must have access to an instance of a supported destination platform, and a role or user with appropriate permissions on that platform. Currently only Snowflake on AWS is supported.
- Ensure that Custom Connector is set to use the same region (for example, US (N. Virginia)) as Data Loader.
- You will require your AWS credentials and a staging location, as the data will have to be staged before it's loaded into Snowflake.
Create a connector
- Create a custom connector for your data source by following the steps in the Custom Connector setup.
- After creating the connector, it will be available in the Data Loader UI for building data pipelines.
Create a pipeline
- From the Custom Connector dashboard:
- Click Use connector next to a connector.
- Click Add a pipeline.
- This will take you to the Choose endpoints interface of Data Loader.
- From Choose endpoints, you can select the specific endpoints you want to use for building the structure of your data.
- You can then proceed with configuring the selected endpoints, authentication, behavior, variables, parameters, keys, destination, and sync frequency.
- Alternatively, from Data Loader:
- Click Add pipeline.
- On Choose source page, select the Custom Connector tab.
- Select your custom connector from the grid of data sources.
- Your connector may have multiple endpoints. On the Choose endpoints interface, select one or more endpoints to use in building the structure of your data.
- To select an endpoint, click and drag it from the left-hand column to the right-hand column.
- You can also click an endpoint and then click the right arrow button to select it.
- Use the double-arrow button to select all endpoints or the left arrow button to deselect endpoints.
- Once you have selected the desired endpoints, click Continue with x endpoints.
Configure your endpoints
Next, you need to configure the endpoints you have selected, determine the structure of your target table, prevent duplicates, allow updates, and enable incremental loading.
You can configure multiple endpoints, all of which will be available when using the connector in a pipeline, allowing you to choose which endpoints to use.
General
The General tab displays the endpoint URL. In this tab, you can set your data warehouse table name and choose either "Basic" or "Advanced" configuration. Advanced configuration requires more manual user input.
Note
The only restriction on the data warehouse table name is a maximum length of 100 characters.
The fully qualified table name, which may contain a maximum of 255 characters, is created by combining the custom name and the custom table prefix.
Note
The Behaviour and Keys tabs are only accessible when Configuration Mode is set to Advanced.
Authentication
On the Authentication tab, enter the credentials needed to authenticate with the data source. The credentials required will vary depending on the type of Authentication you selected. Data Loader currently supports basic authentication, bearer token, API key, OAuth, and no authentication.
When using OAuth authentication, the OAuth credentials are securely stored within the connector itself. For other authentication types, such as basic authentication, bearer token, and API key, the authentication details are stored within Data Loader.
Authentication details added during connector creation are to allow the new connector to be tested to ensure it produces a valid response. When the connector is used in a pipeline, new credentials will need to be added, and these can be different from those added during connector creation.
Behavior
The Behavior tab displays a tree view of the elements provided by the data source, and lets you select which elements you want to include as columns in the target table. By default, all elements are selected to be included as columns in the target table.
Click the checkbox next to an element to deselect it. You can expand the tree to see and select/deselect individual elements within an object or array.
Parameters
When you load data through the pipeline, the increment of data loaded will depend on the parameters passed in the API call. On the Parameters tab, you should configure any parameters needed for the endpoint to correctly select the data you want the pipeline run to retrieve.
Note
Parameters marked with a red tooltip are mandatory, and the fields within those parameters will turn red if left empty.
Use the Dynamic? toggle to select whether the parameter is dynamic (toggle on/green) or static (toggle off/gray).
- Static parameters have an unchanging value that you must type into the Value field.
- Dynamic parameters have a behavior dependent on which Parser is selected from the drop-down: DATE or STRING.
The DATE parser retrieves source data based on some date in your current data set. To define which date will be used, you need to create a variable mapping to that date. The MAX variable type will identify the maximum value of the mapped field in your data set and load new data based on that value. For example, if the maximum (latest) date in the current data set is one week ago, all data newer than one week ago will be loaded when the pipeline runs.
The STRING parser allows you to write a custom query that will be used in the API call. To create this query:
- Click the Configure parser icon to the right of the parameter.
- Select the variables you want to use in the query from the Variables drop-down. You can select multiple variables; however, your query must use every variable you select. Do not select variables that you do not intend to use.
-
Type the query into the Query field. To refer to the selected variables in the query, the variables must be enclosed in curly braces as in this example:
tickets updated_at>${UPDATE_DATE} created_at>${CREATE_DATE}
-
In the Default field, re-enter the query with a default value for each variable. This is so that the initial run of the pipeline can retrieve data. An example of a query with default values is:
tickets updated_at>2018-02-23T09:33:31Z created_at>2020-01-01
-
Click Save when finished.
Static or dynamic are the two options available for all four parameters. DATE or STRING is the choice for the parser for the "Query", "URI", and "Header". The "body" parameter only uses the STRING parser.
Ensure the "Body" parameter's "query" and "default" fields are valid JSON objects. Example:
{
"createdAfter": "${variableName}",
"anotherField": {
"name": "test name"
}
}
You can select one or more variables that the query will depend on and then use the ${variableName}
syntax to refer to them in the "Body" parameter.
Variables
Variables are used in API calls to set values for parameters. Variables are set on the Parameters tab.
To set a variable:
- Click Add new variable.
- Select whether the variable type is MAX or DATE. This will determine how it is used with a parameter.
- Provide a unique name for the variable.
- Select which data field the variable maps to.
- Click Add to save the variable.
Keys
You must set a key column, which will be used to match existing data, to prevent duplicates, and to allow updates. At least one key column is required.
In the Keys tab, add a key column by selecting a property from the drop-down menu. You can use the Add property button to include additional properties as needed.
Warning
Once the pipeline loads data successfully, do not edit the keys, as this will cause the pipeline to malfunction.
Once you have completed all the tabs for all endpoints, click Continue to proceed.
Choose destination
- Choose the destination platform for your pipeline. Currently, only Snowflake on AWS is supported.
- Configure a connection to your destination, as described in Connect to Snowflake.
Note
You will also have to provide your AWS credentials and a staging location, as the data will have to be staged before it's loaded into Snowflake.
Set frequency
Property | Description |
---|---|
Pipeline name | A descriptive label for your pipeline. This is how the pipeline appears on the pipeline dashboard and how Data Loader refers to the pipeline. |
Sync every | The frequency at which the pipeline should sync. Day values include 1—7. Hour values include 1—23. Minute values include 5—59. The input also represents the delay duration before the first synchronization. |
Currently, it is not possible to specify a start time. The pipeline will immediately run at the time it is first created, and then continue to run from that time with the specified frequency.
Once you are happy with your pipeline configuration, click Create pipeline to complete the process and add the pipeline to your dashboard.
Guidelines for connector and pipeline creation
When creating a custom connector and configuring a pipeline, consider the following:
- Region: If you can't find your custom connector listed in Data Loader, make sure your Data Loader region setting is the same as the region setting used to create the connector in Custom Connector.
- Populated arrays: Each element in a populated array will have its own row, and the data will be de-normalized, meaning that there will be redundant values in the target table. Consider this behavior when designing your pipeline.
- Empty arrays: We recommend that you disable all empty arrays to get an initial run working, and check the logs when there is an error.
- Composite key: You need to add a new column to the key to create a composite key, which will be unique inside the array you are extracting. This is so that rows that already exist (and should be updated) can be identified.
- One array per pipeline: If in doubt, only specify one array per pipeline. The composite key uniqueness rule still applies.
- Unique table: Ensure the target table is unique by using an appropriate table prefix.