Making multiple API queries
Overview
Matillion ETL's API Query component is ideal for pulling data from a REST or SOAP API. However, there are cases where using a Python script to make the API call may be more efficient.
As an example, suppose you have a list of addresses for which you require the corresponding longitude and latitude to be written to a Redshift table. This can be achieved using the Google Maps API, which takes an address and returns the longitude and latitude in a JSON format. However, a limitation of this Google Maps API is that it only allows one address per API query. So if you have 100 addresses, you will need to make 100 API queries.
This can be achieved in Matillion ETL by configuring the API profile and using the API Query component with a Table Iterator component. However, this will create 100 individual Redshift tables with one row of data in each. These tables will need to be merged on the Redshift database using a Matillion ETL transformation job.
This would be structured as follows:
The API Query component is using the rsd file attached as an API profile, which is configured to return the longitude and latitude for a given address that is input. The Table Iterator component is iterating around each address, running the API Query, and writing the longitude and latitude back to an individual table per iteration. These tables are then merged together using a transformation job with a multi-table input:
The final step is to delete all of the temporary tables created.
This job is clearly inefficient, as the API Query component will create many files and then load each of these files into an individual table in Redshift, which results in many tables that then need to be deleted.
Another option is to do the API calls in Python and append the results to one file on the Matillion ETL EC2 instance, and then write this file to Redshift so only one table is created in Redshift. An example of this job is shown below.
This job starts by creating a blank file and then iterating round the table with the addresses and making the API call via Python, using Python Script components. The results from the API call are then written back to a csv file. The Python used in the API call is as follows:
A Bash Script component then copies the local file up to Amazon S3 so it can be loaded into Matillion ETL using the S3 Load component.
Both jobs achieve the same output, but using the Python script saves the load on the Redshift server and is about ten seconds quicker. This is illustrated in the following task history showing the job runs. The top task is using the API Query component, and the second task is using the custom Python script:
This illustrates that although the Matillion ETL API Query component can manage almost all API calls, there are some exceptions where it may be better to use a Python script to do the API call.