API Profiles - Pagination
Pagination is the process by which large API response data can be returned as multiple parts and iterated through. Paging requires an implied ordering. By default this may be the item's unique identifier, but can be other ordered fields such as a created date.
The underlying network protocols perform faster and more reliably when the bulk transfer of data is split into individually manageable chunks. For example if it's necessary to transfer 1,000,000 records, an API with paging enabled might expect the caller to make 100 individual HTTP requests, and will send back 10,000 records every time.
API paging can be enabled and managed at Response Configuration section of the API Query Profile and API Extract Profile creation wizards but can also be manually added to RSD files.
Below we describe the paging techniques supported by Matillion ETL and show how they are exposed in the RSD files.
Common features
All Matillion ETL paging implementations will add a new input named rows@next
to the list of columns inside the existing api:info
block. This virtual column is used to drive paging in the various ways demonstrated in this document.
In all cases, the driver will automatically loop through pages until any one of the following conditions is met:
- No data records were returned.
- The paging information held in the
rows@next
variable is missing. - The value of
rows@next
did not change since the last iteration.
Paging can be tricky to get right because RSD is a declarative language rather than an explicit loop, and so can be difficult to debug. It's usually impossible to tell just from the API Query profile "Test" screen whether or not paging has been invoked, because the "Test" screen only shows the first few records.
The best way to be certain that paging is working is to save the API Query Profile and then use it in an API Query component, and check that it successfully returns all the data that matches the query.
Rate limiting
Matillion ETL has no in-built support for artificial rate limiting. It will always issue the HTTP request for the next page immediately after the previous page has been processed.
Full Path pagination
This pagination feature will add the "pageURLPath" to the field you want to be next. You can select the field from the tree view on the left and Set Repeating Paging Element to populate the xpath.
This is a very common pattern, in which the API response in each page includes the instructions for fetching the "next" page. Sometimes it's a full URL; sometimes a partial URL. Sometimes the information is presented as part of the data, or it may be embedded as an HTTP header in the response.
For example: If you set repeating paging element as "/rates", every time it sees "/rates" it'll move to the next page as highlighted in generated the RSD file.
This is one of the simplest methods, and only three changes you can see over the baseline API Profile, highlighted below.
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<api:info title="ExampleAPI" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<attr name="GBP" xs:type="double" readonly="false" other:xPath="/json/rates/GBP" />
<attr name="INR" xs:type="double" readonly="false" other:xPath="/json/rates/INR" />
<attr name="USD" xs:type="double" readonly="false" other:xPath="/json/rates/USD" />
<input name="rows@next" desc="Identifier for the next page of results. Do not set this value manually."/>
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<api:set attr="uri" value="http://your.api/getData" />
<api:set attr="EnablePaging" value="TRUE" />
<api:set attr="pageurlpath" value="/next_page" />
<api:set attr="JSONPath" value="/rates" />
<api:script method="GET">
<api:set attr="method" value="GET"/>
<api:call op="jsonproviderGet">
<api:push/>
</api:call>
</api:script>
</api:script>
Link Header pagination
You need to insert a parameter into the document; the example here is "pagesize" set as the parameter used to indicate the page size and then provide the Page Size number. Below are the field descriptions for Link Header Pagination.
- Page Size Parameter: This is the parameter used to indicate the page size. You can select this parameter by right-clicking on the field in the tree view.
- Page Size: The number of the records returned per page.
In this variation, the full next URL to call is returned by the API as an HTTP header according to RFC8288 i.e.
- with the value set to
http://your.api/getData....
- with
rel="next"
Both of those features must be present for this type of paging to work. You can check the headers returned from the first page using a curl -v command.
The technique is almost identical to the case described earlier where you have a full next URL but slightly more complex as it involves intercepting the custom header, and might look like this:
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<api:info title="ExampleAPI" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<attr name="AUD" xs:type="double" readonly="false" other:xPath="/json/rates/AUD" />
<attr name="GBP" xs:type="double" readonly="false" other:xPath="/json/rates/GBP" />
<attr name="INR" xs:type="double" readonly="false" other:xPath="/json/rates/INR" />
<attr name="USD" xs:type="double" readonly="false" other:xPath="/json/rates/USD" />
<input name="rows@next" desc="Identifier for the next page of results. Do not set this value manually." />
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<api:set attr="EnablePaging" value="TRUE" />
<api:check attr="_input.rows@next">
<api:set attr="URI" value="[_input.rows@next]" />
<api:else>
<api:set attr="uri" value="http://your.api/getData?pagesize=5" />
</api:else>
</api:check>
<api:set attr="PushResponseHeader#" value="Link" />
<api:set attr="JSONPath" value="/rates" />
<api:script method="GET">
<api:set attr="method" value="GET"/> <!-- HTTP request type -->
<api:call op="jsonproviderGet">
<api:first>
<api:set item="userns" attr="link_header" value="[_out1.header:Link]" />
<api:enum list="[_out1.header:Link]" separator=",">
<api:set item="userns" attr="next_position" value="[_value | lfind('>; rel="next"',1) ]" />
<api:notequals item="userns" attr="next_position" value="-1">
<api:set item="userns" attr="pagelink" value="[_value | substring(1, [userns.next_position|subtract(1)]) ]" />
</api:notequals>
</api:enum>
</api:first>
<api:set attr="rows@next" value="[userns.pagelink]" />
<api:push/>
</api:call>
</api:script>
</api:script>
Offset-based pagination
(Also "Record Offset")
This is the simplest form of paging. Offset became popular with apps using SQL databases which already have LIMIT and OFFSET as part of the SQL SELECT Syntax.
In the wizard, Offset pagination will add a record count in the document by selecting the field from the tree view on the left and Set Record Count to populate the xpath on the basis of total number of records.
- Limit: The number of records to return per page.
- Record Count: Select a field from the tree to count the total number of records.
This category of API returns a number of records each time, and expects the caller to subsequently request records starting from an offset beyond the previous range.
There are a number of common variations:
- API provides the next offset: the server helpfully provides the next offset you use to get the next page of records.
- API tells you how many records it returned: caller needs to maintain a counter itself, by updating an offset once per page.
- Caller must count the records: caller must count the records itself and accumulate a counter client-side, incrementing per record.
The technique used here involves a input variable (named totalCount), which is captured from a named elementmappath (in this case /rates/INR). Inside the fetch loop, check to see if the attribute has been set, which it will be once per page, and if so add it to the offset held in rows@next
, starting at 1.
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<api:info title="ExampleAPI" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<attr name="GBP" xs:type="double" readonly="false" other:xPath="/json/rates/GBP" />
<attr name="INR" xs:type="double" readonly="false" other:xPath="/json/rates/INR" />
<attr name="USD" xs:type="double" readonly="false" other:xPath="/json/rates/USD" />
<input name="rows@next" desc="Identifier for the next page of results. Do not set this value manually." />
<input name="totalCount" desc="Records total count" />
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<api:set attr="EnablePaging" value="TRUE" />
<api:set attr="elementmappath#" value="/rates/INR" />
<api:set attr="elementmapname#" value="totalCount" />
<api:set attr="limit" value="10" />
<api:check attr="_input.rows@next">
<api:set attr="uri" value="http://your.api/getData?offset=[_input.rows@next]&limit=[limit]" />
<api:else>
<api:set attr="_input.rows@next" value="1" />
<api:set attr="uri" value="http://your.api/getData?offset=[_input.rows@next]&limit=[limit]" />
</api:else>
</api:check>
<api:set attr="JSONPath" value="/rates" />
<api:script method="GET">
<api:set attr="method" value="GET"/> <!-- HTTP request type -->
<api:call op="jsonproviderGet">
<api:set attr="nextOffset" value="[_input.rows@next | add([limit])]" />
<api:check attr="totalCount">
</api:check>
<api:push/>
</api:call>
</api:script>
</api:script>
Note
rows@next
stores the absolute offset.totalCount
stores the relative offset, and it's fine if different pages contain different numbers of records.
Cursor-based pagination
This allows you to insert the cursor into the document. You can select the cursor from the tree view to the left by right clicking on the field and then select Set Cursor Paging Element.
Given a set, a cursor will be a piece of data that contains a pointer to an element and the info to get the next/previous elements. The server should return the cursor pointing to the next page in each request. Most of the cases the cursor is opaque so users cannot manipulate.
Below is the field description for cursor based pagination.
- Cursor: Select a field from the tree to set as cursor.
- Cursor Parameter: The name of the cursor query parameter.
- Page Size: The number of records returns per page.
- Page Size Parameter: The name of the page size query parameter.
Below is the example of Cursor based pagination in RSD response .
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<api:info title="ExampleAPI" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<!-- You can modify the name, type, and column size here. -->
<attr name="GBP" xs:type="double" readonly="false" other:xPath="/json/rates/GBP" />
<attr name="INR" xs:type="double" readonly="false" other:xPath="/json/rates/INR" />
<attr name="USD" xs:type="double" readonly="false" other:xPath="/json/rates/USD" />
<input name="rows@next" desc="Identifier for the next page of results. Do not set this value manually." />
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<api:set attr="EnablePaging" value="TRUE" />
<api:set attr="elementmappath#" value="/rates/INR" />
<api:set attr="elementmapname#" value="rows@next" />
<api:check attr="_input.rows@next">
<api:set attr="uri" value="http://your.api/getData?pagesize=10&cursor=[_input.rows@next]" />
<api:else>
<api:set attr="uri" value="http://your.api/v1/latest?pagesize=10" />
</api:else>
</api:check>
<api:set attr="JSONPath" value="/rates" />
<api:script method="GET">
<api:set attr="method" value="GET"/>
<api:call op="jsonproviderGet">
<api:push/>
</api:call>
</api:script>
Whenever a parameter needs to be specified, it is because the name of the element (for example, CURSOR) used in the RSD may be different to the actual QUERY used in the response call URI. This applies to all paging options.
Page-based pagination
This is the simplest and most common form of paging, particularly for apps that use SQL databases. The set is divided by pages. The endpoint accepts a page param which is an integer indicating the page within the list to be returned.
Page-based pagination allows you to sort the API call response by page number, how many pages you want, and the last page element. The last page can be set by right clicking the field from the tree view and select Set Last Page.
Paging ends when there is no more data to send, and the server returns a success response which is valid but which contains no data.
- First URL:
http://your.api/getData
- Second URL:
http://your.api/getData?page=2
- Third URL:
http://your.api/getData?page=3
- Nth URL:
http://your.api/getData?page=N
- N+1th URL:
http://your.api/getData?page=N+1
(returns a successful HTTP 200 but no data)
Note
- You can either specify the last page you want to call (last page), or just carry on until the records run out (out of range).
- You can jump to any particular page and do not need to query 99 pages to get the page 100.
Below are the field descriptions for page-based pagination.
- Page Size Parameter: The name of the page size query parameter.
- Page Size: The number of records returned per page.
- Page Number Parameter: The name of the page number parameter.
- Last Page: Select how the last page will be handled and the field from the tree containing the last page by right-clicking on the tree view field.
- Out of Range: If you select "Out of Range", there are further two options to select from the drop-down: "Empty Page" and "Error Code".
The technique involves starting the count at 1, and then incrementing the page number for every call. In this example we used "lastPage" as "elementmapname".
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<api:info title="pagination02" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<attr name="GBP" xs:type="double" readonly="false" other:xPath="/json/rates/GBP" />
<attr name="INR" xs:type="double" readonly="false" other:xPath="/json/rates/INR" />
<attr name="USD" xs:type="double" readonly="false" other:xPath="/json/rates/USD" />
<input name="rows@next" desc="Identifier for the next page of results. Do not set this value manually." />
<input name="lastPage" desc="Keeps track of the last page value returned by the api" span="">
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<api:set attr="EnablePaging" value="TRUE" />
<api:set attr="elementmappath#" value="/rates/INR" />
<api:set attr="elementmapname#" value="lastPage" />
<api:set attr="pagingsizeparam" value="pagesize" />
<api:set attr="pagingsizevalue" value="10" />
<api:set attr="pagingnumberparam" value="2" />
<api:check attr="_input.rows@next">
<api:set attr="uri" value="http://your.api/v1/latest?[pagingsizeparam]=[pagingsizevalue]&[pagingnumberparam]=[_input.rows@next]" />
<api:else>
<api:set attr="_input.rows@next" value="1" />
<api:set attr="uri" value="http://your.api/getData?[pagingsizeparam]=[pagingsizevalue]&[pagingnumberparam]=[_input.rows@next]" />
</api:else>
</api:check>
<api:set attr="JSONPath" value="/rates" />
<api:script method="GET">
<api:set attr="method" value="GET"/>
<api:call op="jsonproviderGet">
<api:set attr="nextPage" value="[_input.rows@next | add(1)]" />
<api:check value="[nextPage | isbetween(1,[lastPage])]">
<api:set attr="rows@next" value="[nextPage]" />
</api:check>
</api:check>
<api:push/>
</api:call>
</api:script>
Failure when the page is out of range
This variation is almost identical to the previous example, except that the server signifies the end of paging by returning a specific HTTP response code: for example a 400 instead of a 200.
It's almost identical to the ordinary paging example, except that the jsonproviderGet
<api:try>
<api:call op="jsonproviderGet">
<api:set attr="rows@next" value="[_input.rows@next | add(1)]" />
<api:push/>
</api:call>
<api:catch code="*">
<api:match pattern="true" value="[_description | contains('400 Bad Request')]" >
<api:else>
<api:throw code="[_code]" desc="[_description]" />
</api:else>
</api:match>
</api:catch>
</api:try>
The full solution might look like below. It will continue paging until it receives a 400 response code, which it treats as the end-of-paging signal.
```
<api:info title="Numbered pager with 400" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
<attr name="id" xs:type="integer" readonly="false" other:xPath="id" />
<input name=""rows@next"" desc=""Identifier" span="">
</api:info>
<api:set attr="DataModel" value="DOCUMENT" />
<input value=""http://your.api/getData?page=[_input.rows@next]"" span="">