Ingesting AWS ElasticSearch data via the API Query component
Overview
Elasticsearch is a distributed and scalable multi-tenant search engine often used for indexing data, (for a website, for example) and to provide "fuzzy matching" and suggested text.
Data can be streamed into Elasticsearch via various methods, but one question that has been repeatedly asked is about how to export data from AWS Elasticsearch into your data warehouse.
Taking the example data uploaded in this tutorial, we will use Matillion ETL to load it into a cloud data warehouse (in this particular example, Amazon Redshift). The data contains movie/film information and can be searched on different criteria.
We will be using Matillion ETL to use the AWS Elasticsearch REST API and load it using the API Query component. This is given as an alternative to the Elasticsearch Query component.
Loading Elasticsearch data into the cloud data warehouse
1. Create an API profile to call the Elasticsearch endpoint with the search criteria. The following profile returns all movies:
<rsb:script xmlns:rsb="http://www.rssbus.com/ns/rsbscript/2/">
<rsb:info title="AWS ElasticSearch">
<attr name="director" xs:type="string" other:xPath="director" />
<attr name="genre" xs:type="string" other:xPath="genre" other:valueformat="aggregate" columnsize="20000" />
<attr name="year" xs:type="integer" other:xPath="year" />
<attr name="actor" xs:type="string" other:xPath="actor" other:valueformat="aggregate" columnsize="20000" />
<attr name="title" xs:type="string" other:xPath="title" />
</rsb:info>
<rsb:set attr="uri" value="https://search-mtln-elastic-test-jax2c2nx47u2ypbdz3jkjqgbxq.eu-west-1.es.amazonaws.com/movies/_search" />
<rsb:set attr="RepeatElement" value="/hits/hits/_source" />
<rsb:script method="GET" >
<rsb:call op="jsonproviderGet">
<rsb:push/>
</rsb:call>
</rsb:script>
</rsb:script>
The rsb:set
URI on line 11 can be changed to narrow down the search. For example, appending /_search?q=1996
as follows will return all movies from 1996:
Lines 4-8 specify which fields to return from the API, while line 13 specifies the repeat element of the API, so where the data is located in the JSON response from Elasticsearch. Further details on creating API profiles is available in Creating API Profiles Support .
2. After the API Profile has been created, you can use it in a Matillion ETL orchestration job in conjunction with the API Query component. Add an API Query component to the canvas, and select the Profile just created:
3. Select the Data Source that will form the new table. This will be the list of the RSD files created in the API profile.
4. Select the fields in the Data Selection dual listbox to form the columns in the table:
5. After specifying an S3 Staging Area and Target Table name, run the API Query component to query Elasticsearch and copy the data to a new Amazon Redshift table:
6. This data can now be used in a transformation job to view the data and transform it:
7. You can use a Split Field component to convert the comma-separated list held in genre into separate columns: