API Profiles Example - MongoDB and DynamoDB
Overview
This page explains how users can force the MongoDB and DynamoDB Query components to resolve data sources (collections) and available columns from predefined schema-definition files (RSDs) rather than use automatic schema discovery to determine these properties.
The process involves generating a selection of RSDs that define each collection, its constituent columns, and their data types. Once generated, the component can be (optionally) forced to use these schema definitions instead of scanning a sample dataset to determine this information during design time.
Generating schema files
The MongoDB and DynamoDB components can generate these RSDs automatically by adding certain connection options.
- Click Project → Manage API Profiles.
- Create a new profile and name it appropriately, e.g. 'MongoDB_DBName'. Remember the profile name as you will be using it exactly in the following steps.
- Close the API Profiles menu.
- Create a new orchestration job and configure a new MongoDB or DynamoDB component (or copy a preconfigured component from another job). Ensure it is pointing at the database you are interested in generating the schema for.
- Add the following connection options:
- Location: /usr/share/tomcat/api_profiles/[profile_name]
- GenerateSchemaFiles: OnStart
- Run the component (right-click → Run Component).
- Remove the connection options Location and GenerateSchemaFiles to avoid accidentally generating them again.
Optionally, delete this job if you no longer need it.
The above run will generate a set of RSD files in the folder /usr/share/tomcat/api_profiles/[profile_name].
You can use any other location on the "Matillion" server. The API Profiles editor defaults to this location when editing profiles and we can use the editor to review/edit these schema definitions as explained below.
Reviewing and editing the generated schema
The API Profiles editor is a simple text editor in Matillion ETL that is typically used to manage profiles defined for the API Query component. By default it looks for available profiles in the folder /usr/share/tomcat/api_profiles. Choosing to use the same location helps us use the editor to manage the schema definitions generated.
To review or edit the generated schema files:
- Click Project → Manage API Profiles
- Edit the profile name.
You will now see a bunch of schema definitions to the left. You can change these as desired or create new definitions by copying from existing. The name of the data source (RSD) should match the name of your collection in MongoDB/DynamoDB.
Using generated schema definitions
The Location connection option determines whether a component should look for an existing schema definition or query MongoDB/DynamoDB for schema-related information.
To force a component to use schema definitions, add the Location connection option and set it to the appropriate folder—in this example, it is /usr/share/tomcat/api_profiles/[profile_name].
Next time you click the Data Source button, it will only list the profiles found in that location and any columns defined in those files. The component will no longer query for a list of tables/collections nor try to determine the columns and their data types.
Dealing with rare columns
One of the challenges with a schemaless database like MongoDB or DynamoDB is that not all columns will necessarily be present in all documents. This may lead to certain columns (rare) missing a column-definition in the RSD that is generated.
Typically, the component will analyse approximately 100 rows from each collection to identify the available columns and the most suitable data type. If the rare column is not see in the sample, then a definition will not be created.
The RowScanDepth connection option can be used to adjust the number of rows scanned from each collection. If the rare column exists in this now-larger sample, a definition will be created.
Alternatively, you may generate the basic definition and then edit it to manually add the column definition yourself as outlined in the section above.