Skip to content

Data lineage

Public preview

Data lineage provides a clear, visual representation of data flow in Data Productivity Cloud pipelines, giving you insights into pipeline dependencies and relationships, and allowing you to answer questions such as:

  • What was the data source?
  • How has the data changed through data transformations?
  • What is the final destination of the data?

Note

Currently, data lineage is available for transformation pipelines only.


Accessing data lineage

Data lineage is collected for each dataset used in your pipelines. To access data lineage:

  1. Log in to the Hub.
  2. Click ManagePipeline Runs.
  3. Click Lineage in the left sidebar.

The Lineage screen shows each table in your project. Listed are table names, the cloud data warehouse infrastructure (for example, Snowflake), and the cloud data warehouse location of each table.

Click a dataset's Table name to drill down to see a lineage graph for that dataset.

Note

The Lineage viewer isn't a full schema view of your cloud data warehouse; it shows only those datasets that you have used in running pipelines.


Using data lineage

The data lineage for a dataset is displayed graphically on a canvas, in a diagram known as a graph of lineage. The graph shows all the states that the data goes through as a transformation pipeline runs on it, as shown in the following simple example:

Data lineage graph

In this example, we can see that two datasets have been combined through some transformation pipeline (denoted by the T icon) to produce a target dataset. Each dataset is represented by a separate box on the canvas, and the data flow is left to right, following the direction of the arrows. The data lineage graph may contain multiple datasets and multiple transformations that act on those datasets.

On this canvas, you can perform the following actions:

  • Zoom in and out using the controls at the bottom-right.
  • Drag the canvas around the window with your mouse.
  • Drag individual boxes (datasets) around the canvas to reorganise the view (the relationships between the datasets will remain unchanged).
  • Click a transformation icon, T, to open a panel giving you the following details:

    • Name of the pipeline.
    • Name of the project containing the pipeline.
    • Status of the most recent pipeline run (SUCCESS, FAILURE).
    • Date and time that the most recent pipeline run started and finished.
    • Approximate duration of the most recent pipeline run, in seconds.
    • The name of the user who most recently ran the pipeline.

    Click the pipeline name at the top of this panel to go to the Pipeline run details page in the Observability dashboard.

  • Click any dataset box to show information about that dataset in a panel on the right. This panel includes a Columns tab which shows the name and data type of every column in the dataset.

  • Click the down arrow in any of the dataset boxes to expand the box, displaying every column in the dataset.

With one or more dataset boxes expanded on the canvas, you can trace the full lineage of any individual column of data. To trace a specific column, click the column name in any of the datasets. Arrows will trace that column between all the datasets in the lineage. The following example shows a simple case of this, showing the ID column in two datasets before and after a transformation.

Data lineage columns


Data access control

Lineage metadata is collected and aggregated at the Matillion account level, so you have a unified view of lineage across your projects. However, to preserve data security, project-level user permissions control which users can see which metadata. Specifically, if a dataset has been read from, or written to, by any project that a given user belongs to, then that user will be able to see the dataset metadata, but otherwise the user won't have access.


Video