Skip to content

Matillion data quality framework

The term "data quality" is very broad, and typically has a dual meaning:

  • Data that is complete, accurate and relevant
  • A set of processes to achieve such a state

We'll focus on the first of these two angles here, and act in the role of data consumers rather than primary creators. The foremost responsibility of a Data Productivity Cloud data processing system is to faithfully record all the input data, apply any required business rules, and deliver the output in an accessible and timely way.

As data consumers we have to first of all assume that the source data is perfect. It's good enough to run the business operations, and is therefore fit for purpose. However, when designing a Data Productivity Cloud data processing system, you should always understand and question the source data and feed back on any suspicions or anomalies.


Data quality problem scenarios

There are often technical problems with source data, in the form of outliers which can't be handled in the desired way. The primary responsibility is first to detect the "bad" data, and then to reject it (so it doesn't break the ELT logic), and finally audit the fact that some data had to be rejected.

That way:

  • Business decisions are made against a known baseline
  • There's a chance for any invalid data to get fixed at source

This document will look at specific ways in which you can use the Data Productivity Cloud to detect, reject and audit data that's of sufficiently bad quality that it's unusable in an ELT process.

Sometimes technical data quality issues like the above are caused by a genuine problem in the system of record. When you use a pipeline to reject and audit bad input, you are providing feedback that can help solve the problem at source.

However some problems that manifest as data quality issues are actually because you're reading the data wrongly. For example, if the system of record states that a flight from Sydney to Canberra took 33 minutes, and that a flight from Sydney to London took 22 minutes, you should take the opportunity to question whether "22" is a really measure in minutes? Perhaps there's a business rule stating that values below 30 should always be interpreted as hours.

Below are some typical data quality problems:

Missing values

Applies mainly to attributes rather than whole records. For example, it's easy to detect if an airline ticket record is missing a "flight number" attribute. However it's usually impossible to determine that someone should have bought an airline ticket, and that for some reason no record of them buying it has appeared.

Handling duplicates

Duplicates are usually easy to detect, as long as it's clear what the unique identifier should be. Sometimes duplicates are genuinely errors, but sometimes they may indicate a missing business rule.

For example, you receive two ticket records for the same person on the same flight, but with different seat numbers. The missing business rule might be that their seat allocation got changed, and only the second record is still valid.

"Dates" that are not dates

This applies to any data which looks like something else, and which maybe a human could read, but which is not meaningful to a computer. For example, a string column containing the value ‘2017-01-29'.

This is an array of bytes [32,30,31,37,2d,30,31,2d,32,39] which can be successfully converted into a date. Whereas the string ‘tomorrow', which makes perfect sense to a human, would be read as the byte array [74,6f,6d,6f,72,72,6f,77], and which is not a legitimate date.


Missing values

In a Transformation pipeline, detection of missing values is most commonly achieved using a Filter component.

The following pipeline shows a flight Table Input component with several filters. Records which pass all the filters along the "happy path" at the top have both an Origin and a Departure airport, and are therefore usable for further processing.

Data Quality Canvas

In contrast, any records which fail either filter will end up going to the Unite operator for accumulation. The Calculator component adds a couple of attributes recording the time and reason for failure.

Calculator CURRENT_DATE

Calculator rejections

All records which pass the filters to end up in the "problem path" should be saved to an audit table using the designer's Table Output component. From there they can easily be sent back to source for data correction, or else reprocessed later once the missing business rules have been added.


Complementary filters

This pipeline also illustrates the important and commonly used technique of adding complementary filters. The conditions in the pairs of filters highlighted below are exactly opposite to each other, and every record will either fall into one or the other category.

Complementary Filters

It's important that these pairs of filters really are exactly opposite. There should be no cases where a record is successfully processed and gets audited to the rejection log.


Handling duplicates

It's vital for relational data to have known unique or primary keys. Primary keys are used only by the optimizer and are not enforced.

Detecting duplicate data and acting accordingly is most commonly done using an Analytic (or "Window") function in a Calculator component, with a formula like this:

ROW_NUMBER() OVER (
PARTITION BY <<unique columns>>
ORDER BY <<sort columns>>)

This allocates an integer sequence number to every record, and ensures that there can only be one number 1 per combination of unique columns. In the case of aircraft seat allocations, the data might look like this:

Flight Number Seat Number Timestamp ROW_NUMBER() OVER (PARTITION BY flightnum, seatnum ORDER BY timestamp) Comment
BA123 43F 2009-01-11 14:52:01 1 First allocation
BA123 67B 2009-01-11 14:58:18 2 Changed by customer
BA123 52A 2009-01-12 10:02:29 3 Final allocation

Every group of key columns (even if there's actually no duplication) is guaranteed to have exactly one record with ROW_NUMBER() 1. This can be used in the "happy path" filter exactly as described in the previous section.

In this example, it might potentially be more useful to order the data by timestamp descending. This makes it very easy to pick out the last seat allocation from among the group of duplicates. It's usually the "last" (i.e. the latest) record which is most useful, e.g. for final seating on the aircraft.

ORDER BY <<sort columns>> DESCENDING)

The complementary filters would be implemented by checking for the value equal to 1 (not a duplicate) and anything greater than 1 (is a duplicate), like this:

Complementary Filters Example


"Dates" that are not dates

You should rightly expect that certain datatype conversions should be performed automatically, because they are unambiguous. For example it usually causes no problems to implicitly convert the strings "1", "2" and "3" into integers, and append them into a table already containing integers. The database should successfully do this on your behalf without any manual intervention needed in the pushdown SQL.

Dates, however, are a different matter, and can cause data quality problems. For example a CSV file might contain the string "2009-01-11" indicating a seat reservation date. If taken from an American system you should assume that it means November 1st. But on an Australian or British system you should read it as January 11th.

You might be lucky and have the database correctly guess which interpretation you need. But when converting strings like these into real date data types, it's best to be explicit about the format mask, using a TO DATE function such as:

This also gives you the opportunity to apply business rules (such as "tomorrow" meaning system date + 1 day), and to reject dates that really do contain nonsense values.

For example, in a Calculator component, you could do as follows:

CASE WHEN "some_string" ~ '\\d{1,4}-\\d{1,2}-\\d{1,2}'
    THEN TO_DATE("some_string", 'YYYY-MM-DD')
    WHEN UPPER("some_string") = 'TOMORROW' THEN CURRENT_DATE + 1
    ELSE TO_DATE(NULL, 'YYYY-MM-DD')
END

This expression uses a pattern match to check that the string could conceivably be a date, and if so converts it using the explicit format mask YYYY-MM-DD.

The special value "tomorrow" is always treated as system date + 1 day.

Any other values are rejected, and a NULL is substituted. In this way the data can be added to a real date column if necessary. The complementary filters for this column would be:

  • the_date IS null or blank
  • the_date IS NOT null or blank

Applying business rules

The above examples have shown how to apply simple business rules to incoming data, to interpret it correctly.

  • If you receive more than one seat allocation record for a flight, the most-recently updated record is the one to use
  • The date value "tomorrow" means "today's date + 1 day"
  • Leave nonsense dates blank in the output

However it's possible to apply much more sophisticated business rules with the Data Productivity Cloud, again using pushdown SQL. For example, in a dataset of flights, the origin airport, departure airport and distance in miles is recorded. However there are a few inconsistencies:

  • Flight distance is sometimes missing
  • Sometimes there is a small disagreement in distance between the outbound and inbound legs of the journey between the same two airports
  • In some cases the recorded distances have changed over time (maybe due to differences in how it was calculated)

How to reliably measure distance (and hence average flight speed and other descriptive statistics), if the actual measurements are unreliable? In consultation with the business, the following rules were agreed:

  • If "distance" is present then just use it in the flight "fact" table
  • If "distance" is missing, then fill in from the most-commonly found distance value for other flights between the same origin and destination, in either direction.
  • In the "route" dimension, use the most-commonly found distance value for all flights between that origin and destination

The key to this is to find the most commonly recorded distance between every pair of airports. To improve performance and make maintenance easier, it's best to perform this kind of aggregation on its own, create an intermediate "working" table of results, and use it to apply the business rules.

Pipeline Overview

The above transformation pipeline creates the intermediate table in several steps:

  • Only use records where source, destination and distance are all present (Filter known)
  • Aggregate and count occurrences (Agg)
  • To obtain a consistent direction, multiplex through almost-complimentary filters (origin < destination and origin > destination), and swap the column names along one branch. This aligns all data between pairs of airports, regardless of flight direction. It also takes the chance to remove any spurious data where the departure and arrival airports are the same.
  • In "Sum Frequency", add all the pairs of airports (so 100 instances of LAX → JFK plus 101 JFK → LAX would become 201 instances of JFK → LAX)
  • Use the ROW_NUMBER analytic function to find the most commonly recorded distance between every pair of airports
  • (Re-)Create the common_flight_distances table containing the results

Having obtained the vital "most common" distances, it's quite straightforward to apply the business rule, in a pipeline like this:

Most common distances business rule

The common_flight_distances is replicated, and left-joined twice to the fact table: once on origin=origin and destination=destination, and then once on origin=destination and destination=origin. This caters for the fact that in the summary table the origin is always alphabetically lower than the destination, whereas in reality flights can go either direction.

From the second join, three distance measurements are available:

  • The original distance
  • The most commonly measured distance (if the origin happens to be alphabetically lower than the destination)
  • The most commonly measured distance (if the origin happens to be alphabetically higher than the destination)

To apply the business rules, simply use the first of these three values which has a non-null value.

Summary

This article has demonstrated how to detect, reject and audit data which fails certain defined data quality rules, such as missing values and duplicates.

The main pattern is detect data quality problems using SQL expressions, then multiplex the data through complementary filters to either be processed or else to be rejected and audited.

Using this technique, you can use the Data Productivity Cloud's pushdown SQL to detect data quality problems in any input data, and to apply arbitrarily complex business rules.