Helping with the GDPR

Overview

From 25/5/2018 any company operating in the EU will be required to comply with the new GDPR Data Protection legislation if they store or process personal information relating to an individual.

If you're a Matillion ETL user your installation will be running within your AWS Infrastructure and so should be considered in the same way as your other EC2 instances, but is there any way Matillion ETL can help with your wider GDPR responsibilities?

Due to the ability to integrate cloud data warehouses with other services such as cloud storage and messaging systems - and indeed we use many of these systems internally within Matillion - the answer is 'Yes'.

This is of course not an exhaustive list and you should consult legal experts if you're in any doubt about your GDPR Compliance. The below examples are part of a wider GDPR policy and set of processes, and are only shown here as an example of the kind of checks that are possible.

Technical Processes Used:

Querying cloud services from within Matillion.
Querying data warehouse system tables from within Matillion ETL.
Getting data from Google Spreadsheets and comparing with information from AWS - although the information is just as relevant for other cloud platforms.

Example Process

One of the main GDPR principles is ensuring that data is up to date and not kept for longer than required - as well as a legal requirement this is just good data practice. When using Matillion ETL to create a data warehouse some areas where data may be stored and become stale include:

In data warehouse snapshots/backups.
In testing databases used during development.
Buckets (where CSV files or other files may be stored as required by your processes).

We'll look at how Matillion ETL can help with the reporting of the above, to aid your GDPR processes. These processes report via SNS and you'll need a SNS topic set up in advance.

Backups

Redshift

Redshift will automatically take Snapshots (backups) of your clusters multiple times a day, however these will expire after a set time (configured in the AWS Console). This means as long as you delete any data that must be removed as part of GDPR even the Automatic Snapshots will eventually expire and the data be removed from your backups.

Manual Snapshots, however, will never expire and must be manually deleted which means keeping track of them.

The attached file you can find the example job Report Snapshots that will query your data warehouse and alert you via an SNS topic of all of your manual snapshots:

See the instructions in the comment for the changes you'll need to make to get it to work for your environment.

The majority of the logic is done within the Python component which builds up an output message (a list of snapshots) which is then passed into the SNS component to be passed to the SNS topic and onto the recipients.

The received email looks like:

Snowflake

Snowflake has two main methods of backing up data that both potentially hold onto customer data; Time Travel and Fail-safe.

Even after removing data (for example: by dropping a table), Time Travel will allow you to restore it for up to 90 days. Time Travel data retention time can be changed (and turned off) using the DATA_RETENTION_TIME_IN_DAYS parameter, detailed here.

Fail-safe is, unfortunately, not configurable but will only hold data for 7 days (beginning from when Time Travel's retention ends).

Testing Databases

Perhaps during development you use one of more testing Databases on a data warehouse. If using customer data this could be a GDPR concern if these environment aren't cleared up after development - old data could lie around for months or years.

As such it's a good practice to always clear down test environments after development - and an automated process to alert you to which environments are populated can help with this.

This example reports databases with non-empty tables and the associated Transform will create a report - non-empty tables will query the system tables and report any databases with non-empty tables so you can ensure only the required environments are populated.

See the instructions in the comments for the changes you'll need to make to get it to work for your environment.

Note: the Matillion ETL environment this runs within will need to have sufficient database privileges to query the system tables for the environments needed - you may wish to run this as a root or admin user depending on your requirements.

The received email looks like:

Bucket File checking

In many cases you may be storing reference data in Buckets to be used by Matillion ETL jobs - perhaps this is business data that arrives in CSV form or lookups provided by external companies.

A requirement of GDPR is to know which data is stored where so it is likely that you have a log of this business data, with details of at least:

The name of the file
The Bucket it resides in

However, there will be a need to check that this list remains up to date, and also to ensure that any new files that appear are identified. If this information is stored in a Google Spreadsheet then Matillion ETL can be set up to access this using the Google Sheets Query component.

In this example we've created a Google Spreadsheet Matillion Cloud Storage Business Data Tracking Example with two sheets:

Sheet (S3/Cloud Storage)_buckets shows the bucket or buckets in scope for the check:

Sheet S3Cloud Storage_files shows the known files that are expected to exist in those buckets:

Also We have populated a Bucket matillion-example with the two known files, and an extra unlisted file (highlighted):

The provided example job 3) Bucket to Spreadsheet Comparison Example will:

Read the of known Buckets for checking from the Google sheet
Read the list of known files from the Bucket
Query S3Cloud Storage directly using Python to determine all of the files that exist
Send an email of any files found in the monitored Buckets that do not exist on the spreadsheet - so these can either be added (to keep your GDPR data log up to date) or removed if they are not required.

As you've hopefully seen from the screenshots the S3 bucket we're checking as an extra file that isn't logged on our known data spreadsheet. When the email arrives it looks like this: