Adding filename as a column to tables (BigQuery)
A question we are often asked is how to add the source filename as a column in the BigQuery table. Unfortunately the BQ Load command doesn't support this; however, there are some workarounds.
Option 1: Using a File Iterator to write the filename to a variable
The File Iterator is used to loop through all files in a location and write the names to a variable. The File Iterator can be configured to point at one folder in a Cloud Storage Bucket, and it identifies all files that match the regex.
In the File Iterator properties, the Input Data Type is "Cloud Storage", and the Input URL is the folder on the Cloud Storage Bucket where the data files are held:
The final step on the Iterator is to configure the variables. This example uses a Job Variable called "filename", and this is pointed at the filename returned from the File Iterator:
The File Iterator is connected to an Orchestration job, which contains the Cloud Storage Load, and a Transformation job. The Cloud Storage Load is configured to Load the file with the filename received from the iterator into a temporary stage table:
The Transformation job simply takes the data from the stage table, uses a calculator to append in the new filename, and then uses a table output to write it to a new table with the filename:
When the job is run, the File Iterator will grab the first file and set the name of it into the variable. The Loop Orchestration job will then load the data from that file into a stage table, and then call a Transformation job. The Transformation job will then load the data from the stage table into a permanent table, and will use the Calculator component to take the variable value containing the filename and will add it into a new column:
Option 2: Using a Cloud Function
There may be some instances where it's not possible to use a File Iterator to pick up the filename. Another option to load the data is to use a Cloud Function and pass through the filename.
The first step is to configure a Cloud Function to put a message on a Pub/Sub Queue when a new file is added into the Cloud Storage Bucket:
The Pub/Sub message is passing a variable called "file".
Matillion is then configured to listen to the Pub/Sub Queue. Further details on configuring Matillion to listen to a Pub/Sub Queue is available here.
An Orchestration Job is then used in Matillion to load the content of the new file into a staging BigQuery table, and then a Transformation Job is used to add in the filename from the variable using the same method as option 1:
Option 3: Create external table and use _FILE_NAME
Instead of creating a copy of the table, you can create an External Table. This external table will have a hidden column that can be referenced in an SQL script called _FILE_NAME, which will give the name of the file which references it:
SELECT _FILE_NAME as f FROM externalTable
The BigQuery Load command doesn't include the option to add the filename to the table created in BigQuery and hence the Matillion Cloud Storage Load component doesn't support adding the filename to the table. However this article included some examples of how to use other Matillion components and GCP services as a workaround to include the filename.