Setting up Matillion ETL in a private VPC
Overview
This guide looks at how to configure Matillion ETL in a Virtual Private Cloud (VPC) with no internet access. A VPC configured with public and private subnets according to AWS best practices, provides you with your own virtual network on AWS. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.
This guide assumes that you already have a private subnet set up, and that you have (or can launch) an EC2 instance with Matillion ETL installed inside this subnet.
Matillion ETL works best when it has access to the internet, either via a publicly addressable IP address and an internet gateway, or via an Elastic Load Balancer. It's also possible to deploy Matillion ETL to a VPC without any internet access or to an isolated subnet with no further routing configured.
There are some caveats. Access to S3 will be via a VPC endpoint for S3 and only accessible for buckets within the same region. While Redshift can be supported in this configuration, access to the wider range of Matillion ETL components will be limited.
:::info{title='Note'} Matillion ETL instances require internet access to report on billing. To communicate your instance with the Hub, refer to Configuring a connection from Matillion ETL to the Hub. :::
You can choose a deployment of a "single instance" or a "high availability" (HA) deployment of a cluster. Below is the infrastructure diagram for "high availability". This high availability option provisions Matillion ETL for Amazon Redshift in existing AWS infrastructure.
The architecture sets up the following:
- A highly available architecture that spans two Availability Zones.
- A VPC configured with public and private subnets according to the AWS best practices, to provide you with your own virtual network on AWS.
- In the public subnets, two Matillion EC2 instances running Matillion ETL in a cluster, deployed across two Availability Zones.
- An Application Load Balancer to direct traffic to the Matillion ETL instances.
- An IAM role, attached to the EC2 instance to specify which AWS services the Matillion ETL instance can access.
- In a private subnet, an Amazon Redshift cluster and its components, such as a cluster subnet group, parameter group, workload management (WLM), and a security group that allows access to the VPC. This is the default behavior, if the
PubliclyAccessible
parameter is set to False. IfPubliclyAccessible
is set to True, the cluster and its components will be created in the public subnets. - An Amazon Simple Storage Service (Amazon S3) bucket for audit logs.
- A VPC endpoint for Amazon S3, so that Amazon Redshift and other AWS resources that are run in a private subnet can have controlled access to Amazon S3 buckets.
- Amazon CloudWatch alarms to monitor the CPU on the Matillion host, to monitor the CPU and disk space of the Amazon Redshift cluster, and to send an Amazon SNS notification, when the alarm is triggered.
- Amazon SNS to send Amazon CloudWatch alarm and event notifications.
- The infrastructure uses a key from AWS Key Management Service (AWS KMS) to enable encryption at rest for the Amazon Redshift cluster, and creates a default master key when no other key is defined.
- An AWS Identity and Access Management (IAM) role that grants minimum permissions required to use Redshift Spectrum with Amazon S3, Amazon CloudWatch Logs, AWS Glue, and Amazon Athena.
- An AWS Glue Catalog as a metadata store.
:::info{title='Note'} If you're deploying Matillion ETL for Amazon Redshift into an existing VPC, make sure that your VPC has two private subnets in different Availability Zones for the PostgreSQL databases and that the subnets aren't shared. The Amazon Redshift database also requires a private subnet. The Matillion ETL for Amazon Redshift EC2 instance requires a public subnet. :::
Accessing Redshift
For Matillion ETL to access the Redshift cluster, it must be created in the same VPC, or else VPC peering needs to be set up. To create a Redshift cluster in a private VPC, a subnet group needs to be set up associated with that VPC.
To create an Amazon Redshift Cluster, follow these steps:
-
Sign in to the AWS Management Console and open the Amazon Redshift console. Then, choose Launch Cluster.
:::info{title='Note'}
- If you use IAM user credentials, make sure that you have the necessary permissions to perform the cluster operations. For more information, see Controlling access to IAM users in the Amazon Redshift Cluster Management Guide.
- When you don't have any clusters in an AWS Region and you open the Clusters page, you have the option to launch a cluster.
- When you have at least one cluster in an AWS Region, the Clusters section displays a subset of information about all the clusters for the account in that AWS Region. :::
-
On the Cluster Details page, specify values for the following option, and then click Continue.
- Cluster Identity: Enter the unique name for your cluster.
- Database Name: Enter the name if you want to create a database with a custom name. This is an optional field.
- Database Port: Provide a port number through which you plan to connect from client applications to the database.
- Master User Name: Enter an account name for the master user of the database.
- Master User Password and Confirm Password: Enter the password for the master user account, and then retype it to confirm the password.
-
On the Node Configuration page, specify values for the following options, and then click Continue.
- Node Type: Choose a node type from the drop-down menu. When you do, the page displays information that corresponds to the selected node type, such as CPU, Memory, Storage, or I/O Performance.
- Cluster Type: Choose a cluster type from the drop-down menu. When you do, the page displays information that corresponds to the selected cluster type, such as Maximum or Minimum.
- On the Additional Configuration page, specify values for the following options, then click Continue. To provide the optional additional configuration details below, configure the following options:
- Cluster Parameter Group: Choose a cluster parameter group to associate with the cluster. If you don't choose one, the cluster uses the default parameter group.
- Encrypt database: Choose whether you want to encrypt all data within the cluster and its snapshots. If you leave the default setting as None, encryption is not enabled. If you want to enable encryption, choose whether you want to use AWS Key Management Service (AWS KMS) or a hardware security module (HSM), and then configure the related settings.
- To Configure Networking Options, launch your cluster in a virtual private cloud (VPC) or outside a VPC:
- Choose a VPC: To launch your cluster in a virtual private cloud (VPC), choose the VPC you want to use. You must have at least one Amazon Redshift subnet group set up to use VPCs.
- Cluster Subnet Group: Select the Amazon Redshift subnet group in which to launch the cluster. This option is available only for clusters in a VPC.
- Publicly Accessible: Choose Yes to enable connections to the cluster from outside of the VPC in which you launch the cluster. Choose No if you want to limit connections to the cluster from only within the VPC.
- Choose a Public IP Address: If you set Publicly Accessible to Yes, choose No here to have Amazon Redshift to provide an Elastic IP (EIP) for the cluster. Alternatively, choose Yes if you want to use an EIP that you have created and manage. If you have Amazon Redshift create the EIP, it is managed by Amazon Redshift.
- Elastic IP: Select the EIP that you want to use to connect to the cluster from outside of the VPC.
- Availability Zones: Choose No Preference to have Amazon Redshift choose the Availability Zone that the cluster is created in. Otherwise, choose a specific Availability Zone.
- Enhanced VPC Routing: Choose Yes to enable enhanced VPC routing. Enhanced VPC routing might require some additional configuration.
- To optionally associate your cluster with one or more security groups, specify values for the following options:
- VPC Security Groups: Use the drop-down menu to choose a VPC security group, or groups for the cluster. By default, the chosen security group is the default VPC security group. This option is only available if you launch your cluster in the EC2-VPC platform.
- Cluster Security Groups: Use the drop-down menu to choose an Amazon Redshift security group or groups for the cluster. By default, the chosen security group is the default security group.
- To optionally create a basic alarm for this cluster, configure the following options, and click Continue:
- Create CloudWatch Alarm: Choose Yes if you want to create an alarm that monitors the disk usage of your cluster, and then specify values for the corresponding options. Choose No if you don't want to create an alarm.
- Disk Usage Threshold: Choose a percentage of average disk usage that has been reached or exceeded at a point where the alarm should trigger.
- Use Existing Topic: Choose No if you want to create a new Amazon Simple Notification Service (Amazon SNS) topic for this alarm. In the Topic box, edit the default name if necessary. For Recipients, type the email addresses for any recipients who should receive the notification when the alarm triggers. Choose Yes if you want to choose an existing Amazon SNS topic for this alarm, and then in the topic list, choose the topic that you want to use.
- To optionally select your maintenance track for this cluster, choose Current or Trailing:
- If you choose Current, your cluster is updated with the latest approved release during your maintenance window. If you choose Trailing, your cluster is updated with the release that was approved earlier.
-
On the Review page, review the details of the cluster. Click Launch Cluster to start the creation process. Otherwise, choose Back to make any necessary changes to the cluster properties, and click Continue to return to the Review page. Once you have initiated the launch process, a summary of the options you chose during this process will be displayed.
:::info{title='Note'} Some cluster properties, such as the values for Database Port and Master User Name, can't be modified later. If you need to change them, click Back to make modifications. :::
-
After you have initiated the creation process, click Close. The cluster might take several minutes to launch.
Interface endpoints
Matillion ETL requires S3 access to load data into Redshift. To grant your private VPC access to your S3 buckets, you need to create an interface endpoint, you must specify the VPC in which to create the interface endpoint, and the service to establish the connection. For further information, refer to VPC Endpoints.
-
Open the Amazon VPC console. Choose Endpoints from the left-side navigation menu.
-
For the Service category option, select AWS services.
- For the Service Name, choose the service you want to establish a connection with. Make sure Interface is selected for the Type.
- Complete the following information, and then choose Create endpoint.
- For VPC, select a VPC in which to create the endpoint.
- For Subnets, select the subnets (Availability Zones) in which to create the endpoint network interfaces. Not all Availability Zones may be supported for all AWS services.
- To enable private DNS for the interface endpoint, and for the option Enable Private DNS Name, select the checkbox. This option is enabled by default.
- For Security group, select the security groups to associate with the endpoint network interfaces.
- (Optional) Add or remove a tag. Choose Add tag, and do the following:
- Key: Enter the key name.
- Value: Enter the key value.
- To remove a tag, click the delete button ("x") to the right of the tag's Key and Value.
:::info{title='Note'} Adding the endpoint interface will add an entry into the VPC route table to route traffic to the S3 bucket. :::
Further considerations
Access to Matillion ETL should now work, and data can be loaded and transformed in your Redshift cluster. However, please note:
- The S3 bucket used for loading your Matillion ETL instance must use the bucket in the same region as the Redshift Cluster.
- Your Matillion ETL instance relies on internet access to connect to many services. Without a route to the internet, many components may not be able to connect to the target API such as Facebook or Google Analytics Query components. Several key functions of Matillion ETL may also fail, such as updating and migration.
Example CloudFormation template
The CloudFormation template deploys a VPC in eu-west-2
with a pair of public and private subnets spread across two Availability Zones. It deploys an internet gateway, with a default route on the public subnets.
- VPC Network:
- VPC CIDR: The ID of the existing VPC. This must be the VPC that contains the subnets.
- Private subnet 1: An existing private subnet to launch secondary resources, e.g. PostgreSQL database.
- Private subnet 2: An existing private subnet to launch secondary resources, e.g. PostgreSQL database.
- Public subnet 1: An existing public subnet to launch the Matillion EC2 instances into.
- Public subnet 2: An existing public subnet to launch the Matillion EC2 instances into.
- Matillion EC2 instance and Application Load Balancer:
- Matillion EC2 instance type: The Amazon EC2 instance type for the Matillion instance. A larger instance type enables greater workload concurrency. For example: m4.large.
- Matillion ALB DNS prefix: The Application Load Balancer DNS name prefix.
- VPC endpoint for S3 with an allow-all policy.
- IAM Profile/Role with full privileges on all services.
- 1 Security group for the public subnet, all traffic allowed.
- 1 Security group for the ELB, all traffic allowed.
- 1 Security group for the private subnet, all traffic allowed.
For the detailed configuration example, please download the CloudFormation Template below.