Configuring Matillion ETL to use a Proxy
This topic explains how to configure Matillion ETL to use a proxy server. This guide presumes that your proxy is already configured and is reachable from the Matillion ETL instance, and that any ports used for communication are open on the respective security groups.
Besides Matillion ETL, there are other applications that do not depend on Tomcat yet use the proxy server configuration of the underlying Linux operating system. Examples of these include: any system processes/services that run in the background; AWS CLI; Matillion ETL's Bash component; or Python scripts run using the Python 2/3 interpreters in Matillion ETL.
It is a known method to implement proxy servers to provide an additional layer of security, or to act as an intermediary between your servers and the internet. Depending on a scenario, proxy servers may help with URL and web content filtering, IDS/IPS, data loss prevention, monitoring, and advanced threat protection.
Matillion ETL is a Java application and is hosted on an Apache Tomcat application server. To configure Tomcat to use a proxy server for http/https communication (which Matillion ETL inherits), follow the instructions below.
Note
Take note of the following information:
- Matillion ETL supports lowercase values from the server environment and the Emerald.properties file, with lowercase taking precedence if uppercase is also present.
- To ignore the proxy, both
no_proxy
(orhttp.noProxy
as a -D arg) andnonProxyHosts
must be configured because they have different formats.no_proxy
is comma-separated, supports CIDR blocks, and passes through to scripts and other CLI tools;nonProxyHosts
, by comparison, is pipe-separated hosts, and is used for JVM-native connections and any libraries that rely on the JVM to connect. - If
nonProxyHosts
is not configured, a single, best-effort attempt will be made to convert anyno_proxy
configuration tononProxyHosts
by replacing commas with pipes.
Note
Refer to JDBC components for specific connection options that need to be set to use the proxy.
Configuring Matillion ETL to use a proxy
The first (and usually, only) change required is to add the definitions required to your /etc/sysconfig/tomcat file, and to then restart Tomcat.
Default JAVA_OPTS
:
JAVA_OPTS='-Djavax.net.ssl.trustStore=/usr/lib/jvm/jre/lib/security/cacerts -Djavax.net.ssl.trustStorePassword=changeit -Djava.security.egd=file:/dev/./urandom -XX:+UseG1GC -XX:OnOutOfMemoryError=/usr/share/emerald/WEB-INF/classes/scripts/oom.sh'
Additions required:
HTTPS:
-Dhttps.proxyHost=yourProxyAddress
HTTP:
-Dhttp.proxyHost=yourProxyAddress
Non-default proxy port
If your proxy port is not the default (3128), then you can specify that as well:
HTTPS:
-Dhttps.proxyPort=3128
HTTP:
-Dhttp.proxyPort=3128
Setting different options
It is possible to set different options in the /usr/share/emerald/WEB-INF/classes/Emerald.properties file, for example:
HTTPS:
HTTPS_PROXY=http://yourProxyAddress:3128/
HTTP:
HTTP_PROXY=http://yourProxyAddress:3128/
This is necessary if your proxy only uses HTTP as the scheme.
Note
The HTTP scheme is usually required for both HTTP and HTTPS proxies. The configuration can be overridden for the items noted.
Users can set the properties to the same value in both file locations.
Items configured automatically
The following items are configured automatically:
- Using KMS with external authentication (via sysconfig only)
- Git integration (via) sysconfig only)
- Matillion Billing Hub
- OpenID authentication (via sysconfig only)
- External authentication
- CDC tasks
- Matillion update (can be overridden)
- API query / API extract (can be overridden)
- OAuth, MindSphere, Pardot connections (can be overridden)
- Bash and Python components (can override)
The following components require additional steps:
- Data transfer components.
- Data staging components (for example, Salesforce Query).
Configuring Matillion ETL to ignore the proxy
If you wish to direct traffic not through the configured proxy, please add the following setting to your /etc/sysconfig/tomcat file, and then restart Tomcat.
In the JAVA_OPS section:
-Dhttp.nonProxyHosts=<targetAddress>
For example:
-Dhttp.nonProxyHosts=169.254.169.254
Wildcards can be included using the * symbol. Pipes can be used to separate multiple items. For example:
-Dhttp.nonProxyHosts=*.snowflakecomputing.com\\|169.254.169.254\\|*.amazonaws.com
Below is an example of the JAVA_OPTS entry using a proxy configuration to include the above non-proxy options:
JAVA_OPTS="-Djavax.net.ssl.trustStore=/usr/lib/jvm/jre/lib/security/cacerts -Djavax.net.ssl.trustStorePassword=changeit -Djava.security.egd=file:/dev/./urandom -Dcom.redhat.fips=true -XX:+UseG1GC -XX:OnOutOfMemoryError=/usr/share/emerald/WEB-INF/classes/scripts/oom.sh -Dhttps.proxyHost=[proxyhostname/ip> -Dhttp.proxyHost=<proxyhostname/ip> -Dhttps.proxyPort=<proxyport> -Dhttp.proxyPort=<proxyport] -Dhttp.nonProxyHosts=.snowflakecomputing.com\\|169.254.169.254\\|*.amazonaws.com"
Data transfer components
Matillion ETL data transfer components communicate directly, rather than over HTTP/HTTPS.
Type | Ports |
---|---|
FTP | 21 |
SSH/SFTP | 22 |
LDAP (including external authentication) | 389 |
MS SQL Server | 1433 |
Oracle RDS | 1521 |
Squid | 3128/3130 |
PostgreSQL | 5432 |
Redshift | 5439 |
!!!
FTP servers will normally try to communicate on a high-numbered port, and so you may need to allow the proxy to be bypassed for all connections to the FTP server IP address.
JDBC components
Generally, the JDBC drivers are configured using the options in the table below. These options can be set in the Connection Options property of any of the JDBC driver components such as Facebook Query.
Property | Description |
---|---|
ProxyServer | Host name/IP of your proxy server. This connection option is required. |
ProxyPort | The TCP port of your proxy (if not 3128). |
ProxyAutoDetect | Sets your configuration from the sysconfig settings. |
ProxyAuthScheme | BASIC, DIGEST, NONE, NEGOTIATE, NTLM, PROPRIETARY. |
ProxyUser | Your username. |
ProxyPassword | Your password. |
ProxySSLType | AUTO, ALWAYS, NEVER, TUNNEL. |
ProxyException | Semicolon-separated list of destinations that are not connected via the proxy. |
Some Query components (Cassandra, MongoDB, Redis) do not support these connection options, but can be configured using the "firewall" options instead:
Property | Description |
---|---|
FirewallServer | Host name/IP of your proxy server. This is required. |
FirewallPort | The TCP port of your proxy. This is required. |
Environment connections
Users on Snowflake and Redshift need to manually add connection options in their environment similarly to JDBC driver components. BigQuery users do not need to add connection options for the proxy, as the connection is automatically via HTTPS. To add connection properties:
- Click the Environments panel in the lower-left of the UI.
- Right-click an environment.
- Click Edit Environment.
- Click Next to navigate to the second Connection page of the Edit Environment dialog.
- Click Manage alongside Advanced Connection Settings to open the JDBC Connection Attributes window.
- Use the + button to add a new parameter entry. Select the required Parameter from the drop-down menu and enter the respective Value. You can select the Text Mode checkbox to enter parameters in plain text instead.
- Add additional parameters as required.
Refer to the following guides for more information regarding JDBC parameters for Snowflake and Redshift:
- Configuring the JDBC Driver for Snowflake.
- Options for JDBC driver version 2.1 configuration for Redshift.