Using R with Matillion ETL for Redshift
Overview
Matillion ETL for Redshift runs as an EC2 Linux instance in your AWS VPC. You have full administrative access to the EC2 instance, and can install additional software—such as the statistical package R.
From within Matillion ETL you can use the Bash Script component to run operating system commands, which means you can make use of software you have installed yourself.
This guide can also apply to Matillion ETL for Snowflake on AWS.
Installing R
You'll need administrative privilege to perform the software installation, so first connect to your EC2 instance using SSH, and then issue the following commands:
sudo su -
yum install R
After successful installation, launch R as root, by entering the command R
. From the interactive interface, install the RJDBC package by issuing the following R command:
install.packages("RJDBC", dep=T)
After installing RJDBC, you can leave the interactive R shell by typing q()
and then n
.
Using R
Once R has been installed on the EC2 instance, it can be accessed from a Bash Script component using the Here Document syntax.
To run R commands, it looks like this:
cat <<EOF | R -q --vanilla
# Your commands here
EOF
Matillion ETL comes pre-installed with a JDBC driver that can communicate with Amazon Redshift, located in your installation here:
/usr/share/emerald/WEB-INF/lib/postgresql-42.3.4.jar
Use R's JDBC command to specify this driver:
jdrv <- JDBC(driverClass="org.postgresql.Driver", classPath="/usr/share/emerald/WEB-INF/lib/postgresql-9.4-1206-jdbc41.jar", identifier.quote="'")
Then set up environment variables for the Redshift JDBC endpoint, username, and password. You can copy and paste the JDBC URL from the AWS Redshift console. Note that in this case, the example uses the Postgres JDBC driver, so use jdbc:postgresql://
instead of jdbc:redshift://
in your URL.
Once a JDBC database connection has been established, you can continue to use the Bash Script component on your Matillion ETL instance to use R.