Recently, Pentaho delivered Pentaho Business Analytics 7.1. The team has been working extremely hard on this release and in our eyes, it is much more than just a ‘dot’ release. Highlights of this release include: adaptive execution on any engine for big data processing, starting with Spark; expanded cloud integration with Microsoft Azure; enterprise-level security for Hortonworks, and improved in-line visualizations.
One of the areas that I am most excited about is our new support for Microsoft Azure’s HDInsight and Azure SQL. Building on the flexibility and openness of the Pentaho platform, Pentaho 7.1 removes data latency by allowing customers to both access data sources in the cloud and on-premises and then process data both in the cloud and on-premises in a hybrid fashion. Stemming from customer demand, in Pentaho 7.1 we are now supporting Azure to give customers more choice in terms of which public cloud vendor to use.
The following is a tutorial for connecting Pentaho with Microsoft Azure. I have highlighted the integration in two parts:
- Part 1 – Connecting Pentaho with Microsoft Azure HDInsight
- Part 2 – Connecting Pentaho to Microsoft Azure SQL and MS SQL Server/Azure VM
Part 1 – Connecting Pentaho with Microsoft Azure HDInsight
Starting with Pentaho 7.1, connectivity with Microsoft Azure HDInsight (HDI) is applicable as Pentaho Data Integration (PDI) extends its Public Cloud Hadoop distribution capabilities. Pentaho will work with an HDI 3.5 cluster configured with Linux nodes, and with the following cluster types: Hadoop, Hbase, and Spark, via the use of PDI steps and with Pentaho reporting and analytics.
PDI - Azure HDI 3.5 Hadoop Distribution
Connecting to HDI can be done in a variety of ways, thus offering a customer flexibility; with the first being a complete Public Cloud configuration and installation of Pentaho into an Azure VM, and the second being a Hybrid Cloud configuration where on premise (Private Cloud) is integrated with Public Cloud resources via a VPN type connection.
An all Public Cloud configuration utilizes the Azure Virtual Network service to create a dedicated virtual network (VNet) that contains Pentaho with Azure resources to create a secure environment. A VNet is a representation of a customer’s network in the cloud. A VNet is a logical isolation of the Azure cloud dedicated to a particular customer and will allow the customer to connect VNets to an on-premises network, thus creating a Hybrid Cloud implementation.
Public Cloud - Pentaho in VM in Azure and Azure Portal – Pentaho running in Azure VM with HDI Cluster
A hybrid Cloud implementation allows Pentaho to run both in Private and Public clouds and enhance Pentaho’s locality to the data it is processing as close as possible, thus reducing network latency and increasing performance.
Implementation of Hybrid Cloud with Pentaho and Azure
To connect the Pentaho PDI GUI client to an HDI Cluster, you will need to copy the cluster configuration files, which can be accomplished using the Ambari UI, to the hdi35 shim folder.
Ambari Web UI – Download Client Config files and location to copy Config Files in hdi35 folder
Windows Azure Blob Storage (WASB) is the protocol Azure HDInsight utilizes with Azure Blob Storage (ABS). ABS is the primary storage for an HDInsight cluster type.
Utilizing information from the core-site.xml, a Hadoop Cluster configuration is created with selecting the Storage as WASB (Port, Username and Password are not required due to the use of storage account credentials defined in cluster configuration files and from Azure Portal). The running of the Hadoop Cluster Test completes all tests successfully.
PDI can utilize WASB to perform file management related tasks, i.e. copying and comparing files. The PDI Hadoop Copy Files job step utilizes the Pentaho Virtual File System (VFS) feature both via a VFC Dialogue and under the hoods in its implementation. From the VFS Dialogue, you are able to configure a WASB URL connection string displayed, which was formulated via the Hadoop Cluster configuration that was created.
VFS Dialogue – Showing WASB – Folder Browsing
PDI File Management related steps can also be used with WASB connectivity. One example is the File Compare step, which does have VFS functionality within its implementation, but does not offer a VFS Dialogue via the Browse button, thus by entering the full WASB path to the files you want to compare will fulfill the desired requirement.
PDI File Management Folder and Job Step – File Compare
Non-VFS Dialogue – Need to explicitly enter WASB path to files
Microsoft Azure SQL and MS SQL Server/Azure VM
Beyond the ability of Pentaho 7.1 to connect to Azure HDI, there is the supported capability for PDI to connect to Azure SQL (Cloud Service) and a Microsoft SQL Server instance installed and running within an Azure VM. One implementation is to have Pentaho installed into an Azure VM, which positons PDI locality close to the VM and Azure SQL, all possibly within the same Azure Virtual Network.
Azure Portal – Pentaho running in Azure VM with Azure SQL and a SQL Server 2014 instance in an Azure VM
Part 2 – Connecting Pentaho to Microsoft Azure SQL and MS SQL Server/Azure VM
Another implementation is to have an on premises PDI connect to the Azure SQL databases via the fully qualified server name of the Azure SQL installation. No VPN or Hybrid Cloud implementation is required.
Connecting to Azure SQL or SQL Server in an Azure VM is the same process and same JDBC jar file that is used for PDI to connect to a SQL Server 2014 instance on premises. You will use the same Connection Type which allows you to set similar port and security settings. Enter the fully qualified server name of the Azure SQL Instance which can be obtained from the Azure Portal. Port 1433 is a universal Microsoft SQL Server port, but verify with your installation. Security can be SQL Server standard or via Active Directory – Integrated Security. Perform a test of your connection.
PDI – Database Connection with Azure SQL
After performing your connection test to Azure SQL, a Table Input step can utilize the connection and perform a query with T-SQL similar to what is used for SQL Server installation, i.e. SQL Server 2014.
PDI – Table Input Step with Connection to Azure SQL
In Pentaho 7.1, the new Azure HDI 3.5 shim will offer extensibility of PDI within the Public Cloud space. In addition, using the same JDBC jar file for Microsoft SQL Server instances, i.e. SQL Server 2014, etc., PDI can connect to Azure SQL. Customers who have business activities within Microsoft Azure can benefit from these new capabilities in Pentaho 7.1 relating to Azure HDInsight, Azure SQL, and SQL Server in an Azure VM. I hope you found this best practice useful!