How to optimise Azure Data Lake Integration Using MuleSoft

  • Written By Alexander Oxford
  • 16/09/2022

Azure Data Lake Storage Gen 2 is a scalable data storage service built by Microsoft Azure, designed for Big Data analytics. Azure Data Lake Storage Gen 2 is built on top of Azure Blob Storage and provides the data organisation and security semantics of Azure Data Lake Gen 1, along with the cost and reliability benefits of Azure Blob Storage.

The MuleSoft Anypoint Connector for Azure Data Lake Storage Gen 2 (Azure Data Lake Storage Connector) provides access to standard Azure Data Lake Storage Gen 2 operations using the Anypoint Platform.

In this blog about Azure Data Lake integration, we’ll focus on uploading file data using Azure Data Lake Connector. This operation is divided into three steps:

  • Create a file,
  • Push file,
  • Flush file.

These operations are based on the Azure Data Lake Gen 2 API.

Getting started

To use this connector, you must be familiar with:

  • Azure Data Lake Storage Gen 2 API,
  • Anypoint Connectors,
  • Mule runtime engine (Mule),
  • Elements and global elements in a Mule flow,
  • How to create a Mule app using Anypoint Studio,
  • Mulesoft Azure Data Lake Connector version 1.0.2 (latest version).

Before creating an app, you must have access to:

  • The Azure Data Lake Storage target resource,
  • Anypoint Platform,
  • Anypoint Studio version 7.1 or later.

Global configuration for Azure Connector:

The above settings are based on Oauth provided by the Azure platform.
x-ms-version 2018-11-09

Reach out to your Azure team to get the details needed for connectivity.

Dependency Snippet for POM:

Example workflow:

The workflow below receives records (~3000 records in this example) from the source and pushes the same data as a CSV file into Azure Blob Storage.

The following are the steps for this workflow:

  1. Logger Start – log the incoming request.
  2. Flow Reference to Set Vars —Set pre-requisite variable for Azure operations.
  3. Create Path—Create a file with the filename from flow-var.
  4. Append File—Push data to be uploaded to the file created in the previous step.
  5. Flush File—write data to the file pushed in step 4.
  6. Delete File on Error (optional)—Delete the created path in case of an error during a flush/append operation.

Step 1: Log the incoming request

Step 2: Sub-flow to set flowVars

Note: 

The tempPayload variable is created using the above data weave (payload.^raw) function in Java form. This is required to count the total number of bytes, as it will error out if done in JSON.

Also, note that the AzureFileSize is set using the below data weave function. AzureFileName & AzureFolderName can be set as per the business requirements.

Step 3: Create a path

This operation uses the resource as a “file” and a fully qualified name created per the requirements into the “path”.
The file system is the parent directory in the Azure Data Lake, for example, “gold”.

Step 4: Append file

This operation pushes the data that needs to be written into the file in Azure Data Lake. Append is currently limited to 4000 MB per request. The component that needs to be used in the “update path” and operation is “append”.

Please note that this operation needs the “AzureFileSize” set in Step 2. The append operation will fail if this variable is not set as above. Also, note that we need to pass the “tempPayload” we created in the previous step as “application/octet-stream”, as the Azure API appends data in “octet-stream” format only.

Note:

For larger file sizes, streaming or batching records is required.

Step 5: Push file

This operation is the next step after the “append data” operation. The component used here is the “update path”, and the operation is flush.
The “AzureFileName” and “AzureFolderName” should be the same as in the previous step. “AzureFileSize” as per step 2.
Content-Length should be 0 (i.e., the “beginning of file” position).

Step 6: Delete file (optional step)

The above two operations are wrapped inside a try block, and in case of an error, the redundant file from step 3 is removed. As there was an error in the append/flush operation, no data was written to the file.
Using the same path and filename taken from the variable and the delete operation, the empty file is deleted. Please note this step is optional and depends on your use case.

References:

If you’d like to leverage Azure Data Lakes and MuleSoft, give us a call or email us at salesforce@coforge.com.

Other useful links:

Best practices to implement Salesforce right the first time.

No need to search: technology, trends and insights for 2022.

Managing legacy systems (upgrade, replace, rebuild).

Latest Insights

Blogs

How to optimise Azure Data Lake Integration Using MuleSoft

In this blog about Azure Data Lake integration, we’ll focus on uploading file data using Azure Data Lake Connector.

How to Insert and Retrieve Data from Amazon DynamoDB
Blogs

How to Insert and Retrieve Data from Amazon DynamoDB

Why do users choose Amazon DynamoDB? What you’ll need to get started and highlights of the functionality in DynamoDB made possible by Mule 4.

mulesofts-anypoint-monitoring-dashboards
Blogs

MuleSofts’ Anypoint monitoring dashboards

Built-in dashboards that provide insights into Mule applications and APIs through graphical representations of data over any given period.