Integrating Databricks Asset Bundles into a CI/CD Pipeline on Azure

Alfeu Duran
Databricks Platform SME
7 min readJul 10, 2024

Check out the sample repository on GitHub: here

This blog is part of a series of three posts, each focusing on a different cloud provider. For the blog focused on AWS, see here, and for the one on GCP (coming soon).

Introduction

Databricks Asset Bundles (DABs) is a new tool for streamlining the development of complex data, analytics, and machine learning (ML) projects on the Databricks platform. DABs allow complete projects to be expressed as a collection of source files called a bundle. These files provide an end-to-end definition of a project, including details on testing and deployment.

The primary feature of DABs is to offer a structured approach to managing and deploying data engineering and ML projects. Compared to the Databricks Terraform Provider, DABs provide a more agile approach by integrating the entire project lifecycle, including code, tests, and deployment configurations, into a single, cohesive bundle. For example, you can use DABs to deploy the Databricks MLOps Stacks, providing an efficient and streamlined way to handle complex project workflows.

Databricks offers an example of running a DABs workflow using GitHub Actions. This blog will demonstrate how to use DABs to build a multi-workspace workflow using Azure DevOps.

The Need for DABs in Remote CI/CD Workflows

Databricks Asset Bundles (DABs) are essential for optimizing remote CI/CD workflows. They simplify the process of managing and deploying assets by providing a unified package that can be easily integrated into various environments. DABs ensure consistency and efficiency, reducing the complexity of asset management across different stages of the CI/CD pipeline. This leads to smoother deployments, enhanced collaboration, and more reliable outcomes, making them indispensable for modern, remote CI/CD operations. Additionally, DABs support robust authentication methods, including username and password, OAuth machine-to-machine (M2M), and OAuth user-to-machine (U2M), enabling secure and practical authentication for seamless integration and deployment.

There are situations where authentication needs to occur within a remote CI/CD pipeline:

  1. Orchestrating Workloads Across Multiple Workspaces: We need to use a CI/CD pipeline to orchestrate development, testing, and promotion workloads across multiple different workspaces, with a single repository serving as the sole source of truth.
  2. Consistent Identity Usage: We want to run jobs or pipelines using specific identities, such as a user or service principal.
  3. Production and Restricted Environments: In production and restricted Databricks environments, we prefer not to grant individual users write permissions to most resources. Instead, users need to promote code to make changes in these restricted workspaces or environments.

High-level Architecture

Workflow Overview

  1. Local Development: Users employ Databricks bundle commands to deploy code to a development workspace, where they can debug notebook code and the DABs YAML template code. Optionally, users can associate the dev workspace with the remote repository using Databricks Git integration to push code or create pull requests.
  2. Merge to Release Branch: After development, the user creates a pull request to the release branch, which is automatically approved.
  3. Deploy to Staging: With the pull request in the release branch, artifacts are deployed to the staging workspace, and the code is executed.
  4. Unit Testing: Unit tests are run using the Nutter framework.
  5. Pull Request to Main Branch: Once the code and tests run successfully, we could create a pull request for the main branch.
  6. Deploy to Production: The deployment to the production environment is done using the bundle command.

Advantages of Using Azure Service Principals

Azure Service Principals offer enhanced security, secure authentication, and permission isolation, ideal for automation and integration with DevOps tools. They facilitate centralized identity management and credential renewal, promoting the principle of least privilege. Specifically for Azure Databricks, they enable secure and automated access, ensuring efficient and safe operations without constant manual intervention.

Deployment Process Overview

In this section, we will detail the deployment process, from project creation in DevOps to artifact deployment, with a focus on implementing the service principal.

1. Create Azure DevOps Project

a. First, go to Azure DevOps within your organization and create a new project.

b. Select the location where the code is hosted.

c. Select the repository and the initial pipeline configuration file (.yml).

d. Click on Save next to the Run button to create the initial pipeline. To adapt the existing structure, we will need to create the variables used in the pipeline and the service principal to be used in this example.

2. Create Service Principal Service Connection in Azure DevOps

a. In Project Settings, go to the Service Connections section.

b. Click on New service connection at the top right. Select Azure Resource Manager.

c. Choose Service Principal (manual). If you do not have a Service Principal configured, follow this documentation.

d. Enter the information about the configured Service Principal.

e. Once configured and validated, proceed to the next steps.

3. Create Environment Variables

a. Create a variable group for the environment variables.

b. Set the DATABRICKS_HOST variable with the workspace name and ENV (referring to the same environment configured in step 4 in databricks.yml).

c. Modify the .yml file for the correct variable names. Note: The pipeline has two stages: OnRelease and toProduction. We can use different variable groups for each.

d. Modify the service principal name used in the pipeline, the same created in step 2. The service principal must have subscription-level permissions in Azure and permissions in Databricks.

4. Add Service Principal Permissions in Databricks

Grant permissions at the account level. In Databricks Accounts, under User Management in the Service Principals tab, add a Microsoft Entra ID-managed Service Principal by providing the Application ID and Service Principal name.

Once configured, proceed to configure the target environments through the bundles settings in databricks.yml.

5. Bundle Configuration (databricks.yml)

Change the target configurations for host, root_path, and add new environments if needed. If you generate the initial repository configuration using databricks bundles init, this will likely be configured for your initial use case..

Results

After running the pipeline, you can see the steps and results of the workflow executed as part of the release. At the end of the first stage, we can see that the job executions and tests passed successfully.

Additionally, as part of the results, we have tests conducted by Nutter on a remote cluster.

After completing the CI/CD pipeline in Azure DevOps, we can view the deployment artifacts and the jobs executed in Databricks as part of the staging environment evaluation. This includes the artifacts such as configuration scripts, notebooks, and code libraries. These elements provide a clear overview of the deployment process and ensure that the workflow functions correctly before promoting to production. Below is an image illustrating the artifacts and job executions in the Databricks interface, respectively.

Artifacts

Conclusion

This solution demonstrates how to deploy and integrate Databricks Asset Bundles within an Azure DevOps CI/CD pipeline. The Python notebooks and pipelines used in this sample solution are basic examples that can be directly obtained using the databricks bundle init templeates. For more complex ML solutions, please refer to other Databricks ML resources, such as the Databricks MLOps stack.

--

--