Building CI Pipeline with Databricks Asset Bundle and GitLab

cover
25 May 2024

Introduction

In the previous blog, I showed you how to build a CI pipeline using Databricks CLI eXtensions and GitLab. In this post, I will show you how to achieve the same objective with the latest and recommended Databricks deployment framework, Databricks Asset Bundles. DAB is actively supported and developed by the Databricks team as a new tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform.

I will skip the general introduction of DAB and its features and refer you to the Databricks documentation. Here, I will focus on how to migrate our dbx project from the previous blog to DAB. Along the way, I will explain some concepts and features that can help you grasp each step better.

Development pattern using Databricks GUI

In the previous post, we used the Databricks GUI to develop and test our code and workflows. For this blog post, we want to be able to use our local environment to develop our code as well. The workflow will be as follows:

  1. Create a remote repository and clone it to our local environment and Databricks workspace. We use GitLab here.

  2. Develop the program logic and test it inside the Databricks GUI or on our local IDE. This includes Python scripts to build a Python Wheel package, scripts to test data quality using pytest, and a notebook to run the pytest.

  3. Push the code to GitLab. The git push will trigger a GitLab Runner to build, deploy, and launch resources on Databricks using Databricks Asset Bundles.

    Setting up your development environments

    Databricks CLI

    First of all, we need to install Databricks CLI version 0.205 or above on your local machine. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI.

    Authentication

    Databricks supports various authentication methods between the Databricks CLI on our development machine and your Databricks workspace. For this tutorial, we use Databricks personal access token authentication. It consists of two steps:

    1. Create a personal access token on our Databricks workspace.
    2. Create a Databricks configuration profile on our local machine.

  4. To generate a Databricks token in your Databricks workspace, go to User Settings → Developer → Access tokens → Manage → Generate new token.

  5. To create a configuration profile, create the file ~/.databrickscfg in your root folder with the following content:

[asset-bundle-tutorial]
host  = https://xxxxxxxxxxx.cloud.databricks.com
token = xxxxxxx

Here, the asset-bundle-tutorial is our profile name, the host is the address of our workspace, and the token is the personal access token that we just created.

You can create this file using the Databricks CLI by running databricks configure --profile asset-bundle-tutorial in your terminal. The command will prompt you for the Databricks Host and Personal Access Token. If you don’t specify the --profile flag, the profile name will be set to DEFAULT.

Git integration (Databricks)

As the first step, we configure Git credentials & connect a remote repo to Databricks . Next, we create a remote repository and clone it to our Databricks repo , as well as on our local machine. Finally we need set up authentication between the Databricks CLI on the Gitlab runner and our Databricks workspace. To do that, we should add two environment variables, DATABRICKS_HOST and DATABRICKS_TOKEN to our Gitlab CI/CD pipeline configurations. For that open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables

Both dbx and DAB are built around the Databricks REST APIs, so at their core, they are very similar. I will go through the steps to create a bundle manually from our existing dbx project.

The first thing that we need to set up for our DAB project is the deployment configuration. In dbx, we use two files to define and set up our environments and workflows (jobs and pipelines). To set up the environment, we used .dbx/project.json, and to define the workflows, we used deployment.yml.

In DAB, everything goes into databricks.yml, which is located in the root folder of your project. Here's how it looks:

bundle:
  name: DAB_tutorial #our bundle name 

# These are for any custom variables for use throughout the bundle.
variables:
  my_cluster_id:
    description: The ID of an existing cluster.
    default: xxxx-xxxxx-xxxxxxxx

#The remote workspace URL and workspace authentication credentials are read from the caller’s local configuration profile named <asset-bundle-tutorial>
workspace:
  profile: asset-bundle-tutorial


# These are the default job and pipeline settings if not otherwise overridden in
# the following "targets" top-level mapping.
resources:
    jobs:
      etl_job:
        tasks:
          - task_key: "main"
            existing_cluster_id: ${var.my_cluster_id}
            python_wheel_task:
              package_name: "my_package"
              entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint
            libraries:
            - whl: ../dist/*.whl
          - task_key: "eda"
            existing_cluster_id: ${var.my_cluster_id}
            notebook_task:
              notebook_path: ../notebooks/explorative_analysis.py
              source: WORKSPACE
            depends_on:
              - task_key: "main"  

      test_job:         
        tasks:
          - task_key: "main_notebook"
            existing_cluster_id: ${var.my_cluster_id}
            notebook_task:
              notebook_path: ../notebooks/run_unit_test.py
              source: WORKSPACE
            libraries:
            - pypi:
                package: pytest

# These are the targets to use for deployments and workflow runs. One and only one of these
# targets can be set to "default: true".
targets:
  # The 'dev' target, used for development purposes.
  # Whenever a developer deploys using 'dev', they get their own copy.
  dev:
    # We use 'mode: development' to make sure everything deployed to this target gets a prefix
    # like '[dev my_user_name]'. Setting this mode also disables any schedules and
    # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
    mode: development
    default: true
    workspace:
      profile: asset-bundle-tutorial
      root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/my-envs/${bundle.target}      
      host: <path to your databricks dev workspace>

The databricks.yml bundle configuration file consists of sections called mappings. These mappings allow us to modularize the configuration file into separate logical blocks. There are 8 top-level mappings:

  • bundle

  • variables

  • workspace

  • artifacts

  • include

  • resources

  • sync

  • targets

Here, we use five of these mappings to organize our project.

bundle:

In the bundle mapping, we define the name of the bundle. Here, we can also define a default cluster ID that should be used for our development environments, as well as information about the Git URL and branch.

variables:

We can use the variables mapping to define custom variables and make our configuration file more reusable. For example, we declare a variable for the ID of an existing cluster and use it in different workflows. Now, in case you want to use a different cluster, all you have to do is to change the variable value.

resources:

The resources mapping is where we define our workflows. It includes zero or one of each of the following mappings: experiments, jobs, models, and pipelines. This is basically our deployment.yml file in the dbx project. Though there are some minor differences:

  • For the python_wheel_task, we must include the path to our wheel package; otherwise, Databricks can’t find the library. You can find more info about building wheel packages using DAB here.
  • We can use relative paths instead of full paths to run the notebook tasks. The path for the notebook to deploy is relative to the databricks.yml file in which this task is declared.

targets:

The targets mapping is where we define the configurations and resources of different stages/environments of our projects. For example, for a typical CI/CD pipeline, we would have three targets: development, staging, and production. Each target can consist of all the top-level mappings (except targets) as child mappings. Here is the schema of the target mapping (databricks.yml).

targets:
  <some-unique-programmatic-identifier-for-this-target>:
    artifacts:
      ...
    bundle:
      ...
    compute_id: string
    default: true | false
    mode: development
    resources:
       ...
    sync:
       ...
    variables:
      <preceding-unique-variable-name>: <non-default-value>
    workspace:
      ...


The child mapping allows us to override the default configurations that we defined earlier in the top-level mappings. For example, if we want to have an isolated Databricks workspace for each stage of our CI/CD pipeline, we should set the workspace child mapping for each target.

workspace:
  profile: my-default-profile

targets:
  dev:
    default: true
  test:
    workspace:
      host: https://<staging-workspace-url>
  prod:
    workspace:
      host: https://<production-workspace-url>

include:

The include mapping allows us to break our configuration file into different modules. For example, we can save our resources and variables to the resources/dev_job.yml file and import it into our databricks.yml file.

# yaml-language-server: $schema=bundle_config_schema.json
bundle:
  name: DAB_tutorial #our bundle name 

workspace:
  profile: asset-bundle-tutorial
include:
  - ./resources/*.yml

targets:
  # The 'dev' target, used for development purposes.
  # Whenever a developer deploys using 'dev', they get their own copy.
  dev:
    # We use 'mode: development' to make sure everything deployed to this target gets a prefix
    # like '[dev my_user_name]'. Setting this mode also disables any schedules and
    # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
    mode: development
    default: true

For more detailed explanation of DAB configurations check out Databricks Asset Bundle configurations

Workflows

The workflows are exactly what I described in previous blog. The only differences is the location of artifacts and files.

The project skeleton

here is how the final project looks like

ASSET-BUNDLE-TUTORAL/
├─ my_package/
│  ├─ tasks/
│  │  ├─ __init__.py
│  │  ├─ sample_etl_job.py
│  ├─ __init__.py
│  ├─ common.py
├─ test/
│  ├─ conftest.py
│  ├─ test_sample.py
├─ notebooks/
│  ├─ explorative_analysis.py
│  ├─ run_unit_test.py
├─ resources/
│  ├─ dev_jobs.yml
├─ .gitignore
├─ .gitlab-ci.yml
├─ databricks.yml
├─ README.md
├─ setup.py

Validate, Deploy & Run

Now, open your terminal and run the following commands from the root directory:

  1. validate: First, we should check if our configuration file has the right format and syntax. If the validation succeeds, you will get a JSON representation of the bundle configuration. In case of an error, fix it and run the command again until you receive the JSON file.

    databricks bundle validate
    

  1. deploy: Deployment includes building the Python wheel package and deploying it to our Databricks workspace, deploying the notebooks and other files to our Databricks workspace, and creating the jobs in our Databricks workflows.

    databricks bundle deploy
    

    If no command options are specified, the Databricks CLI uses the default target as declared within the bundle configuration files. Here, we only have one target so it doesn’t matter, but to demonstrate this, we can also deploy a specific target by using the -t dev flag.

  2. run: Run the deployed jobs. Here, we can specify which job we want to run. For example, in the following command, we run the test_job job in the dev target.

    databricks bundle run -t dev test_job
    

in the output you get a URL to that points to the job run in your workspace.

you can also find your jobs in he Workflow section of your Databricks workspace.

CI pipeline configuration

The general setup of our CI pipeline stays the same as the previous project. It consists of two main stages: test and deploy. In the test stage, the unit-test-job runs the unit tests and deploys a separate workflow for testing. The deploy stage, activated upon successful completion of the test stage, handles the deployment of your main ETL workflow.

Here, we have to add additional steps before each stage for installing Databricks CLI and setting up the authentication profile. We do this in the before_script section of our CI pipeline. The before_script keyword is used to define an array of commands that should run before each job’s script commands. More about it can be found here.

Optionally, you can use the after_project keyword to define an array of commands that should run AFTER each job. Here, we can use databricks bundle destroy --auto-approve to clean up after each job is over. In general, our pipeline go through these steps:

  • Install the Databricks CLI and create configuration profile.
  • Build the project.
  • Push the build artifacts to the Databricks workspace.
  • Install the wheel package on your cluster.
  • Create the jobs on Databricks Workflows.
  • Run the jobs.

here is how our .gitlab-ci.yml looks like:

image: python:3.9

stages:          # List of stages for jobs, and their order of execution
  - test
  - deploy

default:
  before_script:
    - echo "install databricks cli"
    - curl -V
    - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
    - echo "databricks CLI installation finished"
    - echo "create the configuration profile for token authentication"
    - echo "[asset-bundle-tutorial]" > ~/.databrickscfg
    - echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg
    - echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg
    - echo "validate the bundle"
    - databricks bundle validate

  after_script:
    - echo "remove all workflows"
    #- databricks bundle destroy --auto-approve

unit-test-job:   # This job runs in the test stage.
  stage: test
  script:
    - echo "Running unit tests."
    - pip3 install --upgrade wheel setuptools
    - pip install -e ".[local]"
    - databricks bundle deploy -t dev
    - databricks bundle run -t dev test_job
    

deploy-job:      # This job runs in the deploy stage.
  stage: deploy  # It only runs when *both* jobs in the test stage complete successfully.
  script:
    - echo "Deploying application..."
    - echo "Install dependencies"
    - pip install -e ".[local]"
    - echo "Deploying Job"
    - databricks bundle deploy -t dev
    - databricks bundle run -t dev etl_job

Notes

Here are some notes that could help you set up your bundle project:

  • In this blog, we created our bundle manually. In my experience, this helps to understand the underlying concepts and features better. But if you want to have a fast start with your project, you can use default and non-default bundle templates that are provided by Databricks or other parties. Check out this Databricks post to learn about how to initiate a project with the default Python template.
  • When you deploy your code using databricks bundle deploy, Databricks CLI runs the command python3 setup.py bdist_wheel to build your package using the setup.py file. If you already have python3 installed but your machine uses the python alias instead of python3, you will run into problems. However, this is easy to fix. For example, here and here are two Stack Overflow threads with some solutions.

What’s next

In the next blog post, I will start with my first blog post on how to start a machine learning project on Databricks. It will be the first post in my upcoming end-to-end machine learning pipeline, covering everything from development to production. Stay tuned!

Resouces

Make sure you update the cluster_id in resources/dev_jobs.yml