Utilizing YAML Anchors in Databricks Asset Bundles

We all know what YAML is – it’s like JSON, just with indentation instead of brackets. Easier to write and read. That’s it, isn’t it? In most situations… yes. But if we look a little deeper, we’ll find features that many people have no idea exist. And let me emphasize right away, I’m not judging – I didn’t know about them myself until a few months ago.

YAML Aliases and Anchors

One of these features is anchors and aliases. Anchors… Okay, maybe before we move on to the definitions, let’s look at this simple example document that describes the configuration of several databases. As you can probably see, it’s not in line with the DRY (Don’t Repeat Yourself) principle – most of the code is copy-pasted. This isn’t a huge problem when dealing with simple config files like the one presented here, but in real-life scenarios, it would be nice to avoid these kinds of situations.

databases:
  config:
    type: azuresql
    sku_name: S0
    max_size_gb: 4
    collation: SQL_Latin1_General_CP1_CI_AS
  masterdata:
    type: azuresql
    sku_name: S0
    max_size_gb: 4
    collation: SQL_Latin1_General_CP1_CI_AS
  access:
    type: azuresql
    sku_name: S0
    max_size_gb: 4
    collation: SQL_Latin1_General_CP1_CI_AS
Now, back to the definitions. Anchors and aliases are features in YAML that simplify your code by minimizing it’s repetition.
  • Anchors(&) allow you to define a block of data and label it with a unique name. This label can be used to reference the same data elsewhere in the YAML document, effectively creating a reusable component. Anchors are particularly useful for defining commonly used configurations or settings that you want to reference multiple times without rewriting them.
  • Aliases(*) act as a reference or pointer to an anchor. By using an alias, you can include the data defined by an anchor at different points in the document. This helps to avoid redundancy, as changes made to the anchored data automatically propagate to all locations where the alias is used.

Those wishing to dive deeper into the topic can check the details here. We, on the other hand, will move on to an example showing how a configuration file can be simplified using the newly learned syntax.

default: &def
  type: azuresql
  sku_name: S0
  max_size_gb: 4
  collation: SQL_Latin1_General_CP1_CI_AS

databases:
  config:
    <<: *def
  masterdata:
    <<: *def
  access:
    <<: *def

A repeated block of code was moved to the beginning and marked with an anchor &def, then reused multiple times using aliases *def. Let’s check if this syntax really works by using a simple Python code that loads our file and then prints its content to the screen.

As you can see, everything works – the aliases are replaced. The only potential issue is that the default node remains part of our file.

This happens because we defined our reusable configuration separately, and the anchors are not removed after being used – they remain as a fully-fledged node in our YAML file. To avoid this situation, we can rewrite the code and, instead of defining a separate block, simply use its first occurrence. This configuration is equivalent to the one presented earlier in today’s post.

databases:
  config: &def
    type: azuresql
    sku_name: S0
    max_size_gb: 4
    collation: SQL_Latin1_General_CP1_CI_AS
  masterdata:
    <<: *def
  access:
    <<: *def

A quick check if it works as expected.

Perfect.

Now, before we move on to using this syntax in Databricks Asset Bundles, I will cite a portion of the YAML standard, specifically the section “Serializing the Representation Graph” which clearly states that the use of aliases is mandatory when serializing structures.

For sequential access mediums, such as an event callback API, a YAML representation must be serialized to an ordered tree. Since in a YAML representation, mapping keys are unordered and nodes may be referenced more than once (have more than one incoming “arrow”), the serialization process is required to impose an ordering on the mapping keys and to replace the second and subsequent references to a given node with place holders called aliases. YAML does not specify how these serialization details are chosen. It is up to the YAML processor to come up with human-friendly key order and anchor names, possibly with the help of the application. The result of this process, a YAML serialization tree, can then be traversed to produce a series of event calls for one-pass processing of YAML data.

For those interested in the topic, I refer you to this post. For everyone else, just don’t be surprised if your YAML is unexpectedly presented as shown below.

Databricks Asset Bundles

Let’s now see how this functionality can be used when developing Databricks Asset Bundles (DAB).

I’ll start by saying that I personally opt for splitting DAB configurations into smaller files and leaving only the bare minimum in the databricks.yml root. An example root configuration file might look something like this.

bundle:
  name: hellobundles

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

include:
  - bundle/targets/*.yml
  - bundle/workflows/*.yml
  - bundle/variables.yml

This part will remain unchanged in the examples provided below.

Standard Configuration

Let’s start with a simple configuration that includes two environments: dev and test. Both environments point to the same databricks workspace, but with a key difference: dev is set to development mode, while test is set to production mode. Additionally, in the test environment jobs are running in a context of a defined service principal.

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
  tst:
    mode: production
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    run_as:
      service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd

The project includes one workflow definition consisting of three notebook tasks. In the definition below, beside mentioned earlier tasks, you’ll find information about the job cluster being used, notifications, and the schedule. This is a standard YAML syntax, and you can export such definitions from the databricks GUI for existing workflows.

resources:
  jobs:
    hellobundles_master_job:
      name: hellobundles_master_job

      job_clusters:
        - job_cluster_key: cluster_small
          new_cluster:
            spark_version: 15.4.x-scala2.12
            azure_attributes:
              first_on_demand: 1
              availability: SPOT_WITH_FALLBACK_AZURE
              spot_bid_max_price: 100
            node_type_id: Standard_DS3_v2
            enable_elastic_disk: true
            data_security_mode: USER_ISOLATION
            runtime_engine: STANDARD
            num_workers: 1

      schedule:
        quartz_cron_expression: 0 30 0 * * ?
        timezone_id: UTC
        pause_status: UNPAUSED

      email_notifications:
        on_failure:
          - tkostyrka@hellobundles.com
          - anowak@hellobundles.com
          - jkowalski@hellobundles.com      

      tasks:
        - task_key: bronze
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/bronze_processor.py

        - task_key: silver
          depends_on:
            - task_key: bronze
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/silver_processor.py

        - task_key: gold
          depends_on:
            - task_key: silver
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/gold_processor.py

This is our starting point – the code is correct and works well. For a small solution, this is entirely sufficient.

Moving blocks to targets.yml

The first thing worth knowing is that workflow definitions can be overwritten in the targets node, which allows us to modify these definitions per environment – for example, we can change the cluster settings, pause the schedule, or disable notifications in development environments. This can be seen in the example below.

job_clusters, schedule and email_notifications nodes have been moved to targets.yml.

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    resources:
      jobs:
        hellobundles_master_job:
          job_clusters:
            - job_cluster_key: cluster_small
              new_cluster:
                spark_version: 15.4.x-scala2.12
                azure_attributes:
                  first_on_demand: 1
                  availability: SPOT_WITH_FALLBACK_AZURE
                  spot_bid_max_price: 100
                node_type_id: Standard_DS3_v2
                enable_elastic_disk: true
                data_security_mode: USER_ISOLATION
                runtime_engine: STANDARD
                num_workers: 1

          schedule:
            quartz_cron_expression: 0 30 0 * * ?
            timezone_id: UTC
            pause_status: PAUSED
  val:
    mode: production
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    run_as:
      service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
    resources:
      jobs:
        hellobundles_master_job:
          job_clusters:
            - job_cluster_key: cluster_small
              new_cluster:
                spark_version: 15.4.x-scala2.12
                azure_attributes:
                  first_on_demand: 1
                  availability: SPOT_WITH_FALLBACK_AZURE
                  spot_bid_max_price: 100
                node_type_id: Standard_DS3_v2
                enable_elastic_disk: true
                data_security_mode: USER_ISOLATION
                runtime_engine: STANDARD
                num_workers: 1

          schedule:
            quartz_cron_expression: 0 30 0 * * ?
            timezone_id: UTC
            pause_status: UNPAUSED

          email_notifications:
            on_failure:
              - tkostyrka@hellobundles.com
              - anowak@hellobundles.com
              - jkowalski@hellobundles.com

And in the workflow definition itself, we’ve only left the list of tasks.

resources:
  jobs:
    hellobundles_master_job:
      name: hellobundles_master_job   

      tasks:
        - task_key: bronze
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/bronze_processor.py

        - task_key: silver
          depends_on:
            - task_key: bronze
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/silver_processor.py

        - task_key: gold
          depends_on:
            - task_key: silver
          job_cluster_key: cluster_small
          notebook_task:
            notebook_path: ../../notebooks/gold_processor.py

This feature is great as it allows us to parameterize our workflows differently across environments, but it doesn’t yet solve the issue of code duplication.

YAML Aliases and Anchors

This is where the use of anchors and aliases will help us. With this functionality, we can define reusable blocks in the targets configuration (&clusters anchor) and then reference them multiple times when configuring specific workflows (*clusters alias).

clusters: &clusters
  job_clusters:
    - job_cluster_key: cluster_small
      new_cluster:
        spark_version: 15.4.x-scala2.12
        azure_attributes:
          first_on_demand: 1
          availability: SPOT_WITH_FALLBACK_AZURE
          spot_bid_max_price: 100
        node_type_id: Standard_DS3_v2
        enable_elastic_disk: true
        data_security_mode: USER_ISOLATION
        runtime_engine: STANDARD
        num_workers: 1

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    resources:
      jobs:
        hellobundles_master_job:
          <<: *clusters

          schedule:
            quartz_cron_expression: 0 30 0 * * ?
            timezone_id: UTC
            pause_status: UNPAUSED

  val:
    mode: production
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    run_as:
      service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
    resources:
      jobs:
        hellobundles_master_job:
          <<: *clusters

          schedule:
            quartz_cron_expression: 0 30 0 * * ?
            timezone_id: UTC
            pause_status: UNPAUSED

          email_notifications:
            on_failure:
              - tkostyrka@hellobundles.com
              - anowak@hellobundles.com
              - jkowalski@hellobundles.com

Still not convinced it’s worth it? Take a look at how our solution will grow as we add new workflows.

clusters: &clusters
  job_clusters:
    - job_cluster_key: cluster_small
      new_cluster:
        spark_version: 15.4.x-scala2.12
        azure_attributes:
          first_on_demand: 1
          availability: SPOT_WITH_FALLBACK_AZURE
          spot_bid_max_price: 100
        node_type_id: Standard_DS3_v2
        enable_elastic_disk: true
        data_security_mode: USER_ISOLATION
        runtime_engine: STANDARD
        num_workers: 1

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-0000000000000000.0.azuredatabricks.net
    resources:
      jobs:
        master_job:
          <<: *clusters
        admin_job:
          <<: *clusters
        weekend_job:
          <<: *clusters

It looks quite simple and clean, doesn’t it?

Leave a Reply