We all know what YAML is – it’s like JSON, just with indentation instead of brackets. Easier to write and read. That’s it, isn’t it? In most situations… yes. But if we look a little deeper, we’ll find features that many people have no idea exist. And let me emphasize right away, I’m not judging – I didn’t know about them myself until a few months ago.
YAML Aliases and Anchors
One of these features is anchors and aliases. Anchors… Okay, maybe before we move on to the definitions, let’s look at this simple example document that describes the configuration of several databases. As you can probably see, it’s not in line with the DRY (Don’t Repeat Yourself) principle – most of the code is copy-pasted. This isn’t a huge problem when dealing with simple config files like the one presented here, but in real-life scenarios, it would be nice to avoid these kinds of situations.
databases: config: type: azuresql sku_name: S0 max_size_gb: 4 collation: SQL_Latin1_General_CP1_CI_AS masterdata: type: azuresql sku_name: S0 max_size_gb: 4 collation: SQL_Latin1_General_CP1_CI_AS access: type: azuresql sku_name: S0 max_size_gb: 4 collation: SQL_Latin1_General_CP1_CI_AS
- Anchors(&) allow you to define a block of data and label it with a unique name. This label can be used to reference the same data elsewhere in the YAML document, effectively creating a reusable component. Anchors are particularly useful for defining commonly used configurations or settings that you want to reference multiple times without rewriting them.
- Aliases(*) act as a reference or pointer to an anchor. By using an alias, you can include the data defined by an anchor at different points in the document. This helps to avoid redundancy, as changes made to the anchored data automatically propagate to all locations where the alias is used.
Those wishing to dive deeper into the topic can check the details here. We, on the other hand, will move on to an example showing how a configuration file can be simplified using the newly learned syntax.
default: &def type: azuresql sku_name: S0 max_size_gb: 4 collation: SQL_Latin1_General_CP1_CI_AS databases: config: <<: *def masterdata: <<: *def access: <<: *def
A repeated block of code was moved to the beginning and marked with an anchor &def
, then reused multiple times using aliases *def
. Let’s check if this syntax really works by using a simple Python code that loads our file and then prints its content to the screen.
As you can see, everything works – the aliases are replaced. The only potential issue is that the default
node remains part of our file.
This happens because we defined our reusable configuration separately, and the anchors are not removed after being used – they remain as a fully-fledged node in our YAML file. To avoid this situation, we can rewrite the code and, instead of defining a separate block, simply use its first occurrence. This configuration is equivalent to the one presented earlier in today’s post.
databases: config: &def type: azuresql sku_name: S0 max_size_gb: 4 collation: SQL_Latin1_General_CP1_CI_AS masterdata: <<: *def access: <<: *def
A quick check if it works as expected.
Perfect.
Now, before we move on to using this syntax in Databricks Asset Bundles, I will cite a portion of the YAML standard, specifically the section “Serializing the Representation Graph” which clearly states that the use of aliases is mandatory when serializing structures.
For sequential access mediums, such as an event callback API, a YAML representation must be serialized to an ordered tree. Since in a YAML representation, mapping keys are unordered and nodes may be referenced more than once (have more than one incoming “arrow”), the serialization process is required to impose an ordering on the mapping keys and to replace the second and subsequent references to a given node with place holders called aliases. YAML does not specify how these serialization details are chosen. It is up to the YAML processor to come up with human-friendly key order and anchor names, possibly with the help of the application. The result of this process, a YAML serialization tree, can then be traversed to produce a series of event calls for one-pass processing of YAML data.
For those interested in the topic, I refer you to this post. For everyone else, just don’t be surprised if your YAML is unexpectedly presented as shown below.
Databricks Asset Bundles
Let’s now see how this functionality can be used when developing Databricks Asset Bundles (DAB).
I’ll start by saying that I personally opt for splitting DAB configurations into smaller files and leaving only the bare minimum in the databricks.yml
root. An example root configuration file might look something like this.
bundle: name: hellobundles artifacts: default: type: whl build: poetry build path: . include: - bundle/targets/*.yml - bundle/workflows/*.yml - bundle/variables.yml
This part will remain unchanged in the examples provided below.
Standard Configuration
Let’s start with a simple configuration that includes two environments: dev and test. Both environments point to the same databricks workspace, but with a key difference: dev is set to development mode, while test is set to production mode. Additionally, in the test environment jobs are running in a context of a defined service principal.
targets: dev: mode: development default: true workspace: host: https://adb-0000000000000000.0.azuredatabricks.net tst: mode: production workspace: host: https://adb-0000000000000000.0.azuredatabricks.net run_as: service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
The project includes one workflow definition consisting of three notebook tasks. In the definition below, beside mentioned earlier tasks, you’ll find information about the job cluster being used, notifications, and the schedule. This is a standard YAML syntax, and you can export such definitions from the databricks GUI for existing workflows.
resources: jobs: hellobundles_master_job: name: hellobundles_master_job job_clusters: - job_cluster_key: cluster_small new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: SPOT_WITH_FALLBACK_AZURE spot_bid_max_price: 100 node_type_id: Standard_DS3_v2 enable_elastic_disk: true data_security_mode: USER_ISOLATION runtime_engine: STANDARD num_workers: 1 schedule: quartz_cron_expression: 0 30 0 * * ? timezone_id: UTC pause_status: UNPAUSED email_notifications: on_failure: - tkostyrka@hellobundles.com - anowak@hellobundles.com - jkowalski@hellobundles.com tasks: - task_key: bronze job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/bronze_processor.py - task_key: silver depends_on: - task_key: bronze job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/silver_processor.py - task_key: gold depends_on: - task_key: silver job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/gold_processor.py
This is our starting point – the code is correct and works well. For a small solution, this is entirely sufficient.
Moving blocks to targets.yml
The first thing worth knowing is that workflow definitions can be overwritten in the targets node, which allows us to modify these definitions per environment – for example, we can change the cluster settings, pause the schedule, or disable notifications in development environments. This can be seen in the example below.
job_clusters
, schedule
and email_notifications
nodes have been moved to targets.yml
.
targets: dev: mode: development default: true workspace: host: https://adb-0000000000000000.0.azuredatabricks.net resources: jobs: hellobundles_master_job: job_clusters: - job_cluster_key: cluster_small new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: SPOT_WITH_FALLBACK_AZURE spot_bid_max_price: 100 node_type_id: Standard_DS3_v2 enable_elastic_disk: true data_security_mode: USER_ISOLATION runtime_engine: STANDARD num_workers: 1 schedule: quartz_cron_expression: 0 30 0 * * ? timezone_id: UTC pause_status: PAUSED val: mode: production workspace: host: https://adb-0000000000000000.0.azuredatabricks.net run_as: service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd resources: jobs: hellobundles_master_job: job_clusters: - job_cluster_key: cluster_small new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: SPOT_WITH_FALLBACK_AZURE spot_bid_max_price: 100 node_type_id: Standard_DS3_v2 enable_elastic_disk: true data_security_mode: USER_ISOLATION runtime_engine: STANDARD num_workers: 1 schedule: quartz_cron_expression: 0 30 0 * * ? timezone_id: UTC pause_status: UNPAUSED email_notifications: on_failure: - tkostyrka@hellobundles.com - anowak@hellobundles.com - jkowalski@hellobundles.com
And in the workflow definition itself, we’ve only left the list of tasks.
resources: jobs: hellobundles_master_job: name: hellobundles_master_job tasks: - task_key: bronze job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/bronze_processor.py - task_key: silver depends_on: - task_key: bronze job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/silver_processor.py - task_key: gold depends_on: - task_key: silver job_cluster_key: cluster_small notebook_task: notebook_path: ../../notebooks/gold_processor.py
This feature is great as it allows us to parameterize our workflows differently across environments, but it doesn’t yet solve the issue of code duplication.
YAML Aliases and Anchors
This is where the use of anchors and aliases will help us. With this functionality, we can define reusable blocks in the targets configuration (&clusters
anchor) and then reference them multiple times when configuring specific workflows (*clusters
alias).
clusters: &clusters job_clusters: - job_cluster_key: cluster_small new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: SPOT_WITH_FALLBACK_AZURE spot_bid_max_price: 100 node_type_id: Standard_DS3_v2 enable_elastic_disk: true data_security_mode: USER_ISOLATION runtime_engine: STANDARD num_workers: 1 targets: dev: mode: development default: true workspace: host: https://adb-0000000000000000.0.azuredatabricks.net resources: jobs: hellobundles_master_job: <<: *clusters schedule: quartz_cron_expression: 0 30 0 * * ? timezone_id: UTC pause_status: UNPAUSED val: mode: production workspace: host: https://adb-0000000000000000.0.azuredatabricks.net run_as: service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd resources: jobs: hellobundles_master_job: <<: *clusters schedule: quartz_cron_expression: 0 30 0 * * ? timezone_id: UTC pause_status: UNPAUSED email_notifications: on_failure: - tkostyrka@hellobundles.com - anowak@hellobundles.com - jkowalski@hellobundles.com
Still not convinced it’s worth it? Take a look at how our solution will grow as we add new workflows.
clusters: &clusters job_clusters: - job_cluster_key: cluster_small new_cluster: spark_version: 15.4.x-scala2.12 azure_attributes: first_on_demand: 1 availability: SPOT_WITH_FALLBACK_AZURE spot_bid_max_price: 100 node_type_id: Standard_DS3_v2 enable_elastic_disk: true data_security_mode: USER_ISOLATION runtime_engine: STANDARD num_workers: 1 targets: dev: mode: development default: true workspace: host: https://adb-0000000000000000.0.azuredatabricks.net resources: jobs: master_job: <<: *clusters admin_job: <<: *clusters weekend_job: <<: *clusters
It looks quite simple and clean, doesn’t it?
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024
- Utilizing YAML Anchors in Databricks Asset Bundles - August 24, 2024
- Databricks: MERGE WITH SCHEMA EVOLUTION - August 17, 2024
Last comments