We all know what YAML is – it’s like JSON, just with indentation instead of brackets. Easier to write and read. That’s it, isn’t it? In most situations… yes. But if we look a little deeper, we’ll find features that many people have no idea exist. And let me emphasize right away, I’m not judging – I didn’t know about them myself until a few months ago.
YAML Aliases and Anchors
One of these features is anchors and aliases. Anchors… Okay, maybe before we move on to the definitions, let’s look at this simple example document that describes the configuration of several databases. As you can probably see, it’s not in line with the DRY (Don’t Repeat Yourself) principle – most of the code is copy-pasted. This isn’t a huge problem when dealing with simple config files like the one presented here, but in real-life scenarios, it would be nice to avoid these kinds of situations.
databases:
config:
type: azuresql
sku_name: S0
max_size_gb: 4
collation: SQL_Latin1_General_CP1_CI_AS
masterdata:
type: azuresql
sku_name: S0
max_size_gb: 4
collation: SQL_Latin1_General_CP1_CI_AS
access:
type: azuresql
sku_name: S0
max_size_gb: 4
collation: SQL_Latin1_General_CP1_CI_AS
- Anchors(&) allow you to define a block of data and label it with a unique name. This label can be used to reference the same data elsewhere in the YAML document, effectively creating a reusable component. Anchors are particularly useful for defining commonly used configurations or settings that you want to reference multiple times without rewriting them.
- Aliases(*) act as a reference or pointer to an anchor. By using an alias, you can include the data defined by an anchor at different points in the document. This helps to avoid redundancy, as changes made to the anchored data automatically propagate to all locations where the alias is used.
Those wishing to dive deeper into the topic can check the details here. We, on the other hand, will move on to an example showing how a configuration file can be simplified using the newly learned syntax.
default: &def
type: azuresql
sku_name: S0
max_size_gb: 4
collation: SQL_Latin1_General_CP1_CI_AS
databases:
config:
<<: *def
masterdata:
<<: *def
access:
<<: *def
A repeated block of code was moved to the beginning and marked with an anchor &def, then reused multiple times using aliases *def. Let’s check if this syntax really works by using a simple Python code that loads our file and then prints its content to the screen.
As you can see, everything works – the aliases are replaced. The only potential issue is that the default node remains part of our file.
This happens because we defined our reusable configuration separately, and the anchors are not removed after being used – they remain as a fully-fledged node in our YAML file. To avoid this situation, we can rewrite the code and, instead of defining a separate block, simply use its first occurrence. This configuration is equivalent to the one presented earlier in today’s post.
databases:
config: &def
type: azuresql
sku_name: S0
max_size_gb: 4
collation: SQL_Latin1_General_CP1_CI_AS
masterdata:
<<: *def
access:
<<: *def
A quick check if it works as expected.
Perfect.
Now, before we move on to using this syntax in Databricks Asset Bundles, I will cite a portion of the YAML standard, specifically the section “Serializing the Representation Graph” which clearly states that the use of aliases is mandatory when serializing structures.
For sequential access mediums, such as an event callback API, a YAML representation must be serialized to an ordered tree. Since in a YAML representation, mapping keys are unordered and nodes may be referenced more than once (have more than one incoming “arrow”), the serialization process is required to impose an ordering on the mapping keys and to replace the second and subsequent references to a given node with place holders called aliases. YAML does not specify how these serialization details are chosen. It is up to the YAML processor to come up with human-friendly key order and anchor names, possibly with the help of the application. The result of this process, a YAML serialization tree, can then be traversed to produce a series of event calls for one-pass processing of YAML data.
For those interested in the topic, I refer you to this post. For everyone else, just don’t be surprised if your YAML is unexpectedly presented as shown below.
Databricks Asset Bundles
Let’s now see how this functionality can be used when developing Databricks Asset Bundles (DAB).
I’ll start by saying that I personally opt for splitting DAB configurations into smaller files and leaving only the bare minimum in the databricks.yml root. An example root configuration file might look something like this.
bundle:
name: hellobundles
artifacts:
default:
type: whl
build: poetry build
path: .
include:
- bundle/targets/*.yml
- bundle/workflows/*.yml
- bundle/variables.yml
This part will remain unchanged in the examples provided below.
Standard Configuration
Let’s start with a simple configuration that includes two environments: dev and test. Both environments point to the same databricks workspace, but with a key difference: dev is set to development mode, while test is set to production mode. Additionally, in the test environment jobs are running in a context of a defined service principal.
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
tst:
mode: production
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
run_as:
service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
The project includes one workflow definition consisting of three notebook tasks. In the definition below, beside mentioned earlier tasks, you’ll find information about the job cluster being used, notifications, and the schedule. This is a standard YAML syntax, and you can export such definitions from the databricks GUI for existing workflows.
resources:
jobs:
hellobundles_master_job:
name: hellobundles_master_job
job_clusters:
- job_cluster_key: cluster_small
new_cluster:
spark_version: 15.4.x-scala2.12
azure_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: 100
node_type_id: Standard_DS3_v2
enable_elastic_disk: true
data_security_mode: USER_ISOLATION
runtime_engine: STANDARD
num_workers: 1
schedule:
quartz_cron_expression: 0 30 0 * * ?
timezone_id: UTC
pause_status: UNPAUSED
email_notifications:
on_failure:
- tkostyrka@hellobundles.com
- anowak@hellobundles.com
- jkowalski@hellobundles.com
tasks:
- task_key: bronze
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/bronze_processor.py
- task_key: silver
depends_on:
- task_key: bronze
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/silver_processor.py
- task_key: gold
depends_on:
- task_key: silver
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/gold_processor.py
This is our starting point – the code is correct and works well. For a small solution, this is entirely sufficient.
Moving blocks to targets.yml
The first thing worth knowing is that workflow definitions can be overwritten in the targets node, which allows us to modify these definitions per environment – for example, we can change the cluster settings, pause the schedule, or disable notifications in development environments. This can be seen in the example below.
job_clusters, schedule and email_notifications nodes have been moved to targets.yml.
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
resources:
jobs:
hellobundles_master_job:
job_clusters:
- job_cluster_key: cluster_small
new_cluster:
spark_version: 15.4.x-scala2.12
azure_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: 100
node_type_id: Standard_DS3_v2
enable_elastic_disk: true
data_security_mode: USER_ISOLATION
runtime_engine: STANDARD
num_workers: 1
schedule:
quartz_cron_expression: 0 30 0 * * ?
timezone_id: UTC
pause_status: PAUSED
val:
mode: production
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
run_as:
service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
resources:
jobs:
hellobundles_master_job:
job_clusters:
- job_cluster_key: cluster_small
new_cluster:
spark_version: 15.4.x-scala2.12
azure_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: 100
node_type_id: Standard_DS3_v2
enable_elastic_disk: true
data_security_mode: USER_ISOLATION
runtime_engine: STANDARD
num_workers: 1
schedule:
quartz_cron_expression: 0 30 0 * * ?
timezone_id: UTC
pause_status: UNPAUSED
email_notifications:
on_failure:
- tkostyrka@hellobundles.com
- anowak@hellobundles.com
- jkowalski@hellobundles.com
And in the workflow definition itself, we’ve only left the list of tasks.
resources:
jobs:
hellobundles_master_job:
name: hellobundles_master_job
tasks:
- task_key: bronze
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/bronze_processor.py
- task_key: silver
depends_on:
- task_key: bronze
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/silver_processor.py
- task_key: gold
depends_on:
- task_key: silver
job_cluster_key: cluster_small
notebook_task:
notebook_path: ../../notebooks/gold_processor.py
This feature is great as it allows us to parameterize our workflows differently across environments, but it doesn’t yet solve the issue of code duplication.
YAML Aliases and Anchors
This is where the use of anchors and aliases will help us. With this functionality, we can define reusable blocks in the targets configuration (&clusters anchor) and then reference them multiple times when configuring specific workflows (*clusters alias).
clusters: &clusters
job_clusters:
- job_cluster_key: cluster_small
new_cluster:
spark_version: 15.4.x-scala2.12
azure_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: 100
node_type_id: Standard_DS3_v2
enable_elastic_disk: true
data_security_mode: USER_ISOLATION
runtime_engine: STANDARD
num_workers: 1
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
resources:
jobs:
hellobundles_master_job:
<<: *clusters
schedule:
quartz_cron_expression: 0 30 0 * * ?
timezone_id: UTC
pause_status: UNPAUSED
val:
mode: production
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
run_as:
service_principal_name: 11111111-aaaa-bbbb-cccc-dddddddddddd
resources:
jobs:
hellobundles_master_job:
<<: *clusters
schedule:
quartz_cron_expression: 0 30 0 * * ?
timezone_id: UTC
pause_status: UNPAUSED
email_notifications:
on_failure:
- tkostyrka@hellobundles.com
- anowak@hellobundles.com
- jkowalski@hellobundles.com
Still not convinced it’s worth it? Take a look at how our solution will grow as we add new workflows.
clusters: &clusters
job_clusters:
- job_cluster_key: cluster_small
new_cluster:
spark_version: 15.4.x-scala2.12
azure_attributes:
first_on_demand: 1
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: 100
node_type_id: Standard_DS3_v2
enable_elastic_disk: true
data_security_mode: USER_ISOLATION
runtime_engine: STANDARD
num_workers: 1
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-0000000000000000.0.azuredatabricks.net
resources:
jobs:
master_job:
<<: *clusters
admin_job:
<<: *clusters
weekend_job:
<<: *clusters
It looks quite simple and clean, doesn’t it?
- Terraforming Databricks #3: Lakehouse Federation - October 15, 2024
- Terraforming Databricks #2: Catalogs & Schemas - September 16, 2024
- Terraforming Databricks #1: Unity Catalog Metastore - September 4, 2024





This was really helpful – i didn’t realize the YAML aliases / anchors were possible in asset bundles. Thanks!
Hi, I’m really glad it was helpful! 🙂
Can we define the anchors in one yml file an use it in another yml file?