Connect Power BI and Spark notebooks with Microsoft Fabric Semantic Link

One of the most exciting pieces of news that has emerged in recent days is the availability of the Semantic Link feature in Microsoft Fabric. What exactly is it, and what makes it so thrilling? Let’s delve into it.

The Semantic Link feature empowers you to establish a connection between Power BI datasets and Synapse Data Science within the Microsoft Fabric ecosystem. That’s the official explanation, but it will also be highly beneficial for data engineers and Power BI developers, not just data scientists! Semantic Link serves several pivotal purposes, such as:

simplifying data connectivity,
facilitating the dissemination of semantic data information,
seamlessly integrating with widely-used data scientist tools.

Why is this so significant? Well, in the past, disseminating a Power BI model to Data Science workflows was quite challenging. While there were possibilities to extract some data and semantic information via XMLA Endpoints and the REST API of Power BI Service, it felt like a somewhat cumbersome, backward ETL process for distributing such data.

Now, you might wonder why data scientists would require anything from Power BI. The answer is quite straightforward: business logic. Some DAX measures contain crucial logic that is calculated on the fly, making it difficult to transfer them back to a lakehouse or data warehouse. In some projects I’ve observed, these calculations were duplicated in multiple places due to limited collaboration options. As a result, over time, maintenance issues arose, and these calculations began to diverge.

It appears that Fabric Semantic Link (FSL) is the new hero in town, offering a solution to this problem. The typical workflow with FSL resembles the illustration below:

MicrosoftFabricSemanticLink_001 — https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview

Fabric Semantic Link effectively bridges the gap throughout the entire process. For instance, consider a scenario in which a data science team generates data that can be stored in OneLake using Spark. Once the data resides in OneLake, it becomes readily accessible in Power BI, where a semantic model with dynamic calculations and more can be applied. Additionally, this information can be seamlessly utilized in Synapse Data Science through Semantic Link, greatly simplifying the sharing of knowledge compared to previous methods. Semantic Link includes the sempy Python library, which can be used from a Fabric notebook to connect to the Power BI dataset where the model is located.

It’s important to remember that FSL isn’t just a connectivity feature; it also propagates semantic information from Power BI to empower new capabilities within the Microsoft Fabric ecosystem for data augmentation, validation, exploration, and an extendable set of semantic functions.

There are two methods for interacting with Power BI Datasets:

SemPy Python Library: This library provides functionality similar to Pandas, making it user-friendly for Data Scientists. It enhances the standard Pandas capabilities by incorporating additional metadata derived from data sources and Power BI Data Categories (e.g., geographic, barcode, URL), relationships, hierarchies, and more.
Spark Native Connector: This connector offers native Spark API functionalities, providing the flexibility to utilize Spark SQL, PySpark, R, and Scala, depending on your requirements.

I created in my Fabric workspace notebook where I will write all the code. You canStrengthen the bond between Power BI and Spark Notebooks with Microsoft Fabric Semantic Link install the Semantic Library in-line for your current session in the notebook using the following command:

%pip install semantic-link

Alternatively, you can install it from the Library Management, which makes it available for the entire workspace:

After installation, you can import the SemPy library to begin working with it.

import sempy.fabric as sf

In my examples, I have uploaded one of Microsoft’s sample Power BI models called “Corporate Spend” (you can find it here). The first method I’d like to demonstrate is the read_table function. As the name suggests, it reads a table from the dataset. You simply provide the dataset name as the first parameter and the table name as the second:

df = sf.read_table("Corporate Spend","Fact")
type(df)

I also verified the type of object returned by the read_table function, and as you can see, it is an object called a “Fabric Data Frame”:

It closely resembles a standard Pandas DataFrame, but when you read a Power BI dataset into a FabricDataFrame, it automatically attaches semantic information, including metadata and annotations, from the dataset to the FabricDataFrame. You can create this type of object in two ways:

Manually, by using in-memory data.
By reading a table from Power BI data, as demonstrated above.

Manually in this case means that you can specify values directly:

df = sf.FabricDataFrame({
    "Sales Agent": ["Agent 1", "Agent 1", "Agent 2"],
    "Customer[Country/Region]": ["US", "GB", "US"],
    "Industry[Industry]": ["Services", "CPG", "Manufacturing"],
    "Product Category[Category]": ["Electronics", "Clothing", "Electronics"],
    "Revenue": [10000, 5000, 12000],
})

As I mentioned earlier, with SemPy, you have the functionality of Pandas at your disposal. Therefore, you can easily use methods like head to display the first 10 records of your FabricDataFrame:

df.head(10)

Pandas indeed offers a vast array of methods for data manipulation and analysis (I recommend exploring the Pandas documentation link). In this post, I won’t delve into the common Pandas data manipulation patterns, but rather, I’ll concentrate on the essential functionality of interacting with Power BI using the Semantic Library.

One of the primary exploratory functions available is list_datasets which retrieves all the models within our workspace (the workspace where our notebook is located) along with some fundamental information about them.

sf.list_datasets()

Once you’ve identified the model you want to explore, you can provide its name to the next function, called list_tables to examine the objects contained within it. This function allows you to get a closer look at the tables and elements present in the selected model:

sf.list_tables("Corporate Spend")

Even more intriguing is the ability to examine the relationships between those tables, which can be done using the list_relationships function. As shown below, it provides a comprehensive list with details about the cardinality and the columns on which each relationship is based. This feature is indeed quite valuable for understanding the data structure and relationships within the Power BI model.

sf.list_relationships("Corporate Spend")

If you find that textual descriptions of relationships are not sufficient, you can take it a step further by visualizing these relationships using the plot_relationship_metadata function. This feature is not only visually appealing but also highly beneficial for gaining a more intuitive understanding of the model directly from your code.

from sempy.relationships import plot_relationship_metadata
plot_relationship_metadata(sf.list_relationships("Corporate Spend"))

Looks impressive isn’t it?

As you might anticipate, you can also review measures, including their associated expressions, to gain insights into how calculations and metrics are defined within the model. This capability allows you to examine the logic and formulas behind the measures, aiding in your understanding of the data and its calculations.

sf.list_measures("Corporate Spend")

Now, it’s a breeze to get the hang of how the entire model is put together. And if you ever fancy running a specific measure within a particular context, the evaluate_measure function has got your back. You can dial in not only the measure you want to crunch but also tinker with column groupings, apply filters, and more. It’s like having a toolbox of data tricks at your fingertips!

sf.evaluate_measure("Corporate Spend"
        ,measure="Amount"
        ,groupby_columns=["Department[VP]","Scenario[Scenario]"]
        ,filters={"Scenario[Scenario]": ["Actual", "Plan"]})

The example above is a pretty standard approach for folks working with data frames. However, if you’re fluent in DAX, you can send queries directly using the evaluate_dax method. This method lets you harness your DAX skills to craft custom queries and get precisely the results you need. It’s a powerful option for those who are well-versed in DAX.

df_result = sf.evaluate_dax(
    "Corporate Spend",
    """
    EVALUATE SUMMARIZECOLUMNS(
         'Department'[VP]
        ,'Scenario'[Scenario]
        ,"Total Revenue"
        ,CALCULATE([Amount])
        )
    """)

display(df_result)

For me personally, this is one of the most significant game-changers. Using the evaluate_dax method, we can automate data quality tests by comparing measures from the model with data from other layers of our architecture. It has never been this simple!

But it can get even easier. As you may be aware, in notebooks, we have magic commands that allow us to switch the default language of a cell to a different one. And now, you can use DAX as the cell language! This means you can seamlessly integrate DAX expressions into your notebook cells, making it even more convenient for DAX experts to work their magic.

%load_ext sempy

%%dax "Corporate Spend"

    EVALUATE SUMMARIZECOLUMNS(
         'Department'[VP]
        ,'Scenario'[Scenario]
        ,"Total Revenue"
        ,CALCULATE([Amount])
        )

Of course, SemPy offers a plethora of features, and I’d recommend checking out the official library to explore it further and see what it can do for you!

Now, let me show you how you can utilize the native Spark connector. By setting the variable spark.sql.catalog.pni you can harness standard Spark constructs to interact with Power BI. For instance, you can use commands like SHOW TABLES to work with datasets seamlessly. This integration with Spark opens up a wide range of possibilities for data exploration and analysis.

spark.conf.set("spark.sql.catalog.pbi", "com.microsoft.azure.synapse.ml.powerbi.PowerBICatalog")

%%sql

SHOW TABLES FROM pbi.`Corporate Spend`

Extracting data from a dataset is a straightforward process. You can retrieve the desired data using standard SQL queries, similar to how you would interact with any other Spark data source. This simplicity makes working with Power BI datasets and performing data selection effortless.

%%sql

SELECT * FROM pbi.`Corporate Spend`.`Fact`
LIMIT 10

Certainly, it’s worth noting that not only Spark SQL but also PySpark constructs are at your disposal through the Native Connector. This means you can leverage the full power of PySpark for data processing and analysis when working with Power BI datasets.

spark.table("pbi.`Corporate Spend`.Fact").printSchema()

Indeed, there are many more possibilities with Semantic Link that I haven’t covered here. Exploring the concept of semantic propagation and the range of semantic functions it offers, along with data augmentation capabilities, is definitely worth delving into. These aspects of Semantic Link can significantly enhance your data workflows and analytical capabilities. for me huge game-changer!

Official documentation: