Why should I ‘refresh’ Direct Lake models in Microsoft Fabric?

You’ve probably heard about a new mode for datasets/semantic models called Direct Lake. This mode combines the advantages of existing modes: the performance of Import mode and the lack of need to refresh or cache data from Direct Query. It works by directly accessing files from a one lake without requiring extra steps. This means you can analyze the data more quickly without having to copy it into Power BI first. It’s really exciting to see it in action, especially because it can save time by not needing to refresh tons of models. What’s even more important is that by not needing to refresh, we can save Capacity Units, which are important in our Fabric world.

So it might surprise you to see this:

or this:

You might wonder why you need to configure any refresh operation if Direct Lake doesn’t copy any data into your dataset. Let me explain this in a few simple steps. First of all, I created a lakehouse and our test table called “sales” using SQL and PySpark:

spark.sql("""
CREATE TABLE sales (
    id INT,
    product_id INT,
    quantity INT,
    sale_date DATE
)
""")

We need some data, so I also inserted a few rows. I used this simple script, but feel free to use any technique you prefer, such as SQL, Dataflows, pipelines, etc:

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import lit, current_date

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True)
])

rows = [
    Row(1, 101, 5),
    Row(2, 102, 3),
    Row(3, 103, 2),
    Row(4, 104, 1),
    Row(5, 105, 10)
]

df = spark.createDataFrame(rows, schema).withColumn("sale_date", current_date())

df.write.mode('append').insertInto("sales")

To demonstrate the specific behavior of refreshing direct lake datasets, we need to create a custom dataset. Let’s switch to our lakehouse and create it there. Since not everything is currently supported in Power BI Desktop, I will use the web interface available via the fabric portal. I’ve created a semantic model that will contain only one table:

Once our model is ready, we can open it. When selecting “Semantic model,” we can choose the property “Direct Lake Behavior“. Here, we will select “Direct Lake Only” to disable fallback to direct query mode. I chose that to ensure that every interaction with the model will attempt to use Direct Lake.

The model is ready, so let’s create a new report based on it by selecting “New report” from the ribbon.

For demonstration purposes, I created a simple table and slicer. As you can see in the picture below, it reflects exactly the data that resides on OneLake:

When we go back to the notebook, we can insert an additional three rows:

rows = [
    Row(6, 101, 5),
    Row(7, 102, 3),
    Row(8, 103, 2)
]

df = spark.createDataFrame(rows, schema)
df = df.withColumn("sale_date", current_date())
df.write.mode('append').insertInto("sales")

When we open the report again, we will see eight records as expected.Information about this refresh is not visible via the graphical interface, but if you are interested in the details, you can run a trace in Profiler to see what is happening behind the scenes:

This is because behind the scenes, Fabric “refreshes” the Direct Lake model to point out the newest Delta table version. We can turn off automatic refresh from the settings of the model by disabling “Keep your Direct Lake data up to date“. Please notice that also in this type of model, you can configure a schedule for when refresh will happen:

To test if it works, we will delete all the rows from the table. As you can see, there are no rows returned by a simple select statement:

%%sql

DELETE FROM sales

When we check what happened on the report side, the data is still accessible without any problems. Why is that? Because Direct Lake is pointing to a specific version – in this case, to the version “before” the delete statement. As you probably know, the delta format itself works on versions, and every operation (insert, update, delete) performed on it creates a new version. If you want to learn more about this, I recommend you to find my articles dedicated to the delta format (link).

It is because our model is pointing to the older version of the delta, and the only way to update those pointers is to “refresh” the model. As you can see, this operation has nothing to do with “copying” or “loading” the data; it is only updating the reference to the newest delta table version. Some of you may ask why we have control over it. I would say that there are some scenarios when it will be useful, for example:

– Multiple loading processes: You can refresh the model when all loadings finish, not every single time.
– You are refreshing structures behind the scenes, but the report should not reflect them immediately.

So please remember that the refresh operation is pretty lightweight, and there is no data copying involved in it. I hope you enjoyed the article; the next one will appear pretty soon.

Leave a Reply