or this:
You might wonder why you need to configure any refresh operation if Direct Lake doesn’t copy any data into your dataset. Let me explain this in a few simple steps. First of all, I created a lakehouse and our test table called “sales” using SQL and PySpark:
spark.sql(""" CREATE TABLE sales ( id INT, product_id INT, quantity INT, sale_date DATE ) """)
from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import lit, current_date schema = StructType([ StructField("id", IntegerType(), True), StructField("product_id", IntegerType(), True), StructField("quantity", IntegerType(), True) ]) rows = [ Row(1, 101, 5), Row(2, 102, 3), Row(3, 103, 2), Row(4, 104, 1), Row(5, 105, 10) ] df = spark.createDataFrame(rows, schema).withColumn("sale_date", current_date()) df.write.mode('append').insertInto("sales")
To demonstrate the specific behavior of refreshing direct lake datasets, we need to create a custom dataset. Let’s switch to our lakehouse and create it there. Since not everything is currently supported in Power BI Desktop, I will use the web interface available via the fabric portal. I’ve created a semantic model that will contain only one table:
Once our model is ready, we can open it. When selecting “Semantic model,” we can choose the property “Direct Lake Behavior“. Here, we will select “Direct Lake Only” to disable fallback to direct query mode. I chose that to ensure that every interaction with the model will attempt to use Direct Lake.
rows = [ Row(6, 101, 5), Row(7, 102, 3), Row(8, 103, 2) ] df = spark.createDataFrame(rows, schema) df = df.withColumn("sale_date", current_date()) df.write.mode('append').insertInto("sales")
This is because behind the scenes, Fabric “refreshes” the Direct Lake model to point out the newest Delta table version. We can turn off automatic refresh from the settings of the model by disabling “Keep your Direct Lake data up to date“. Please notice that also in this type of model, you can configure a schedule for when refresh will happen:
To test if it works, we will delete all the rows from the table. As you can see, there are no rows returned by a simple select statement:
%%sql DELETE FROM sales
When we check what happened on the report side, the data is still accessible without any problems. Why is that? Because Direct Lake is pointing to a specific version – in this case, to the version “before” the delete statement. As you probably know, the delta format itself works on versions, and every operation (insert, update, delete) performed on it creates a new version. If you want to learn more about this, I recommend you to find my articles dedicated to the delta format (link).
It is because our model is pointing to the older version of the delta, and the only way to update those pointers is to “refresh” the model. As you can see, this operation has nothing to do with “copying” or “loading” the data; it is only updating the reference to the newest delta table version. Some of you may ask why we have control over it. I would say that there are some scenarios when it will be useful, for example:
– Multiple loading processes: You can refresh the model when all loadings finish, not every single time.
– You are refreshing structures behind the scenes, but the report should not reflect them immediately.
So please remember that the refresh operation is pretty lightweight, and there is no data copying involved in it. I hope you enjoyed the article; the next one will appear pretty soon.
- Avoiding Issues: Monitoring Query Pushdowns in Databricks Federated Queries - October 27, 2024
- Microsoft Fabric: Using Workspace Identity for Authentication - September 25, 2024
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
Last comments