In my latest post, I wrote a few words about the warehouses available within Fabric. Today, I would like to show you an end-to-end analytical scenario with the second most important item available on the platform, which is the Lakehouse. Microsoft’s Fabric Lakehouse serves as a cutting-edge data architecture platform that consolidates the storage, management, and analysis of structured and unstructured data into a unified hub. This dynamic and expandable solution empowers enterprises to efficiently handle vast data volumes through a range of possibilities that come with Spark. The most important thing to keep in mind is that Lakehouse is fully based on the Spark engine. Spark itself is a flexible tool where we can use different languages (Spark, Scala, SQL, or Python) inside notebooks to operate on the files stored within OneLake, which I spoke about last time. In this case, it is similar to the Warehouse, but there is a crucial difference between the two. The Warehouse gives us a full TSQL experience, whereas the Lakehouse is more of a data engineering tool.
One thing that can be misleading is the fact that with the Lakehouse, there is automatically created an SQL Endpoint that, in reality, is a read-only warehouse with synchronized metadata with the Lakehouse. However, I will tell a few words about it later in this article. Let’s get started.
Creating Fabric Lakehouse
Creating a lakehouse is as simple as creating any other artifact within Fabric. We can open the welcome page, and there is a pointer to Synapse Data Engineering.
From there, we can choose Lakehouse (preview):
The initial configuration is nothing more than giving a name to our lakehouse, and all other things will be set up for us automatically.
After a few moments, our lakehouse should be ready. I will go through the available options in a moment, but for now, let’s load some data. In the article about warehouses, I used Data Factory, but in this case, I will choose Dataflows Gen2:
Dataflows Gen2, as the name suggests, is a new version of the well-known Dataflows. It is, of course, based on Power Query, a very user-friendly technology where we can achieve quite complex transformations directly from a graphical interface.
To load the data, I chose “Get Data” -> “SQL Server database”. I have my own sample database from which I will extract data for demo purposes:
Configuring the connection should not be a problem for anyone, and it looks the same as it was in the past:
On the next screen, we can choose which objects we want to load. In this case, we will take all of them one by one, just for simplicity:
After choosing them, each object becomes a query, and each table can be transformed independently. Of course, Power Query is flexible enough to handle more complex transformations, such as aggregations, joins, pivoting, or transposition. However, since we don’t need them for now, I will skip this step.
A more interesting option is available in the Home tab, called “Add data destination.” This option allows us to load ingested data into one of four destinations (currently):
- Azure SQL Database
- Lakehouse
- Azure Data Explorer
- Azure Synapse Analytics
For our purposes, we will choose Lakehouse, of course.
Creating a connection is pretty simple. We just need to chose Data gateway if needed and way of authentication:
When we have defined how we want to connect to the Lakehouse, we can choose which lakehouse (of course, we can have multiple lakehouses) and which table will be our target. We can choose to create a new table (Dataflows will create a new delta table for us behind the scenes) or use an existing one.
After that, we can choose what should happen during the refresh, and we have two options:
- Append – new data will be inserted into the table.
- Replace – old data will be removed, and new data will be inserted.
These two options should be sufficient because we should use Dataflows or Data Factory only for data extraction between the source system and the Warehouse/Lakehouse. For further transformations, we have Notebooks (for Lakehouses) or TSQL code (for Warehouses). As you can see below, not all columns will be mapped; some columns will be discarded due to incompatible data types or encoding.
Under View tab we have possibility to choose Schema view. It is pretty nice because it is graphically shows entire flow which can be important to quickly understand how the flow looks like especially when it is much more complicated than my example:
You may also notice a small Lakehouse icon. When you hover your mouse over it or click on it, you will see the details of the destination connection. We have to repeat those steps for every single table that we want to load.
All the transformations, including the destination configuration, are also available in the standard Query settings window, which is visible on the right when we select a single table.
After publishing our dataflow, we should notice a bunch of new objects within our workspace. Some of them have “DataflowsStaging” in their name, as you can see in the picture below:
Those staging objects that came with dataflows are not important for us at the moment. What’s more important are the other three objects that have been created:
- Lakehouse – an instance of the lakehouse where we can query our structures using notebooks and the Spark engine.
- SQL endpoint – a read-only endpoint that we can use to query lakehouse structures using TSQL.
- Dataset (default) – a default Power BI dataset connected to the above (similar to how it was done for the Warehouse).
The important thing to note is that all of these objects have synchronized metadata. For example, if a new table is added to the Lakehouse, it will be automatically reflected in the SQL endpoint and default dataset. Now, let’s move on to the next step. To load our data correctly into the lakehouse, we need to start our dataflows.
After some time, the data should be in place, and we can open our Lakehouse. As you can see, all the objects are in place, and we have a standard notebook experience where we can write our code.
We can add additional cells to write our code or simply drag the name of a table to the cell, and a simple snippet with spark.sql
will be added for us. We can run the code (CTRL+Enter or click the play button attached to the cell, or use the “run all” button available on top). Then, some amazing things will happen – the Spark cluster will start in seconds, which is a great improvement compared to other Spark implementations where it can sometimes take more than 1 or 2 minutes.
Of course, behind the scenes, Spark is used, so we have access to all the logs, performance metrics, and metadata about our run.
Every notebook has its own default language that is used by default. We can choose from PySpark, Scala, R, and SQL. What’s even more important is that we can change the language locally for a single cell by using a magic command like %%sql. Then, we can use SQL directly without any additional wrappers like spark.sql or anything of that sort.
There are also other magic commands that gives you possibility to change language in a specific cell:
- %%pyspark – python,
- %%spark – scala,
- %%html – html,
- %%sparkr – r.
There is also special markdown cell where you can use this language to describe notebook.
Microsoft has also included a Data Frame Wrapper tool, which allows us to use a graphical interface to transform our Pandas dataframes (important to note that this is a Pandas dataframe, not a PySpark dataframe). We can transform our PySpark dataframe to Pandas using the toPandas() method, and then it will be available for use with the Data Frame Wrapper.
This tool gives us a pretty nice graphical representation of the data, including some profile information.
From the operations menu, we can choose a set of actions that we can perform on top of our data frame. For me, it looks similar to the Power Query experience, and it is very easy to use.
We can choose from a range of built-in functions, and when we are happy with our results, we can add the generated code to the cell of our notebook. It’s a nice way to learn how to perform basic transformations in Pandas!
Our notebook is not a proprietary format of Fabric, but it is compatible with well-known standards. It can be exported to formats such as ipynb, Python, HTML, or LaTeX.
When we are ready with our code, we can save the notebook in the workspace of our choice.
Delta Format & Spark configuration
Behind the scenes, all data registered in the metastore within Microsoft Fabric is stored in the Delta format. As you may already be aware, the primary building block of Delta is the Parquet file format. In Fabric, there have been notable enhancements made in connection to Parquet, and one of these advancements is the introduction of V-Order.
V-Order, implemented by Microsoft, is a write-time optimization specifically designed to enhance the performance of Parquet files when utilized under the Microsoft Fabric compute engines, including Power BI, SQL, Spark, and other compatible platforms. By leveraging V-Order, Fabric enables lightning-fast read operations, resulting in improved query performance and overall data processing efficiency.
For detailed information on the capabilities and advantages of V-Order within the Microsoft Fabric ecosystem, I recommend referring to the official documentation:
V-Order works by applying special sorting, row group distribution, dictionary encoding and compression on parquet files, thus requiring less network, disk, and CPU resources in compute engines to read it, providing cost efficiency and performance. V-Order sorting has a 15% impact on average write times but provides up to 50% more compression
This feature is a game-changer introduced with Fabric as it enables Verti-Scan engines, such as Power BI, to access delta tables with performance comparable to in-memory data access. It is important to note that the application of V-Order on Parquet files does not restrict compatibility with other tools; these files remain fully compatible. V-Order and Verti-Scan are key success factors behind DirectLake, which allows for the direct reading of Delta files from Power BI. However, I will postpone discussing this topic for now, as I intend to explore it in depth in future articles.
Some of you may ask what version of Spark was used and what resources were engaged to run my notebook. It’s a very good question, and we can find the answer in the workspace settings.
As you can see from this section, we can set up the Spark and Delta versions that will be used. Currently, it is Spark 3.3 (which is great news because many features come with this version) and Delta 2.2. In terms of performance, there is the concept of Pools. Currently, we have a pool with 1 to 10 medium-sized nodes that are memory-optimized. After the evaluation period, this pool will resize according to the capacity purchased. This Starter pool has also possibility to auto-scale when needed. Starter pools have Spark clusters that are always on so that is the reason why they are available so fast.
You also have the option to create your own Spark pool and define the specific resources you require. You will only be charged for the resources when they are being used, and they will be automatically removed after a specific Time To Live (TTL) that you have control over. If you are interested in this topic I recommend you to go to the official documentation.
Another important aspect is available on the next page of the Workspace settings, where we can add specific libraries if needed.
There are also some possibilities to add some custom properties for Spark but use them carefully only when needed because they can affect start time of a cluster.
Let’s go back to the workspace and take a closer look at the additional object, which is the SQL Endpoint. I’ve already mentioned it briefly, but let’s examine it in more detail.
As you can observe, the SQL Endpoint is simply an endpoint that allows us to query our lakehouse using SQL. However, it may seem a bit perplexing because you can also write SQL in a Notebook, and you’re absolutely right! But it’s a different flavor of SQL. In the SQL Endpoint, you can write T-SQL, whereas in a notebook, there is a distinct variation of it.
Another aspect that can cause confusion is when comparing the SQL Endpoint with the Warehouse. While they share similarities in concept, the main difference lies in the fact that the SQL Endpoint is tightly connected to the Lakehouse and is strictly read-only. This means we cannot utilize it to write T-SQL statements that modify or save any data, as opposed to the Warehouse, where such operations are indeed possible.
Just like mentioned earlier, we can utilize the SQL Endpoint not only to access and query data from the Lakehouse but also to combine it with data from the Warehouse.
We can always switch between SQL Endpoint and Lakehouse with just few clicks using builtin switch in top left corner:
When we navigate to the settings of a lakehouse, we can find the Connection String for the SQL Endpoint. This endpoint is built on top of TDS (Tabular Data Stream), which means we can use Management Studio to query the lakehouse seamlessly, without any issues. It’s a convenient feature!
Additionally, we have the capability to connect directly to OneLake, which is used to store delta tables behind the scenes. I previously covered this in my last article about the Warehouse, so I won’t repeat it here. However, as you can observe below, each lakehouse has its own folder where data is stored, and it can be accessed quite easily.
There is a lot of functionalities and we only touched the service!
One of the primary inquiries I frequently receive pertains to the concurrency limit and how it functions within the Fabric Lakehouse. In this regard, a queuing mechanism operates on a First-in-First-out (FIFO) basis, with the number of concurrent jobs being contingent upon the available Capacity. To provide an illustrative example, let us consider the smallest F2 Capacity SKU, which permits a maximum of one job to execute concurrently, while accommodating a queue length of four. Conversely, if we opt for the F32 Capacity SKU, eight concurrent jobs can be executed with a queue length of 32.
Furthermore, given the abundance of activities occurring within the workspace, Microsoft has implemented Dynamic Reserve-based throttling for jobs. Within this framework, jobs are categorized into two groups: interactive jobs, encompassing notebooks and lakehouse tasks, and batch jobs. Consequently, specific minimum and maximum reserve boundaries have been established for these job categories. Once again, I strongly recommend studying the documentation to acquire a comprehensive understanding of the intricacies involved (link). Of course those topics are much broader and I will not write all of them here but for sure I will dedicate few articles to data engineering scenarios within Microsoft Fabric.
Naturally, the subjects I’ve touched upon here only scratch the surface of the broader scope encompassed by data engineering within Microsoft Fabric. Consequently, I will be devoting a series of articles to delve deeper into this particular scenario. For now that’s all, I hope you enjoyed the article.
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments