One of the key components of Microsoft Fabric is Synapse Data Warehouse. This entity within the Fabric is specifically designed to facilitate the storage of data within the OneLake concept. It provides a seamless interface for interacting with the data through TSQL commands. If you were to consider Synapse SQL pools, such as Dedicated SQL Pool or Serverless SQL Pool, as the older siblings of the Fabric Data Warehouse, you would be entirely correct. For SQL professionals, working with Fabric Warehouse should be very intuitive and familiar, given their experience with Synapse or SQL Server. It is worth noting that the warehouse is seamlessly integrated with the OneLake concept, meaning that data is not stored in a proprietary format exclusive to the warehouse itself but rather resides within the same storage utilized by other artifacts. Furthermore, Fabric leverages the Delta format as its underlying mechanism.
What is the difference between Fabric lakehouse and Fabric warehouse?
This question may come to your mind, and there are multiple differences as well as some similarities between the two. I believe I will write a dedicated article on this topic, but for now, I will just mention a few factors that can influence your choice between the two:
Data types
A Lakehouse is a Spark-based solution, which means it can handle all types of data, including unstructured data. On the other hand, a Warehouse is best suited for structured data.
Language
A Lakehouse can be developed using Pyspark, Spark SQL, R, or Scala, whereas a Warehouse is a pure TSQL service. This can be an important factor to consider when assessing the skill set within our organization. Please keep in mind that Spark SQL and TSQL are both SQL languages, but they have some differences in syntax.
Things can get even more complicated when you notice that with a Lakehouse, another object called SQL Endpoint is created. This endpoint is essentially a service that allows you to read the Lakehouse using TSQL. However, let’s leave that topic for now and proceed to the demo.
Creating warehouse
Once you log in to Fabric, you can switch to the Data Warehouse experience by selecting it from the bottom left corner.
Once you’re there, you will have a few options to choose from. We will select Warehouse (Preview) to create an empty warehouse. I also recommend checking out Warehouse sample, which will create a warehouse for you with some sample data already inside.
Creating a warehouse is extremely simple because all you need to provide is a meaningful name and an optional sensitivity label (you can find more information about sensitivity labels within Fabric here)
After a while, our Warehouse should be ready. Please take a look at the message at the top, which states that a default Power BI Dataset will be created on top of our warehouse. An important aspect of this dataset is that it will be synchronized with your warehouse. This means that any tables you add to the warehouse will also be visible in the dataset.
The interface to interact with the Warehouse is straightforward. On the left, you have the Explorer window where you can see all the objects grouped into schemas. You will also notice the Queries section, which allows you to save queries within the warehouse.
The Warehouse supports the TDS protocol, which is the same protocol used for SQL Server or Synapse SQL Pools. Therefore, there is no problem connecting to it from tools like Management Studio. How can we do it? When we return to the workspace where our warehouse is saved, we can open the context menu and find the connection string to use:
Connecting is quite simple. We can paste our connection string and choose Azure Active Directory – Universal with MFA as the authentication mechanism. After logging in, we will see all the objects.
When we take a closer look at the workspace, we will also notice the automatically built Dataset (default) on top of our warehouse.
Loading sample data using Fabric Data Pipelines
If you want to quickly load samples into the warehouse, you can choose the “Warehouse sample” option that I mentioned earlier. However, for our purposes, I would like to demonstrate how to load data using Data Pipelines, also known as Data Factory, which is built into Fabric. To do this, let’s switch our experience to Data Factory:
Under this experience, we have two options to choose from: Data pipeline and Dataflow Gen2.
Dataflow Gen2 is an enhanced version of Power BI dataflows, which I will definitely write about in the future. However, for now, let’s select Data pipeline. Creating it is also simplified, and we only need to provide the name.
When you open it, you will see a new look of the pipeline, which also includes some activities that were not available before. We will load data from Azure SQL Database using the standard approach: Lookup to retrieve a list of objects to extract from the database, and Foreach to loop through them and copy from the source to the target warehouse. Let’s drag the Lookup activity and take a look at its configuration:
For those who have previously worked with Data Factory or Synapse Pipelines, the experience will be very similar, although some minor differences may arise. In the Lookup activity, on the Settings page, the first step is to choose the Data store type. This allows us to connect to artifacts within the workspace (such as other Warehouses, Likehouses, etc.) or to an External service. In this case, I selected External and clicked on “New” to create a new connection.
Below, you can see all the supported services for now. I have selected Azure SQL Database:
The configuration is a standard process, so I won’t go too much into the details.
The query inside the lookup should return all the tables from the database, and it appears as follows:
SELECT OBJECT_SCHEMA_NAME(object_id) AS SchemaName, OBJECT_NAME(object_id) AS ObjectName FROM sys.tables
We have Lookup so now let’s add Foreach loop:
Under the Items property, we need to add the output of a lookup. Let’s proceed with that step:
@activity('Get list of tables').output.value
Inside the loop we can add Copy activity:
The Copy activity on the Source side is configured as shown below. Similar to the previous example with the Lookup activity, we have the option to choose the connection to objects stored within the workspace, external ones, or even connect to a Sample dataset (I will revisit this option shortly). I used the same connection that I had previously set up for Azure SQL and inserted a simple expression that generates “SELECT * FROM table” for each table retrieved from the Lookup.
@concat('SELECT * FROM ',item().SchemaName,'.',item().ObjectName)
When considering the sample option, we have the opportunity to utilize datasets provided by Microsoft. These datasets include resources such as the COVID-19 Data Lake, NYC Taxi – Green, Diabetes, Public Holidays, and the Retail Data Model from Wide World Importers. These datasets are particularly valuable for learning and demonstration purposes.
As the destination of our Copy activity, I selected the Workspace and chose my warehouse. I used parameters from the Foreach activity as the target object name.
I also enabled the options “Auto create table on target” and DROP IF EXISTS:
@concat('DROP TABLE IF EXISTS ',item().SchemaName,'.',item().ObjectName)
Everything should work, so I saved it and now I can run it or create a schedule. I ran it once on demand.
After some time, I can view the status of my run in a neat hierarchical view:
If desired, I can also navigate to the Monitoring tab, where all the information about my run is available.
When I return to the Warehouse itself, all my objects are accessible. I can visually open them, and by clicking on a specific table, I can view sample rows.
At the top, I have a few options:
- Get data: A shortcut to create a new pipeline for transferring data.
- New SQL query: Opens a new query window where I can write T-SQL.
- New visual query: Opens the visual query editor similar to Power Query.
- New report: A shortcut to create a new report based on the default dataset created with the warehouse.
- New measure: Allows you to create a DAX measure that will be available for analysis in the default dataset.
I especially recommend exploring the visual query editor, as it provides a Power Query-like experience and is quite powerful, especially for individuals with limited T-SQL knowledge.
Please keep in mind that the visual query editor mentioned above has certain functionalities that can be translated behind the scenes to T-SQL (you may also notice the “View SQL” button in the screenshot). As a result, you may not have access to all the functionalities that you see in the standard Power Query.
When I view the properties of a specific table, I can clearly see a URL similar to Data Lake Storage Gen2 (with a “dfs” subresource in the URL). The relative path indicates that these objects are stored in the following schema:
Tables/Schema/TableName.
The Warehouse editor also provides the possibility to switch between the data warehouse and the default dataset editor. In the default dataset editor, you can define the default model and customize its appearance:
The “Model” is simply an editor for the default dataset.
Other loading methods
Of course, loading via Pipelines is not the only possibility. You also have the option to use the aforementioned Dataflows Gen2 or specific SQL constructs such as:
The last two options are particularly useful when you have multiple warehouses in your Fabric, as you can easily add them to your Warehouse editor.
So, writing cross-database queries using three-part names is possible.
It opens new opportunities from an architectural perspective because I can imagine a standard medallion architecture (bronze-silver-gold) as separate warehouses that have some cross-database queries between them.
If you would like to see how this data is stored behind the scenes, you can use tools like Storage Explorer. However, make sure you have installed the latest version and connect to it using your Azure AD credentials, which are also used in Fabric. It is important to note that OneLake is built on top of Data Lake storage, but not all functionalities have a direct correlation (you can find more information in the documentation).
The URL used for connection is https://onelake.dfs.fabric.microsoft.com/WorkspaceName/. When you connect, you will notice that each warehouse has its own folder.
Inside this folder, you will find several subfolders. The “Tables” folder contains all the managed tables created in the warehouse.
This storage is not fully managed by you – some structures are enforced and managed internally via the Fabric interface, but in certain areas, you can upload data, etc. (I will cover this in an article about OneLake that will be published soon).
TSQL features
If the Warehouse is created for T-SQL developers, it is worth mentioning a few words about feature coverage. I will briefly touch upon the most important aspects for now:
- Data types – most of types known from SQL Server are supported except:
- money/smallmone – use decimal
- datetime and smalldatetime – use datetime2
- nchar and nvarchar – use char and varchar
- text and ntext – use varchar
- image – use varbinary
- collation – for now only Latin1_general_100_BIN2_UTF8 is supported collation.
- primary key/unique constraints – oly supported when NONCLUSTERED and NOT ENFORCED options are used.
- FOREIGN KEY – only supported with NOT ENFORCED OPTIONS (link).
Following structures from SQL Server are not supported at all:
- Computed columns
- Indexed views
- Sequence
- Sparse columns
- Surrogate keys on number sequences with Identity columns
- Synonyms
- Triggers
- Unique indexes
- User-defined types
- Temporary tables
Everything not listed above should be supported by I encourage you to study documentation to see all the details (link).
It is important to be aware of certain limitations in order to properly plan migrations or implementations, especially for those who are transitioning from the SQL Server world
Summary
That’s all for now. I hope my article has provided you with some introductory information about the warehousing possibilities that come with Fabric. I have been playing around with these products for a few months now, and I can say that I am really looking forward to seeing all of it in action in real-world projects! Currently, the entire platform is in preview, so we can test, test, and test again. I encourage you to try it out yourself and send feedback to Microsoft! In my next articles, we will cover many more Fabric functionalities, so stay tuned
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments