In this article, I would like to mention 8 Power BI features that every data architect should know when building analytical systems on a bigger scale including not only reports themselves but also data warehousing solutions like Synapse Analytics. Synapse Analytics itself is integrated with Power BI – for example, you have a unified experience during authoring Power BI objects inside the Synapse Studio but when building a bigger solution you should consider much more than that. Because of that, I decided to prepare a list that contains a few options that are with us for some time and should be considered during every deployment that is involving not only Power BI itself but also other services. The list is subjective and is based on my personal experience.
1) Import vs Direct Query
The first point sounds like a fundamental thing that everyone knows but in my opinion it is only partially true. From my experience, I can say that Import mode is treated as a default option (which is good!) and people are afraid of Direct Query even if it will be a better option in some scenarios! Just to remind you Direct Query is a method of retrieving data directly from the data source. Important thing is that this retrieving is taking place during the query execution. In this mode Power BI is not storing data but sending requests to the data source and pulls it from there. You can imagine that when you will connect PBI to Synapse then your DAX measures etc. will be translated to SQL and Synapse will take the burden of those calculations.
In many scenarios Import mode is much better from a performance point of view but there are situations when the data size is way too big to keep in the model. Another example is a situation when our data source is changing all the time and we have to reflect those changes on our reports as soon as possible. In such scenarios, Direct Query can be a much better option.
Direct Query should be considered especially when Synapse Dedicated SQL Pool is our data source, Synapse can handle much larger volumes of data, and keeping such volumes in Power BI service is not always a good option for many reasons. Please keep in mind that your report in DQ can be as fast as your source so you have to remember about things like indexing, caching etc. It is crucial for you to remember that your model in DQ will be much faster if you will have all tables already modeled and you don’t need any additional transformations in Power Query.
One important note that I would like to mention here is the fact that we should think twice if we want to use Direct Query to Synapse SQL Serverless. Why? Because Power BI can query data source multiple times and every “click” on the report can generate additional queries, in Synapse Serverless there is no such thing like cache so it can be pretty costly.
Don’t forget about this possibility and choose the right option for the job!
Documentation: link
2) Composite model
This mode is strictly connected to the previous point. Let’s say you have a bunch of standard-size tables in your Synapse that are refreshed once per day and one huge fact table that loads every hour. Should we go with Direct Query in such scenario? It depends! One of the possibilities is Composite model – a model that combines data from more than one DirectQuery source or that combines DirectQuery with import data is called a composite model.
It is definitely worth to consider when we have multiple tables in a model and they have different requirements regarding refreshing process. When you have multiple data sources I would prefer to integrate them in datawarehouse first to avoid problems on Power BI side.
Please keep in mind one thing that is already mentioned in the documentation:
You can create relationships between tables as you always have, even when those tables come from different sources. Any relationships that are cross-source are created with a cardinality of many-to-many, regardless of their actual cardinality. You can change them to one-to-many, many-to-one, or one-to-one. Whichever cardinality you set, cross-source relationships have different behavior. You can’t use Data Analysis Expressions (DAX) functions to retrieve values on the one
side from the many
side. You may also see a performance impact versus many-to-many relationships within the same source.
Still I think every architect or modeler should remember this option.
Documentation: link
3) Incremental refresh
Incremental refresh is hard. In many projects I saw that data was transferred fully from sources up to the models and it is ok until it is a problem 🙂 In enterprise-size models, it can be hard to load everything fully especially when you have tons of data to reload. In those situations, Incremental refresh can be handy. It is giving us possibility to partition the dataset and then refresh only subset of partitions. Power BI gives us a simplified user interface to deal with that but the best effect we get when we will deal with partitions on our own via XMLA Endpoints.
When should we consider incremental refresh? When we see that time needed to refresh is reaching the limit or when we predict that data size can be big enough to reload it in incremental manner.
Documentation: link
4) Hybrid tables
During the good old times, we had Analysis Services Multidimensional. If you are old enough to remember these times you can also remember terms like ROLAP, HOLAP and MOLAP. In a nutshell, those terms refer to the situation where different partitions of a table can query data sources directly or cache data on the Analysis Services side. It was a pretty useful feature and guess what – similar approach we have in Power BI and it is called Hybrid tables. In reality, this feature is another step of Composite Models and Incremental Refresh features mentioned above but it is definitely worth mentioning in the dedicated paragraph.
In previous points, I said that we can have mix of tables in a model where one table can be in Direct Query and another one can be in Import. Hybrid tables go another step further and give the possibility to have partitions in Direct Query or Import within a single table! When it can be useful? Let’s say we have a fact table where only the current month is still open and can change and the rest of the months are closed and by definition cannot change. We can say that part that can change is “hot” and the rest is “cold”. In such scenario, it make no sense to refresh entire table if only one month is still open. Previously we had no other choice but currently, Hybrid tables gives us this possibility to refresh only what is needed.
Another example will be opposite to teh above. Let’s say that current month is queried by the users for 90% of time, older months are queried only in 10%. What we can do with it? Partition our table with two partitions and old months partition should be in Direct Query (to keep the model small) and current month is queried frequently so to give best performance we can keep it in Import Mode. Sounds cool? It is!
Documentation: link
5) Aggregations
When I am thinking about aggregations I have the same filling that I have with Hybrid tables – I am thinking about Analysis Services in the Multidimensional mode that contains pre-aggregated values to speed up queries. Aggregations available in Power BI play a similar role – some data can be aggregated and loaded to the model and very detailed data can be in a separate table in Direct Query mode. You can imagine the situation when you have 100 milion fact table with invoice data and you are showing on the reports aggregated data on a day level but you have requirement that there must be option to drill-through to the single line of an invoice. It sounds like a perfect match for aggregations. Most of a queries will hit aggregation table that is in Import mode and if needed single line will be retrieved directly from data source like Synapse. This feature can be extremely useful but many times designers forget or don’t know about it so don’t make such mistake and always consider it in such scenarios.
Documentation: link
6) Deployment pipelines
Deployments of a reporting solution are always a challenge. In the past, it was extremely hard or even impossible to automate deployments in Power BI environments. Thankfully right now we have a feature called Deployment pipelines that gives us the possibility to deploy Power BI objects between workspaces that act as different environments. Additionally, it gives us the possibility to graphically compare objects between workspaces, and what is even more important to change parameter values or data source connection strings so when deployment is taking place we can use different connections on different workspaces. I didn’t mention about REST API that comes with this feature – for some time now we have possibility to automate everything via REST API so it can be fully integrated with our CICD pipelines. Automation is always a good thing so definitely, all of us must be familiar with it.
Documentation: link
7) Refresh datasets via API
How do you refresh data in Power BI? From my experience, I must say that most of the time built-in schedule is used. Is there anything wrong with that? Of course not! But if you have an automated solution where Power BI is fully integrated with ETL processes it is good to align both. Fortunately, we can do it without any problem – you can authenticate to Power BI using Service Principal and send REST API request that will initiate the refresh of a dataset. Why we need something like that? There is multiple reasons for example:
- you want to have single tool that will orchestrate everything like Azure Data Factory,
- you don’t know exactly when your ETL process will end so you want to trigger dataset refresh as fast as possible when it will end,
- you don’t want to query your data warehouse (for example Synapse) from Power BI during ETL.
Personally, I am always trying to convince my customers that official reports should be refreshed from ETL.
Documentation: link
8) Power BI Datamarts
This is the new guy in town. Datamart is nothing but a fully managed database solution that bridge the gap between business users and IT. With this feature you can have a simple and no-code experience to perform self-service ETL using Power Query and load it to Azure SQL Database built-in into the Power BI portal. You can also query this database using external tools but the main purpose of it is to be source for your PBI solution.
When it can be useful? In many self-service scenarios for example when you want to build a departmental self-service solution or give some independent experience for teams without waiting for IR. For know this feature is in preview and has a few limitations but still worth considering especially that you can use it in much more complicated scenarios with for example Synapse in Hub and Spoke architecture where Dedicated SQL Pool acts as a Hub and PBI Data Mart can take data from hub. I definitely recommend overview videos prepared by the Synapse team on YT: link
Documentation: link.
Is it everything? Of course not! There is multiple other features that can help you with your deployments that I didn’t list above like embedding, dataflows, datasets certification or purview integration just to name few but as I said, in the beginning, this list is subjective and reflects my experiences. the only thing that is important is that Power BI is becoming more and more mature so our life is much easier. Thanks!
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments