One of the most exciting things that have been announced during Microsoft Build conference was Azure Synapse Link for SQL. Previously there was the possibility to connect using Synapse Link to Azure Cosmos DB and Dataverse database but guess what – now it is available also for Azure SQL Database and SQL Server 2022. Let’s check how it works in practice.
First of all, it is worth mentioning that Synapse Link for SQL is not only a connection but near-realtime replication mechanism between the operational store (Azure SQL or SQL Server 2022) and Dedicated SQL Pool. According to the official documentation, this mechanism has the following characteristics:
Minimum impact on the operational workload
Synapse Link for SQL can extract data incrementally from an operational store and load it directly to Synapse SQL Dedicated Pool. It is similar to Transactional Replication so it works on a lower level than the simple extraction query that we know from standard ETL processes. This extraction is a lightweight process so it should not affect the standard operational workload.
Reduced complexity with no ETL to manage
Most of the analytical systems have some sort of landing, staging or ods area. This place in most cases keeps data without any additional changes and based on this further transformations are made. The standard approach for this kind of load is to prepare Integrated Pipelines or Data Factory pipelines that utilize Copy Activity and copy data one to one based on metadata. With Synapse Link for SQL, it is not needed because it is done automatically. Of course, we have to monitor and troubleshoot the entire process but still, the goal can be achieved in a much simpler way.
Near-realtime insights into your operational data
Realtime is almost impossible in a BI world because there is always some kind of delay between the insertion of data in the source system and its appearance in the target system so we can say about near-realtime. In this case, replication will propagate data to Synapse pretty fast and for sure much faster than any ETL process can achieve.
Below you can see simple architecture view on Synapse Link for SQL:
When you establish a connection between source & target then your data will be replicated – in the beginning, data will be fully replicated and then it will be extracted in incremental data feeds. It works on table level so you can select only those objects that are interesting to you, if you would like to add or remove more objects in the future it is also possible. If you will need to pause replication there is no problem with that. One important thing to mention is connected to transactional consistency – to keep consistency across tables is possible but it is optional. Depending on our needs we can enable it but please keep in mind that by enabling it we are limiting overall replication throughput.
So that’s all regarding theory let’s go to the portal to check how it works in practice. I have two resources that I created before the demo:
- Azure SQL Database
- Synapse Workspace
Before we create Synapse Link we have to prepare one thing on the source side (Azure SQL) – we have to turn on system-assigned managed identity, without it creation of the link will fail:
Creation is pretty simple, we just to go Azure SQL Server and on Identity tab there is possibility to turn it on:
What is System-assigned Managed Identity? It is Azure Active Directory identity assigned to our resource (SQL Server) that can be used for authentication – great thing about it is that it is passwordless and is maintained automatically.
After that creation should be possible. Let’s go to the Synapse Studio and there under Integrate tab we should find the new option to add Link connection (Preview):
On the New Link connection window we can choose Source type (Azure SQL or SQL Server) and click New as you can see below:
Creation experience should be very familiar if you use Integrated Pipelines before – we have to specify connection details, authorization information, etc:
After that we can specify all the tables that we want to replicate – in this window we can also preview what data is inside:
Of course not everything we can replicate there are some limitations for example table row size cannot exceed 7500 bytes:
List of all limitations for this preview version you can find here. Next, we have to set up compute power that we want to use for replication – as you can see below we have 4 different options here:
It is good to start small and then increase it when we will see that provided throughput is too small. Of course, you have to pay for this reserved computing power. The last step that you must perform is connected with mapping between objects:
By default all the target objects will have the same schema and object name as source objects. Please ensure that the target schema is available. On this screen you can also choose the distribution and type of target index. For demo purposes I put all the tables in dbo schema, all of them will be distributed using Round Robin and heap structure.
When all the objects are ready I started the link as you can see below after a few minutes replication started:
I checked if all the data replicated correctly by executing the query in both source and target and as you can see below everything works as expected:
A very interesting thing can be noticed in source Azure SQL – three additional tables appeared in “changefeed” schema:
Those tables are probably used to keep track of the configuration of our mechanism and status of replication. Please keep in mind that you can pause your Synapse Dedicated Pool but you should avoid stopping your Link connection.
I performed the additional test – I prepared a simple script that will add new rows to the source table in a loop and then checked if data is replicated to the Synapse and it works pretty well:
In what scenario this feature can be useful? First of all, try to imagine the situation where you want to have specific data as fast as possible in your data warehouse. Instead of extracting and loading it in a loop much better solution can be Synapse Link. Same situation if you have to replicate data one to one – instead of using Data Factory/Integrated Pipelines you can turn on synapse Link – it can be cheaper and simpler to maintain and what is maybe even more important we don’t have to worry how to extract only rows that changed since out the last load – now it is done automatically. Money is always an aspect that you should consider – for now, I cannot say how much it will cost but for sure there will be some cost associated with it because we are reserving some computing power for replication. Still, it is always good to have such technology that can be useful in some scenarios. As I said in the beginning not only Azure SQL but also SQL Server 2022 can be used as a source for replication – it works slightly different so maybe I will write a separate article about it.
For now, that’s all – see you!
Official documentation: link.
- Avoiding Issues: Monitoring Query Pushdowns in Databricks Federated Queries - October 27, 2024
- Microsoft Fabric: Using Workspace Identity for Authentication - September 25, 2024
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
Hi Adrian.
Did you check how many connections deoes the ‘link’ consume when replicating to the desired target database using this 4+4 core option? Of course this is something that I would track/configure when it comes to the concurrency limits in Synapse and curious how it forks processes internally to replicate many objects 🙂
Hi Michał, yes Link is consuming concurrency slots like any other clients tools 🙂