Purview is one of the most needed services in the Azure cloud. It gives us the opportunity to scan, classify, and govern data assets within our organization. This service is constantly developing, so we can notice more and more valuable tools. One of the main questions that I hear from my customers is about data lineage because they want to track how data flows through their processes. In this article, I would like to show you how to display lineage in Azure SQL using the relatively new ‘Lineage extraction’ feature. The described feature is not a simple static data extraction but a dynamic one. What does this mean? While it is technically possible to write a TSQL syntax parser and generate lineage based on it, such analysis cannot describe lineage based on things like parameter values that are not known until execution. The same idea applies to conditional logic inside the code that depends on runtime values, as the lineage can change from one execution to another. The table below illustrates the differences between static code analysis and analysis of runtime values
As you can see, almost everything can be extracted based on runtime analysis – Let’s explore how it works! Firstly, let’s set up Purview. I didn’t demonstrate how to do it on my blog, so let’s go through the process now. We need to search for the Purview service and create a new one:
Next, we need to provide some basic information about the service:
On the following page, we have networking settings where we can configure how to access Purview. For the purposes of this demo, we won’t be restricting access in any way.
Under the configuration tab, you can add Kafka configuration, but for the purpose of this demo, we will skip this part.
After the Purview resource is ready, we can open Purview Studio and navigate to Data Map -> Sources. From there, we can register a new data source:
Select Azure SQL Database from the available connectors:
Providing access is a simple process. You just need to provide the following information:
- Name: A simple name to identify the source.
- Azure Subscription: The subscription where the resource is located.
- Server name: The name of the SQL Server.
- Endpoint: The URI of the server.
- Select a collection: The Purview collection that should be used for registration.
- Data use management: Purview has the possibility to define a Data Access Policy for Azure SQL. For now, leave it disabled, but maybe I will write a few words about it in the future.
We have successfully added Azure SQL Database as a source for our Data Map inside the Purview. This means that we can now discover and manage data from the Azure SQL Database within the Purview environment:
Now we can set up a scan. Please be aware that the Purview managed identity should have access to the source Azure SQL in order to read metadata and runtime values. Particulary it should be added to the db_owner role. Other authentication methods are allowed, but in most cases, MSI should be an option. Also, please note that Lineage Extraction is optional, so please turn it on.
When setting up a scan for Azure SQL Database in Purview, it’s important to note that scoped scans are allowed. This means that we have the option to choose which objects we want to scan, such as tables, views, or stored procedures, rather than scanning the entire database. This can help to reduce the scan time and focus on the specific data objects that are most important to us. Therefore, we should carefully consider which objects to include in the scan scope based on our data management policies and compliance requirements.
After selecting the objects to scan for Azure SQL Database in Purview, we will be prompted to choose a scan ruleset on the next screen. A ruleset is a collection of scan rules that define the criteria for identifying scanned objects. For the purpose of this demo, we will use the default ruleset, which provides a good enough solution.
After setting up the scan ruleset for Azure SQL Database in Purview, we have the option to trigger the scan based on a schedule. However, for the purpose of this demo, we will only scan the objects once to demonstrate the scan process and view the results. Please keep in mind that in a real-world scenario, we should schedule the scans based on the frequency of data changes and business requirements.
Once we have confirmed that everything is set up correctly, we can save the scan configuration and run the scan (please be aware that for now only one scan with lineage extraction enabled is allowed):
We can navigate to the Monitoring tab to track the progress of the scan. Depending on the size of the database and the scope of the scan, the scan may take some time to complete.
Please notice that single scan has two operations performed:
– Manual is standard metadata scanning operation (it is manual because I triggered it manually) – Lineage extraction is additional step when runtime is checked to generate lineage. After the scan is complete, we can view the scan results in the Purview data map. The data map will display all the tables identified during the scan as assets, along with associated metadata such as schema. This allows us to easily discover the data assets, with the ability to search, filter, and analyze, we can quickly locate the data assets we need and collaborate with others to ensure that they are being used effectively.
When we select a single table from the data map in Purview and navigate to the Lineage tab, we may notice that the tab is empty. This is because, as mentioned earlier in this article, lineage information is extracted based on runtime values. To obtain information about lineage, we need to execute procedures that load it:
To test lineage I added two objects – first it will be source_customer table generated based on SalesLT.Customer:
SELECT TOP 0 NameStyle, Title, FirstName, MiddleName, LastName, Suffix, CompanyName, SalesPerson, EmailAddress, Phone, PasswordHash, PasswordSalt, rowguid, ModifiedDate INTO dbo.source_customer FROM SalesLT.Customer
Then I added two sample rows to this table:
INSERT INTO dbo.source_customer (NameStyle, Title, FirstName, MiddleName, LastName, Suffix, CompanyName, SalesPerson, EmailAddress, Phone, PasswordHash, PasswordSalt, rowguid, ModifiedDate) VALUES (1, 'Mr.', 'John', 'A.', 'Doe', NULL, 'ABC Company', 'Jane Smith', 'john.doe@example.com', '123-456-7890', 'hjg56dgfkj34r5gf', 'hjg56', NEWID(), GETDATE()); INSERT INTO dbo.source_customer (NameStyle, Title, FirstName, MiddleName, LastName, Suffix, CompanyName, SalesPerson, EmailAddress, Phone, PasswordHash, PasswordSalt, rowguid, ModifiedDate) VALUES (0, 'Ms.', 'Jane', NULL, 'Smith', NULL, 'XYZ Inc.', NULL, 'jane.smith@example.com', '555-555-1212', 'abc123', 'abc123', NEWID(), GETDATE());
We have our data in place so let’s create stored procedure that will load it:
CREATE PROC SalesLT.uspLoadCustomer AS BEGIN INSERT INTO SalesLT.Customer ( NameStyle,Title,FirstName,MiddleNAme,LastNAme,Suffix,CompanyNAme,SalesPerson ,EmailAddress,Phone,PasswordHash,PasswordSalt,rowguid,ModifiedDate ) SELECT NameStyle,Title,FirstName,MiddleNAme,LastNAme,Suffix,CompanyNAme,SalesPerson ,EmailAddress,Phone,PasswordHash,PasswordSalt,rowguid,GETDATE() AS ModifiedDate FROM dbo.source_customer END GO
In the end I executed procedure to load data from source to target:
dbo.uspLoadCustomer
I executed scanning one more time and I received my lineage:
Looks pretty nice and it gives overview of the data flow logic but more interesting information can be found when we go to properties where under Related assets Stored Procedure Statement Operations is located:
After clicking it we can see a lot of useful information including:
- columnMapping -source to sink mapping in JSON format,
- inputs – list of input tables,
- lineageParentQN – URI to parent object,
- lineageParentType – type of parent object,
- outputs – target table,
- qualifiedName – qualified name of this procedure execution,
- queryHash – SQL query hash,
- queryText – SQL query textm
- statementType – type of operation that query executes.
Very broad information. Let’s complicate it a little and add more complicated transformation and additional table:
ALTER PROC SalesLT.uspLoadCustomer AS BEGIN INSERT INTO SalesLT.Customer ( NameStyle,Title,FirstName,MiddleNAme,LastNAme,Suffix,CompanyNAme,SalesPerson ,EmailAddress,Phone,PasswordHash,PasswordSalt,rowguid,ModifiedDate ) SELECT * FROM ( SELECT TOP 1 NameStyle,Title,FirstName,MiddleNAme,LastNAme,Suffix,CompanyNAme,SalesPerson ,EmailAddress,Phone,PasswordHash,PasswordSalt,rowguid,GETDATE() AS ModifiedDate FROM dbo.source_customer WHERE FirstName='John' ORDER BY LastNAme ) as q1 UNION SELECT * FROM ( SELECT TOP 1 NameStyle,Title,FirstName,MiddleNAme,LastNAme,Suffix,CompanyNAme,SalesPerson ,EmailAddress,Phone,PasswordHash,PasswordSalt,rowguid,GETDATE() AS ModifiedDate FROM dbo.source_customer2 WHERE FirstName = 'Jane' ORDER BY LastNAme DESC ) as q2 END GO
Lineage generated correcly and now shows two source tables:
[ { "DatasetMapping": { "Source": "mssql://elit-sample-db.database.windows.net/sample/dbo/source_customer", "Sink": "mssql://elit-sample-db.database.windows.net/sample/SalesLT/Customer" }, "ColumnMapping": [ { "Source": "CompanyNAme", "Sink": "CompanyNAme" }, { "Source": "EmailAddress", "Sink": "EmailAddress" }, { "Source": "FirstName", "Sink": "FirstName" }, { "Source": "LastNAme", "Sink": "LastNAme" }, { "Source": "MiddleNAme", "Sink": "MiddleNAme" }, { "Source": "NameStyle", "Sink": "NameStyle" }, { "Source": "PasswordHash", "Sink": "PasswordHash" }, { "Source": "PasswordSalt", "Sink": "PasswordSalt" }, { "Source": "Phone", "Sink": "Phone" }, { "Source": "SalesPerson", "Sink": "SalesPerson" }, { "Source": "Suffix", "Sink": "Suffix" }, { "Source": "Title", "Sink": "Title" }, { "Source": "rowguid", "Sink": "rowguid" } ] }, { "DatasetMapping": { "Source": "mssql://elit-sample-db.database.windows.net/sample/dbo/source_customer2", "Sink": "mssql://elit-sample-db.database.windows.net/sample/SalesLT/Customer" }, "ColumnMapping": [ { "Source": "CompanyNAme", "Sink": "CompanyNAme" }, { "Source": "EmailAddress", "Sink": "EmailAddress" }, { "Source": "FirstName", "Sink": "FirstName" }, { "Source": "LastNAme", "Sink": "LastNAme" }, { "Source": "MiddleNAme", "Sink": "MiddleNAme" }, { "Source": "NameStyle", "Sink": "NameStyle" }, { "Source": "PasswordHash", "Sink": "PasswordHash" }, { "Source": "PasswordSalt", "Sink": "PasswordSalt" }, { "Source": "Phone", "Sink": "Phone" }, { "Source": "SalesPerson", "Sink": "SalesPerson" }, { "Source": "Suffix", "Sink": "Suffix" }, { "Source": "Title", "Sink": "Title" }, { "Source": "rowguid", "Sink": "rowguid" } ] } ]
As you can see we in mapping json we have now two mappings because we have two source tables. Pretty interesting and can be very useful.
Data lineage is an important aspect of data management that provides information on how data flows through different systems. It helps organizations to track and trace data from its origin to its destination, ensuring that data remains accurate and trustworthy throughout its lifecycle. Data lineage also provides insights into the impact of data changes on downstream processes, which is important for compliance, data governance, and risk management.All of these things can be achieved in Purview. I am looking forward to more dynamic implementations, such as in Dedicated SQL Pool. There is still a lot to do, but we are on the right path to getting all the necessary functionalities.
- Avoiding Issues: Monitoring Query Pushdowns in Databricks Federated Queries - October 27, 2024
- Microsoft Fabric: Using Workspace Identity for Authentication - September 25, 2024
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
Last comments