Scanning Databricks local hive metastore from Microsoft Purview

One of the latest news related to Purview is the announcement of a new Databricks connector that will aid in gaining insights from the Hive metastore within the Databricks instance. In this article, I have prepared a short tutorial to demonstrate how to use the connector, what it looks like, and the benefits you will receive. Please note that this feature works with the local metastore and has no connection with the Unity Catalog. Microsoft has already announced its integration with the Unity Catalog, but this feature is not yet available.

Before we proceed to the portal, let’s review the functionalities that our new Purview connector will support. The official documentation can be found at the following link.

Below, you can see the supported capabilities:

Metadata extraction: Registered tables and views will be captured.

Full Scan: Purview will perform a full scan of the entire Hive metastore. Unfortunately, incremental scanning and scoped scanning are not currently supported. After the scan, you will get the following objects:

Azure Databricks workspace
Hive instance
Database
Tables
Views

Lineage: It is partially supported at present because it only shows the connection between Hive objects and the storage where the data is located. Lineage between objects is not supported.

Other functionality such as classifications, access policies, and data sharing are not currently supported. As you can see, not everything is supported, but we can certainly expect more in the future. However, it is still a more convenient way to scan than using the existing Hive connector.

Ok, let’s prepare our system – below you can see overview diagram of entire Purview & Databricks architecture:

This connector needs some additional components so to use it we need to use Virtual Machine with integration Runtime installed on it(5 and 6 on the picture above). I will not show you how to setup Virtual Machine in Azure because that is not the point of this article but – I created smallest possible VM with Windows OS (DSv3).

When VM is ready then we have to install Self Hosted Integration Runtime on it (link). From download page choose the newest version:

Installation is pretty simple and you can do it by clicking Next few times in the wizard:

If everything is successful, you should see the “Configuration Manager” where you can register your SHIR.

To register your SHIR, go to your Purview instance and open the Microsoft Purview Governance Portal:

In the portal go to the Data Map-> Integration runtimes section and click New:

First, select Self-Hosted as the type of integration runtime that you want to set up.:

Provide a name for your SHIR:

On the final window, you will receive two authentication keys. Please copy one of them.

Paste it on SHIR Configuration Manager and click Register:

If everything is successful, you should see a message indicating that the “Integration Runtime (Self-hosted) node has been registered successfully.” Please keep in mind that the SHIR must be accessible from a networking perspective for Purview in order for everything to work correctly.

Make sure it also works on the cloud side by checking the portal. The newly added integration runtime should have a status of “Running.”

Ensure that you get Java 11 – different versions for sure will not work:

Now that our infrastructure is ready, let’s move on to the security section. Purview can authenticate to the Databricks instance using a Personal Access Token (PAT). To generate the PAT, go to your Databricks workspace and click on “User Settings,” located in the top-right corner.

Please be aware that PAT has the same privileges as a person that generates it:

You can add comments to your PAT to help identify it, especially if you have multiple tokens. It is important to note the lifetime of your token as it will only be valid for the number of days specified. After that, it must be rotated or regenerated and replaced in your Azure Key Vault.

After generating the token, copy it and save it in the Azure Key Vault that was previously prepared. Purview will retrieve it from there.

Ok, now we are ready to Register new data source – go to Data Map -> Sources -> Register:

On the new screen choose Azure Databricks connector:

Provide the following details:

Name – name of data source,
Subscription,
Databricks workspace name,
workspace URL (this one will populate automatically),
collection – target collection inside the Purview where you wan to place your Databricks.

After that, an additional data source must be registered: Azure Key Vault. We need it because it will be used to store the token generated earlier. Preparing this connection is relatively simple, but please remember that the Purview Managed Identity must have the privilege to read secrets from this Key Vault. You can assign this MSI to the role of ‘Key Vault Secrets User.’

When data source is registered we must prepare proper credentials to access this data source. To do it go to Management -> Credentials and click New:

Additionally, make sure to give your credentials a meaningful name and provide a proper description. When registering the credentials, select “Access Token” as the type, as this is the authentication method used to access the Key Vault. Then, choose the appropriate Key Vault and specify the name and version (if necessary) of the secret that you want to access:

Ok we are ready to prepare a scan – go to Data Map -> Sources and choose small scan icon in the Databricks box:

Provide all the needed information for the scan:

Name
Integration Runtime – this self-hosted integration runtime that we established
Credential that we created based on key vault
Cluster ID – this one you will find in the cluster tags in Databricks workspace (see screenshot below) or in URL when you will open cluster
Mount Points – all the mount points separated by semicolon
Collection – collection where all the scanned assets will be placed

Next scan trigger can be set – for our purposes I set it to Once:

It will take some time but after few minutes we should see the result – in my case 20 assets have been discovered:

We can go to the Data Catalog to see the details of all of those tables:

Data governance is important because it helps ensure that data is accurate, consistent, and secure. It also helps organizations comply with regulations, such as the General Data Protection Regulation (GDPR). By establishing clear rules and procedures for data management, organizations can avoid conflicts and disputes over data ownership and usage. Purview can help organizations with data governance by providing a centralized data catalog that enables discovery, understanding, and management of data. It allows organizations to create policies for data usage, classification, and retention, and provides tools for monitoring and enforcing these policies. Integration with Databricks is a great step forward, and I am looking forward to seeing integration with Unity Catalog, which will surely help many organizations. Please stay tuned as I plan to publish a series of articles in the near future, which will provide insights and information on the integration between Unity Catalog and Purview.