Collecting the history of query and command executions in Azure Databricks is essential for troubleshooting, optimizing performance, and ensuring robust security measures. This practice establishes an audit trail for swift issue resolution, aids in creating a usage baseline for objects, and contributes to maintaining a secure data processing environment. The historical data becomes a valuable resource for system reliability enhancement and effective debugging. Today, I would like to demonstrate how to access query history not only for standard general-purpose and job clusters by leveraging audit logs and system tables. This potent capability empowers you to glean valuable insights into your data workflows. Let’s delve into how this functionality works!
Today, our focus will be on General Purpose and Job Cluster commands, not SQL Warehouses. Handling SQL Warehouses is relatively straightforward as we have Query history and Query Profile accessible directly from the Workspace, API, and other sources:
Let’s skip SQL Warehouses and get to the point. First of all, it’s important to note that execution history is not readily available in default logs. To access this information, Verbose Logging must be enabled in the Admin settings of your workspace.
Look for the Verbose Audit Logs setting under the Advanced tab in the Other section:
To begin searching for query execution history in system tables, we’ll first enable it using the REST API and Personal Access Token Authentication (PAT). Let’s generate the token – you can do this in your User Settings:
In the Developer section, you can oversee and manage all your Personal Access Token:
Now, let’s go ahead and generate a new token.:
The next piece of information we’ll need is the Databricks Workspace identifier. You can obtain it in various ways, but the simplest method is to extract the string between “https://” and “azuredatabricks.net” from the URL:
Another important detail for enabling System Tables is the Metastore identifier. It’s worth noting that, similar to other features in the Databricks ecosystem, Unity Catalog needs to be enabled.
Once you’ve logged in to your Databricks Account, you’ll find it directly available in your account settings.Now that we have all the necessary details, let’s move forward with sending the request via the REST API. In my case, I’ll be utilizing curl as the tool for making the request. The initial method will fetch all the system schemas along with their status, indicating whether they are enabled or not in my workspace. Please replace the uppercase placeholders with your own elements.
curl -v -X GET -H "Authorization: Bearer PAT" "https://WORKSPACEID.azuredatabricks.net/api/2.0/unity-catalog/metastores/METASTOREID/systemschemas"
When you get the response, you’ll notice that some schemas are already turned on, and there are others that you can still activate:
{"schemas":[
{"schema":"storage","state":"AVAILABLE"},
{"schema":"operational_data","state":"UNAVAILABLE"},
{"schema":"access","state":"ENABLE_COMPLETED"},
{"schema":"billing","state":"ENABLE_COMPLETED"},
{"schema":"compute","state":"ENABLE_COMPLETED"},
{"schema":"marketplace","state":"AVAILABLE"},
{"schema":"lineage","state":"AVAILABLE"},
{"schema":"information_schema","state":"ENABLE_COMPLETED"}]
}
To activate a particular schema, include its name in the last part of the URI. It’s important to note that in the previous example, we used the GET method, but now we’re using PUT:
curl -v -X PUT -H "Authorization: Bearer PAT" "https://WORKSPACEID.azuredatabricks.net/api/2.0/unity-catalog/metastores/meta-id/systemschemas/SCHEMANAME"
Once enabled, you’ll observe in your Catalog Explorer that the system catalog becomes visible, showcasing all the enabled schemas. I highly recommend exploring these tables as they contain valuable information useful in various scenarios:
Let’s return to our main focus – the history of queries or commands. We can find this information in the table called system.access.audit but you mist filter it where action_name will be equal to runCommand. It has various columns available, and an example query might look like this:
%sql
SELECT
event_time,
user_identity.email,
service_name,
action_name,
request_params.notebookId,
request_params.clusterId,
request_params.executionTime,
request_params.commandText
FROM `system`.`access`.`audit`
WHERE action_name = 'runCommand'
AND lower(request_params.commandText) NOT LIKE '%from system.%'
AND lower(request_params.commandText) NOT LIKE '%from `system`.%'
The result presented there is for sure something that we expected,:
Isn’t that awesome? But, the story continues. Sometimes, we need to combine that info with other metrics. Log Analytics workspace is like a regular storage for logs. To set it up, just go to our Databricks service on the Azure portal. Click on Diagnostic Settings, and from there, you can send logs over to Log Analytics:
Check out the screenshot below. It displays all the available logs that can be saved. In our situation, we only require Databricks Notebook logs. You can give it a name and choose Log Analytics as the destination. Keep in mind that you have various options like Storage Account, Event Hub, or even a custom Partner solution as alternative targets.
Once everything is set up, you can run queries in Databricks. After a while, you should observe in Log Analytics that DatabricksNotebook object is ready for querying:
As you may be aware, we can use the Kusto Query Language to analyze logs stored in Log Analytics. It’s quite straightforward. For instance, the query below fetches fundamental details about queries executed in Databricks:
DatabricksNotebook
| where ActionName =='runCommand'
| project
OperationName,
TimeGenerated,
Category,
parse_json(Identity).email,
parse_json(RequestParams).commandText,
parse_json(RequestParams).commandLanguage
The result is quite straightforward, and we can correlate it with other logs to obtain a much more comprehensive understanding.
I highly recommend checking out the official documentation on the logs that Databricks can generate. You might discover some valuable information that could be useful for your specific situation. Here’s the link.
That’s all for today! I hope you found the article interesting. If you’re into all things Databricks, mark your calendar for SQLDay, one of the biggest conferences in this part of Europe happening on May 13-15 in Wroclaw and online. I’m very happy that I will have opportunity to lead a workshop titled “Databricks – zero to hero” If you are interested, register while there are still slots available. More information and sign up information are available here: link. See you there!
SQL geek, Data enthusiast, Consultant & Developer
Latest posts by Adrian Chodkowski
(see all)
This solution is going to work only for Unity Catalog enabled workspaces?
Hello Ela, System tables are only for Unity Catalog enabled workspaces, Audit Logs are available for all 🙂