Azure Synapse Resultset caching – how it works?

Dedicated SQL Pool in Azure Synapse has a very interesting feature called result set caching. This mechanism is very specific to Synapse and it can significantly increase the performance of a query. In a nutshell, we can say that result set caching when enabled automatically cache query results for repetitive use. Because of this, any further queries can retrieve this result directly from cache instead of recomputation from scratch. As you probably know Synapse has some concurrency slots and the good news is that queries that retrieve data from the cache do not count against those concurrency limits. How does it work? Let’s check that in practice.

For demo purposes, I created in my Synapse workspace Dedicated SQL Pool called SynapseDwh with DW100c size:

Inside the database, I loaded structures available in the Azure SQL Demo database:

I didn’t spend a lot of time defining the distribution of a table and all of them are distributed using a round-robin algorithm:

SELECT        
 OBJECT_SCHEMA_NAME( object_id ) AS schemaName,        
 OBJECT_NAME( object_id ) AS tableName,        
 distribution_policy_desc 
FROM sys.pdw_table_distribution_properties;

Of course, it is not the best configuration from a performance point of view but for our purposes is good enough.

Now let’s run our testing query:

SELECT 
     YEAR(SOH.OrderDate) 
    ,C.SalesPerson 
    ,P.Color 
    ,AVG(UnitPrice) 
    ,SUM(OrderQty) 
    ,COUNT(*) 
FROM dbo.[SalesOrderDetail] AS SOD 
JOIN dbo.[SalesOrderHeader] AS SOH 
    ON SOH.SalesOrderId = SOD.SalesOrderId 
JOIN dbo.Product AS P 
    ON P.ProductId = SOD.ProductId 
JOIN dbo.Customer AS C 
    ON C.CustomerId = SOH.CustomerId 
GROUP BY 
     YEAR(SOH.OrderDate) 
    ,C.SalesPerson 
    ,P.Color 
OPTION (LABEL = 'Cache test query')

The result of this query is not important but the metadata associated with it. As you can see I used OPTION(LABEL) to mark my query with ‘Cache test query’ text- it will be easier for me to identify all the information that I will find within dynamic management views etc. One of the places where I will find information about cache is a DMV called sys.dm_pdw_exec_requests:

SELECT 
     request_id 
    ,command 
    ,status 
    ,result_cache_hit 
FROM sys.dm_pdw_exec_requests AS PDW 
WHERE [label] ='Cache test query'

Below you can see information returned by the above query:

The most important for us is column result_cache_hit. This column can have few values:

1 – data retrieved from the cache,
0 – data not retrieved from the cache (cache miss)
negative values from -1 to -128 – data not retrieved from the cache because of some factor. A specific number is pointing to a detailed reason for example -1 means that result set caching is disabled on a database.

There are multiple reasons why data will not be taken from cache for example:

cache disabled on the session level,
row-level security on the source table,
usage of a temporary or external table,
usage of non-deterministic functions like GETDATE(),
estimated result set size is bigger than 10GB,
the result set contains rows with size bigger than 64 KB.

The above list does not contain all the possible reasons – the full list can be found in the documentation (link).

So we have information that caching is disabled on a database level, so let’s turn it on. It is pretty simple and can be executed by using ALTER DATABASE command:

ALTER DATABASE SynapseDWH SET RESULT_SET_CACHING ON ;

An important remark is that this query must be executed from the master database. When it is turned on we can run our testing query one more time and check DMV:

As you can see above, this time we have status 0 which means we didn’t get data from the cache because it is the first time that we run this query. After one more execution we have the following result:

Now we have the result that we expected. If we want to go deeper into the execution of a query we can check another DMV called sys.dm_pdw_request_steps:

SELECT 
    request_id
   ,step_index
   ,operation_type
   ,location_type
   ,status
   ,total_elapsed_time
   ,command 
FROM sys.dm_pdw_request_steps 
WHERE request_id IN ('QID7252','QID7263')

I filtered it by request_id that I get from the previous query. The result looks pretty interesting:

“Not cached query” required 18 steps to return data including some data movement activities etc, “Cached query” required only one step – return data from cache from Control Node. Sounds cool, isn’t it? We have also command executed to return data from the cache and it looks like this:

select * from [DWResultCacheDb].dbo.[iq_{0C460BE7-CC96-4A77-A2CD-7FD737F59A14}] OPTION(MAXDOP 1)

It looks that Synapse internally identifies our query and retrieved it from the specific table using one thread. Please remember that this mechanism is just retrieving the result set – there are not any additional calculations just simple “SELECT *”.

You can ask how much space this cache currently takes and what is the maximum limit for it. Currently, every Dedicated SQL Pool instance has 1TB of such cache (including the smallest 100DWc!), Checking the size is also pretty simple, you can use DBCC SHOWRESULTCACHESPACEUSED command:

Cache will be maintained automatically by Synapse – for example, it will be purged every 48 hours if it wasn’t used or has been invalidated or when it reaches maximum size. You can do those things also manually – the cache will be removed when you will turn it off for a database or when you will run DBCC DROPRESULTSETCACHE.

It was a little surprising for me when I first touch this feature that cache is not automatically dropped when we will pause our instance. It is very important information to keep in mind. It works like this because it is not in-memory RAM based cache but rather something similar to Bufferpool Extension from SQL Server (link) where data is placed in nonvolatile storage – in this case very fast NVMe SSD which is not as fast as RAM but still much faster than starndard disks. Please be aware that resultset cache is not working like query plan caching – in this case, to hit cache you have to run exactly the same query. Exactly in this case means that the query string must be 100% the same.

Let’s check how it works on example, I have a simple query that I executed two times:

SELECT TOP 1000 * FROM dbo.[SalesOrderDetail] OPTION (LABEL = 'Hit cache test')

then I modified it a little by putting the * sign in the new line:

SELECT TOP 1000  * FROM dbo.[SalesOrderDetail] OPTION (LABEL = 'Hit cache test')

As a result, I received the following information:

As you can see every change in the text will be treated as a new query. The same will happen when you will parametrize or add filter to your query:

SELECT   SUM(OrderQty) FROM dbo.[SalesOrderDetail] WHERE SalesOrderId = 71832 OPTION (LABEL = 'Hit cache - where clause')

and then I take subset from above:

SELECT   SUM(OrderQty) FROM dbo.[SalesOrderDetail] WHERE SalesOrderId = 71832 AND OrderQty<100 OPTION (LABEL = 'Hit cache - where clause')

As you can see resultset cache can be a powerful tuning technique but it is not feasible for all scenarios.

A few good examples come to my mind when this feature should be considered:

Initial screen of Power BI or any other reporting tools – it is sending a query to the database that is known in advance – result set can be placed in the cache so users will not wait for initial data loading on the report,
database is queried automatically by reporting tools like Power BI and some of those queries can be executed in advance to put data in cache (good old technique to warm cache is still valid),
some data is exported (for example via Power BI Paginated Reports) and we have well-known queries used for exporting,
during the ETL some query results must be referenced multiple times – it can be referenced from the cache,

Scenarios when resultset cache is not the best thing to use:

a lot of ad-hoc queries – workload with a lot of unique queries will place data in the cache but it will not be reused,
reporting tools will send tons of different queries,
data will change very often so the cache will be invalidated and will not be reused,
cached result set will be so big that it can cause high throttling on the control node and slow down the overall query response on the instance.

Please keep in mind that you can turn on and turn off caching whenever you want. For example, it can be a good idea to turn off it during ETL and turn on it after or control it on the session level (by simply running SET RESULT_SET_CACHING ON).

The last thing that I wanted to highlight is that queries that retrieve data from the resultset cache are not counting to the concurrency limit. There are scenarios where you can have a lot of concurrent queries and if they will hit cache then it will be amazingly fast and will compute nodes will not be even touched to return query result.

Please try it yourself and probably you will find examples where it will be very useful.

Useful links: