Delta Lake 101 Part 3: Optimize ZOrdering and File Pruning

Today I would like to write few words about one of the most important techniques used to improve performance in your Lakehouse – two specific keywords: OPTIMIZE and ZORDER. Let’s start by introducing those two terms.

The OPTIMIZE command in Delta Lake helps tidy up how data files are stored in a Delta table. It rearranges these files to make queries faster by avoiding the need to go through extra data that isn’t needed. When you use the OPTIMIZE command, Delta Lake does things like combining small files into bigger ones and putting data in a better order. This all makes it easier to get to the data you want when you’re running queries.

File pruning in Spark is like tidying up before you start working with data. Imagine you have a big pile of files. Instead of looking through all of them, you only pick the ones that have the info you need. Spark does something similar. It figures out which parts of the data it actually needs for a task and only reads those parts, saving time and resources. This is handy when dealing with lots of data.

We will begin by setting up some global configurations as a starting point:

SET spark.sql.files.maxRecordsPerFile = 1000;
SET spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 1;

The first option sets a limitation on the maximum number of files that can be stored within a single file, and in this specific example, that limit is set to one thousand. Meanwhile, the second option dictates the number of columns for which statistics, encompassing both minimum and maximum values, should be computed. By default, statistics are generated for the initial 32 columns; however, I wish to restrict this calculation exclusively to the first column.

Previously, I established a “customer” table containing a set of rows, locating it within the “raw” schema. Now, I am employing this existing table to facilitate the creation of a new one through the “CREATE TABLE AS” (CTAS) syntax. Specifically, I am generating a new table comprising a random selection of nine thousand records.

CREATE OR REPLACE TABLE customer
LOCATION 'abfss://data@sqldaylakehouse.dfs.core.windows.net/filepruning/customer'
AS 
SELECT 
*
FROM  
raw.customer
ORDER BY RAND() DESC
LIMIT 9000;

With our foundational structures in place, let’s proceed to create a temporary view for convenience, bearing the moniker “abc.” To accomplish this task, we can employ the “createOrReplaceTempView” method.

%python
df = spark.sql("SELECT c_custkey FROM customer LIMIT 1")
df.createOrReplaceTempView("abc")

Now, let’s execute a query on this “abc” view:

SELECT * FROM abc

To determine the number of files we read, we can easily verify this information by inspecting the specific job details. In the Job details, there should be a “View” button that, when clicked, will open a new window displaying a plethora of internal information, including the count of files read.

Once the new window is launched, please proceed to the SQL/DataFrame tab. Within this tab, locate our specific query and select it. This action will grant us access to the relevant details, including the count of files that were read during the execution.

Following these steps, you will be presented with the query details and the Directed Acyclic Graph (DAG), which represents the query plan. To access further information, locate the “Scan” operation located at the bottom of the DAG and click on it.

A lot of statistics will be visible for us and two most important for us will be:

files pruned – number of skipped files,
files read – number of read files:

As observed in this scenario, we exclusively accessed one file while bypassing eight others. This favorable outcome underscores the efficient pruning of redundant files. Let’s transition to another example where our objective is to locate a specific customer.

SELECT input_file_name(),*
FROM customer
WHERE c_custkey = 342278

In this particular case, when we examine the execution plan, it becomes apparent that we read all nine files.

How does Spark determine which file to read? The critical information regarding the minimum and maximum values, stored within a single file, can be located within the Delta log.

By default, these statistics are collected for the initial 32 columns. However, as previously mentioned in this article, we modified this behavior to exclusively calculate statistics for the first column. In our search for a common value that likely appears in each file (positioned within the specified ranges for each file), our approach involves arranging rows based on the value employed in the search clause. To validate this approach, we will commence by eliminating the restriction on the maximum number of rows per file by setting it to 0.

SET spark.sql.files.maxRecordsPerFile = 0

Additionally, we have the option to customize the default behavior of the “OPTIMIZE” command. Specifically, we can configure it to generate files with a maximum size of 4KB. It’s worth noting that the default size is set at 1GB, which is typically a more suitable value. However, for the purposes of testing and experimentation, adjusting it to 4KB is acceptable.

SET spark.databricks.delta.optimize.maxFileSize = 4096

The “OPTIMIZE” command plays a pivotal role in consolidating files into larger ones, making it a valuable option for enhancing the read performance of our lakehouse. Furthermore, we have the capability to dictate the ordering of data based on a specific column by configuring the “ZORDER BY” clause, providing further control over query performance optimization.

OPTIMIZE customer
ZORDER BY (c_custkey)

This operation may require some time to execute, but upon its conclusion, we will receive a statistics report. In my specific case, this resulted in the addition of 198 files and the removal of 9 files.

Now, as we attempt to locate a specific customer and inspect the statistics, we observe that only one file was read:

SELECT * FROM customer
WHERE c_custkey = 251502

In conclusion, the strategic application of “OPTIMIZE” and “ZORDER” in your Apache Spark data processing endeavors can significantly enhance overall system efficiency. These two robust techniques fine-tune your Spark workflows, leading not only to improved performance but also substantial resource savings.

You can orchestrate the restructuring of data files, merging them into more manageable units. This streamlined approach expedites query execution and prudent resource allocation, resulting in a notable uptick in overall productivity.

In a similar vein, “ZORDER” provides you with precise control by allowing you to specify the column for data ordering, a crucial factor in refining query performance. Whether it’s related to sorting, filtering, or joining operations, the ability to customize data organization to suit your specific needs can yield substantial improvements.

As you apply these techniques in practical scenarios, it’s important to recognize the essential role played by statistics in assessing their effectiveness. A continuous routine of monitoring and refinement is key to consistently optimizing the benefits derived from “OPTIMIZE” and “ZORDER.”

In the ever-evolving realm of big data, the capacity to optimize data processing is not merely advantageous; it’s a necessity. Through “OPTIMIZE” and “ZORDER,” Apache Spark equips you with the tools required to navigate this landscape with precision and efficiency, ensuring that your data workflows remain at the forefront of operational excellence. Embrace these methodologies, adhere to industry best practices, and unlock the full potential of Apache Spark in your data analytics journey.

Certainly, while “OPTIMIZE” and “ZORDER” are powerful techniques, it’s important to note that there are other optimization methods available. One notable innovation is Microsoft Fabric’s introduction of “VORDER,” a promising approach to enhancing query performance when working with Delta files. Stay tuned for our upcoming discussion on this topic in the next articles, where we’ll delve into the benefits and applications of “VORDER” in further detail.