Easy way to retrieve metadata from loaded files in Azure Databricks

As you continue adding more files to your table, it keeps growing larger. One day, while reviewing it, you notice some records in the middle that contain mistakes. To troubleshoot this in an easy way, it’s essential to have proper metadata about the files you have already loaded. How can you obtain this information? Another scenario could be when you want to deduplicate the data, and the only information available about when the data arrived is in the metadata of files stored. These and many more use cases require metadata. In today’s article, I will demonstrate how to do this based on a few simple examples. Let’s get started!

The simplest way to achieve this is by performing an SQL query with the input_file_name function:

SELECT 
  input_file_name() AS file_name
FROM delta.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem`

As it is visible in the screenshot above, all the information is correctly organized. It might be slightly misleading that the function name suggests ‘file_name,’ but, in reality, it returns the full path. However, it is not an issue to further transform it to extract only the file name:

%sql
SELECT 
  substring(input_file_name(), length(substring_index(input_file_name(), '/', -1)) + 2) AS file_name
FROM delta.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem`

If you prefer Python syntax, using it is also quite straightforward:

df =(
        spark.read.format('delta')
        .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem')
        .selectExpr('input_file_name()')
        .distinct()
    )

display(df)

input_file_name() is not the only method available for this purpose. Its significance is highlighted, especially because this function may not work correctly (at the time of writing) on Databricks Shared Cluster. Therefore, we need to explore alternative options to achieve the same result.

Another approach to gather more information about your input files involves using the _metadata column. This is a discreet column that functions across various input files. It is labeled as ‘secret’ because it is not visible by default. If you wish to view the _metadata column in the DataFrame you’re working with, simply include it in your query. However, there’s a caveat: if your data source already has a column named _metadata, the queries will display details from that column in the data source, not the file’s information. To avoid this, if possible, steer clear of using such naming conventions.

df =(
        spark.read.format('delta')
        .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem')
        .selectExpr('_metadata.*')
        .distinct()
    )

df.printSchema()

_Metadata is not simple column – it is a struct that consits on multiple elements, the most important ones:

file_path – path where file is located,
file_name – name of a file,
file_size – size of a file,
file_modification_time – timestamp where was modified.

Avoid using * (as demonstrated in the previous example) to retrieve information from the _metadata column, as Databricks may add more attributes to it in the future. It’s advisable to be more specific in selecting the attributes you need:

df =(
        spark.read.format('delta')
        .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem')
        .selectExpr('_metadata.file_name AS file_name','_metadata.file_size AS file_size')
        .distinct()
    )

display(df)

You can not only use it in a simple Spark read operation but also with COPY INTO and Autoloader. Below, you can see an example for COPY INTO.

First, let’s create the target table:

%sql
CREATE TABLE mytable
USING delta
AS SELECT *  FROM PARQUET.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/nation`
LIMIT 0

Then we can proceed to run the COPY INTO itself. Please note that in the COPY_OPTIONS, I specified mergeSchema as True because the file_name column wasn’t added during the CREATE TABLE statement and if I will not do it I will receive an error:

%sql
COPY INTO mytable
FROM (
  SELECT *, _metadata.file_name FROM 'abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/nation'
)
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true')

Now, we can check if the column is available to us, ensuring everything works as expected:

%sql

SELECT DISTINCT file_name FROM mytable

Databricks is a powerful tool that facilitates the import of data from various file types and organizes it neatly into tables. However, when it comes to retaining specific details about the data’s origin, such as the source files, Databricks doesn’t automatically track this information. You need to explicitly instruct it to remember these details when importing the data. Throughout this process, you have the flexibility to choose precisely which details to retain and associate with the table you’re creating. I strongly recommend to always keep track of such metadata in your lakehouse because even if it is not usable right now, it can be in the future.

That’s all for today, thank you for your time and see you soon.

Author
Recent Posts

Follow me

Adrian Chodkowski

SQL geek, Data enthusiast, Consultant & Developer

Follow me

Latest posts by Adrian Chodkowski (see all)

High concurrency mode for Fabric notebooks in pipelines - June 30, 2025
Microsoft Fabric Warehouse development – deploying project with SQLPACKAGE - March 14, 2025
Microsoft Warehouse development – Pre Post Deployment scripts - March 11, 2025

1 Comment

Ela says:

February 1, 2024 at 8:13 am

Czy dałoby się wykorzystać to podejście do wyciągania informacji o odświeżeniu całego kontenera/folderu (zbiorcze info dla wszystkich parquetow i jakiś max date z tego)?
A to podejście, per parquet, już wypróbowane, działa, dzięki za artykuł 🙂

Easy way to retrieve metadata from loaded files in Azure Databricks

1 Comment

Leave a ReplyCancel reply

Last articles

Last comments

Facebook