As you continue adding more files to your table, it keeps growing larger. One day, while reviewing it, you notice some records in the middle that contain mistakes. To troubleshoot this in an easy way, it’s essential to have proper metadata about the files you have already loaded. How can you obtain this information? Another scenario could be when you want to deduplicate the data, and the only information available about when the data arrived is in the metadata of files stored. These and many more use cases require metadata. In today’s article, I will demonstrate how to do this based on a few simple examples. Let’s get started!
The simplest way to achieve this is by performing an SQL query with the input_file_name function:
SELECT input_file_name() AS file_name FROM delta.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem`
As it is visible in the screenshot above, all the information is correctly organized. It might be slightly misleading that the function name suggests ‘file_name,’ but, in reality, it returns the full path. However, it is not an issue to further transform it to extract only the file name:
%sql SELECT substring(input_file_name(), length(substring_index(input_file_name(), '/', -1)) + 2) AS file_name FROM delta.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem`
If you prefer Python syntax, using it is also quite straightforward:
df =( spark.read.format('delta') .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem') .selectExpr('input_file_name()') .distinct() ) display(df)
input_file_name() is not the only method available for this purpose. Its significance is highlighted, especially because this function may not work correctly (at the time of writing) on Databricks Shared Cluster. Therefore, we need to explore alternative options to achieve the same result.
Another approach to gather more information about your input files involves using the _metadata column. This is a discreet column that functions across various input files. It is labeled as ‘secret’ because it is not visible by default. If you wish to view the _metadata column in the DataFrame you’re working with, simply include it in your query. However, there’s a caveat: if your data source already has a column named _metadata, the queries will display details from that column in the data source, not the file’s information. To avoid this, if possible, steer clear of using such naming conventions.
df =( spark.read.format('delta') .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem') .selectExpr('_metadata.*') .distinct() ) df.printSchema()
_Metadata is not simple column – it is a struct that consits on multiple elements, the most important ones:
- file_path – path where file is located,
- file_name – name of a file,
- file_size – size of a file,
- file_modification_time – timestamp where was modified.
Avoid using * (as demonstrated in the previous example) to retrieve information from the _metadata column, as Databricks may add more attributes to it in the future. It’s advisable to be more specific in selecting the attributes you need:
df =( spark.read.format('delta') .load('abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/lineitem') .selectExpr('_metadata.file_name AS file_name','_metadata.file_size AS file_size') .distinct() ) display(df)
You can not only use it in a simple Spark read operation but also with COPY INTO and Autoloader. Below, you can see an example for COPY INTO.
First, let’s create the target table:
%sql CREATE TABLE mytable USING delta AS SELECT * FROM PARQUET.`abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/nation` LIMIT 0
Then we can proceed to run the COPY INTO itself. Please note that in the COPY_OPTIONS, I specified mergeSchema as True because the file_name column wasn’t added during the CREATE TABLE statement and if I will not do it I will receive an error:
%sql COPY INTO mytable FROM ( SELECT *, _metadata.file_name FROM 'abfss://data@sqldaylakehouse.dfs.core.windows.net/raw/nation' ) FILEFORMAT = PARQUET COPY_OPTIONS ('mergeSchema' = 'true')
Now, we can check if the column is available to us, ensuring everything works as expected:
%sql SELECT DISTINCT file_name FROM mytable
Databricks is a powerful tool that facilitates the import of data from various file types and organizes it neatly into tables. However, when it comes to retaining specific details about the data’s origin, such as the source files, Databricks doesn’t automatically track this information. You need to explicitly instruct it to remember these details when importing the data. Throughout this process, you have the flexibility to choose precisely which details to retain and associate with the table you’re creating. I strongly recommend to always keep track of such metadata in your lakehouse because even if it is not usable right now, it can be in the future.
That’s all for today, thank you for your time and see you soon.
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Czy dałoby się wykorzystać to podejście do wyciągania informacji o odświeżeniu całego kontenera/folderu (zbiorcze info dla wszystkich parquetow i jakiś max date z tego)?
A to podejście, per parquet, już wypróbowane, działa, dzięki za artykuł 🙂