Incrementally loading files from SharePoint to Azure Data Lake using Data Factory

Introduction

Companies store a lot of useful data in SharePoint, OneDrive (which is essentially SharePoint), and MS Teams (also based on SharePoint).
If you want to enhance your data platform with information that is usually stored in various file formats (Excel, MS Access, CSV, and others), you’ll need to save it to a more user-friendly location like Azure Data Lake Storage. To avoid redundant file retrieval, an incremental data loading method is necessary.

In the article below, I will guide you through building a simple pipeline in Azure Data Factory that facilitates incremental file retrieval to Azure Data Lake Storage for further analytical use case.

High Level Architecture

High Level Design Azure Data Factory, Graph API, SharePoint to Azure Data Lake

Our architecture requires following Azure Services/Objects:

Azure Data Factory: The primary tool responsible for executing the file retrieval pipeline.
Service Principal/Managed Identity: An essential identity for authorizing access to SharePoint.
Azure Data Lake: The destination for downloaded files.
SQL Database: A metadata store to track information about downloaded files.
(Optional) Azure Key Vault: To securely store the password for the Service Principal.

How it works

During the initial run, we selectively retrieve desired files from the SharePoint folder (e.g. only XLSX files), save them to the storage account, and log the operation. The logging includes file metadata, such as the File Last Modification Date.

In subsequent runs, we retrieve the logged metadata, identify the maximum Modification Date and filter for files newer than the ones loaded in the previous run. We have to do this before pulling files from SharePoint to get only those files that changed compare to previous run. We then load the files into ADLS (Azure Data Lake Storage) and log the operation again.

Azure Active Directory (Entra ID) Configuration

Service Principal Configuration

Option 1) Configure access for Service Principal to all SharePoint Sites.

To implement this option:

Access your app registration,
Add the ‘MS Graph’ API permission with ‘Application Permission’,
Specify ‘Sites.Read.All’ as the permission name.

Tenant Admin consent is required since anyone with access to this App Registration will gain the ability to read all SharePoint sites in the tenant.

Option 2) Configure access for specific SharePoint Site.

For more granular control, providing access only to a specific SharePoint site is possible. You can add the same ‘MS Graph’ API permission as mentioned above, but this time with different permission name – Sites.Selected. Then to grant access to a specific SharePoint site, you can use tools like PowerShell. Use the following command:

Import-Module Microsoft.Graph.Sites

$params = @{
    roles = @("write")
    grantedToIdentities = @(
        @{
            application = @{
                id = "89ea5c94-1234-4321-abcd-3fa95f62b66e"
                displayName = "sharepoint-downloader"
            }
        }
    )
}

New-MgSitePermission -SiteId $siteId -BodyParameter $params

Where id is your App Registration Application ID and siteId is your SharePoint site ID.

Managed Identity

Please note that you can also grant the same permissions for Managed Identity (either System or User). This provides several benefits: no password or secret to maintain, identity is a project asset. The limitation is e.g. the lack of possibility to access SharePoint from tenants other than the one where your Managed Identity exists.

Our scenario: We will be using a Service Principal as Data Factory, and SharePoint resides in different tenants.

Azure Key Vault Configuration

Because our scenario requires the Service Principal Secret value to be stored in Azure Key Vault, we need to:

Add the Application Registration secret value to the Key Vault secret.
Grant the Key Vault Secrets Officer role to the Data Factory System or User Managed Identity.

We are going to store the secret with the name ‘DSS-SPI-SHAREPOINT’ in the Key Vault. This ensures secure and centralized management of sensitive information, aligning with best practices for handling credentials and secrets in Azure environments.

SQL Server Configuration

In SQL Server, we are going to keep information about all files that we have downloaded to the Storage Account using a Stored Procedure that Data Factory will trigger.

To ensure proper access control:

Create Users in Database
We need to create user in the database and grant necessary permissions to execute the procedure (that will write to table) and read the table data.
You can choose between creating a SQL login:
CREATE USER sql_user WITH PASSWORD='strong!Pa$$word'
Or creating a login from Data Factory System/User Identity:
CREATE USER [adf-dev-eliti] FROM EXTERNAL PROVIDER;
where [adf-dev-eliti] is name of your Data Factory instance.

Create table:

CREATE TABLE [xls].[file_logs](
    [xls_name] [varchar](255) NULL,
    [lastModifiedDateTime] [datetime] NULL,
    [folder_name] [varchar](255) NULL,
    [site_name] [varchar](255) NULL,
    [hash] [varchar](255) NULL
)

Create Procedure:

CREATE PROCEDURE [xls].[log_xls_load]      
    @xls_name varchar(255)
    ,@lastModifiedDateTime varchar(255)
    ,@folder_name varchar(255)
    ,@site_name varchar(255)
    ,@hash varchar(255)
AS   

    SET NOCOUNT ON;  
        INSERT INTO xls.file_logs (xls_name,lastModifiedDateTime,folder_name,site_name,hash)
        VALUES(@xls_name,@lastModifiedDateTime,@folder_name,@site_name,@hash)
GO

Ensure that the created users have the necessary permissions to execute the stored procedure and read from the table. Grant the required permissions using the GRANT statement, such as:
```
GRANT EXECUTE ON [xls].[log_xls_load] TO [adf-dev-eliti];
GRANT SELECT ON[xls].[file_logs] TO [adf-dev-eliti];
```

Data Factory Configuration

Data Factory Credentials

For authentication to the Graph API, you’ll need either ServicePrincipal or User Assigned Managed Identity credentials. In our example, we will be using ServicePrincipal credentials with the name ‘EXCEL-SPI.’

Provide the following information:

Credential name: ‘EXCEL-SPI’ in our example.
Azure Cloud Type: In most cases, it is ‘Azure Public.’
Tenant ID: The Tenant ID where the Service Principal was created.
Service Principal ID: The Application ID from the Service Principal configuration.
AKV linked service: Refer to the Linked Service section where we will create a Linked Service to Azure Key Vault.
Secret Name: It is a reference to the Key Vault Entry Name where we have put the Service Principal secret value (in our example it was ‘DSS-SPI-SHAREPOINT’).
Secret Version: In most cases, it will be ‘Latest version’, this setting will make ADF to always look for most recent Secret value (this is very helpful when we rotate the secret).

This configuration ensures that Data Factory can securely authenticate to the Graph API using the specified ServicePrincipal credentials, leveraging secret storage in Azure Key Vault.

Datafactory Credentials Service Principal Key Vault Backend Secret

Linked Services

Azure Data Lake Gen2 – This is a Linked Service to the Azure Storage Account where we will send retrieved files.

Type: Azure Data Lake Storage Gen2
(1) Authentication: For production, it is recommended to use Managed Identity, but for simplicity in this example, we will use ‘Account Key’.
(2) URL: It is the Data Lake Storage Endpoint, which you can find in the ‘Endpoints’ section of your storage account.
(3) Storage Account Key = The key is available in the ‘Access Key’ section of your storage account.

HTTP – This is the core Linked Service that we will be using to call the Graph API.

(1) Type: HTTP
(2) Linked Service Parameter: URL (string type)
(3) Base URL: Dyanmic content @{linkedService().URL} That means we will pass URL value during pipeline execution and re-use same Linked Service.

Azure key Vault (OPTIONAL) -required for Service Principal Secret. This is not required when you are using Managed Identity.

Azure SQL Database – required if you want to load files incrementally and make your process more dynamic (metadata-driven).

(1) Type: Azure SQL Database.
(2) SQL Endpoint: You can either enter the server and database name manually or select from your Azure subscription.
(3) Authentication type: System Identity (ADF Identity that we added to SQL Server).

Data Factory development

Datasets

HTTP with Binary format (name: HTTP_BinaryFiles) – Source binary files that allow us to copy any kind of file type, linked with an HTTP Linked Service that we will be using in the Copy Activity. Additionally, we create a dataset parameter named “URL” and in the linked service properties, we assign this parameter’s value to the URL Parameter by adding dynamic content @dataset().URL.
We will be passing values to the URL parameter dynamically from the pipeline.

Azure Datafactory Dataset binary file HTTP linked service

ADLS with Binary format (name: ADLS_binXLS) – Destination to Azure Data Lake, format binary, linked with Data Lake Linked Service.
Additionally, we create dataset parameters: XLS_folder_pathand XLSX_file_name. These parameters are used in the File Path attribute, which consists of three elements: ADLS Container name, Folder path, and File Name. We will have to dynamically assign a folder path to take into consideration the current date, as well as dynamically assign a file name based on the file that we will be retrieving.

Datafactory Dataset Binary Azure Data Lake Storage

In our case:

Container name is static value: “raw“.
Folder path is dynamic conent @concat('file_format=other_xlsx',dataset().XLS_folder_path), where the base path is always file_format=other_xlsx and the rest of the path will be passed inside the pipeline.
File name is the value of the parameter@dataset().XLSX_file_name, which will also be passed from the pipeline.

Azure SQL Dataset (name: sql_file_log_table) – You only need to use the Linked Service that we created; there is no need to point the Dataset to a specific table. It will be used to retrieve data from the log table in the SQL Database.

Pipeline – finally the core!

Pipeline graph:

To simplify the pipeline, I will ‘hard code’ some variables. Later, you can modify those parts of the pipeline to dynamically assign values, for example, from the database, and iterate through each value. In our example, we will retrieve data from single site and single folder.

Our Pipeline has following variables:

SiteURL – We will set this variable based on the activity Get_SPSiteId_by_name output value ‘Id.’ It will be the base URL to run the ‘Get drivers’ method.
SiteName – We will set this variable to the source SharePoint site name from which we will be pulling data. We can pass the site name to this pipeline from an external pipeline to make it more dynamic.
partition_folder – This variable will hold the target ADLS folder path in Hive-style partitioning convention.
SharePoint_folder – We will set this variable to the name of the folder or path on SharePoint from which we will be pulling data. In our example, it will be a static value, but you can pass multiple values and loop through folders.
increment_date – This variable will be used to store the timestamp retrieved from the SQL table and filter SharePoint files to include only new files compared to the previous run.

Pipeline Activities:

(0) Set_Site_name [Set Variable]

Description: Set the SharePoint site name from which we would like to get the files. In our case, it is hardcoded to “SampleTeamSite“.

(1) Get_SPSiteId_by_name [Web]

Description: This activity will be used to pull the SharePoint ID by providing the SharePoint site name from the “SiteName” variable.
Configuration:
- URL: @concat('https://graph.microsoft.com/v1.0/sites/3p5g5j.sharepoint.com:/sites/',variables('SiteName')), you need to replace “3p5g5j” with you SharePoint tenant name.
- Method: GET
- Authentication: Service Principal
- Authentication reference method: Credential
- Credentials: EXCEL-SPI (please refer to Data Factory Credentials section)
- Resource: https://graph.microsoft.com

Response: Should be JSON, with the most important column being ‘id,’ which we will be using in the next activity.

{...    
"id": "3p5g5j.sharepoint.com,f66dcbda-1234-1234-1234-1234a68afab5,12340827-09b4-4d9a-9113-fba255d8af11",
...}

(2) Set_Graph_SPSiteID [Set Variable]

Description: This activity will set the ‘SiteURL‘ variable to the ‘ID‘ value from the previous activity response and concatenate the ‘ID‘ with the Graph API URL.
Configuration:
- Name: SiteURL
- Value: @concat('https://graph.microsoft.com/v1.0/sites/',activity('Get_SPSiteId_by_name').output.id)

(3) Get_drivers [Web]

Description: This activity will be used to retrieve a list of SharePoint folders.
Configuration:
- URL: @concat(variables('SiteURL'),'/drives')
  Example URL: https://graph.microsoft.com/v1.0/sites/3p5g5j.sharepoint.com,f66dcbda-1234-1234-1234-1234a68afab5,12340827-09b4-4d9a-9113-fba255d8af11/drives
- Method: GET
- Authentication: Service Principal
- Authentication reference method: Credential
- Credentials: EXCEL-SPI (please refer to Data Factory Credentials section)
- Resource: https://graph.microsoft.com
Response: Should be JSON, with the most important columns being ‘name’ and ‘Id,’ which we will be using in the next activity.
```
{...
"id": "b!2stt9pFsLkGbFmGbpor66666MPO0CZpNkRP7olXYrxFTaM9Ddvs8Ra3f014nRyX2",
....
"name": "DSS_XLS",
...}
```

(4) Set_SharePoint_Folder [Set Variable]

Description: Set the SharePoint folder name from which we would like to get the files. In our case, it is hardcoded to ‘DSS_XLS.’ It will be used to filter the above response to select only the folder we are interested in to pull the data from.
Configuration:
- Variable type: Pipeline variable
- Name: SharePoint_folder
- Value: DSS_XLS

(5) Filter_XLS_Folder [Web]

Description: In this activity, we filter the output of activity (3) and select the item where the name is equal to ‘DSS_XLS,’ which is the value from activity (4). You can make this pipeline more dynamic, for example, by filtering multiple folders (using an OR condition) or by implementing a For Each activity to iterate through all the folders from which you want to pull data.
Configuration:
- Items (dynamic content): @activity('Get_drivers').output.value
- Condition (dynamic content): @equals(item().name,variables('SharePoint_folder'))

(6) Get_files [Web]

Description: This activity invokes the Graph API to list children items in the provided Drive ID, which we filtered in the previous step and is inside a specific folder path, in our case, ‘/items/root/children.’ We refer to the Drive ID (Folder ID) from the previous activity with “activity(‘Filter_XLS_Folder’).output.value[0].id.” You can also list items in subfolders; this method also returns a list of folders that you can iterate through.
Configuration:
- URL: dynamic content below
```
@concat(
    variables('SiteURL')
    ,'/drives/'
    ,activity('Filter_XLS_Folder').output.value[0].id
    ,'/items/root/children'
)
```
- Method: GET
- Authentication: Service Principal
  - Credentials: EXCEL-SPI
  - Resource: https://graph.microsoft.com

Response: Below is the response from the activity. I have pasted only the most interesting parts that we are going to use in the pipeline. Basically, it returns the entire content of our SharePoint folder, including information such as File Name, File Modification Time, and Download URL.

{
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#sites('3p5g5j.sharepoint.....<longURL>...",
    "value": [
        {
            "@microsoft.graph.downloadUrl": "https://3p5g5j.sharepoint.com/sites/Sam...<veryLongURL>...",
            "createdDateTime": "2023-11-21T17:24:12Z",
...
            "lastModifiedDateTime": "2023-11-21T17:24:12Z",
            "name": "DSS_sample_103.json",
...
        },
        {
            "@microsoft.graph.downloadUrl": "https://3p5g5j.sharepoint.com/sites/Sample...<veryLongURL>...",
            "createdDateTime": "2023-11-21T16:33:49Z",
...
            "lastModifiedDateTime": "2023-11-21T16:33:49Z",
            "name": "Financial Sample_one_sheet.xlsx",
...
            }
}

(7) get_increment_date [Lookup]

Description: This activity connects to the SQL Database, specifically to the [xls].[file_logs] table. From all files in a specific SharePoint folder, it retrieves the latest Modification Date. Each time we get the files, we log the File Modification timestamp to this table. To retrieve only new files, we use the timestamp from the last load. Additionally, we need to convert the SQL datetime value to an ISO 8601 timestamp with ‘T’ between the date and add ‘Z’ at the end of the string, matching this format: ‘2023-11-21T16:33:49Z.’ There is also a handler for Nulls, for example, if this pipeline runs for the first time and there are no entries in the database table yet.
Configuration:
- Source dataset: sql_file_log_table
- First row only: True
- Use query: Query
- Query:
```
select 
ISNULL(
    max(CONVERT(varchar(50), lastModifiedDateTime, 127)+'Z')
    ,CONVERT(varchar(50), cast('1900-01-01' as datetime), 127)+'Z'
    ) as max_dt 
from [xls].[file_logs]
WHERE 1=1 
and site_name = 'SampleTeamSite'
and folder_name = 'DSS_XLS'
```
  NOTE: You should, of course, change, for example, ‘SampleTeamSite’ to reference the pipeline variable ‘SiteName’ and make the entire code a concatenation of strings and variables. However, to make it more readable for beginners, I have hardcoded them.
Response: This activity responds with a single value, but it is of Object type e.g. {"firstRow": { "max_dt": "2023-11-21T16:33:49Z"}, ...

(8) Set_increment_date [Set Variable]

Description: his activity is retrieving a value from activity (7) and transforming it into a string.
Configuration:
- Variable type: Pipeline variable
- Name: increment_date
- Value (dynamic content): @string(activity('get_increment_date').output.firstRow.max_dt)

(9) Filter_xlsx_file_and_increment [Filter]

Description: In this activity, we filter the output of activity (6), which provided us with a list of all files, and apply two filters to it.
The first filter is optional, and in our case, we want to retrieve only XLSX files. I have used the File Name to end with ‘.xlsx,’ but you can apply multiple conditions there.
The second filter is to retrieve only new files that have changed since the last pipeline run. To achieve this, we compare the ‘lastModifiedDateTime‘ value with the ‘increment_date‘ that we pulled from the SQL Table and select only items greater than that timestamp.

Configuration:

Items (dynamic content) : @activity('Get_files').output.value

Condition: dynamic content below

@and(
endswith(item().name,'.xlsx') 
,greater(
    item().lastModifiedDateTime 
    ,variables('increment_date')
    )
)

(10) Set_partition [Set Variable]

Description: In this activity, we additionally set the target path in Data Lake storage for our data. We use Hive Style Partitioning for folder path names. This setup helps us identify where the files came from and when exactly they were uploaded to the Storage Account. An example path will be:
/system=sharepoint/site=SampleTeamSite/sp_folder=DSS_XLS/dt=2023-12-11/hh=13/mm=08/
We will use this in the Copy Activity.

Configuration:

Name: partition_folder

Value: dynamic content below

@concat('/system=sharepoint','/site=',variables('SiteName'),
'/sp_folder=',variables('SharePoint_folder'),
     '/dt=',utcNow('yyyy-MM-dd')
    ,'/hh=',utcNow('hh')
    ,'/mm=',utcNow('mm'),'/'
    )

(11) ForEach_xls_file [ForEach]

Description:This activity will iterate through each file that was returned by the ‘Filter_xlsx_file_and_increment’ activity. In parallel, it will execute the Copy Activity and log file retrieval to the database (activities 12 and 13).
Configuration:
- Sequential: No (that mean – we want to get files in parallel)
- Batch count: Configure it to number of files you want to get in parallele (max 50)
- Items (dyanamic content): @activity('Filter_xlsx_file_and_increment').output.value

(12) COPY_SP_binary_to_ADLS_binary [Copy Data]

Description: This activity will invoke the retrieved download URL and save the output file to Azure Data Lake Storage, to the specific path we set in activity (10), which is ‘partition_folder’.
Configuration Source:
- Source dataset: HTTP_BinaryFiles
- Dataset properties:
  - URL (dyanamic content) : @item()['@microsoft.graph.downloadUrl'
- Request method: GET
Configuration Sink:
- Sink dataset: ADLS_binXLS
- Dataset properties (dyanamic content):
  - XLS_folder_path: @variables('partition_folder')
  - XLSX_file_name: @item()['name']

(13) Log_file_upload [Stored Procedure]

Description:This activity will be used to log information about loaded files to a database table. It includes details such as file modification date, SharePoint site, SharePoint folder name, and file name.

Configuration:

Linked service: AzureSqlDatabase1

Stored procedure name: [xls].[log_xls_load]

**Stored procedure parameters**
name	value (dyanamic content)
xls_name	@item().name
lastModifiedDateTime	@item().lastModifiedDateTime
folder_name	@item().parentReference.name
site_name	@variables(‘SiteName’)
hash	@item().file.hashes.quickXorHash

Note: I did not explain the hash parameter previously, but it is available from the Graph API result, and I am exploring this parameter to see if it can also help me check whether the content of the file has changed. This can prevent a situation where someone opened a file and it was auto-saved, which might trigger a change in modification time but not in the content of the file. This could potentially save us some processing time later if we also exclude those files.

Conclusion

We have created a simple pipeline that we can scale for multiple SharePoint sites and folders. We can utilize this technique for multiple purposes, such as:

Backing up files from SharePoint to Storage Account
Loading files to Data Lake or Data Lakehouse
Staging files for further data processing

In addition to this, it allows us to do this incrementally and save money by avoiding putting duplicate files to storage and/or processing the same file multiple times. This process runs very efficiently and can download multiple files in parallel (like hundreds of XLSX files in a few minutes). Just schedule the trigger and enjoy 🙂