One of the hottest topics in the world of the Microsoft data platform is without any doubt Azure Synapse Analytics. It is a service that evolved from Azure Datawarehouse and is currently based on several pillars, where on the one hand we are dealing with Dedicated SQL Pool, i.e. a relational SQL engine operating in MPP (Massive Parallel Processing) technology, SQL Serverless where we are not interested in resources at all, and only the data volume we want to process and Spark Pool, i.e. a separate service based on Apache Spark. The whole is enriched with integration with version control, pipelines, data lake, Power BI and several other services within the Azure cloud. There is really a lot of it and, in general, the service gives great opportunities to analyze, store and process various data in several ways. Due to the fact that Azure has long been the number one topic when it comes to my technical interests, today I would like to share with you how to load data into the Dedicated SQL Pool. Of course, there are several possibilities to perform this type of operation, but today I would like to focus on a relatively simple, but extremely effective option, which is undoubtedly COPY INTO.
One of the most important topics in Azure Data Platform is Azure Synapse Analytics. This service evolved from Azure Datawarehouse and currently is based on a few services. First of all, we have Dedicated SQL Pool – a powerful relational engine that is based on MPP (Massive Parallel Processing) technology, we also have SQL Serverless where we don’t care about resources but the volume of data that we want to process but that is not everything because Spark Pool for big data workloads and Data Explorer for telemetry and log analysis at large scale. All those services are strictly connected to each other and additionally, we have the possibility to use them with pipelines, azure storage, Power BI and a few other services available in Azure. Today I would like to share with you some ideas on how mentioned above Dedicated SQL Pool can be loaded. There are few possibilities for how to do it but today I will focus on COPY INTO – a very simple but powerful command. Let’s get started!
If you remember the good old days when we had Azure DAtawarehouse you probably remember that the preferable way of loading data into the solution was Polybase. This technology is nothing but data virtualization mechanism that was used to load data from Data Lake storage or Blob storage by creating External Data Source. Polybase is also available in SQL Server instances to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB, Hadoop clusters, Cosmos DB, and S3-compatible object storage without separately installing client connection software – but not about SQL Server today! Regarding loading of Synapse Dedicated SQL Pool Polybase is still a valid option but we have a pretty good competitor nowadays which is COPY INTO. With this command, we can load data efficiently without the creation of additional objects.
Before we will go deeper into the details let’s say a few words about what kind of permissions we must have to run COPY INTO – there are two of them:
- ADMINISTER DATABASE BULK OPERATIONS,
- INSERT.
As I said in the beginning COPY INTO can load that from following storage services:
- Azure BLOB Storage,
- Azure Data Lake Storage Gen2.
Someone can ask why we have to load data from storage and not directly from data sources like Azure SQL. It is a good question but the answer is pretty simple – it just runs faster! The best situation will be when you will use Azure Data Lake Storage Gen 2 – this is the fastest storage for analytical workload because of many factors – the most crucial one is that with this storage we have hierarchical containers that give us the possibility to pretty quickly find data that is interesting for us. Data Lake is built on top of BLOB storage so connecting to both is very similar. In the example that I preferred we can authenticate using one of the following methods:
- Shared Access Signature,
- MSI – Managed Service Identity,
- Service Principal,
- Service Key,
- Azure Active Directory.
As you can see we have many options but please consider using Azure Active Directory-based methods because it is the most secure and reliable way for authentication. Please keep in mind that best practices from SMP(Symmetric Multi-Processing) systems like SQL Server not always are good for MPP (Massive Parallel Processing) like Synapse. COPY INTO and POLYBASE are loading data directly to worker nodes of the Synapse cluster and BULK INSERT is loading data to the control node and data must be distributed by the control node itself. As you can imagine this is a much slower operation than the direct load available with storage between the source system and target synapse dedicated SQL pool.
If we are talking about data formats that we can load using COPY INTO we have three options:
- CSV (compressed or not),
- Parquet,
- ORC.
Everyone knows what CSV is so I don’t need to explain it – this format is pretty popular and it is standard for many years. An interesting fact is that Synapse can read compressed CSV files so the file size will be much smaller and because of that read operation from disk can be faster. Parquet format is also pretty popular, this opensource standard comes from the Apache ecosystem. This format is based on columnar storage with a very efficient compression algorithm. When data is placed in parquet only those columns are read which are significant for specific queries and because of that we can save time on reading data from disk. Parquet is pretty useful for every system that reads data directly from the disk (Synapse Serverless can be a good example here). A very similar file format is Apache ORC which is also based on columnar storage and it is pretty popular in various implementations of big data solutions.
Ok, the introduction is behind us so let’s check everything in practice. As I already mentioned if you want to use COPY INTO there is no need to create additional database objects so the only object that we have to prepare is the target table. For demo purposes, I created a Date table that will keep calendar data that is based on NY Taxi dataset (It is publicly available and you can get it from here).
CREATE TABLE [dbo].[stage_Date] ( [DateID] int NOT NULL, [Date] datetime NULL, [DateBKey] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DaySuffix] varchar(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfWeek] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfWeekInMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfWeekInYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfQuarter] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [DayOfYear] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [WeekOfMonth] varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [WeekOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [WeekOfYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [Month] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [MonthName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [MonthOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [Quarter] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [QuarterName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [Year] char(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [YearName] char(7) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [MonthYear] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [MMYYYY] char(6) COLLATE SQL_Latin1_General_CP1_CI_AS NULL, [FirstDayOfMonth] date NULL, [LastDayOfMonth] date NULL, [FirstDayOfQuarter] date NULL, [LastDayOfQuarter] date NULL, [FirstDayOfYear] date NULL, [LastDayOfYear] date NULL, [IsHolidayUSA] bit NULL, [IsWeekday] bit NULL, [HolidayUSA] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ) WITH ( DISTRIBUTION = ROUND_ROBIN, HEAP );
As you probably noticed our table is a heap and we can explicitly declare that in WITH section (without declaring what kind of table we want COLUMNSTORE will be created). Next to that, we mentioned the distribution algorithm which is ROUND_ROBIN so proportional fill. Distribution and Indexing is a big topic in Synapse so probably I will write an article about it but for now I can mention that HEAP + Round Robin is a good choice for staging tables.
We have our target table already created so let’s test our loading statement:
COPY INTO [dbo].[stage_Date] FROM 'https://seequalitystorageaccount.blob.core.windows.net/nytaxi/data/Date/*.txt' WITH ( CREDENTIAL =(IDENTITY = 'Managed Identity'), FILE_TYPE = 'CSV', FIELDTERMINATOR = ',', ROWTERMINATOR='0X0A', ENCODING = 'UTF8' ) OPTION (LABEL = 'COPY : dbo.Date');
The above syntax is pretty clear and it is easy to read for everyone. In FROM section of a query, I put the full path to the container where my data is located – if there is a need I can put provide multiple container paths. In the location that I choose many files are located and because I use * all of them will be loaded. What can be new for you Is OPTION (LABEL) – these keywords give you the possibility to add a label to your query and then based on this label you can very easily identify your query in the logs, dynamic management views, etc – pretty cool!
In the WITH section on my query, I added also a few options like file type, field delimiter, row terminator, or file encoding. Obviously, I didn’t put all the options in the above query but the file that I choose is pretty simple and the above options are enough. You probably noticed also a Credential option that I specified – in this section, we have to say to Synapse how it can authenticate to the file location. In this case, I choose Managed Identity so Synapse identity will be used to connect to Data Lake Storage.
If the file structure is defined correctly and Synapse can reach the specified location everything should work smoothly and data from the file should land in the table:
Of course, there is many different options and properties in COPY INTO:
COPY INTO [schema.]table_name [(Column_list)] FROM '<external_location>' [,...n] WITH ( [FILE_TYPE = {'CSV' | 'PARQUET' | 'ORC'} ] [,FILE_FORMAT = EXTERNAL FILE FORMAT OBJECT ] [,CREDENTIAL = (AZURE CREDENTIAL) ] [,ERRORFILE = '[http(s)://storageaccount/container]/errorfile_directory[/]]' [,ERRORFILE_CREDENTIAL = (AZURE CREDENTIAL) ] [,MAXERRORS = max_errors ] [,COMPRESSION = { 'Gzip' | 'DefaultCodec'| 'Snappy'}] [,FIELDQUOTE = 'string_delimiter'] [,FIELDTERMINATOR = 'field_terminator'] [,ROWTERMINATOR = 'row_terminator'] [,FIRSTROW = first_row] [,DATEFORMAT = 'date_format'] [,ENCODING = {'UTF8'|'UTF16'}] [,IDENTITY_INSERT = {'ON' | 'OFF'}] )
In the above code, you can clearly notice many things like compression or error handling options. I would like to say a few words about error handling because it is pretty interesting. To demonstrate how it works I prepared a flat file with the same structure as we had above:
19000101,1900-01-00 00:00:00.000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
As you noticed I left many of the columns empty but the most important thing is a mistake that I made – the second column contains 00 so it is improper date format and I should get an error and it happened when I started the query:
Msg 13812, Level 16, State 1, Line 145 Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 2 (Date) in data file /data/Date/MyTestFile.txt.
If we have a number of errors that we accept we can set it up in MAXERRORS parameter – below you can see that I accepted only one error:
TRUNCATE TABLE [dbo].[stage_Date] COPY INTO [dbo].[stage_Date] FROM 'https://seequalitystorageaccount.blob.core.windows.net/nytaxi/data/Date/*.txt' WITH ( CREDENTIAL =(IDENTITY = 'Managed Identity'), FILE_TYPE = 'CSV', FIELDTERMINATOR = ',', ROWTERMINATOR='0X0A', ENCODING = 'UTF8', MAXERRORS = 1 )
After execution proper data rows have been inserted but I also received information that one row has been rejected:
Query completed. Rows were rejected while reading from external source(s). 1 row rejected from table [stage_Date] in plan step 4 of query execution: Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 2 (Date) in data file /data/Date/MyTestFile.txt.
The information that I get is pretty straightforward and it indicates not only files but also rows and columns. This kind of information can be useful during development but for automation purposes, you can also redirect wrong errors to a specific location with ERRORFILE option (and ERRORFILE_CREDENTIAL which can be used to authenticate to this location. If the error location will be based on the same location as the data then we should provide a full path with the root container and then authorization will be based on credentials specified in the CREDENTIAL option):
COPY INTO [dbo].[stage_Date] FROM 'https://seequalitystorageaccount.blob.core.windows.net/nytaxi/data/Date/' WITH ( CREDENTIAL =(IDENTITY = 'Managed Identity'), FILE_TYPE = 'CSV', FIELDTERMINATOR = ',', ROWTERMINATOR='0X0A', ENCODING = 'UTF8', ERRORFILE = '/errors' )
Of course above query returns an error but this time all the errors have been redirected to the specified location as you can see below:
Synapse created a new folder named _rejectedrows and inside of it, we have an additional folder with the date and time when the query was executed. In this location, we have also two files where the file with the “Row” suffix contains a row with the error and the file with the “Error” suffix contains an error message.
Very interesting behavior we can notice in situations when we will provide MAXERROR bigger than 0 and specify ERRORFILE. A wrong row will be redirected to the error location but the query itself will end with success. I would like also to mention that even if an error will not appear folder will be created in the error location – we have to be aware of this very specific behavior.
Let’s go to the next topic and let’s check how to load a subset of columns and how to replace empty values. The below query is taking only the first two columns from my test file (we can identify them by names defined in the target table), additionally, for DateBKey we are replacing empty values with “#”:
COPY INTO [dbo].[stage_Date] (DateID,Date,DateBKey default '#') FROM 'https://seequalitystorageaccount.blob.core.windows.net/nytaxi/data/Date/MyTestFile.txt' WITH ( CREDENTIAL =(IDENTITY = 'Managed Identity'), FILE_TYPE = 'CSV', FIELDTERMINATOR = ',', ROWTERMINATOR='0X0A', ENCODING = 'UTF8' )
If you want to deal with empty values at this stage then it is worth considering.
We can use COPY INTO not only from the TSQL but also from other tools like Azure Data Factory/Integrated Pipelines. If you decide to use ADF we have a few things to remember – first of all, the entire process of loading is done by Synapse, and ADF is the only orchestrator, because of that Copy Activity DIU units will not affect the performance of loading. Second thing is that we have to configure Copy Activity in a different way for different file types.
If we want to load CSV files then we have to configure the following settings:
- row delimiter must be a single character or \r\n,
- NULL value cannot be handled or an empty string should be used,
- file encoding should be set to UTF-8 or UTF-16,
- Escape Character and Quote Character must be not empty and exactly the same for both settings,
- number of rows to skip must be set to 0 or left with a default value,
- the only supported compression is GZIP or no compression at all.
If we are talking about Parquet or ORC formats the only setting that we can affect is the compression of a file. As you can see loading flat files is much more complicated so we will do it in our example.
The first step is the creation of Linked Service to Data Lake Storage Gen2 and Synapse. It can be done in a standard way like any other connection so I skipped this part. In the next step, I created data set on data lake storage – the configuration of this object is visible below:
- In the file path setting, I put root container and child folders but without a single file because we want to load all the files from this location,
- For Escape character and Quote Character I put backslash because I cannot leave it empty – my files don’t have those so I set a sign that will not affect loading,
- Null value setting I left empty so empty values will not be managed during the load.
The second dataset that is connecting to Synapse is pretty simple and the only thing that we have to do is choose the target table:
In Copy Activity choosing the source looks like this:
- File path type in this case I set to Wildcard file path where the only thing to do is to put * instead of the hardcoded file name (in this scenario * is the only supported option)
- I selected also “Recursively” option – please select it when you want to load folders and subfolders from a provided location.
Sink configuration is also simple to understand and configure:
On this screen you can choose the method of loading – please remember that Polybase and Copy command is much faster than Bulk Insert. In the Default values section as the name suggests we can provide default values in a similar way as we did in the TSQL example. We can use “Additional options ” to provide additional options that in TSQL we provided in WITH section of COPY INTO. The rest of the options I left empty and as you probably noticed I am truncating the target table before the loading.
That’s all regarding configuration. After execution of this pipeline data should be loaded without any problems:
As I said in the begging entire loading process is done by Synapse and the only thing that ADF is doing is generating proper COPY INTO command. This generated query we can find in Synapse in many places – for example in dynamic management view sys.dm_pdw_exec_requests:
SELECT * FROM sys.dm_pdw_exec_requests WHERE status in ('Completed','Failed','Cancelled')
or in Query activity panel:
Using one of the above methods we can find a query generated by ADF:
COPY INTO [dbo].[stage_Date] ( [DateID] 1 ,[Date] 2 ,[DateBKey] 3 ,[DayOfMonth] 4 ,[DaySuffix] 5 ,[DayName] 6 ,[DayOfWeek] 7 ,[DayOfWeekInMonth] 8 ,[DayOfWeekInYear] 9 ,[DayOfQuarter] 10 ,[DayOfYear] 11 ,[WeekOfMonth] 12 ,[WeekOfQuarter] 13 ,[WeekOfYear] 14 ,[Month] 15 ,[MonthName] 16 ,[MonthOfQuarter] 17 ,[Quarter] 18 ,[QuarterName] 19 ,[Year] 20 ,[YearName] 21 ,[MonthYear] 22 ,[MMYYYY] 23 ,[FirstDayOfMonth] 24 ,[LastDayOfMonth] 25 ,[FirstDayOfQuarter] 26 ,[LastDayOfQuarter] 27 ,[FirstDayOfYear] 28 ,[LastDayOfYear] 29 ,[IsHolidayUSA] 30 ,[IsWeekday] 31 ,[HolidayUSA] 32 ) FROM 'https://seequalitystorageaccount.dfs.core.windows.net:443/nytaxi/data/Date/*' WITH ( IDENTITY_INSERT='OFF' ,CREDENTIAL=(IDENTITY='Storage Account Key',SECRET='***') ,FILE_TYPE='CSV' ,COMPRESSION='NONE' ,FIELDQUOTE='\' ,FIELDTERMINATOR=',' ,ROWTERMINATOR='0x0A' ,FIRSTROW=1 ) OPTION (LABEL='ADF Activity ID: 4062133e-ac79-4378-8393-9139687fd129')
The above query is showing us all the configurations that we set in ADF. Very important information you can see in CREDENTIAL option – to authenticate to Data Lake Storage Account Key was used (I chose this method in Linked Service configuration) and sensitive value is hidden and is not available for anyone who will analyze logs, etc. Additionally, our query has a label with an identifier of ADF activity so it can be easily joined with information available in Data FActory logs.
Ok, folks, that’s all regarding COPY INTO. As you can see loading data to Synapse is much easier with COPY INTO. I suggest testing all the options available on your own – below you can find some links that can be also useful for you. Enjoy!
Linki:
- Executing SQL queries from Azure DevOps using Service Connection credentials - August 28, 2024
- Setup Git credentials for Service Principal in Azure Databricks - August 21, 2024
- Microsoft Fabric 101 Episode 3: Pausing and Scaling using portal and Powershell - August 8, 2024
Last comments