Performance optimization in Copy data activity in Azure

This article outlines the copy activity performance optimization features that you can leverage in Azure Data Factory and Synapse pipelines.


Copy performance optimization features

The service provides the following performance optimization features:


Database transaction units (DTUs)

Data Integration Units

Self-hosted integration runtime scalability

Parallel copy

Staged copy


Database transaction units (DTUs)

A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.

 Service tiers in the DTU-based purchasing model are differentiated by a range of compute sizes with a

 fixed amount of included storage, fixed retention period for backups, and fixed price. All service tiers in the DTU-based purchasing model

 provide flexibility of changing compute sizes with minimal downtime; however, there is a switch over period where connectivity is lost to

 the database for a short amount of time, which can be mitigated using retry logic. Single databases and elastic pools are billed hourly based

 on service tier and compute size. 

When you do copy from Azure SQL source You can Scale UP DTUs to 3000 and do copy and then Scale Down to old DTU using ADF.


Data Integration Units

A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory and Synapse pipelines.

Power is a combination of CPU, memory, and network resource allocation.

DIU only applies to Azure integration runtime. DIU does not apply to self-hosted integration runtime.  


Self-hosted integration runtime scalability

You might want to host an increasing concurrent workload. Or you might want to achieve higher performance in your present workload level.

You can enhance the scale of processing by the following approaches:


You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.

Scale up works only if the processor and memory of the node are being less than fully utilized.

You can scale out the self-hosted IR, by adding more nodes (machines).

If you would like to achieve higher throughput, you can either scale up or scale out the Self-hosted IR:


If the CPU and available memory on the Self-hosted IR node are not fully utilized, but the execution of concurrent jobs is reaching the limit,

you should scale up by increasing the number of concurrent jobs that can run on a node.  

If on the other hand, the CPU is high on the Self-hosted IR node or available memory is low, you can add a new node to help scale out

the load across the multiple nodes.  

Note in the following scenarios, single copy activity execution can leverage multiple Self-hosted IR nodes:


Copy data from file-based stores, depending on the number and size of the files.

Copy data from partition-option-enabled data store (including Azure SQL Database, Azure SQL Managed Instance, Azure Synapse Analytics,

Oracle), depending on the number of data partitions.


Parallel copy

You can set the parallelCopies property to indicate the parallelism you want the copy activity to use.

 Think of this property as the maximum number of threads within the copy activity. 

 The threads operate in parallel. The threads either read from your source, or write to your sink data stores. 

 You can set parallel copy (parallelCopies property in the JSON definition of the Copy activity, or Degree of parallelism setting in

 the Settings tab of the Copy activity properties in the user interface) on copy activity to indicate the parallelism that you want the

 copy activity to use. You can think of this property as the maximum number of threads within the copy activity that read from your source

 or write to your sink data stores in parallel.


The parallel copy is orthogonal to Data Integration Units or Self-hosted IR nodes. It is counted across all the DIUs or Self-hosted IR nodes.

For each copy activity run, by default the service dynamically applies the optimal parallel copy setting based on your source-sink pair

 and data pattern.


Staged copy

A data copy operation can send the data directly to the sink data store. Alternatively,

you can choose to use Blob storage as an interim staging store.


How staged copy works

When you activate the staging feature, first the data is copied from the source data store to the staging storage

 (bring your own Azure Blob or Azure Data Lake Storage Gen2). Next, the data is copied from the staging to the sink data store. 

 The copy activity automatically manages the two-stage flow for you, and also cleans up temporary data from the staging storage

 after the data movement is complete.

 

 When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before you move data

 from the source data store to the staging store and then decompressed before you move data from an interim or staging data store to the 

 sink data store.


Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither with nor without staged copy.

 For such scenario, you can configure two explicitly chained copy activities to copy from source to staging then from staging to sink.

 

 Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in storage before

 you load it into a destination data store.

 If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked service isn't supported.


Thanks

 Rahul Sharma





Comments

Popular posts from this blog

Azure blob Storage interview questions

Why do we have two keys in storage account and need for rotate them ?