Posts

Performance optimization in Copy data activity in Azure

This article outlines the copy activity performance optimization features that you can leverage in Azure Data Factory and Synapse pipelines. Copy performance optimization features The service provides the following performance optimization features: Database transaction units (DTUs) Data Integration Units Self-hosted integration runtime scalability Parallel copy Staged copy Database transaction units (DTUs) A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.  Service tiers in the DTU-based purchasing model are differentiated by a range of compute sizes with a  fixed amount of included storage, fixed retention period for backups, and fixed price. All service tiers in the DTU-based purchasing model  provide flexibility of changing compute sizes with minimal downtime; however, there is a switch over period where connectivity is lost to  the database for a short amount of time, which can be mitigated using retry logic. Single...

Azure blob Storage interview questions

 Azure Blob Storage is a cloud-based storage service that is often used to store and process large amounts of data.  it is important to be prepared to answer questions about this service. 1. What is Azure Blob Storage? Azure Blob Storage is a cloud-based storage service that is optimized for storing large amounts of unstructured data, such as video, audio,and images. Blob storage is also a good choice for storing data that needs to be accessed frequently, but doesn’t need to be updated often. 2. How does Azure Blob storage work? Azure Blob storage is a cloud-based storage service that allows you to save and retrieve data from anywhere in the world.  Blobs are stored in containers, which can be public or private, and can be accessed via a URL. Azure Blob storage is a great way to store large amounts of data that you need to be able to access quickly and easily. 3. What are the key components of Azure blob storage? The key components of Azure blob storage are containers and...

Dynamic Partition pruning in Spark 3.0

Image
We have seen often people saying - SparkSQL is very slow after I make integration with BI tools. Why do Joins behave weird & slow? I hate spark because I must learn how to use memory - And I'm confused I cannot decied how many executors, executor-cores, executor-memory. Defining Shuffles & partitioning during M/R A little bit of relief as Spark 3.0 introduces Dynamic Partition Pruning Assume we want to query logs_table and add filter.  SELECT *FROM logs_table WHERE log_date = '2020-05-05' Earlier we had Basic data flow which is -  Scan all data first -> then Filter As an optimization many Databases introduced Filter push down, which is -  Filter first -> then Scan table We are using one more optimization which is Static pruning. Data is stored partition files in multi-columnar data format.  Query first Filter(On partition column) -> then Scan data only those partitions We also use Table de-normalization.  SELECT *FROM logs_table JOIN dates ...

difference between repartition and coalesce in Databricks

 Spark is basically known for in memory computation with distributed mechanism. We can create partition on large files or tables so that we don't need to scan all data. For example, We have table with multiple countries data and need to fetch data from specific country only so we can create partition on country column and Spark will spilit our tables for each country data file . Sometimes we need to repartition our table to adjust number of partitions. Now we will discuss difference between repartition and coalesce :  Repartition : 1. Repartition is used to increase/ decrease the partition. 2. Repartition shuffles the data across multiple partitions so this data movement is not good thing. 3. It's slow in performance. 4. It shuffles data from multiple partitions and create equal size of newly defined partitions . Coalesce :  1. It's used to only decrease the number of partitions. 2. It doesn't shuffle the data across partitions. 3. Performance is really good . 4. It doe...

How to Scaleup and ScaleDown Azure SQL database using Azure data factory?

 Sometimes we face the problem to Scaleup/ ScaleDown the Azure SQL DB using code in our adf pipeline. We handle it manually most of the times but what if our adf pipeline is triggered by Schedule trigger and we need to Scale up / down DB as per Start/ finish of pipeline copy activity? here is the solution. Create a proc to Select and Update Current DTU of Azure SQL DB. Create Proc SelectUpdate ( @CurrentDtu varchar (10) output , @DTUtoUpdate varchar (10)) -- S12 - 3000 as begin SELECT @CurrentDTU =slo.service_objective FROM sys.databases d JOIN sys.database_service_objectives slo ON d.database_id = slo.database_id ALTER DATABASE [TestAzureDB] MODIFY( SERVICE_OBJECTIVE=@DTUtoUpdate) end adf pipeline steps 1- call the proc (@cDTU output,'S12') to select current DTU first n then update DTU --- call this from lookup activity and it will set value into @cDTU variable 2- use wait activity - bcz DB takes 1 min to reflect changes 3 - copy data activity 4 - call same proc again by passi...

How to migrate Azure SQL tables to databricks dynamically?

 During Azure SQL to databricks migration, I was trying to find a way to migrate multiple tables within multiple databases as per defined(TableName,DBName) in a configuration table . Configuration table structure like DBname SchemaName SqlTableName SqlServerName DatabricksDBName data like DB1 dbo t1 serverUrl1 Cp_DB1 DB1 dbo t2 Serverurl2 Cp_DB1 DB2 dbo t1 serverurl3 Cp_DB2 DB3 dbo t2 serverUrl1 Cp_DB2 So I need to migrate these tables by picking data from corresponding Databases and server and need to load in Databricks bronze layer with expected databricks dbname Solution I created a adf pipeline to loop through with this table by lookup activity + foreach activity and passed servername, Tablename , database name, User,Password(from Key Vault) as parameters to databricks notebook inside pipeline. ADF pipeline -> lookup activity > foreach activity> scale up corresponding db > notebook activity> scale down corresponding db. Databricks generic notebook-  % python...

Why do we have two keys in storage account and need for rotate them ?

 On more than one occasion developers do not see the sense to have two keys for an Azure Storage account when they always use the first one  to its developments. The truth is that I have found that most of the time there is no key maintenance   to improve the security of the storage account.   The idea with this pair of keys is to always have a «backup key» when you regenerate one of them so that there is no loss of service at any time.  There might be case when you see your key1 has been compromised so you can rotate the key in applications , doing this your application will still work. The upgrade process in this case could be the following: The first key is used in your applications. When it’s time to regenerate the keys, change the first key for the second key in your applications settings that are using it. Regenerate the primary key and again change the secondary key with the new one generated. Regenerate the second key. This process can easily be aut...