Posts

Showing posts from November, 2022

Performance optimization in Copy data activity in Azure

This article outlines the copy activity performance optimization features that you can leverage in Azure Data Factory and Synapse pipelines. Copy performance optimization features The service provides the following performance optimization features: Database transaction units (DTUs) Data Integration Units Self-hosted integration runtime scalability Parallel copy Staged copy Database transaction units (DTUs) A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.  Service tiers in the DTU-based purchasing model are differentiated by a range of compute sizes with a  fixed amount of included storage, fixed retention period for backups, and fixed price. All service tiers in the DTU-based purchasing model  provide flexibility of changing compute sizes with minimal downtime; however, there is a switch over period where connectivity is lost to  the database for a short amount of time, which can be mitigated using retry logic. Single...

Azure blob Storage interview questions

 Azure Blob Storage is a cloud-based storage service that is often used to store and process large amounts of data.  it is important to be prepared to answer questions about this service. 1. What is Azure Blob Storage? Azure Blob Storage is a cloud-based storage service that is optimized for storing large amounts of unstructured data, such as video, audio,and images. Blob storage is also a good choice for storing data that needs to be accessed frequently, but doesn’t need to be updated often. 2. How does Azure Blob storage work? Azure Blob storage is a cloud-based storage service that allows you to save and retrieve data from anywhere in the world.  Blobs are stored in containers, which can be public or private, and can be accessed via a URL. Azure Blob storage is a great way to store large amounts of data that you need to be able to access quickly and easily. 3. What are the key components of Azure blob storage? The key components of Azure blob storage are containers and...

Dynamic Partition pruning in Spark 3.0

Image
We have seen often people saying - SparkSQL is very slow after I make integration with BI tools. Why do Joins behave weird & slow? I hate spark because I must learn how to use memory - And I'm confused I cannot decied how many executors, executor-cores, executor-memory. Defining Shuffles & partitioning during M/R A little bit of relief as Spark 3.0 introduces Dynamic Partition Pruning Assume we want to query logs_table and add filter.  SELECT *FROM logs_table WHERE log_date = '2020-05-05' Earlier we had Basic data flow which is -  Scan all data first -> then Filter As an optimization many Databases introduced Filter push down, which is -  Filter first -> then Scan table We are using one more optimization which is Static pruning. Data is stored partition files in multi-columnar data format.  Query first Filter(On partition column) -> then Scan data only those partitions We also use Table de-normalization.  SELECT *FROM logs_table JOIN dates ...

difference between repartition and coalesce in Databricks

 Spark is basically known for in memory computation with distributed mechanism. We can create partition on large files or tables so that we don't need to scan all data. For example, We have table with multiple countries data and need to fetch data from specific country only so we can create partition on country column and Spark will spilit our tables for each country data file . Sometimes we need to repartition our table to adjust number of partitions. Now we will discuss difference between repartition and coalesce :  Repartition : 1. Repartition is used to increase/ decrease the partition. 2. Repartition shuffles the data across multiple partitions so this data movement is not good thing. 3. It's slow in performance. 4. It shuffles data from multiple partitions and create equal size of newly defined partitions . Coalesce :  1. It's used to only decrease the number of partitions. 2. It doesn't shuffle the data across partitions. 3. Performance is really good . 4. It doe...

How to Scaleup and ScaleDown Azure SQL database using Azure data factory?

 Sometimes we face the problem to Scaleup/ ScaleDown the Azure SQL DB using code in our adf pipeline. We handle it manually most of the times but what if our adf pipeline is triggered by Schedule trigger and we need to Scale up / down DB as per Start/ finish of pipeline copy activity? here is the solution. Create a proc to Select and Update Current DTU of Azure SQL DB. Create Proc SelectUpdate ( @CurrentDtu varchar (10) output , @DTUtoUpdate varchar (10)) -- S12 - 3000 as begin SELECT @CurrentDTU =slo.service_objective FROM sys.databases d JOIN sys.database_service_objectives slo ON d.database_id = slo.database_id ALTER DATABASE [TestAzureDB] MODIFY( SERVICE_OBJECTIVE=@DTUtoUpdate) end adf pipeline steps 1- call the proc (@cDTU output,'S12') to select current DTU first n then update DTU --- call this from lookup activity and it will set value into @cDTU variable 2- use wait activity - bcz DB takes 1 min to reflect changes 3 - copy data activity 4 - call same proc again by passi...

How to migrate Azure SQL tables to databricks dynamically?

 During Azure SQL to databricks migration, I was trying to find a way to migrate multiple tables within multiple databases as per defined(TableName,DBName) in a configuration table . Configuration table structure like DBname SchemaName SqlTableName SqlServerName DatabricksDBName data like DB1 dbo t1 serverUrl1 Cp_DB1 DB1 dbo t2 Serverurl2 Cp_DB1 DB2 dbo t1 serverurl3 Cp_DB2 DB3 dbo t2 serverUrl1 Cp_DB2 So I need to migrate these tables by picking data from corresponding Databases and server and need to load in Databricks bronze layer with expected databricks dbname Solution I created a adf pipeline to loop through with this table by lookup activity + foreach activity and passed servername, Tablename , database name, User,Password(from Key Vault) as parameters to databricks notebook inside pipeline. ADF pipeline -> lookup activity > foreach activity> scale up corresponding db > notebook activity> scale down corresponding db. Databricks generic notebook-  % python...

Why do we have two keys in storage account and need for rotate them ?

 On more than one occasion developers do not see the sense to have two keys for an Azure Storage account when they always use the first one  to its developments. The truth is that I have found that most of the time there is no key maintenance   to improve the security of the storage account.   The idea with this pair of keys is to always have a «backup key» when you regenerate one of them so that there is no loss of service at any time.  There might be case when you see your key1 has been compromised so you can rotate the key in applications , doing this your application will still work. The upgrade process in this case could be the following: The first key is used in your applications. When it’s time to regenerate the keys, change the first key for the second key in your applications settings that are using it. Regenerate the primary key and again change the secondary key with the new one generated. Regenerate the second key. This process can easily be aut...

What is soft delete in storage account or delta lake ?

  Soft Delete in storage account Enabling this configuration for azure storage ensures that even if blobs/data were deleted from the storage account, Blobs/data objects remain recoverable for a particular time which set in the 'Retention policies' [Retention policies can be 7 days to 365 days]. Solution From Azure Console: Go to Storage Account For each Storage Account, navigate to Data Protection Select set soft delete enabled and enter a number of days you want to retain soft deleted data. Using Azure Command-Line Interface 2.0: Update retention days in below command az storage blob service-properties delete-policy update --days-retained <RetentionDaysValue> --account-name <StorageAccountName> --enable true Default Value: When new storage account is created, soft delete is by default disabled.

What are benefits of Azure Storage ?

 Azure storage is the great way to store your structured ( CSV , text) semi structured (json , XML ) and unstructured (audio, video)  data on cloud  with below benefits.... 1. Durable and highly available :  Azure provides us durability to copy data in multiple locations across zones , geography etc. So we can create multiple copies of data which can be good in disaster recovery. 2. Secure   Whatever data you store on cloud is stored in encrypted form even when you retrieve data it travels on network in encrypted form only. 3. Scalable Azure storage is highly scalable, you can increase reduce size of data or file accordingly and you will be charged on the basis of data size only . So if you want to store 100 gb data then you can store it in azure and after months if you need to store 10 gb only then Azure will scale down it accordingly. 4. Managed services We don't have to worry about performance, patching , maintenance because it will be managed by Azure . 5 Ac...

What is Azure Storage account?

 Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks.  The storage account provides a unique namespace for your Azure Storage data that's accessible from anywhere in the world over HTTP or HTTPS. We can store Structured(CSV, Text), Semistructured(Json, XML) and NonStructured(audio,video) inside the Storage account. Types of storage accounts There are basically two types of Storage account , Blob storage and ADLS (Azure data lake storage). Blob Storage Blob is a file and good for storage , capable of storing 4.77 TB per file It all has begun with Blob Storage in Microsoft Azure. BLOB is an acronym and means Binary Large OBject. Or, in plain English, the unstructured files, such as images, video, music files, backup files, etc. There are three different ways to store Blobs in Microsoft Azure: Block Blob Good for file storage, capable of 4.77 TB per file When you store a file as a block blob, it arrives to...

Introduction

 Hello guys , This is my first blog and I will cover all the things related to azure data engineer whether it's Azure data factory, Azure delta lake , Databricks, Key vault , logic apps , Synapse, SQL Server and many more . Thank you