Archive for the ‘Microsoft Azure’ Category

SQL Server on Linux – Best practices post installation

August 30, 2018 1 comment


Setup SQL Server on Linux:

1. Spinning up a new Linux VM on Microsoft Azure

2. Install and Configure SQL Server 2017 on Linux Azure VM

3. Connect SQL Server on Linux with SSMS from a Windows machine

Best Practices:

Here are some of the best practices post installing SQL Server on Linux that can help you maximize database performance:

1. To maintain efficient Linux and SQL Scheduling behavior, it’s recommended to use the ALTER SERVER CONFIGURATION command to set PROCESS AFFINITY for all the NUMANODEs and/or CPUs. [Setting Process Affinity]

2. To reduce the risk of tempdb concurrency slowdowns in high performance environments, configure multiple tempdb files by adding additional tempdb files by using the ADD FILE command. [tempdb Contention]

3. Use mssql-conf to configure the memory limit and ensure there’s enough free physical memory for the Linux operating system.

sudo /opt/mssql/bin/mssql-conf set memory.memorylimitmb 1024
sudo systemctl restart mssql-server

4. On multi-node Non-Uniform Memory Access (NUMA) installations, auto NUMA balancing needs to be disabled to allow SQL Server to operate at maximum efficiency on a NUMA system.

sysctl -w kernel.numa_balancing=0

5. You can also change the kernel settings value for virtual address space to 256K, as the default value of 65K may be insufficient for a SQL Server installation.

sysctl -w vm.max_map_count=262144

6. Use the noatime attribute to disable last accessed timestamps with any file system that is used to store SQL Server data and log files.

7. For the most consistent performance experience, you must leave the Transparent Huge Pages (THP) option enabled.

8. Virtual machine (VM) features like Hyper-V Dynamic Memory shouldn’t be used with SQL Server installations. When using VMs, be sure to assign sufficient fixed-memory sizes.

9. Make sure you have a properly configured swapfile to avoid any out of memory issues.


Azure Databricks learning resources (documentation and videos)

August 7, 2018 1 comment


Databricks Introduction

What is Azure Databricks [Video]

Create Databricks workspace with Apache Spark cluster

Extract, Transform & Load (ETL) with Databricks

– Documentation:
   – Azure
   – Databricks

From Channel 9

1. Data Science using Azure Databricks and Apache Spark [Video]

2. Data ingestion, stream processing and sentiment analysis using Twitter [Video]

3. ETL with Azure Databricks using ADF [Video]

4. ADF new features & integration with Azure Databricks [Video]

5. Azure Databricks introduces R Studio Integration [Video]

6. Run Jars and Python scripts on Azure Databricks using ADF [Video]

From Microsoft Build Conf


Azure Databricks (a fully managed Apache Spark offering)

July 28, 2018 Leave a comment


Databricks Introduction:

Azure Databricks = Best of Databricks + Best of Azure

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform (PaaS).

It is a fast, easy-to-use, and collaborative Apache Spark–based analytics platform. Designed in collaboration with the creators of Apache Spark, it combines the best of Databricks and Azure to help you accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration among data scientists, data engineers, and business analysts. Because it’s an Azure service, you benefit from native integrations with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB. You also get enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.

–> With Databricks you can:
– Launch your new Spark environment with a single click.
– Integrate effortlessly with a wide variety of data stores.
– Use Databricks Notebooks to unify your processes and instantly deploy to production.
– Improve and scale your analytics with a high-performance processing engine optimized for the comprehensive, trusted Azure platform.

Learning Resources:

    Webinar recording on Azure Databricks

    My next blog in series

Azure SQL Data Sync – keep your data in sync between Azure Hybrid environment SQL Server and SQL DB

July 12, 2018 Leave a comment

SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.

Internal Mechanism & Performance impact:

Data Sync uses insert, update, and delete triggers to track changes. It creates side tables in the user database for change tracking. These change tracking activities have an impact on your database workload. Assess your service tier and upgrade if needed.

Since Data Sync is trigger-based, transactional consistency is not guaranteed. Microsoft guarantees that all changes are made eventually, and that Data Sync does not cause data loss.

Limitations & Requirements:

1. Each table must have a primary key
2. Snapshot isolation must be enabled
3. A table cannot have an identity column that is not the primary key
4. The names of objects (databases, tables, and columns) cannot contain the printable characters period (.), left square bracket ([), or right square bracket (]).
5. Unsupported data types
    a. FileStream
    b. SQL/CLR UDT
    c. XMLSchemaCollection (XML supported)
    d. Cursor, Timestamp, Hierarchyid
6. Azure Active Directory authentication is not supported.

Maximum number of sync groups any database can belong to 5
Maximum number of endpoints in a single sync group 30
Maximum number of on-premises endpoints in a single sync group 5
Database, table, schema, and column names 50 characters per name
Tables in a sync group 500
Columns in a table in a sync group 1000
Data row size on a table 24 Mb
Minimum sync interval 5 Minutes



1. Frequency can be set in seconds, minutes, hours and days (min 5 minutes)
2. You can choose and select desired tables and columns to sync
3. For on premises database you must configure a local DMG agent

Setup Azure SQL Data Sync:

Check here for Step-by-Step tutorial.

1. Create a Hub database on SQL DB (Hub & Spoke topology)
    a. Hub database: must be Azure SQL DB
    b. Spoke/Member database: rest of databases are either Azure SQL DB, on SQL Server instance

2. Create Sync group (On the Hub Database create a “New Sync Group”)
    a. Sync Schema: which data is being synchronized
    a. Sync metadata database: must be an Azure SQL DB
    b. Sync Interval: frequency
    c. Conflict Resolution Policy: (Hub wins or Member wins)

3. Add Sync Members (Spokes, can be either SQL DB or SQL instance)
    a. Sync Agent Gateway (for on-prem): download and install on on-premise server
    b. Sync Direction: bi-directional, or one direction

4. Configure Sync group
    a. Select the Tables/Columns which you want to sync.

Powershell script to:

Sync SQL DB & on-prem SQL Server instance
Sync between multiple SQL Databases

Best Practices:




Data Sync is not appropriate for the following scenarios:
1. Disaster Recovery
2. Read Scale
4. Migration from on-premises SQL Server to Azure SQL Database

Further reading:

An introduction to Azure HDInsight – Microsoft’s Big-Data/Hadoop solution on Azure

March 3, 2018 1 comment

The Microsoft Azure portal has all the details on HDInsight and is very vast. Here in this post I’ve simply curated main and important stuff for myself and others to get started with HDInsight.

Azure HDInsight is a standard Apache Hadoop distribution offered as a managed service on Microsoft Azure. It is based on the Hortonworks Data Platform (HDP) and provisioned as clusters on Azure. The clusters can be created on your choice of Windows or Linux Servers.

What HDInsight offers:

1. Provides an end-to-end SLA on all your production workloads.
2. Enables you to scale workloads up or down anytime and only pay for what you use.
3. Protects and Secure your data as per government compliance.
4. Provide Log Analytics to monitor your clusters.
5. Globally availability in multiple regions.
6. Provides various productivity tools for development.

HDInsight enables a broad range of scenarios such as: Process & Analyze Big-Data, Batch Processing, in-memory processing ETL, Data Warehousing, Machine Learning, IoT and more, by using a broad spectrum of open-source frameworks, like Hadoop, Spark, Kafka, HBase, Hive, Storm and R Server.


HDInsight Cluster Types:

1. Hadoop: A simple Map-Reduce programming model to process and analyze batch data in parallel. [Apache Hadoop]

2. Spark: An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. [Apache Spark]

3. HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data. [Apache HBase]

4. R Server: A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.

5. Storm: A distributed, real-time computation system for processing large streams of data fast. [Apache Storm]

6. Hive: or Interactive Query (AKA: Live Long and Process), In-memory caching for interactive and faster Hive queries. [Apache Hive]

7. Kafka: An open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. [Apache Kafka]

Other Components available with HDInsight:

Ambari Avro HCatalog
Mahout MapReduce YARN
Phoenix Pig Sqoop
Tez Oozie ZooKeeper

Storage options in HDInsight:

1. Azure Blob Store

2. Azure Data Lake Store

Azure Data Lake Store vs Azure Blob Storage

Role Based Security:

Owner Lets you manage everything
Contributor Lets you manage everything except access to resources
Reader Lets you view everything but not make changes
User Access Administrator Lets you manage user access to Azure resources


HDInsight security:

[Overview and more details on Microsoft Docs]