Watch & Subscribe my SQL videos on YouTube | Join me on Facebook

SQL Server 2019 released, awesome new features – download now !!!

November 4, 2019 Leave a comment

Today on 4th November 2019 Microsoft in MSIgnite2019 event announced the release of new version of SQL Server i.e. SQL Server 2019.

New stuff in SQL Server 2019 is all about Big Data Clusters for SQL Server, which will allow you to:
– Deploy scalable clusters of SQL Server, Spark, HDFS on Kubernetes
– Read, write, and process big data from Transact-SQL or Spark
– With Polybase query data from external SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources with external tables
– and many more, we will check below…

–> Download SQL Server (evaluation version):

To download SQL Server 2019 you can Register and Download the Full or Evaluation version (180 days) here.

Or you can directly download the installer SQL2019-SSEI-Eval.exe

–> Free Developer Version:

Back in March 2016 Microsoft announced that going forward the Developer version of SQL Server any release will be free for Developers and Learning purpose. Register and Download the Developer version.

This Developer version is meant for development and testing only, and not for production environments or for use with production data. For more info please check my previous blog post.

–> Download SSMS (separate install):

Microsoft starting with SQL Server 2016 decoupled SSMS from SQL Server setup and is available as a separate installer. This is basically to support the move to make a universal version of SSMS for both SQL Server on-Prem & Azure SQL Database, that will ship every month or so. The SSMS setup is available separately as free download.

–> Download SSRS (Reporting Services):

Just like SSMS, now SSRS is also separately available for install from Microsoft Download Center, link.

–> Check new features in SQL Server 2019:

1. Big data clusters with scalable compute and storage composed of SQL Server, Spark, and HDFS. It provides key elements of a data lake – Hadoop Distributed File System (HDFS), Spark, and analytics tools deeply integrated with SQL Server. [more info]

2. A complete Data Science & AI platform to train and operationalize models in SQL Server ML Services or Spark ML using Azure Data Studio notebooks.

3. Data virtualization allows queries across relational and non-relational data without movement or replication. PolyBase enabled you to run a T-SQL query inside SQL Server to pull data from Hadoop and return it in a structured format—all without moving or copying the data.

4. Intelligent Query Processing improves scaling of queries and Automatic Plan Correction resolves performance problems. [more info]
  – Table variable deferred compilation
  – Batch mode on row store
  – T-SQL Scalar UDF inlining
  – Approximate QP (Approximate COUNT DISTINCT)
  – Memory grant feedback, row mode

5. In-memory improvements for analytics on operational data using HTAP. Higher concurrency and scale through persistent memory (PMEM). [more info]
– Hybrid buffer pool
– Memory-optimized TempDB
– In-Memory OLTP support for Database Snapshots

6. Greater uptime with more online indexing operations. [more info]

7. Data Discovery & Classification labeling for GDPR and Vulnerability Assessment tool to track compliance.

8. Support for your choice of Windows, Linux, and containers. SQL Server on Linux for Docker Engine [link].

9. High Availability with, five synchronous replica pairs, Secondary to primary replica connection redirection, Run Always On availability groups on containers using Kubernetes. [more info]

10. Accelerated Database Recovery [more info]

11. SQL Graph enhancements, support of MATCH predicates in a MERGE statement, SHORTEST_PATH inside MATCH, and support for derived tables or view aliases in graph match query.

Check all the above and many more new features of SQL Server 2019 in MSDN Blogs.

Are You Prepared for Disaster? Evaluating Cloud Backup Solutions by AWS vs. Azure vs. Google Cloud

February 12, 2019 Leave a comment


The adoption of public cloud computing services shows no signs of slowing down — Gartner predicted that the global public cloud services market will grow 17.3 percent in 2019 alone.

A huge draw of the public cloud is the use of its low-cost storage services for cloud backup purposes. Businesses can securely back up their data straight from an on-premise data center to the public cloud for disaster preparation. Such disasters, whether caused by natural factors or simple human error, can lead to the loss of data that is essential for business continuity.

Cloud backup is cost-effective, it provides anytime, anywhere data access via an Internet connection, and it stores data in an off-site location for data center redundancy. This article goes into detail on the cloud backup solutions offered by three major public cloud providers — AWS, Microsoft Azure, and Google Cloud. You’ll be able to compare pricing, features, and the level of support you get from the three service providers.

Main Cloud Backup Solutions/Features

Azure Backup is Microsoft Azure’s dedicated cloud-based backup solution. AWS has Amazon S3 Simple Storage and Amazon Glacier as its main storage services for cloud backup. Google Cloud Storage provides enterprise-grade public cloud storage.

Amazon Web Services (AWS)

S3 is the main AWS service suited for cloud backup purposes and there are 20 geographic regions housing data centers around the world. The global AWS infrastructure helps businesses benefit from storing their data in the region closest to their main operational base for more rapid data transfer in the event of an outage.

Backing up data to S3 is as straightforward as creating a storage bucket and uploading the relevant files. You can set permissions for each data object, encrypt your data, and add metadata.

Glacier is a long-term, low-cost storage service with an Active Archive option, which enables you to retrieve your data within 5 minutes. High-performance block-level storage is available from the EBS service. S3 and Glacier are object storage services.

Another important service in the context of cloud backup is AWS Storage Gateway, which provides your on-premises applications with a low-latency connection to AWS cloud storage services like S3, EBS and Glacier.

A submarket has emerged in the area of AWS cloud backup in which third-party vendors attempt to simplify workloads, meet compliance demands, and reduce costs when using S3 for backup purposes. Examples of such services include N2WS AWS Backup and Cloudberry.

Microsoft Azure

Azure Backup can be used as a dedicated cloud-based backup service that entirely replaces an existing on-premises or off-site backup solution. Your data can be replicated either locally within a region or in a separate region in what Azure terms locally redundant storage (LRS) and geo-redundant storage (GRS).

Data encryption, application-consistent backups, long-term retention, and data compression are some of the features available in each of the four separate components you can choose from within Azure Backup.

Google Cloud

Google Cloud Storage provides durable cloud storage accessed through a unified API. The unified API enables businesses to integrate cloud backup into their apps.

Google promises millisecond latency from its Cloud Storage service, which is helpful for achieving the required recovery time objective (RTO) for swift disaster recovery.


All three of these cloud backup providers operate a pay-per-use model in which the monthly cost depends primarily on the amount of data stored. Other factors that influence the price are the frequency at which you access data and the geographic region your data is stored in.


The AWS free usage tier entities users to up to 5GB of free storage in S3. Beyond that point, the cost per gigabyte depends on the geographic region, the quantity of storage used, and the frequency of access. The below table provides costs for the U.S East region.


The price you pay to use Azure cloud backup varies depending on whether you choose to make the data locally redundant or geographically redundant, with the latter being more costly due to the additional peace of mind it provides. Like in AWS, the cost also varies depending on the amount of data storage consumed. See the table of costs for the U.S Eastern region below:

Note that for storage needs greater than 5,000 TB, you need to contact Azure for a custom quote. Costs may differ when backing up data in other Azure regions.

Google Cloud

With Google Cloud Storage you also get 5 GB of free usage. Beyond this point, the per-gigabyte costs persist independently of the amount of data stored, which makes the pricing more straightforward but doesn’t reward businesses storing a lot of data with lower costs.

The cost varies depending on whether you want data stored regionally (better performance, lower latency) or multi-regionally (geo-redundancy). Costs also differ between data accessed regularly (nearline storage) or infrequently (coldline storage). Below you’ll see the costs for the U.S East region.


A crucial aspect to consider in the public cloud is the level of support available from your service provider. You need to factor the potential for problems and technical issues arising with your cloud service usage and how promptly the service provider can respond.

All three providers have paid support plans available. Each company tiers its support plans, with the premium plans providing the quickest response times to technical issues.

The AWS Enterprise plan promises 24/7 support and sub-15 minute response times for critical issues, but it costs $15,000 per month while its Business plan users pay from $100 per month to get less than one hour response times for critical issues and 24/7 tech support.

Google’s Platinum support package provides similar benefits to the AWS Enterprise support plan but the cost is given by quote only. Google has a Gold support package which delivers a 1-hour response time for critical issues.

Lastly, Azure’s Professional Direct plan provides 24/7 technical support and sub-one-hour
Response times for $1,000 per month. The Standard plan costs $300 but the response time is increased to two hours for critical issues.


Your choice of cloud backup solution depends on the particular provider that best meets your needs. All three offer similar levels of premium technical support. Google differs slightly in pricing in that it doesn’t alter its per-gigabyte cost as you store more data.

Azure Backup meets the needs of businesses looking for a dedicated cloud backup solution. AWS is more general-purpose and requires expert knowledge to minimize costs and maximize performance as a backup service, and third-party AWS backup services can help out with that. Google Storage also has a wider range of use cases than just backup.

Spark/Scala: Convert or flatten a JSON having Nested data with Struct/Array to columns (Question)

January 9, 2019 Leave a comment

The following JSON contains some attributes at root level, like ProductNum and unitCount.
It also contains a Nested attribute with name “Properties”, which contains an array of Key-Value pairs.

Now, what I want is to expand this JSON, and have all the attributes in form of columns, with additional columns for all the Keys in Nested array section, like in the “Expected Output” section below:



Expected output, as described above:

| ProductNum | invoice_id | job_id | sku_id | unitCount |  
| 6000078    | 923659     | 296160 | 312002 | 3         |  



val DS_Products = spark.createDataset("""{
}""" :: Nil)

val DF_Products =

val df_flatten = DF_Products
  .select($"*", explode($"Properties") as "SubContent")

val df_flatten_pivot = df_flatten


|ProductNum|UnitCount|          SubContent|
|   6000078|        3|[invoice_id, 923659]|
|   6000078|        3|    [job_id, 296160]|
|   6000078|        3|    [sku_id, 312002]|

|   6000078|        3|    923659|296160|312002|


2018 blogging in review (Happy New Year – 2019 !!!)

December 31, 2018 1 comment


Happy New Year 2019… from SQL with Manoj !!!

As stats helper monkeys have stopped preparing annual report for any of their blogs, so I’ve prepared my own Annual Report for this year again.

–> Here are some Crunchy numbers from 2018

SQL with Manoj 2018 Stats

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 793,171 times by 542,918 unique visitors in 2018. If it were an exhibit at the Louvre Museum, it would take about 17 days for that many people to see it.

There were 68 pictures uploaded, taking up a total of ~6 MB. That’s about ~6 pictures every month.

This blog also got its highest ever hits/views per day (i.e. 3,552 hits) on Sept 25th this year.


–> All-time posts, views, and visitors

SQL with Manoj all time views


–> Posting Patterns

In 2018, there were 26 new posts, growing the total archive of this blog to 546 posts.

LONGEST STREAK: 6 post in Feb 2018


–> Attractions in 2018

These are the top 5 posts that got most views in 2018:

1. Download & Install SQL Server Management Studio (SSMS) 2016 (62,101 views)

2. SQL Server 2016 RTM full & final version available – Download now (31,705 views)

3. Getting started with SQL Server 2014 | Download and Install Free & Full version (20,443 views)

4. SQL Basics – Difference b/w WHERE, GROUP BY and HAVING clause (16,113 views)

5. SQL Basics – Difference b/w TRUNCATE, DELETE and DROP? (13,189 views)


–> How did they find me?

The top referring sites and search engines in 2018 were:

SQL with Manoj 2018 Search Engines referrers


–> Where did they come from?

Out of 210 countries, top 5 visitors came from India, United States, United Kingdom, Canada and Australia:

SQL with Manoj 2018 top Countries visitors


–> Followers: 407 160
Email: 247
Facebook Page: 1,358


–> Alexa Rank (lower the better)

Global Rank: 221,534
US Rank: 139,012
India Rank: 46,758
Estimated Monthly Revenue: $1,320
Actual Monthly Revenue: $300

SQL with Manoj 2018 Alexa ranking

Alexa history shows how the alexa rank of has varied in the past, which in turn also tells about the site visitors.

–> 2019 New Year Resolution

– Write at least 1 blog post every week
– Write on new feaures in SQL Server 2017 & 2019
– Also explore and write blog post on Azure Data Platform
– Post at least 1 video every week on my YouTube channel


That’s all for 2018, see you in year 2019, all the best !!!

Connect me on Facebook, Twitter, LinkedIn, YouTube, Google, Email

Hadoop/HDFS storage types, formats and internals – Text, Parquet, ORC, Avro

December 30, 2018 1 comment

HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.

–> Here we will talk about different types of file formats supported in HDFS:

1. Text (CSV, TSV, JSON): These are the flat file format which could be used with the Hadoop system as a storage format. However these format do not contain the self inherited Schema. Thus with this the developer using any processing engine have to apply schema while reading these file formats.

2. Parquet: file format is the Columnar oriented format in the Hadoop ecosystem. Parquet stores the binary data column wise, which brings following benefits:
– Less storage, efficient Compression resulting in Storage optimization, as the same data type is residing adjacent to each other. That helps in compressing the data better hence provide storage optimization.
– Increased query performance as entire row needs not to be loaded in the memory.

Parquet file format can be used with any Hadoop ecosystem like: Hive, Impala, Pig, Spark, etc.

3. ORC: stands for Optimized Row Columnar, which is a Columnar oriented storage format. ORC is primarily used in the Hive world and gives better performance with Hive based data retrievals because Hive has a vectorized ORC reader. Schema is self contained in the file as part of the footer. Because of the column oriented nature it provide better compression ratio and faster reads.

4. Avro: is the Row oriented storage format, and make a perfect use case for write heavy applications. The schema is self contained with in the file in the form of JSON, which help in achieving efficient schema evolution.

–> Now, Lets take a deep dive and look at these file format through a series of videos below:


Author/Speaker Bio: Viresh Kumar is a v-blogger and an expert in Big Data, Hadoop and Cloud world. He has an experience of ~14 years in the Data Platform industry.


Book: Hadoop – The Definitive Guide: Storage and Analysis at Internet Scale