Watch & Subscribe my SQL videos on YouTube | Join me on Facebook

Spark SQL – Beware of Implicit datatype conversions (TypeCoercion)

March 6, 2020 1 comment

 
While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

–> I’m populating some test data to reproduce the scenario, for that I’m inserting 9 rows and storing decimal values as String, query below:

CREATE OR REPLACE TEMPORARY VIEW vwTestDataType as 
select * from values 
("row1", "2.0"), 
("row2", "1.5"), 
("row3", "1.0"), 
("row4", "0.8"), 
("row5", "0.6"), 
("row6", "0.4"), 
("row7", "0.2"), 
("row8", "0.0"),
("row9", null);

describe vwTestDataType;

col_name | data_type | comment
col1           | string         | null
col2           | string         | null

 

–> Now, I’ll create a similar query where I was observing the issue. The below query should return me 7 rows, but instead it returns just 3 rows.

select * from vwTestDataType where col2 > 0

Running above query in “SQL Server” throws below error for the same dataset:

Conversion failed when converting the varchar value ‘2.0’ to data type int.

 

–> Let’s check why Spark SQL query didn’t failed and why its behaving like this.

I will use EXPLAIN EXTENDED operator to know what’s happening with the query while creating the Logical Plan.

explain extended select * from vwTestDataType where col2 > 0

Here is the plan you can see that under Analyzed Logical Plan the column “col2” is getting implicitly typecasted to INT, as the comparison value is an INT type. Thus it is converting all 0.x values to 0 and filtering them out.

Plan

== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘col2 > 0)
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13284, col2#13285]
+- Filter (cast(col2#13285 as int) > 0)
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13284, col2#13285]
+- LocalRelation [col1#13284, col2#13285]

== Optimized Logical Plan ==
LocalRelation [col1#13284, col2#13285]

== Physical Plan ==
LocalTableScan [col1#13284, col2#13285]

 

–> Now to avoid this issue you must explicitly type cast the column and value to the exact datatype to get expected result. Like here we should convert the String column & value to Double, this way the query returns all 7 rows as expected:

select * from vwTestDataType where double(col2) > double(0)
--OR--select * from vwTestDataType where col2 > 0.0

Let’s again check the Logical Plan of the modified query by using EXPLAIN EXTENDED operator how it looks like:

explain extended select * from vwTestDataType where double(col2) > double(0)
--OR--explain extended select * from vwTestDataType where col2 > 0.0

plan
== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘double(‘col2) > ‘double(0))
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13213, col2#13214]
+- Filter (cast(col2#13214 as double) > cast(0 as double))
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13213, col2#13214]
+- LocalRelation [col1#13213, col2#13214]

== Optimized Logical Plan ==
LocalRelation [col1#13213, col2#13214]

== Physical Plan ==
LocalTableScan [col1#13213, col2#13214]

 

So while working with Spark SQL we should make sure there should not be such datatype conflicts, and moreover these type of issues should be handled in way beginning while modelling the tables with correct datatype.


SQL Error – “SELECT TOP 100” throws error in SparkSQL – what’s the correct syntax?

January 23, 2020 Leave a comment

 
In SQL Server to get top-n rows from a table or dataset you just have to use “SELECT TOP” clause by specifying the number of rows you want to return, like in the below query.

But when I tried to use the same query in Spark SQL I got a syntax error, which meant that the TOP clause is not supported with SELECT statement.

%sql
Select TOP 100 * from SalesOrder

Error in SQL statement: ParseException:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input ‘100’ expecting (line 1, pos 11)

== SQL ==
Select top 100 * from SalesOrder
———–^^^

 

As Spark SQL does not support TOP clause thus I tried to use the syntax of MySQL which is the “LIMIT” clause.

So I just removed “TOP 100” from the SELECT query and tried adding “LIMIT 100” clause at the end, it worked and gave expected results !!!

%sql
Select * from SalesOrder LIMIT 100

SQL Server 2019 released, awesome new features – download now !!!

November 4, 2019 Leave a comment

 
Today on 4th November 2019 Microsoft in MSIgnite2019 event announced the release of new version of SQL Server i.e. SQL Server 2019.
 

New stuff in SQL Server 2019 is all about Big Data Clusters for SQL Server, which will allow you to:
– Deploy scalable clusters of SQL Server, Spark, HDFS on Kubernetes
– Read, write, and process big data from Transact-SQL or Spark
– With Polybase query data from external SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources with external tables
– and many more, we will check below…

 
–> Download SQL Server (evaluation version):

To download SQL Server 2019 you can Register and Download the Full or Evaluation version (180 days) here.

Or you can directly download the installer SQL2019-SSEI-Eval.exe
 

–> Free Developer Version:

Back in March 2016 Microsoft announced that going forward the Developer version of SQL Server any release will be free for Developers and Learning purpose. Register and Download the Developer version.

This Developer version is meant for development and testing only, and not for production environments or for use with production data. For more info please check my previous blog post.
 

–> Download SSMS (separate install):

Microsoft starting with SQL Server 2016 decoupled SSMS from SQL Server setup and is available as a separate installer. This is basically to support the move to make a universal version of SSMS for both SQL Server on-Prem & Azure SQL Database, that will ship every month or so. The SSMS setup is available separately as free download.
 

–> Download SSRS (Reporting Services):

Just like SSMS, now SSRS is also separately available for install from Microsoft Download Center, link.
 

–> Check new features in SQL Server 2019:

1. Big data clusters with scalable compute and storage composed of SQL Server, Spark, and HDFS. It provides key elements of a data lake – Hadoop Distributed File System (HDFS), Spark, and analytics tools deeply integrated with SQL Server. [more info]

2. A complete Data Science & AI platform to train and operationalize models in SQL Server ML Services or Spark ML using Azure Data Studio notebooks.

3. Data virtualization allows queries across relational and non-relational data without movement or replication. PolyBase enabled you to run a T-SQL query inside SQL Server to pull data from Hadoop and return it in a structured format—all without moving or copying the data.

4. Intelligent Query Processing improves scaling of queries and Automatic Plan Correction resolves performance problems. [more info]
  – Table variable deferred compilation
  – Batch mode on row store
  – T-SQL Scalar UDF inlining
  – Approximate QP (Approximate COUNT DISTINCT)
  – Memory grant feedback, row mode

5. In-memory improvements for analytics on operational data using HTAP. Higher concurrency and scale through persistent memory (PMEM). [more info]
– Hybrid buffer pool
– Memory-optimized TempDB
– In-Memory OLTP support for Database Snapshots

6. Greater uptime with more online indexing operations. [more info]

7. Data Discovery & Classification labeling for GDPR and Vulnerability Assessment tool to track compliance.

8. Support for your choice of Windows, Linux, and containers. SQL Server on Linux for Docker Engine [link].

9. High Availability with, five synchronous replica pairs, Secondary to primary replica connection redirection, Run Always On availability groups on containers using Kubernetes. [more info]

10. Accelerated Database Recovery [more info]

11. SQL Graph enhancements, support of MATCH predicates in a MERGE statement, SHORTEST_PATH inside MATCH, and support for derived tables or view aliases in graph match query.
 

Check all the above and many more new features of SQL Server 2019 in MSDN Blogs.


Are You Prepared for Disaster? Evaluating Cloud Backup Solutions by AWS vs. Azure vs. Google Cloud

February 12, 2019 Leave a comment

 

 
The adoption of public cloud computing services shows no signs of slowing down — Gartner predicted that the global public cloud services market will grow 17.3 percent in 2019 alone.

A huge draw of the public cloud is the use of its low-cost storage services for cloud backup purposes. Businesses can securely back up their data straight from an on-premise data center to the public cloud for disaster preparation. Such disasters, whether caused by natural factors or simple human error, can lead to the loss of data that is essential for business continuity.

Cloud backup is cost-effective, it provides anytime, anywhere data access via an Internet connection, and it stores data in an off-site location for data center redundancy. This article goes into detail on the cloud backup solutions offered by three major public cloud providers — AWS, Microsoft Azure, and Google Cloud. You’ll be able to compare pricing, features, and the level of support you get from the three service providers.
 

Main Cloud Backup Solutions/Features

Azure Backup is Microsoft Azure’s dedicated cloud-based backup solution. AWS has Amazon S3 Simple Storage and Amazon Glacier as its main storage services for cloud backup. Google Cloud Storage provides enterprise-grade public cloud storage.
 

Amazon Web Services (AWS)

S3 is the main AWS service suited for cloud backup purposes and there are 20 geographic regions housing data centers around the world. The global AWS infrastructure helps businesses benefit from storing their data in the region closest to their main operational base for more rapid data transfer in the event of an outage.

Backing up data to S3 is as straightforward as creating a storage bucket and uploading the relevant files. You can set permissions for each data object, encrypt your data, and add metadata.

Glacier is a long-term, low-cost storage service with an Active Archive option, which enables you to retrieve your data within 5 minutes. High-performance block-level storage is available from the EBS service. S3 and Glacier are object storage services.

Another important service in the context of cloud backup is AWS Storage Gateway, which provides your on-premises applications with a low-latency connection to AWS cloud storage services like S3, EBS and Glacier.

A submarket has emerged in the area of AWS cloud backup in which third-party vendors attempt to simplify workloads, meet compliance demands, and reduce costs when using S3 for backup purposes. Examples of such services include N2WS AWS Backup and Cloudberry.
 

Microsoft Azure

Azure Backup can be used as a dedicated cloud-based backup service that entirely replaces an existing on-premises or off-site backup solution. Your data can be replicated either locally within a region or in a separate region in what Azure terms locally redundant storage (LRS) and geo-redundant storage (GRS).

Data encryption, application-consistent backups, long-term retention, and data compression are some of the features available in each of the four separate components you can choose from within Azure Backup.
 

Google Cloud

Google Cloud Storage provides durable cloud storage accessed through a unified API. The unified API enables businesses to integrate cloud backup into their apps.

Google promises millisecond latency from its Cloud Storage service, which is helpful for achieving the required recovery time objective (RTO) for swift disaster recovery.
 

Pricing

All three of these cloud backup providers operate a pay-per-use model in which the monthly cost depends primarily on the amount of data stored. Other factors that influence the price are the frequency at which you access data and the geographic region your data is stored in.
 

AWS

The AWS free usage tier entities users to up to 5GB of free storage in S3. Beyond that point, the cost per gigabyte depends on the geographic region, the quantity of storage used, and the frequency of access. The below table provides costs for the U.S East region.

Azure

The price you pay to use Azure cloud backup varies depending on whether you choose to make the data locally redundant or geographically redundant, with the latter being more costly due to the additional peace of mind it provides. Like in AWS, the cost also varies depending on the amount of data storage consumed. See the table of costs for the U.S Eastern region below:

Note that for storage needs greater than 5,000 TB, you need to contact Azure for a custom quote. Costs may differ when backing up data in other Azure regions.
 

Google Cloud

With Google Cloud Storage you also get 5 GB of free usage. Beyond this point, the per-gigabyte costs persist independently of the amount of data stored, which makes the pricing more straightforward but doesn’t reward businesses storing a lot of data with lower costs.

The cost varies depending on whether you want data stored regionally (better performance, lower latency) or multi-regionally (geo-redundancy). Costs also differ between data accessed regularly (nearline storage) or infrequently (coldline storage). Below you’ll see the costs for the U.S East region.

Support

A crucial aspect to consider in the public cloud is the level of support available from your service provider. You need to factor the potential for problems and technical issues arising with your cloud service usage and how promptly the service provider can respond.

All three providers have paid support plans available. Each company tiers its support plans, with the premium plans providing the quickest response times to technical issues.

The AWS Enterprise plan promises 24/7 support and sub-15 minute response times for critical issues, but it costs $15,000 per month while its Business plan users pay from $100 per month to get less than one hour response times for critical issues and 24/7 tech support.

Google’s Platinum support package provides similar benefits to the AWS Enterprise support plan but the cost is given by quote only. Google has a Gold support package which delivers a 1-hour response time for critical issues.

Lastly, Azure’s Professional Direct plan provides 24/7 technical support and sub-one-hour
Response times for $1,000 per month. The Standard plan costs $300 but the response time is increased to two hours for critical issues.
 

Conclusion

Your choice of cloud backup solution depends on the particular provider that best meets your needs. All three offer similar levels of premium technical support. Google differs slightly in pricing in that it doesn’t alter its per-gigabyte cost as you store more data.

Azure Backup meets the needs of businesses looking for a dedicated cloud backup solution. AWS is more general-purpose and requires expert knowledge to minimize costs and maximize performance as a backup service, and third-party AWS backup services can help out with that. Google Storage also has a wider range of use cases than just backup.


Spark/Scala: Convert or flatten a JSON having Nested data with Struct/Array to columns (Question)

January 9, 2019 Leave a comment

 
The following JSON contains some attributes at root level, like ProductNum and unitCount.
It also contains a Nested attribute with name “Properties”, which contains an array of Key-Value pairs.

Now, what I want is to expand this JSON, and have all the attributes in form of columns, with additional columns for all the Keys in Nested array section, like in the “Expected Output” section below:

{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "unitCount":"3"
}

 

Expected output, as described above:

+-------------------------------------------------------+   
| ProductNum | invoice_id | job_id | sku_id | unitCount |  
+-------------------------------------------------------+   
| 6000078    | 923659     | 296160 | 312002 | 3         |  
+-------------------------------------------------------+

 

Solution:

val DS_Products = spark.createDataset("""{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "UnitCount":"3"
}""" :: Nil)

val DF_Products = spark.read.json(DS_Products)

val df_flatten = DF_Products
  .select($"*", explode($"Properties") as "SubContent")
  .drop($"Properties")

df_flatten.show()

val df_flatten_pivot = df_flatten
  .groupBy($"ProductNum",$"UnitCount")
  .pivot("SubContent.key")
  .agg(first("SubContent.value"))

df_flatten_pivot.show()

Output:

+----------+---------+--------------------+
|ProductNum|UnitCount|          SubContent|
+----------+---------+--------------------+
|   6000078|        3|[invoice_id, 923659]|
|   6000078|        3|    [job_id, 296160]|
|   6000078|        3|    [sku_id, 312002]|
+----------+---------+--------------------+

+----------+---------+----------+------+------+
|ProductNum|UnitCount|invoice_id|job_id|sku_id|
+----------+---------+----------+------+------+
|   6000078|        3|    923659|296160|312002|
+----------+---------+----------+------+------+