Archive

Archive for the ‘Others’ Category

Spark SQL – Beware of Implicit datatype conversions (TypeCoercion)

March 6, 2020 1 comment

 
While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

–> I’m populating some test data to reproduce the scenario, for that I’m inserting 9 rows and storing decimal values as String, query below:

CREATE OR REPLACE TEMPORARY VIEW vwTestDataType as 
select * from values 
("row1", "2.0"), 
("row2", "1.5"), 
("row3", "1.0"), 
("row4", "0.8"), 
("row5", "0.6"), 
("row6", "0.4"), 
("row7", "0.2"), 
("row8", "0.0"),
("row9", null);

describe vwTestDataType;

col_name | data_type | comment
col1           | string         | null
col2           | string         | null

 

–> Now, I’ll create a similar query where I was observing the issue. The below query should return me 7 rows, but instead it returns just 3 rows.

select * from vwTestDataType where col2 > 0

Running above query in “SQL Server” throws below error for the same dataset:

Conversion failed when converting the varchar value ‘2.0’ to data type int.

 

–> Let’s check why Spark SQL query didn’t failed and why its behaving like this.

I will use EXPLAIN EXTENDED operator to know what’s happening with the query while creating the Logical Plan.

explain extended select * from vwTestDataType where col2 > 0

Here is the plan you can see that under Analyzed Logical Plan the column “col2” is getting implicitly typecasted to INT, as the comparison value is an INT type. Thus it is converting all 0.x values to 0 and filtering them out.

Plan

== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘col2 > 0)
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13284, col2#13285]
+- Filter (cast(col2#13285 as int) > 0)
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13284, col2#13285]
+- LocalRelation [col1#13284, col2#13285]

== Optimized Logical Plan ==
LocalRelation [col1#13284, col2#13285]

== Physical Plan ==
LocalTableScan [col1#13284, col2#13285]

 

–> Now to avoid this issue you must explicitly type cast the column and value to the exact datatype to get expected result. Like here we should convert the String column & value to Double, this way the query returns all 7 rows as expected:

select * from vwTestDataType where double(col2) > double(0)
--OR--select * from vwTestDataType where col2 > 0.0

Let’s again check the Logical Plan of the modified query by using EXPLAIN EXTENDED operator how it looks like:

explain extended select * from vwTestDataType where double(col2) > double(0)
--OR--explain extended select * from vwTestDataType where col2 > 0.0

plan
== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘double(‘col2) > ‘double(0))
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13213, col2#13214]
+- Filter (cast(col2#13214 as double) > cast(0 as double))
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13213, col2#13214]
+- LocalRelation [col1#13213, col2#13214]

== Optimized Logical Plan ==
LocalRelation [col1#13213, col2#13214]

== Physical Plan ==
LocalTableScan [col1#13213, col2#13214]

 

So while working with Spark SQL we should make sure there should not be such datatype conflicts, and moreover these type of issues should be handled in way beginning while modelling the tables with correct datatype.


SQL Error – “SELECT TOP 100” throws error in SparkSQL – what’s the correct syntax?

January 23, 2020 Leave a comment

 
In SQL Server to get top-n rows from a table or dataset you just have to use “SELECT TOP” clause by specifying the number of rows you want to return, like in the below query.

But when I tried to use the same query in Spark SQL I got a syntax error, which meant that the TOP clause is not supported with SELECT statement.

%sql
Select TOP 100 * from SalesOrder

Error in SQL statement: ParseException:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input ‘100’ expecting (line 1, pos 11)

== SQL ==
Select top 100 * from SalesOrder
———–^^^

 

As Spark SQL does not support TOP clause thus I tried to use the syntax of MySQL which is the “LIMIT” clause.

So I just removed “TOP 100” from the SELECT query and tried adding “LIMIT 100” clause at the end, it worked and gave expected results !!!

%sql
Select * from SalesOrder LIMIT 100

SQL Server 2019 released, awesome new features – download now !!!

November 4, 2019 Leave a comment

 
Today on 4th November 2019 Microsoft in MSIgnite2019 event announced the release of new version of SQL Server i.e. SQL Server 2019.
 

New stuff in SQL Server 2019 is all about Big Data Clusters for SQL Server, which will allow you to:
– Deploy scalable clusters of SQL Server, Spark, HDFS on Kubernetes
– Read, write, and process big data from Transact-SQL or Spark
– With Polybase query data from external SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources with external tables
– and many more, we will check below…

 
–> Download SQL Server (evaluation version):

To download SQL Server 2019 you can Register and Download the Full or Evaluation version (180 days) here.

Or you can directly download the installer SQL2019-SSEI-Eval.exe
 

–> Free Developer Version:

Back in March 2016 Microsoft announced that going forward the Developer version of SQL Server any release will be free for Developers and Learning purpose. Register and Download the Developer version.

This Developer version is meant for development and testing only, and not for production environments or for use with production data. For more info please check my previous blog post.
 

–> Download SSMS (separate install):

Microsoft starting with SQL Server 2016 decoupled SSMS from SQL Server setup and is available as a separate installer. This is basically to support the move to make a universal version of SSMS for both SQL Server on-Prem & Azure SQL Database, that will ship every month or so. The SSMS setup is available separately as free download.
 

–> Download SSRS (Reporting Services):

Just like SSMS, now SSRS is also separately available for install from Microsoft Download Center, link.
 

–> Check new features in SQL Server 2019:

1. Big data clusters with scalable compute and storage composed of SQL Server, Spark, and HDFS. It provides key elements of a data lake – Hadoop Distributed File System (HDFS), Spark, and analytics tools deeply integrated with SQL Server. [more info]

2. A complete Data Science & AI platform to train and operationalize models in SQL Server ML Services or Spark ML using Azure Data Studio notebooks.

3. Data virtualization allows queries across relational and non-relational data without movement or replication. PolyBase enabled you to run a T-SQL query inside SQL Server to pull data from Hadoop and return it in a structured format—all without moving or copying the data.

4. Intelligent Query Processing improves scaling of queries and Automatic Plan Correction resolves performance problems. [more info]
  – Table variable deferred compilation
  – Batch mode on row store
  – T-SQL Scalar UDF inlining
  – Approximate QP (Approximate COUNT DISTINCT)
  – Memory grant feedback, row mode

5. In-memory improvements for analytics on operational data using HTAP. Higher concurrency and scale through persistent memory (PMEM). [more info]
– Hybrid buffer pool
– Memory-optimized TempDB
– In-Memory OLTP support for Database Snapshots

6. Greater uptime with more online indexing operations. [more info]

7. Data Discovery & Classification labeling for GDPR and Vulnerability Assessment tool to track compliance.

8. Support for your choice of Windows, Linux, and containers. SQL Server on Linux for Docker Engine [link].

9. High Availability with, five synchronous replica pairs, Secondary to primary replica connection redirection, Run Always On availability groups on containers using Kubernetes. [more info]

10. Accelerated Database Recovery [more info]

11. SQL Graph enhancements, support of MATCH predicates in a MERGE statement, SHORTEST_PATH inside MATCH, and support for derived tables or view aliases in graph match query.
 

Check all the above and many more new features of SQL Server 2019 in MSDN Blogs.


2018 blogging in review (Happy New Year – 2019 !!!)

December 31, 2018 1 comment

 

Happy New Year 2019… from SQL with Manoj !!!

As WordPress.com stats helper monkeys have stopped preparing annual report for any of their blogs, so I’ve prepared my own Annual Report for this year again.
 

–> Here are some Crunchy numbers from 2018

SQL with Manoj 2018 Stats

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 793,171 times by 542,918 unique visitors in 2018. If it were an exhibit at the Louvre Museum, it would take about 17 days for that many people to see it.

There were 68 pictures uploaded, taking up a total of ~6 MB. That’s about ~6 pictures every month.

This blog also got its highest ever hits/views per day (i.e. 3,552 hits) on Sept 25th this year.

 

–> All-time posts, views, and visitors

SQL with Manoj all time views


 

–> Posting Patterns

In 2018, there were 26 new posts, growing the total archive of this blog to 546 posts.

LONGEST STREAK: 6 post in Feb 2018

 

–> Attractions in 2018

These are the top 5 posts that got most views in 2018:

1. Download & Install SQL Server Management Studio (SSMS) 2016 (62,101 views)

2. SQL Server 2016 RTM full & final version available – Download now (31,705 views)

3. Getting started with SQL Server 2014 | Download and Install Free & Full version (20,443 views)

4. SQL Basics – Difference b/w WHERE, GROUP BY and HAVING clause (16,113 views)

5. SQL Basics – Difference b/w TRUNCATE, DELETE and DROP? (13,189 views)

 

–> How did they find me?

The top referring sites and search engines in 2018 were:

SQL with Manoj 2018 Search Engines referrers


 

–> Where did they come from?

Out of 210 countries, top 5 visitors came from India, United States, United Kingdom, Canada and Australia:

SQL with Manoj 2018 top Countries visitors


 

–> Followers: 407

WordPress.com: 160
Email: 247
Facebook Page: 1,358

 

–> Alexa Rank (lower the better)

Global Rank: 221,534
US Rank: 139,012
India Rank: 46,758
Estimated Monthly Revenue: $1,320
Actual Monthly Revenue: $300

SQL with Manoj 2018 Alexa ranking


Alexa history shows how the alexa rank of sqlwithmanoj.com has varied in the past, which in turn also tells about the site visitors.
 

–> 2019 New Year Resolution

– Write at least 1 blog post every week
– Write on new feaures in SQL Server 2017 & 2019
– Also explore and write blog post on Azure Data Platform
– Post at least 1 video every week on my YouTube channel

 

That’s all for 2018, see you in year 2019, all the best !!!
 

Connect me on Facebook, Twitter, LinkedIn, YouTube, Google, Email


Powershell – Restart Azure VM and log off Users remotely

October 21, 2018 1 comment

 

1. Open RUN by pressing Windows + R keys, type powershell command and hit Enter.
 

2. Now on Powershell window first connect to the Azure VM that you want to remotely restart:

PS C:\Users\manojp> Enter-PSSession -ComputerName MyAzureVMName

[MyAzureVMName]: PS C:\Users\manojp\Documents>

 

3. Now try issuing the Restart command:

[MyAzureVMName]: PS C:\Users\manojp\Documents> Restart-Computer

Restart-Computer : Failed to restart the computer MyAzureVMName with the following error message: The system shutdown
cannot be initiated because there are other users logged on to the computer.
+ CategoryInfo : OperationStopped: (MyAzureVMName:String) [Restart-Computer], InvalidOperationException
+ FullyQualifiedErrorId : RestartcomputerFailed,Microsoft.PowerShell.Commands.RestartComputerCommand

So, this gave us error as few users are still logged in, thus can’t restart the VM.

 

4. Let’s check who all are logged in on this VM:

[MyAzureVMName]: PS C:\Users\manojp\Documents> quser

USERNAME    SESSIONNAME    ID    STATE    IDLE TIME    LOGON TIME
charlesl         rdp-tcp#0           2      Active    1:07            12/21/2018 08:26 AM

 

5. Let’s try kicking users out by specifying the ID which is “2”:

[MyAzureVMName]: PS C:\Users\manojp\Documents> logoff 2

 

6. We will check if that user is kicked out or anybody is still remaining:

[MyAzureVMName]: PS C:\Users\manojp\Documents> quser

quser : No User exists for *
+ CategoryInfo : NotSpecified: (No User exists for *:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

 

7. Let’s finally restart the VM:

[MyAzureVMName]: PS C:\Users\manojp\Documents> Restart-Computer

PS>