Watch & Subscribe my SQL videos on YouTube | Join me on Facebook

Azure Data Factory (ADF) Pipeline failure – found more columns than expected column count (DelimitedTextMoreColumnsThanDefined)

July 29, 2020 Leave a comment

 
I was setting up an Azure Data Factory (ADF) to copy files from Azure Data Lake Storage Gen1 to Gen2, but while running the Pipeline it was failing with below error:

Operation on target Copy_sae failed: Failure happened on ‘Sink’ side.
ErrorCode=DelimitedTextMoreColumnsThanDefined,
‘Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error found when processing ‘Csv/Tsv Format Text’ source ‘0_2019_11_09_01_43_32.avro’ with row number 53: found more columns than expected column count 27.,
Source=Microsoft.DataTransfer.Common,’

 

After some research I figured out that its because I had not selected the “Binary Copy” option while creating the Copy Data activity (shown in image below).

Root Cause: If the files under a particular folder you are copying contains files having different schema like, variable number of columns, different delimiters, quote char settings, or some data issue, the ADF pipeline will end up running in this error.

So, for bulk copying or migrating your data from one Data Lake to another try choosing this option, so that ADF won’t open the files to read schema, but it just simply treats every file as binary and copy it to the other location.


 
Hope this helps !

Migrate ADLS Gen1 to Gen2


SQL DBA – Change RECOVERY mode & SHRINK all databases at once in SQL Server

July 17, 2020 2 comments

 
There are times when you are left with no or very less space in you SQL Server VM disks, and the main cause of this is not only the lot amount of data coming in but the heavy processing your database Server is doing, and thus filling up the entire log space.

 
SQL Server provides you some effective commands to shrink the database files which got inflated due to heavy processing and to make room for new data.

ALTER DATABASE <db_name> SET RECOVERY SIMPLE;

DBCC SHRINKFILE (N'<log_file_name>' , 100);

But this works with one database at a time, so what if you have lots of databases? It will be tiring and time consuming to visit each DB, get the database file names and apply the shrink command. Its even slow to do the same via SSMS GUI.
 

Note: Do not change the recovery mode in a PROD environment unless it is really required.

 
With the below T-SQL statements you can generate scripts for all the databases and can run to change all database settings at once:

-- Generate SQL Script to change Recovery mode to 'SIMPLE' for all DBs:
SELECT 
	'ALTER DATABASE [' + name + '] 
	 SET RECOVERY SIMPLE;' as SimpleRecovery4AllDBs
FROM sys.databases  
WHERE recovery_model_desc <> 'SIMPLE'

-- Generate SQL Script to Shrink log files of all DBs:
SELECT 
	'USE ' + DB_Name(database_id) + '; 
	 DBCC SHRINKFILE (N''' + name + ''' , 100);' as ShrinkAllDBs
FROM sys.master_files 
WHERE database_id > 4 and [type] = 1

 

Please let me know if you have any other way to do the same (and in more effective way) !!!
 


Databricks Notebook error: Your administrator has only allowed sql and scala commands on this cluster.

June 8, 2020 Leave a comment

 
So while creating a Python notebook and running it on my Databricks Cluster I observed following error:

Your administrator has only allowed sql and scala commands on this cluster. This execution contained at least one disallowed language.

 

Its obvious that the error is due to some restriction applied at Cluster level. So I went to the Cluster settings page and checked the Spark Config and found below key-value configuration settings:

spark.databricks.repl.allowedLanguages sql,scala

 

So if you want to run other languages like Python & R you can remove the entire line or restrict any language(s) then change is as per your needs.


Spark SQL – Beware of Implicit datatype conversions (TypeCoercion)

March 6, 2020 1 comment

 
While working on some data analysis I saw one Spark SQL query was not getting me expected results. The table had some good amount of data, I was filtering on a value but some records were missing. So, I checked online and found that Spark SQL works differently compared to SQL Server, in this case while comparing 2 different datatypes columns or variables.

–> I’m populating some test data to reproduce the scenario, for that I’m inserting 9 rows and storing decimal values as String, query below:

CREATE OR REPLACE TEMPORARY VIEW vwTestDataType as 
select * from values 
("row1", "2.0"), 
("row2", "1.5"), 
("row3", "1.0"), 
("row4", "0.8"), 
("row5", "0.6"), 
("row6", "0.4"), 
("row7", "0.2"), 
("row8", "0.0"),
("row9", null);

describe vwTestDataType;

col_name | data_type | comment
col1           | string         | null
col2           | string         | null

 

–> Now, I’ll create a similar query where I was observing the issue. The below query should return me 7 rows, but instead it returns just 3 rows.

select * from vwTestDataType where col2 > 0

Running above query in “SQL Server” throws below error for the same dataset:

Conversion failed when converting the varchar value ‘2.0’ to data type int.

 

–> Let’s check why Spark SQL query didn’t failed and why its behaving like this.

I will use EXPLAIN EXTENDED operator to know what’s happening with the query while creating the Logical Plan.

explain extended select * from vwTestDataType where col2 > 0

Here is the plan you can see that under Analyzed Logical Plan the column “col2” is getting implicitly typecasted to INT, as the comparison value is an INT type. Thus it is converting all 0.x values to 0 and filtering them out.

Plan

== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘col2 > 0)
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13284, col2#13285]
+- Filter (cast(col2#13285 as int) > 0)
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13284, col2#13285]
+- LocalRelation [col1#13284, col2#13285]

== Optimized Logical Plan ==
LocalRelation [col1#13284, col2#13285]

== Physical Plan ==
LocalTableScan [col1#13284, col2#13285]

 

–> Now to avoid this issue you must explicitly type cast the column and value to the exact datatype to get expected result. Like here we should convert the String column & value to Double, this way the query returns all 7 rows as expected:

select * from vwTestDataType where double(col2) > double(0)
--OR--select * from vwTestDataType where col2 > 0.0

Let’s again check the Logical Plan of the modified query by using EXPLAIN EXTENDED operator how it looks like:

explain extended select * from vwTestDataType where double(col2) > double(0)
--OR--explain extended select * from vwTestDataType where col2 > 0.0

plan
== Parsed Logical Plan ==
‘Project [*]
+- ‘Filter (‘double(‘col2) > ‘double(0))
+- ‘UnresolvedRelation `vwTestDataType`

== Analyzed Logical Plan ==
col1: string, col2: string
Project [col1#13213, col2#13214]
+- Filter (cast(col2#13214 as double) > cast(0 as double))
+- SubqueryAlias `vwtestdatatype`
+- Project [col1#13213, col2#13214]
+- LocalRelation [col1#13213, col2#13214]

== Optimized Logical Plan ==
LocalRelation [col1#13213, col2#13214]

== Physical Plan ==
LocalTableScan [col1#13213, col2#13214]

 

So while working with Spark SQL we should make sure there should not be such datatype conflicts, and moreover these type of issues should be handled in way beginning while modelling the tables with correct datatype.


SQL Error – “SELECT TOP 100” throws error in SparkSQL – what’s the correct syntax?

January 23, 2020 Leave a comment

 
In SQL Server to get top-n rows from a table or dataset you just have to use “SELECT TOP” clause by specifying the number of rows you want to return, like in the below query.

But when I tried to use the same query in Spark SQL I got a syntax error, which meant that the TOP clause is not supported with SELECT statement.

%sql
Select TOP 100 * from SalesOrder

Error in SQL statement: ParseException:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input ‘100’ expecting (line 1, pos 11)

== SQL ==
Select top 100 * from SalesOrder
———–^^^

 

As Spark SQL does not support TOP clause thus I tried to use the syntax of MySQL which is the “LIMIT” clause.

So I just removed “TOP 100” from the SELECT query and tried adding “LIMIT 100” clause at the end, it worked and gave expected results !!!

%sql
Select * from SalesOrder LIMIT 100