Archive
SQL Error – “SELECT TOP 100” throws error in SparkSQL – what’s the correct syntax?
In SQL Server to get top-n rows from a table or dataset you just have to use “SELECT TOP” clause by specifying the number of rows you want to return, like in the below query.
But when I tried to use the same query in Spark SQL I got a syntax error, which meant that the TOP clause is not supported with SELECT statement.
%sql Select TOP 100 * from SalesOrder
Error in SQL statement: ParseException:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input ‘100’ expecting (line 1, pos 11)== SQL ==
Select top 100 * from SalesOrder
———–^^^
As Spark SQL does not support TOP clause thus I tried to use the syntax of MySQL which is the “LIMIT” clause.
So I just removed “TOP 100” from the SELECT query and tried adding “LIMIT 100” clause at the end, it worked and gave expected results !!!
%sql Select * from SalesOrder LIMIT 100
Spark/Scala: Convert or flatten a JSON having Nested data with Struct/Array to columns (Question)
The following JSON contains some attributes at root level, like ProductNum and unitCount.
It also contains a Nested attribute with name “Properties”, which contains an array of Key-Value pairs.
Now, what I want is to expand this JSON, and have all the attributes in form of columns, with additional columns for all the Keys in Nested array section, like in the “Expected Output” section below:
{ "ProductNum":"6000078", "Properties":[ { "key":"invoice_id", "value":"923659" }, { "key":"job_id", "value":"296160" }, { "key":"sku_id", "value":"312002" } ], "unitCount":"3" }
Expected output, as described above:
+-------------------------------------------------------+ | ProductNum | invoice_id | job_id | sku_id | unitCount | +-------------------------------------------------------+ | 6000078 | 923659 | 296160 | 312002 | 3 | +-------------------------------------------------------+
Solution:
val DS_Products = spark.createDataset("""{ "ProductNum":"6000078", "Properties":[ { "key":"invoice_id", "value":"923659" }, { "key":"job_id", "value":"296160" }, { "key":"sku_id", "value":"312002" } ], "UnitCount":"3" }""" :: Nil) val DF_Products = spark.read.json(DS_Products) val df_flatten = DF_Products .select($"*", explode($"Properties") as "SubContent") .drop($"Properties") df_flatten.show() val df_flatten_pivot = df_flatten .groupBy($"ProductNum",$"UnitCount") .pivot("SubContent.key") .agg(first("SubContent.value")) df_flatten_pivot.show()
Output:
+----------+---------+--------------------+ |ProductNum|UnitCount| SubContent| +----------+---------+--------------------+ | 6000078| 3|[invoice_id, 923659]| | 6000078| 3| [job_id, 296160]| | 6000078| 3| [sku_id, 312002]| +----------+---------+--------------------+ +----------+---------+----------+------+------+ |ProductNum|UnitCount|invoice_id|job_id|sku_id| +----------+---------+----------+------+------+ | 6000078| 3| 923659|296160|312002| +----------+---------+----------+------+------+
Azure Databricks learning resources (documentation and videos)
Databricks Introduction
– What is Azure Databricks [Video]
– Create Databricks workspace with Apache Spark cluster
– Extract, Transform & Load (ETL) with Databricks
– Documentation:
– Azure
– Databricks
From Channel 9
1. Data Science using Azure Databricks and Apache Spark [Video]
2. Data ingestion, stream processing and sentiment analysis using Twitter [Video]
3. ETL with Azure Databricks using ADF [Video]
4. ADF new features & integration with Azure Databricks [Video]
5. Azure Databricks introduces R Studio Integration [Video]
6. Run Jars and Python scripts on Azure Databricks using ADF [Video]
From Microsoft Build Conf
Azure Databricks (a fully managed Apache Spark offering)
Databricks Introduction:
Azure Databricks = Best of Databricks + Best of Azure
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform (PaaS).
It is a fast, easy-to-use, and collaborative Apache Spark–based analytics platform. Designed in collaboration with the creators of Apache Spark, it combines the best of Databricks and Azure to help you accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration among data scientists, data engineers, and business analysts. Because it’s an Azure service, you benefit from native integrations with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB. You also get enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
–> With Databricks you can:
– Launch your new Spark environment with a single click.
– Integrate effortlessly with a wide variety of data stores.
– Use Databricks Notebooks to unify your processes and instantly deploy to production.
– Improve and scale your analytics with a high-performance processing engine optimized for the comprehensive, trusted Azure platform.
Learning Resources:
– Webinar recording on Azure Databricks