Archive

Posts Tagged ‘Scala’

Spark error – Parquet does not support decimal. See HIVE-6384

August 5, 2020 1 comment

 
I was creating a Hive table in Databricks Notebook from a Parquet file located in Azure Data Lake store by following command:

val df = spark.read.parquet(
 "abfss://adlsstore@MyStorageAccount.dfs.core.windows.net/x/y/z/*.parquet")

df.write.mode("overwrite").saveAsTable("tblOrderDetail")

But I was getting following error:

warning: there was one feature warning; re-run with -feature for details
java.lang.UnsupportedOperationException: Parquet does not support decimal. See HIVE-6384

 
As per the above error it relates to some Hive version conflict, so I tried checking the Hive version by running below command and found that it is pointing to an old version (0.13.0). This version of Hive metastore did not support the BINARY datatypes for parquet formatted files.

spark.conf.get("spark.sql.hive.metastore.version")


 

Also as per this Jira Task on HIVE-6384 the support for multiple datatypes was implemented for Parquet SerDe in Hive 1.2.0 version.

 
So to update the Hive metastore to the current version you just need to add below commands in the configuration of the cluster you are using.

Click on “Clusters” –> click “Edit” on the top –> expand “Advanced Options” –> under “Spark” tab and “Spark Config” box add the below two commands:

spark.sql.hive.metastore.version 1.2.1
spark.sql.hive.metastore.jars builtin

You just need to restart the cluster so that the new settings are in use.
 


 

Some similar errors:
– Parquet does not support date
– Parquet does not support timestamp


Spark/Scala: Convert or flatten a JSON having Nested data with Struct/Array to columns (Question)

January 9, 2019 Leave a comment

 
The following JSON contains some attributes at root level, like ProductNum and unitCount.
It also contains a Nested attribute with name “Properties”, which contains an array of Key-Value pairs.

Now, what I want is to expand this JSON, and have all the attributes in form of columns, with additional columns for all the Keys in Nested array section, like in the “Expected Output” section below:

{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "unitCount":"3"
}

 

Expected output, as described above:

+-------------------------------------------------------+   
| ProductNum | invoice_id | job_id | sku_id | unitCount |  
+-------------------------------------------------------+   
| 6000078    | 923659     | 296160 | 312002 | 3         |  
+-------------------------------------------------------+

 

Solution:

val DS_Products = spark.createDataset("""{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "UnitCount":"3"
}""" :: Nil)

val DF_Products = spark.read.json(DS_Products)

val df_flatten = DF_Products
  .select($"*", explode($"Properties") as "SubContent")
  .drop($"Properties")

df_flatten.show()

val df_flatten_pivot = df_flatten
  .groupBy($"ProductNum",$"UnitCount")
  .pivot("SubContent.key")
  .agg(first("SubContent.value"))

df_flatten_pivot.show()

Output:

+----------+---------+--------------------+
|ProductNum|UnitCount|          SubContent|
+----------+---------+--------------------+
|   6000078|        3|[invoice_id, 923659]|
|   6000078|        3|    [job_id, 296160]|
|   6000078|        3|    [sku_id, 312002]|
+----------+---------+--------------------+

+----------+---------+----------+------+------+
|ProductNum|UnitCount|invoice_id|job_id|sku_id|
+----------+---------+----------+------+------+
|   6000078|        3|    923659|296160|312002|
+----------+---------+----------+------+------+