Advertisements

Archive

Archive for the ‘Microsoft Azure’ Category

Spark/Scala: Convert or flatten a JSON having Nested data with Struct/Array to columns (Question)

January 9, 2019 Leave a comment

 
The following JSON contains some attributes at root level, like ProductNum and unitCount.
It also contains a Nested attribute with name “Properties”, which contains an array of Key-Value pairs.

Now, what I want is to expand this JSON, and have all the attributes in form of columns, with additional columns for all the Keys in Nested array section, like in the “Expected Output” section below:

{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "unitCount":"3"
}

 

Expected output, as described above:

+-------------------------------------------------------+   
| ProductNum | invoice_id | job_id | sku_id | unitCount |  
+-------------------------------------------------------+   
| 6000078    | 923659     | 296160 | 312002 | 3         |  
+-------------------------------------------------------+

 

Solution:

val DS_Products = spark.createDataset("""{
   "ProductNum":"6000078",
   "Properties":[
      {
         "key":"invoice_id",
         "value":"923659"
      },
      {
         "key":"job_id",
         "value":"296160"
      },
      {
         "key":"sku_id",
         "value":"312002"
      }
   ],
   "UnitCount":"3"
}""" :: Nil)

val DF_Products = spark.read.json(DS_Products)

val df_flatten = DF_Products
  .select($"*", explode($"Properties") as "SubContent")
  .drop($"Properties")

df_flatten.show()

val df_flatten_pivot = df_flatten
  .groupBy($"ProductNum",$"UnitCount")
  .pivot("SubContent.key")
  .agg(first("SubContent.value"))

df_flatten_pivot.show()

Output:

+----------+---------+--------------------+
|ProductNum|UnitCount|          SubContent|
+----------+---------+--------------------+
|   6000078|        3|[invoice_id, 923659]|
|   6000078|        3|    [job_id, 296160]|
|   6000078|        3|    [sku_id, 312002]|
+----------+---------+--------------------+

+----------+---------+----------+------+------+
|ProductNum|UnitCount|invoice_id|job_id|sku_id|
+----------+---------+----------+------+------+
|   6000078|        3|    923659|296160|312002|
+----------+---------+----------+------+------+

 

Advertisements

Powershell – Restart Azure VM and log off Users remotely

October 21, 2018 1 comment

 

1. Open RUN by pressing Windows + R keys, type powershell command and hit Enter.
 

2. Now on Powershell window first connect to the Azure VM that you want to remotely restart:

PS C:\Users\manojp> Enter-PSSession -ComputerName MyAzureVMName

[MyAzureVMName]: PS C:\Users\manojp\Documents>

 

3. Now try issuing the Restart command:

[MyAzureVMName]: PS C:\Users\manojp\Documents> Restart-Computer

Restart-Computer : Failed to restart the computer MyAzureVMName with the following error message: The system shutdown
cannot be initiated because there are other users logged on to the computer.
+ CategoryInfo : OperationStopped: (MyAzureVMName:String) [Restart-Computer], InvalidOperationException
+ FullyQualifiedErrorId : RestartcomputerFailed,Microsoft.PowerShell.Commands.RestartComputerCommand

So, this gave us error as few users are still logged in, thus can’t restart the VM.

 

4. Let’s check who all are logged in on this VM:

[MyAzureVMName]: PS C:\Users\manojp\Documents> quser

USERNAME    SESSIONNAME    ID    STATE    IDLE TIME    LOGON TIME
charlesl         rdp-tcp#0           2      Active    1:07            12/21/2018 08:26 AM

 

5. Let’s try kicking users out by specifying the ID which is “2”:

[MyAzureVMName]: PS C:\Users\manojp\Documents> logoff 2

 

6. We will check if that user is kicked out or anybody is still remaining:

[MyAzureVMName]: PS C:\Users\manojp\Documents> quser

quser : No User exists for *
+ CategoryInfo : NotSpecified: (No User exists for *:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

 

7. Let’s finally restart the VM:

[MyAzureVMName]: PS C:\Users\manojp\Documents> Restart-Computer

PS>


SQL Server on Linux – Best practices post installation

August 30, 2018 1 comment

 

Setup SQL Server on Linux:

1. Spinning up a new Linux VM on Microsoft Azure

2. Install and Configure SQL Server 2017 on Linux Azure VM

3. Connect SQL Server on Linux with SSMS from a Windows machine
 

Best Practices:

Here are some of the best practices post installing SQL Server on Linux that can help you maximize database performance:

1. To maintain efficient Linux and SQL Scheduling behavior, it’s recommended to use the ALTER SERVER CONFIGURATION command to set PROCESS AFFINITY for all the NUMANODEs and/or CPUs. [Setting Process Affinity]

2. To reduce the risk of tempdb concurrency slowdowns in high performance environments, configure multiple tempdb files by adding additional tempdb files by using the ADD FILE command. [tempdb Contention]

3. Use mssql-conf to configure the memory limit and ensure there’s enough free physical memory for the Linux operating system.

sudo /opt/mssql/bin/mssql-conf set memory.memorylimitmb 1024
sudo systemctl restart mssql-server

4. On multi-node Non-Uniform Memory Access (NUMA) installations, auto NUMA balancing needs to be disabled to allow SQL Server to operate at maximum efficiency on a NUMA system.

sysctl -w kernel.numa_balancing=0

5. You can also change the kernel settings value for virtual address space to 256K, as the default value of 65K may be insufficient for a SQL Server installation.

sysctl -w vm.max_map_count=262144

6. Use the noatime attribute to disable last accessed timestamps with any file system that is used to store SQL Server data and log files.

7. For the most consistent performance experience, you must leave the Transparent Huge Pages (THP) option enabled.

8. Virtual machine (VM) features like Hyper-V Dynamic Memory shouldn’t be used with SQL Server installations. When using VMs, be sure to assign sufficient fixed-memory sizes.

9. Make sure you have a properly configured swapfile to avoid any out of memory issues.


Azure Databricks learning resources (documentation and videos)

August 7, 2018 1 comment

 

Databricks Introduction

What is Azure Databricks [Video]

Create Databricks workspace with Apache Spark cluster

Extract, Transform & Load (ETL) with Databricks

– Documentation:
   – Azure
   – Databricks
 

From Channel 9

1. Data Science using Azure Databricks and Apache Spark [Video]

2. Data ingestion, stream processing and sentiment analysis using Twitter [Video]

3. ETL with Azure Databricks using ADF [Video]

4. ADF new features & integration with Azure Databricks [Video]

5. Azure Databricks introduces R Studio Integration [Video]

6. Run Jars and Python scripts on Azure Databricks using ADF [Video]
 

From Microsoft Build Conf


 


Azure Databricks (a fully managed Apache Spark offering)

July 28, 2018 Leave a comment

 

Databricks Introduction:

Azure Databricks = Best of Databricks + Best of Azure

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform (PaaS).

It is a fast, easy-to-use, and collaborative Apache Spark–based analytics platform. Designed in collaboration with the creators of Apache Spark, it combines the best of Databricks and Azure to help you accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration among data scientists, data engineers, and business analysts. Because it’s an Azure service, you benefit from native integrations with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB. You also get enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
 


 
–> With Databricks you can:
– Launch your new Spark environment with a single click.
– Integrate effortlessly with a wide variety of data stores.
– Use Databricks Notebooks to unify your processes and instantly deploy to production.
– Improve and scale your analytics with a high-performance processing engine optimized for the comprehensive, trusted Azure platform.
 

Learning Resources:

    Webinar recording on Azure Databricks

    My next blog in series