Advertisements

Archive

Author Archive

SQL Tips – Issues with BCP when you have two SQL Server instances

June 9, 2018 1 comment

 
I observed one thing here with BCP (Bulk Copy Program), when you have 2 versions of SQL Server installed on you PC or Server. I had SQL Server 2014 & 2016 installed on one of my DEV server.
 

So if you are executing Query from SQL 2016 instance, it was inserting records in SQL 2014 instance:

exec master..xp_cmdshell ‘BCP AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c’

 

But even if you use BCP 2016 version, it was still inserting in SQL 2014 instance:

exec master..xp_cmdshell ‘C:\”Program Files”\”Microsoft SQL Server”\”Client SDK”\ODBC\130\Tools\Binn\BCP.exe AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c’

 

On inquiring a bit I came to know that by default the BCP command points to the Default instance of SQL Server, no matter from which instance you are executing the Query. MSDN BoL

So, you have to provide parameter to connect to server. If not specified, it would go to default instance
 

So your BCP command should go with the “-S” option, like this:

exec master..xp_cmdshell ‘C:\”Program Files”\”Microsoft SQL Server”\”Client SDK”\ODBC\130\Tools\Binn\BCP.exe AdventureWorks2014.Person.Address2 IN d:\PersonAddressByQuery.txt -T -c -S DBServerName/InstanceName


Advertisements
Categories: SQL Tips Tags: ,

SQL Job creation failing (having VBScript step) after SQL Server 2016/2017 upgrade

May 21, 2018 Leave a comment

 
Yesterday my friend pinged me and told that he is facing some issues while executing a SQL Job DDL script. They had upgraded their SQL Server version from 2008 to 2016, and while creating SQL Jobs they were facing below error:

Msg 14234, Level 16, State 1, Procedure sp_verify_subsystem, Line 28 [Batch Start Line 2]
The specified ‘@subsystem’ is invalid (valid values are returned by sp_enum_sqlagent_subsystems).

 
I checked the code and on inquiring for a moment I saw that one of the SQL Job step was configured for a VB Script, as shown below:

...
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep 
@job_id=@jobId, 
@step_name=N'xyz VBScript', 
@step_id=3, 
@cmdexec_success_code=0, 
@on_success_action=4, 
@on_success_step_id=0, 
@on_fail_action=5, 
@on_fail_step_id=0, 
@retry_attempts=0, 
@retry_interval=0, 
@os_run_priority=0, 
@subsystem=N'ActiveScripting', 
...

On quickly checking on net I came to know that this feature has been discontinued and should not be used.

** Important *\* This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.
MS BoL link

 

This MS BoL link also mentions about the discontinued feature:

ActiveX subsytem is discontinued. Use command line or PowerShell scripts instead.


Prepare for Certification Exam 70-775: Perform Data Engineering on Microsoft Azure HDInsight

April 10, 2018 Leave a comment

 
In my [previous post] I’ve tried to collate some basic stuff about HDInsight to let you know the basics and get started. You can also check [Microsoft Docs] for HDInsight to know more and deep dive into the Big-Data platform.
 

Microsoft Certification Exams is one of a good and easy approach to understand the technology. You can find details about Exam 70-775 certification on the Microsoft Certification page.

Though the web page provides most the details of what would be asked in the Exam, but lacks in providing the study material against each module and topics under it. Thus here with this post I’ve tried to find and provide the study material links against each of the topics covered on these modules:
 

The exam is divided into 4 Modules:

1. Administer and Provision HDInsight Clusters
2. Implement Big Data Batch Processing Solutions
3. Implement Big Data Interactive Processing Solutions
4. Implement Big Data Real-Time Processing Solutions

 

Module #1. Administer and Provision HDInsight Clusters

1. Deploy HDInsight clusters
    – Create a HDInsight cluster [Portal] [ARM Template] [PowerShell] [.net SDK] [CLI]
    – Create HDInsight clusters with Hadoop, Spark, Kafka, etc [Link]
    – Select an appropriate cluster type based on workload considerations [Link]
    – Create a cluster in a private virtual network [Link]
    – Create a domain-joined cluster [Link]
    – Create a cluster that has a custom metastore [link]
    – Manage managed disks [with Apache Kafka]
    – Configure vNet peering [Link]

2. Deploy and secure multi-user HDInsight clusters
    – Provision users who have different roles
    – Manage users, groups & permissions [Ambari] [PowerShell] [Apache Ranger]
    – Configure Kerberos [Link]
    – Configure service accounts
    – Implement SSH [Connecting] [Tunneling]
    – Restrict access to data [Link]

3. Ingest data for batch and interactive processing
    – Ingest data from cloud or on-premises data; store data in Azure Data Lake
    – Store data in Azure Blob Storage
    – Perform routine small writes on a continuous basis using Azure CLI tools
    – Ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy
    – Ingest data from an on-premises Hadoop cluster

4. Configure HDInsight clusters
    – Manage metastore upgrades
    – View and edit Ambari configuration groups
    – View and change service configurations through Ambari
    – Access logs written to Azure Table storage
    – Enable heap dumps for Hadoop services
    – Manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell
    – Perform cluster-level debugging
    – Stop and start services through Ambari
    – Manage Ambari alerts and metrics

5. Manage and debug HDInsight jobs
    – Describe YARN architecture and operation
    – Examine YARN jobs through ResourceManager UI and review running applications
    – Use YARN CLI to kill jobs
    – Find logs for different types of jobs
    – Debug Hadoop and Spark jobs
    – Use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
 

Module #2. Implement Big Data Batch Processing Solutions

1. Implement batch solutions with Hive and Apache Pig
    – Define external Hive tables; load data into a Hive table
    – Use partitioning and bucketing to improve Hive performance
    – Use semi-structured files such as XML and JSON with Hive
    – Join tables with Hive using shuffle joins and broadcast joins
    – Invoke Hive UDFs with Java and Python; design scripts with Pig
    – Identify query bottlenecks using the Hive query graph
    – Identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON

2. Design batch ETL solutions for big data with Spark
    – Share resources between Spark applications using YARN queues and preemption
    – Select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance
    – Connect to external Spark data sources
    – Incorporate custom Python and Scala code in a Spark DataSets program
    – Identify query bottlenecks using the Spark SQL query graph

3. Operationalize Hadoop and Spark
    – Create and customize a cluster by using ADF
    – Attach storage to a cluster and run an ADF activity
    – Choose between bring-your-own and on-demand clusters
    – Use Apache Oozie with HDInsight
    – Choose between Oozie and ADF
    – Share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types
    – Select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
 

Module #3. Implement Big Data Interactive Processing Solutions

1. Implement interactive queries for big data with Spark SQL
    – Execute queries using Spark SQL
    – Cache Spark DataFrames for iterative queries
    – Save Spark DataFrames as Parquet files,
    – Connect BI tools to Spark clusters
    – Optimize join types such as broadcast versus merge joins
    – Manage Spark Thrift server and change the YARN resources allocation
    – Identify use cases for different storage types for interactive queries

2. Perform exploratory data analysis by using Spark SQL
    – Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling
    – Use Spark SQL’s two-table joins to merge DataFrames and cache results
    – Save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet)
    – Manage interactive Livy sessions and their resources

3. Implement interactive queries for big data with Interactive Hive
    – Enable Hive LLAP through Hive settings
    – Manage and configure memory allocation for Hive LLAP jobs
    – Connect BI tools to Interactive Hive clusters

4. Perform exploratory data analysis by using Hive
    – Perform interactive querying and visualization
    – Use Ambari Views
    – Use HiveQL
    – Parse CSV files with Hive
    – Use ORC versus Text for caching
    – Use internal and external tables in Hive
    – Use Zeppelin to visualize data

5. Perform interactive processing by using Apache Phoenix on HBase
    – Use Phoenix in HDInsight
    – Use Phoenix Grammar for queries
    – Configure transactions, user-defined functions, and secondary indexes
    – Identify and optimize Phoenix performance
    – Select between Hive, Spark, and Phoenix on HBase for interactive processing
    – Identify when to share metastore between a Hive cluster and a Spark cluster
 

Module #4. Implement Big Data Real-Time Processing Solutions

1. Create Spark streaming applications using DStream API
    – Define DStreams and compare them to Resilient Distributed Dataset (RDDs)
    – Start and stop streaming applications
    – Transform DStream (flatMap, reduceByKey, UpdateStateByKey)
    – Persist long-term data stores in HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Apache Kafka or Event Hub
    – Visualize streaming data in a PowerBI real-time dashboard

2. Create Spark structured streaming applications
    – Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets
    – Create Window Operations on Event Time
    – Define Window Transformations for Stateful and Stateless Operations
    – Stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data
    – Persist Long Term Data HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Kafka or Event Hub
    – Visualize streaming data in a PowerBI real-time dashboard

3. Develop big data real-time processing solutions with Apache Storm
    – Create Storm clusters for real-time jobs
    – Persist Long Term Data HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Kafka or Event Hub
    – Configure event windows in Storm
    – Visualize streaming data in a PowerBI real-time dashboard
    – Define Storm topologies and describe Storm Computation Graph Architecture
    – Create Storm streams and conduct streaming joins
    – Run Storm topologies in local mode for testing
    – Configure Storm applications (Workers, Debug mode)
    – Conduct Stream groupings to broadcast tuples across components
    – Debug and monitor Storm jobs

4. Build solutions that use Kafka
    – Create Spark and Storm clusters in the virtual network
    – Manage partitions
    – Configure MirrorMaker
    – Start and stop services through Ambari
    – Manage topics

5. Build solutions that use HBase
    – Identify HBase use cases in HDInsight
    – Use HBase Shell to create updates and drop HBase tables
    – Monitor an HBase cluster
    – Optimize the performance of an HBase cluster
    – Identify uses cases for using Phoenix for analytics of real-time data
    – Implement replication in HBase


An introduction to Azure HDInsight – Microsoft’s Big-Data/Hadoop solution on Azure

March 3, 2018 1 comment

 
The Microsoft Azure portal has all the details on HDInsight and is very vast. Here in this post I’ve simply curated main and important stuff for myself and others to get started with HDInsight.
 

Azure HDInsight is a standard Apache Hadoop distribution offered as a managed service on Microsoft Azure. It is based on the Hortonworks Data Platform (HDP) and provisioned as clusters on Azure. The clusters can be created on your choice of Windows or Linux Servers.
 

What HDInsight offers:

1. Provides an end-to-end SLA on all your production workloads.
2. Enables you to scale workloads up or down anytime and only pay for what you use.
3. Protects and Secure your data as per government compliance.
4. Provide Log Analytics to monitor your clusters.
5. Globally availability in multiple regions.
6. Provides various productivity tools for development.
 

HDInsight enables a broad range of scenarios such as: Process & Analyze Big-Data, Batch Processing, in-memory processing ETL, Data Warehousing, Machine Learning, IoT and more, by using a broad spectrum of open-source frameworks, like Hadoop, Spark, Kafka, HBase, Hive, Storm and R Server.


 

HDInsight Cluster Types:

1. Hadoop: A simple Map-Reduce programming model to process and analyze batch data in parallel. [Apache Hadoop]

2. Spark: An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. [Apache Spark]

3. HBase: A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data. [Apache HBase]

4. R Server: A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.

5. Storm: A distributed, real-time computation system for processing large streams of data fast. [Apache Storm]

6. Hive: or Interactive Query (AKA: Live Long and Process), In-memory caching for interactive and faster Hive queries. [Apache Hive]

7. Kafka: An open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. [Apache Kafka]
 

Other Components available with HDInsight:

Ambari Avro HCatalog
Mahout MapReduce YARN
Phoenix Pig Sqoop
Tez Oozie ZooKeeper
 

Storage options in HDInsight:

1. Azure Blob Store

2. Azure Data Lake Store

Azure Data Lake Store vs Azure Blob Storage
 

Role Based Security:

Owner Lets you manage everything
Contributor Lets you manage everything except access to resources
Reader Lets you view everything but not make changes
User Access Administrator Lets you manage user access to Azure resources

 

HDInsight security:

[Overview and more details on Microsoft Docs]


 


Download SQL Server 2017 for free (with full MSBI stack)

March 1, 2018 Leave a comment

 
With SQL Server 2014 Microsoft made its SQL Server Developer Edition free for Development and Test database in a non-production environment. This edition is not meant for Production environments or for use with production data.

SQL Server 2014 Dev Ed free

With SQL Server 2017 Developer edition developers can build any kind of application on top of SQL Server. It includes all the functionality of Enterprise edition, but is licensed for use as a development and test system, not as a production server.

So, with this free edition you get the Database Engine as well as full MSBI stack with DW/BI capabilities ( i.e. SSIS /AS /RS) for free 🙂
 

Downloads here:

SQL Server 2017 Developer Edition

SQL Server Management Studio (SSMS, latest version)

– Sample databases for SQL Server [AdventureWorks] [Wide World Importers]

SQL Operations Studio