Advertisements

Archive

Archive for the ‘Others’ Category

All about Blockchain – Introduction

June 25, 2018 1 comment

 
Blockchain is a disruptive technology that is going to transform many industries in near future. Bitcoin is one of its implementation. Let’s check what is Blockchain is all about.
 

What is Blockchain?

A Blockchain is an immutable, secure, digital, distributed ledger without a central authority, which uses public/private signature technology to validate and record transactions in near real-time. It is nothing but a data structure which works as decentralized database or a distributed ledger that stores a registry of assets and transactions over a peer-to-peer network of computers.

Blockchain’s distributed digital ledger contains cryptographically signed transactions that are grouped into blocks. Each Transaction recorded in the database is digitally signed and mathematically guaranteed to be authentic and resistant to fraud. Each block is cryptographically linked to the previous one after validation and undergoing a Proof of Work or Consensus decision. As new blocks are added, older blocks become more difficult to modify, thus immutable. New blocks are replicated across all copies of the ledger within the network/nodes, and any conflicts are resolved automatically using established rules.

 

Blockchain Internals:

1. Blockchain uses a Distributed Ledger to track transactions

2. A Ledger is a Immutable (append only) database most commonly used in accounting

3. Same copy of the data distributed across all the participating Nodes (Decentralized)

4. All new transactions are securely encrypted and then broadcast across the Blockchain network to be added to the system

5. Participants in the Blockchain verify the transaction is valid and then writes it to the Ledger

6. Transactions are grouped together in Blocks. Blocks are linked to previous Blocks, which make the blockchain.

7. The Transaction chain tracks how ownership changes, while the Block-Chain tracks the order of transactions.
 

Example: how Blockchain works?

1. A person X transfers $100 to person Y. Both persons X & Y have their Account numbers and corresponding Private Keys.

2. A Transaction record is created which contains the transaction details and digital signatures from both persons/parties.

2. The Transaction is broadcasted to the network for verification by the various computer nodes to make sure if the transaction details are valid.

3. On successful validation of Transaction is accepted in the network and added to a Block. Each block contains a unique code called a Hash, it also contains the hash of the previous block in the chain.

4. The verified Block is added to the Blockchain. The Hash codes connect the blocks together in a specific order.

… However this does not look as simple as mentioned above. There goes lot of computations and verifications to make sure what transactions are valid and to add them to the Blockchain.
 

Blockchain vs Traditional Ledgers:

– Traditional Ledgers are centralized and thus requires third Party authority and middlemen to authenticate, approve and record Transactions.

– But Blockchain safely distributes ledger across the entire network and does not require any middlemen.

Blockchain implementations:

– Eliminates Intermediaries: Allows industries to redefine or create new business models.

– Reduces Fraud: Highly secure and transparent, making it nearly impossible to change historical records.

– Increases Efficiency and Speed: Simplifies transactions and enables T+Zero settlement time.

– Increases Revenue and Savings: Potential savings and new revenue opportunities through more efficient processes and reduced costs.
 

Blockchain Use Cases:

1. Ownerships
    – Land registries
    – Property titles
    – Other physical assets

2. Identities
    – Blockchain e-identities to citizens
    – Use services like voting
    – Healthcare records

3. Verification
    – Licenses
    – Proofs of records (degrees, grades, etc)
    – Transactions
    – Processes or events

4. Movement of assets
    – Transferring money from one person/entity to another.
    – Enabling direct payments, once a work condition has been performed.
 

Blockchain implementations:

1. Bitcoin or BTC serves as the cryptographically secured unit of value, numeraire (standard for currency exchange) and currency in the case of the Bitcoin protocol and hybrid fuel/currency used as a Cryptocurrency.

2. Ethereum or ETH serves as the cryptographically secured unit of value, numeraire and hybrid fuel/currency for the Ethereum protocol.

3. Ripple, LiteCoin, etc.
 

Blockchain Network Types:

1. Public:
– Many unknown participants, like Banks, Traders, Financial firms etc.
– Writes by all participants
– Reads by all participants
– Consensus by Proof of Work

2. Private:
– Known participation from one organization, like a Bank
– Write permission centralized
– Reads may be public or restricted
– multiple algorithms for consensus

3. Consortium:
– Known participation from multiple organization
– Write requires consensus from several participants
– Reads may be public or restricted
– multiple algorithms for consensus
 

Microsoft Blockchain offering:

Microsoft has a Blockchain offering on Microsoft Azure as Ethereum Blockchain as a Service (EBaaS) so Enterprise clients and developers can have a single click, low cost, ready-made, cloud based blockchain dev/test/production environment.
 

We will see more on Blockchain in next posts !!!


Advertisements

Prepare for Certification Exam 70-775: Perform Data Engineering on Microsoft Azure HDInsight

April 10, 2018 Leave a comment

 
In my [previous post] I’ve tried to collate some basic stuff about HDInsight to let you know the basics and get started. You can also check [Microsoft Docs] for HDInsight to know more and deep dive into the Big-Data platform.
 

Microsoft Certification Exams is one of a good and easy approach to understand the technology. You can find details about Exam 70-775 certification on the Microsoft Certification page.

Though the web page provides most the details of what would be asked in the Exam, but lacks in providing the study material against each module and topics under it. Thus here with this post I’ve tried to find and provide the study material links against each of the topics covered on these modules:
 

The exam is divided into 4 Modules:

1. Administer and Provision HDInsight Clusters
2. Implement Big Data Batch Processing Solutions
3. Implement Big Data Interactive Processing Solutions
4. Implement Big Data Real-Time Processing Solutions

 

Module #1. Administer and Provision HDInsight Clusters

1. Deploy HDInsight clusters
    – Create a HDInsight cluster [Portal] [ARM Template] [PowerShell] [.net SDK] [CLI]
    – Create HDInsight clusters with Hadoop, Spark, Kafka, etc [Link]
    – Select an appropriate cluster type based on workload considerations [Link]
    – Create a cluster in a private virtual network [Link]
    – Create a domain-joined cluster [Link]
    – Create a cluster that has a custom metastore [link]
    – Manage managed disks [with Apache Kafka]
    – Configure vNet peering [Link]

2. Deploy and secure multi-user HDInsight clusters
    – Provision users who have different roles
    – Manage users, groups & permissions [Ambari] [PowerShell] [Apache Ranger]
    – Configure Kerberos [Link]
    – Configure service accounts
    – Implement SSH [Connecting] [Tunneling]
    – Restrict access to data [Link]

3. Ingest data for batch and interactive processing
    – Ingest data from cloud or on-premises data; store data in Azure Data Lake
    – Store data in Azure Blob Storage
    – Perform routine small writes on a continuous basis using Azure CLI tools
    – Ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy
    – Ingest data from an on-premises Hadoop cluster

4. Configure HDInsight clusters
    – Manage metastore upgrades
    – View and edit Ambari configuration groups
    – View and change service configurations through Ambari
    – Access logs written to Azure Table storage
    – Enable heap dumps for Hadoop services
    – Manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell
    – Perform cluster-level debugging
    – Stop and start services through Ambari
    – Manage Ambari alerts and metrics

5. Manage and debug HDInsight jobs
    – Describe YARN architecture and operation
    – Examine YARN jobs through ResourceManager UI and review running applications
    – Use YARN CLI to kill jobs
    – Find logs for different types of jobs
    – Debug Hadoop and Spark jobs
    – Use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions
 

Module #2. Implement Big Data Batch Processing Solutions

1. Implement batch solutions with Hive and Apache Pig
    – Define external Hive tables; load data into a Hive table
    – Use partitioning and bucketing to improve Hive performance
    – Use semi-structured files such as XML and JSON with Hive
    – Join tables with Hive using shuffle joins and broadcast joins
    – Invoke Hive UDFs with Java and Python; design scripts with Pig
    – Identify query bottlenecks using the Hive query graph
    – Identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON

2. Design batch ETL solutions for big data with Spark
    – Share resources between Spark applications using YARN queues and preemption
    – Select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance
    – Connect to external Spark data sources
    – Incorporate custom Python and Scala code in a Spark DataSets program
    – Identify query bottlenecks using the Spark SQL query graph

3. Operationalize Hadoop and Spark
    – Create and customize a cluster by using ADF
    – Attach storage to a cluster and run an ADF activity
    – Choose between bring-your-own and on-demand clusters
    – Use Apache Oozie with HDInsight
    – Choose between Oozie and ADF
    – Share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types
    – Select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)
 

Module #3. Implement Big Data Interactive Processing Solutions

1. Implement interactive queries for big data with Spark SQL
    – Execute queries using Spark SQL
    – Cache Spark DataFrames for iterative queries
    – Save Spark DataFrames as Parquet files,
    – Connect BI tools to Spark clusters
    – Optimize join types such as broadcast versus merge joins
    – Manage Spark Thrift server and change the YARN resources allocation
    – Identify use cases for different storage types for interactive queries

2. Perform exploratory data analysis by using Spark SQL
    – Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling
    – Use Spark SQL’s two-table joins to merge DataFrames and cache results
    – Save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet)
    – Manage interactive Livy sessions and their resources

3. Implement interactive queries for big data with Interactive Hive
    – Enable Hive LLAP through Hive settings
    – Manage and configure memory allocation for Hive LLAP jobs
    – Connect BI tools to Interactive Hive clusters

4. Perform exploratory data analysis by using Hive
    – Perform interactive querying and visualization
    – Use Ambari Views
    – Use HiveQL
    – Parse CSV files with Hive
    – Use ORC versus Text for caching
    – Use internal and external tables in Hive
    – Use Zeppelin to visualize data

5. Perform interactive processing by using Apache Phoenix on HBase
    – Use Phoenix in HDInsight
    – Use Phoenix Grammar for queries
    – Configure transactions, user-defined functions, and secondary indexes
    – Identify and optimize Phoenix performance
    – Select between Hive, Spark, and Phoenix on HBase for interactive processing
    – Identify when to share metastore between a Hive cluster and a Spark cluster
 

Module #4. Implement Big Data Real-Time Processing Solutions

1. Create Spark streaming applications using DStream API
    – Define DStreams and compare them to Resilient Distributed Dataset (RDDs)
    – Start and stop streaming applications
    – Transform DStream (flatMap, reduceByKey, UpdateStateByKey)
    – Persist long-term data stores in HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Apache Kafka or Event Hub
    – Visualize streaming data in a PowerBI real-time dashboard

2. Create Spark structured streaming applications
    – Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets
    – Create Window Operations on Event Time
    – Define Window Transformations for Stateful and Stateless Operations
    – Stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data
    – Persist Long Term Data HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Kafka or Event Hub
    – Visualize streaming data in a PowerBI real-time dashboard

3. Develop big data real-time processing solutions with Apache Storm
    – Create Storm clusters for real-time jobs
    – Persist Long Term Data HBase and SQL
    – Persist Long Term Data Azure Data Lake and Azure Blob Storage
    – Stream data from Kafka or Event Hub
    – Configure event windows in Storm
    – Visualize streaming data in a PowerBI real-time dashboard
    – Define Storm topologies and describe Storm Computation Graph Architecture
    – Create Storm streams and conduct streaming joins
    – Run Storm topologies in local mode for testing
    – Configure Storm applications (Workers, Debug mode)
    – Conduct Stream groupings to broadcast tuples across components
    – Debug and monitor Storm jobs

4. Build solutions that use Kafka
    – Create Spark and Storm clusters in the virtual network
    – Manage partitions
    – Configure MirrorMaker
    – Start and stop services through Ambari
    – Manage topics

5. Build solutions that use HBase
    – Identify HBase use cases in HDInsight
    – Use HBase Shell to create updates and drop HBase tables
    – Monitor an HBase cluster
    – Optimize the performance of an HBase cluster
    – Identify uses cases for using Phoenix for analytics of real-time data
    – Implement replication in HBase


Download SQL Server 2017 for free (with full MSBI stack)

March 1, 2018 Leave a comment

 
With SQL Server 2014 Microsoft made its SQL Server Developer Edition free for Development and Test database in a non-production environment. This edition is not meant for Production environments or for use with production data.

SQL Server 2014 Dev Ed free

With SQL Server 2017 Developer edition developers can build any kind of application on top of SQL Server. It includes all the functionality of Enterprise edition, but is licensed for use as a development and test system, not as a production server.

So, with this free edition you get the Database Engine as well as full MSBI stack with DW/BI capabilities ( i.e. SSIS /AS /RS) for free 🙂
 

Downloads here:

SQL Server 2017 Developer Edition

SQL Server Management Studio (SSMS, latest version)

– Sample databases for SQL Server [AdventureWorks] [Wide World Importers]

SQL Operations Studio


Meltdown and Spectre vulnerability, all about, and references for patching Windows OS & SQL Server

February 23, 2018 Leave a comment

Meltdown and Spectre are hardware vulnerabilities in modern computers which leak passwords and sensitive data by affecting nearly all modern operating systems (Windows, Linux, etc) and processors (includes Intel, AMD, ARM, etc). These hardware vulnerabilities allow programs to steal data which is currently processed on the computer, data like passwords, personal photos, emails, instant messages and even business-critical documents.
 

–> On 4th January 2018 three vulnerabilities affecting many modern processors were publicly disclosed by Google’s Project Zero:

1. CVE-2017-5715 (Spectre, branch target injection) – Systems with microprocessors utilizing speculative execution and indirect branch prediction may allow unauthorized disclosure of information to an attacker with local user access via a side-channel analysis.

2. CVE-2017-5753 (Spectre, bounds check bypass) – Systems with microprocessors utilizing speculative execution and branch prediction may allow unauthorized disclosure of information to an attacker with local user access via a side-channel analysis.

3. CVE-2017-5754 (Meltdown, rogue data cache load) – Systems with microprocessors utilizing speculative execution and indirect branch prediction may allow unauthorized disclosure of information to an attacker with local user access via a side-channel analysis of the data cache.
 

Tech giants such as Apple, Alphabet, and Intel identified these vulnerabilities. Apple kept mum for a while and Intel decided not to inform the US-CERT (United States Computer Emergency Readiness Team), upon learning about Meltdown and Spectre as hackers had not taken advantage of the flaws. It was only Google who disclosed the information to Intel, AMD and ARM Holdings back in June of 2017.


 

What’s the vulnerability all about?

Most of the chip manufacturers around the world add some flaws to their hardware to help them running faster. The two main techniques used to speed up them are Caching and Speculative Execution. If exploited, these could give hackers and malicious/rouge programs access to the data which was considered totally protected. Both of these techniques are dubbed as Meltdown & Spectre respectively and are explained below.

 

What is Meltdown?

The vulnerability basically melts security boundaries which are normally enforced by the hardware. Meltdown breaks the mechanism that keeps applications from accessing arbitrary system memory. Consequently, applications can access system memory or cache.

Meltdown is a novel attack that allows overcoming memory isolation completely by providing a simple way for any user process to read the entire kernel memory of the machine it executes on, including all physical memory mapped in the kernel region. Meltdown does not exploit any software vulnerability, i.e., it works on all major operating systems. Instead, Meltdown exploits side-channel information available on most modern processors, e.g., modern Intel micro architectures since 2010 and potentially on other CPUs of other vendors.

It is a software based side-channel attack exploiting out-of-order execution on modern processors to read arbitrary kernel- and physical-memory locations from an unprivileged user space program. Without requiring any software vulnerability and independent of the operating system, Meltdown enables an adversary to read sensitive data of other processes or virtual machines in the cloud with up to 503 KB/s, affecting millions of devices.
 

What is Spectre?

This vulnerability is based on the root cause, speculative execution. As it is not easy to fix, it will haunt us for quite some time. Spectre tricks other applications into accessing arbitrary locations in their memory.

Speculative execution is a technique used by high speed processors in order to increase performance by guessing likely future execution paths and prematurely executing the instructions in them. For example when the program’s control flow depends on an uncached value located in the physical memory, it may take several hundred clock cycles before the value becomes known. Rather than wasting these cycles by idling, the processor guesses the direction of control flow, saves a checkpoint of its register state, and proceeds to speculatively execute the program on the guessed path. When the value eventually arrives from memory the processor checks the correctness of its initial guess. If the guess was wrong, the processor discards the (incorrect) speculative execution by reverting the register state back to the stored checkpoint, resulting in performance comparable to idling. In case the guess was correct, however, the speculative execution results are committed, yielding a significant performance gain.


 

Guidance for Windows OS: [Server link], [Client link]

Guidance for SQL Server: [link]

Guidance for Azure: [link]

Guidance for Oracle: [link]

Guidance for AWS: [link]


 

Meltdown demos (video):


 

References:
Google Project Zero
meltdownattack.com (Meltdown PDF)
spectreattack.com (Spectre PDF)
Good read on Meltdown and Spectre (csoonline.com)
Google Retpoline (Jump Over ASLR)
Microsoft Cloud blog
stratechery.com
blog.bitnami.com


An Introduction to Cloud Computing …aligned with Microsoft Azure

February 7, 2018 1 comment

 

What is Cloud Computing?

Cloud Computing is the delivery of computing services like servers, storage, databases, networking, software, analytics and more-over the Internet (“the cloud”). Here the computing resources which contains various servers, applications, data and other resources are integrated and provide a service over the Internet to Individuals and Organizations. Companies offering these computing services are called cloud providers and typically charge for cloud computing services based on usage, similar to how you are billed for water or electricity at home.

Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. – NIST

 
The two prominent Cloud Computing providers in the market currently are:
– Microsoft Azure and
– Amazon’s AWS.
 

Uses of Cloud Computing:

1. On-demand and Self-Service, without any human intervention or manual work.

2. Create as many Virtual Machines (VMs) of your choice of Operating System (OS) quickly without worrying about hardware and office/lab space.

3. Instantaneously Scale up and Scale down the VMs and other services

4. Create new apps and services quickly

5. Resource pooling and Elasticity.

6. Host websites, portals and blogs

7. Store, back up and recover data

8. Stream audio and video

9. Analyse data for patterns and make predictions
 

Benefits of Cloud Computing:

1. Cost: eliminates the capital expense of buying hardware and software and setting up and running on-site datacenters

2. Global Scale: Quickly Scale-Up & Scale-Out as in when you need more resource, and Scale-Down when not in need, and pay as you use.

3. Reliability: Provision of Data backup, Business Continuity and Disaster Recovery (BCDR), by mirrored data at multiple redundant sites on the cloud provider’s network.

4. Speed and Performance: Majority of computing resources can be provisioned in minutes, with state-of-art and latest-gen high-end hardware.

5. Productivity: Rather than involving in IT management chores, the IT teams can spend time on important business goals.
 

Types of Cloud Computing:

As per the NIST (National Institute of Standards and Technology) the Cloud Computing service provider should have following 3 service models for its customers:

1. Infrastructure as a Service (IaaS): The consumer can provision Processing, Storage, Networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include Operating Systems and Applications.

– The consumer does not manage or control the underlying cloud infrastructure.

– But has control over Operating Systems, Storage, deployed Applications, and possibly limited control of select networking components (e.g., host firewalls).

– Example: Windows and Linux VMs, Blob Storage, SQL Server with Windows/Linux VM, Virtual Network, etc.

2. Platform as a Service (PaaS): The consumer can deploy onto the cloud infrastructure consumer-created or acquired applications created using Programming Languages and Tools supported by the provider.

– The consumer does not manage or control the underlying cloud infrastructure including Network, Servers, Operating Systems, or Storage.

– But has control over the deployed Applications and possibly application hosting environment configurations.

– Example: Azure SQL Database, DocumentDB, HDInsight, Data Factory, etc.

3. Software as a Service (SaaS): The consumer can use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based email).

– The consumer does not manage or control the underlying cloud infrastructure including Network, Servers, Operating Systems, Storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.

– Example: Microsoft Office 365, WordPress, Joomla, Django, etc.

Deployment Models:

1. Public cloud: The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.

– Example: Microsoft Azure.

2. Private cloud: The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.

– Example: Microsoft Azure Stack.

3. Hybrid cloud: This combines Public and Private clouds, bound together by technology that allows data and applications to be shared between them, providing businesses greater flexibility and more deployment options.

– Example: Cloud Bursting for load-balancing between clouds.