Archive for the ‘Hadoop’ Category

What is Lambda Architecture? and what Azure offers with its new Cosmos DB?

February 16, 2018 2 comments

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch processing and stream processing methods, and minimizing the latency involved in querying big data.

It is a Generic, Scalable, and Fault-tolerant data processing architecture to address batch and speed latency scenarios with big data and map-reduce.

–> The system consists of three layers: Batch Layer, Speed Layer & Service Layer

1. All data is pushed into both the Batch layer and Speed layer.

2. The Batch layer has a master dataset (immutable, append-only set of raw data) and pre-computes the batch views.

3. The Serving layer has Batch views for fast queries.

4. The Speed Layer compensates for processing time (to the serving layer) and deals with recent data only.

5. All queries can be answered by merging results from Batch views and Real-time views or pinging them individually.

–> Lambda Architecture with Azure:

Azure offers you a combination of following technologies to accelerate real-time big data analytics:

1. Azure Cosmos DB, a globally distributed and multi-model database service.

2. Apache Spark for Azure HDInsight, a processing framework that runs large-scale data analytics applications.

3. Azure Cosmos DB change feed, which streams new data to the batch layer for HDInsight to process.

4. The Spark to Azure Cosmos DB Connector

–> How Azure simplifies the Lambda Architecture:

1. All data is pushed into Azure Cosmos DB for processing.

2. The Batch layer has a master dataset (immutable, append-only set of raw data) stored in Azure Cosmos DB. Using HDI Spark, you can pre-compute your aggregations to be stored in your computed Batch Views.

3. The Serving layer is an Azure Cosmos DB database with collections for the master dataset and computed Batch View for fast queries.

4. The Speed layer compensates for processing time (to the serving layer) and deals with recent data only. It utilizes HDI Spark to read the Azure Cosmos DB change feed. This enables you to persist your data as well as to query and process it concurrently.

5. All queries can be answered by merging results from batch views and real-time views, or pinging them individually.

–> For complete details check here in Microsoft Docs: Azure Cosmos DB: Implement a lambda architecture on the Azure platform


Sample 14 Interview Questions and Answers for Hadoop Administration Certified Professional

August 21, 2017 3 comments

Despite plenty of opportunities for Hadoop professionals, getting a good job may seem tedious. This is because cracking the Hadoop Admin Interview is a challenge and you must prepare for it to get a good job. At Koenig Solutions, candidates not only acquire Hadoop administration certification, but also get to prepare for the interview to start a challenging yet lucrative career.

–> This article enlists 14 important questions and answers commonly asked during Hadoop Administration jobs interviews:

Q1. What daemons are required to run a Hadoop cluster?
A. DataNode, NameNode, JobTracker and TaskTracker are required for the process.

Q2. How would you restart a NameNode?
A. The easiest way – click on (to run the command to stop running shell script). After this, click to restart the NameNode.

Q3. What are different schedulers available in Hadoop?
A. a. COSHH: Considers the workload, cluster and the user heterogeneity for scheduling decisions.
    b. FIFO Scheduler: Doesn’t consider heterogeneity, but orders the job on the basis of arrival time in queue.
    c. Fair Sharing: Defines a pool for each user. Users can use their own pools to execute the job.

Q4. What Hadoop shell commands can be used to perform copy operation?
A. fs –copyToLocal
    fs –put
    fs –copyFromLocal.

Q5. What’s the purpose of jps command?
A. It is used to confirm whether the daemons running Hadoop cluster are working or not. The output of jps command reveals the status of DataNode, NameNode, Secondary NameNode, JobTracker and TaskTracker.

Q6. How many NameNodes can be run on single Hadoop cluster?
A. Only one.

Q7. What will happen when the NameNode on the Hadoop cluster is down?
A. Whenever the NameNode is down, the file system goes offline.

Q8. Detail crucial hardware considerations when deploying Hadoop in product environment.
A. Operating System: 64-bit operating system
    Capacity: Larger form factor (3.5”) disks allow more storage and costs less.
    Network: Two TOR switches per rack for better redundancy.
    Storage: To achieve high performance and scalability, it is better to design a Hadoop platform by moving the compute activity to data.
    Memory: System’s memory requirements vary based on the application.
    Computational Capacity: Can be determined by the total count of MapReduce slots existing across nodes within a Hadoop cluster.

Q9. Which command will you use to determine if the HDFS (Hadoop Distributed File System) is corrupt?
A. Hadoop FSCK (File System Check) command.

Q10. How a Hadoop job can be killed?
A. using command: Hadoop job –kill jobID.

Q11. Can filed be copied across multiple clusters? If yes, how?
A. Yes, it is possible using distributed copy. DistCP command can be used for intra or inter cluster copying.

Q12. Recommend the best Operating System to run Hadoop.
A. Ubuntu or Linux is the best. Although Windows can be used, it can lead to several problems.

Q13. How often the NameNode should be reformatted?
A. Never, as it can lead to complete data loss. It is formatted only once, in the beginning.

Q14. What are Hadoop configuration files and where are they located?
A. Hadoop has 3 different configuration files – mapred-site.xml, hdfs-site.xml, and core-site.xml – which are located in “conf” sub directory.

Checkout – Best Free Resources For Sharpening Your Skills In Hadoop. These are just a few questions, but you may come across several others, depending on your Hadoop

Author Bio: Michael Warne is a tech blogger and an expert in Hadoop certification training. He has an experience of 5 years in the Hadoop professionals industry, and has worked as a certified Hadoop for top-notch IT companies.