Archive

Posts Tagged ‘Apache Spark Features’

Apache Spark – Introduction & main features (Part 1)

October 12, 2020 Leave a comment

 

Apache Spark is an open source, unified analytics engine, designed for distributed big data processing and machine learning.

Although Apache Hadoop was still there to cater for Big Data workloads, but its Map-Reduce (MR) framework had some inefficiencies and was hard to manage & administer.

– A typical Hadoop program involves multiple iterative MR jobs which loads data to disk and reads again in every iteration, thus involves latency and performance issues.

– Hadoop also provides SQL like ad-hoc querying capability thru Pig & Hive, but due to MR it involves writing/reading data to/from disks, which makes it much slower and gives a bad experience when you need quick data retrieval.

– Thus the creators of Hadoop MR at UC Berkeley lab took this problem as an opportunity and came up with a project called Spark, which was written in Scala and is 10-100 times faster compared to Hadoop MR.
 

The main features of Spark are:

1. Speed: It is much more faster in processing and crunching large scale data compared to traditional databases engines and MapReduce Hadoop jobs. With in-memory storage and computation of data it provides high performance batch & streaming processing. Not only this but features like DAG scheduler, Query Optimizer, Tungsten execution engine, etc. makes Spark blazingly fast.
 

2. Modularity & Generality: For different kind of workloads Spark provides multiple components in its core like SQL, Structured Streaming, MLlib for Machine Learning, and GraphX (all as separate modules but unified under one engine).


 
3. Ease of use (Polyglot): To write your apps Spark provides multiple programming languages of your choice which you are comfortable with, like Scala, SQL, Python, Java & R. It also provides n number of APIs & libraries for you to quickly build your apps without going through a steep learning curve.
 

4. Extensibility: With Spark you can read data from multiple heterogeneous sources like HDFS, HBase, Apache Hive, Azure Blob Storage, Amazon S3, Apache Kafka, Kinesis, Cassandra, MongoDB, etc. and other traditional databases (RDBMSs). Spark also supports reading various file formats, such as CSV, Text, JSON, Parquet, ORC, Avro, etc. and from RDBMS tables.


 
5. Resilient (Fault Tolerant): Spark works on the concept of RDDs i.e. “Resilient Distributed Dataset”. It is a Fault Tolerant collection of objects partitioned across several nodes. With the concept of lineage RDDs can rebuild a lost partition in case of any node failure.
 

6. Runs Anywhere: Spark runs on multiple platforms like standalone cluster manager, Hadoop Yarn, Apache Mesos and Kubernetes.