Spark SQL – Spark basics & internals


Apache Spark is an open source, unified analytics engine, designed for distributed big data processing and machine learning.
 

The main features of Spark are:

1. Speed: It is much more faster in processing and crunching large scale data compared to traditional databases engines and MapReduce Hadoop jobs. With in-memory storage and computation of data it provides high performance batch & streaming processing. Not only this but features like DAG scheduler, Query Optimizer, Tungsten execution engine, etc. makes Spark blazingly fast.

2. Modularity & Generality: For different kind of workloads Spark provides multiple components in its core like SQL, Structured Streaming, MLlib for Machine Learning, and GraphX (all as separate modules but unified under one engine).

3. Ease of use (Polyglot): To write your apps Spark provides multiple programming languages of your choice which you are comfortable with, like Scala, SQL, Python, Java & R. It also provides n number of APIs & libraries for you to quickly build your apps without going through a steep learning curve.

4. Extensibility: With Spark you can read data from multiple heterogeneous sources like HDFS, HBase, Apache Hive, Azure Blob Storage, Amazon S3, Apache Kafka, Kinesis, Cassandra, MongoDB, etc. and other traditional databases (RDBMSs). Spark also supports reading various file formats, such as CSV, Text, JSON, Parquet, ORC, Avro, etc. and from RDBMS tables.

5. Runs Anywhere: Spark runs on multiple platforms like standalone cluster manager, Hadoop Yarn, Apache Mesos and Kubernetes.


 


 

Components & Architecture of Spark:

1. Spark Driver: The Driver program can run various operations in parallel on a Spark cluster.
– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.
– And in parallel it instantiates SparkSession for the Spark Application.
– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.
– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.

2. Spark Session: A SparkSession provides a single entry point to interact with all Spark functionalities and the underlying core Spark APIs.
– For every Spark Application you’ve to create a SparkSession explicitly, but if you are working from an Interactive Shell the Spark Driver instantiates it implicitly for you.
– The role of SparkSession is also to send Spark Tasks to the executors to run.

3. Cluster Manager: Its role is to manage and allocate resources for the cluster nodes on which your Spark application is running.
– It works for Spark Driver and provides information about available Executor nodes and schedule Spark Tasks on them.
– Currently Spark supports built-in standalone cluster manager, Hadoop YARM, Apache Mesos and Kubernetes.

4. Spark Executor: By now you would have known what are Executors.
– These executes Tasks for an Spark Application on a Worker Node and keep communication with the Spark Driver.

 


%d bloggers like this: