Archive

Archive for October 19, 2020

Apache Spark – main Components & Architecture (Part 2)

October 19, 2020 1 comment

 

1. Spark Driver:

– The Driver program can run various operations in parallel on a Spark cluster.

– It is responsible to communicate with the Cluster Manager for allocation of resources for launching Spark Executors.

– And in parallel it instantiates SparkSession for the Spark Application.

– The Driver program splits the Spark Application into one or more Spark Jobs, and each Job is transformed into a DAG (Directed Acyclic Graph, aka Spark execution plan). Each DAG internally has various Stages based upon different operations to perform, and finally each Stage gets divided into multiple Tasks such that each Task maps to a single partition of data.

– Once the Cluster Manager allocates resources, the Driver program works directly with the Executors by assigning them Tasks.
 

2. Spark Session:

– A SparkSession provides a single entry point to interact with all Spark functionalities and the underlying core Spark APIs.

– For every Spark Application you’ve to create a SparkSession explicitly, but if you are working from an Interactive Shell the Spark Driver instantiates it implicitly for you.

– The role of SparkSession is also to send Spark Tasks to the executors to run.
 

3. Cluster Manager:

– Its role is to manage and allocate resources for the cluster nodes on which your Spark application is running.

– It works for Spark Driver and provides information about available Executor nodes and schedule Spark Tasks on them.

– Currently Spark supports built-in standalone cluster manager, Hadoop YARM, Apache Mesos and Kubernetes.
 

4. Spark Executor:

– By now you would have known what are Executors.

– These executes Tasks for an Spark Application on a Worker Node and keep communication with the Spark Driver.

– An Executor is actually a JVM running on a Worker node.