Home / Blog / Data Science / Apache Spark Building Blocks

Apache Spark Building Blocks

June 28, 2024
47

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Learn the core concepts of Data Science Course video on Youtube:

Some of Spark Features are:

Open Source
Distributed and parallel computation
In Memory computation
Used for Iterative and interactive applications
Distributed Datasets
Batch and Real-time applications
Programming in Scala, Java, Python, R, and SQL
Requires Commodity Hardware
Fault tolerance
Scalability
Runs both on Windows and Linux OS
Written into Scala programming language
Easy to use

Spark is renowned for its ability to process huge data quickly in a distributed setting. Spark leverages the cluster's distributed memory, which is how the magic happens. All transactions are then performed entirely in memory after the data has been put into the memory from the disc. Click here to learn Data Science Course in Chennai

SparkContext, RDD (data objects), and Operations make up Apache Spark.

SparkContext - Any application's starting point is the SparkContext. It makes it possible for the application to interact with data sources and manipulate data.

SparkSession - Spark 2.0 onwards a new entry point called SparkSession was introduced. SparkSession has built-in support for Hive (HiveContext) and SQL-like (SQLContext) operations.

The next component is the data on which the processing is to be done. In Spark, a special type of data is used. Spark holds the data in memory and processes the data in memory.

RDD: Resilient Distributed Datasets, is a read-only memory abstraction of the data object in Spark. RDDs are collections of records that are immutable, fault-tolerant, and partitioned.

In Spark 1.3 version release the Dataframe was introduced. The major difference is that RDD is unstructured objects whereas, the Dataframes are organized in a tabular fashion.

DataFrames are collections of organised data that are dispersed among nodes. SQL and DataFrames, which Spark introduced, are comparable to relational database tables or Python's Pandas DataFrames. The schema may be deduced from Dataframes.

Version 1.6 of the datasets was made available. Extensions to dataframes are datasets. Before running the code, Spark will be able to examine the schema to see what is being specified. In summary, while compiling object-oriented operations, Spark will be able to check and assess the data type associated with the data objects.

The term "dataset" refers to a collection of tightly typed, organised data. Datasets' main objective is to offer a simple means of performing transformations on objects without sacrificing the benefits of Spark's efficiency and resilience.

3rd Component - OperationsApache Spark supports 2 types of operations: Transformations and Actions

RDDs are the input for transformations, which produce one or more (new) RDDs as the result. RDDs cannot be changed since they are read-only, in-memory, immutable objects. Transformations produce a lineage graph known as a DAG (Directed Acyclic Graph), often referred to as lazy transactions. These lineage graphs are often referred to as RDD operator or dependency graphs.

DAGs can be seen as the execution plan for the transformation operations that we want to perform.

Transformations can be classified as Narrow transformations or Wider transformation

When data does not need to move between partitions to get functions executed are called Narrow transformations. The data reside in a single partition.
examples: map(), mapPartition(), flatMap(), filter(), union()

Wider transformations: The required data for computing reside on many partitions and hence data moves across from multiple partitions. Also known as shuffle transformations as data gets shuffled while the operations are executed.
examples: groupByKey(), aggregateByKey(), aggregate(), join(), repartition()

TRANSFORMATIONS EXAMPLES:

General

map
filter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy

Math / Statistical

sample
randomSplit

Set Theory / Relational

union
intersection
subtract
distinct
cartesian
zip

Data Structure / I/O

keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe

Action Operations:

The actions are referred to as urgent operations. While transformation operations produce new RDD(s), action activities produce results from the RDDs. The outcomes of action operations are kept on Spark drivers (drivers are JVM processes that coordinate the activities of workers and task execution) or in an external storage system.

ACTIONS EXAMPLES:

General

reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
takeOrdered

Math / Statistical

count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct

Set Theory / Relational

takeOrdered

Data Structure / I/O

saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile

Spark Optimization

Monitoring is crucial once an application is put into production to maintain the results and make sure that tasks are completed successfully. The effectiveness of tasks is often evaluated based on a few factors, including runtime, storage space, and metrics for data shuffles across nodes. The majority of developers just concentrate on creating apps; they pay little attention to refactoring and optimising the code.

Typically, optimisation is carried out on two levels: the cluster level and the application level. Typically, cluster-level optimisation entails using hardware and Spark clusters to their fullest potential. Spark has the ability to run in parallel, therefore the more hardware the better. Optimised performance also benefits from quicker networks and more memory, especially when shuffling data. The finest feature of it is autonomous memory management in the most recent Spark version (versions higher than 1.6). The default caching option of MEMORY_ONLY can be changed to MEMORY_AND_DISK as the storage level if RDD is excessively large and cannot fit in memory. With this option, the partitions on the disc are transferred to memory without being recalculated. When data is temporarily moved from memory to disc, disc storage is equally crucial. To guarantee that optimised results are produced, additional cores can also be changed. Number of executors, Cores allotted to each executor, and Memory allotted to each executor are a few variables that may be changed for optimal use of Spark tasks.

Project Tungsten deserves special attention whenever optimisation in Spark is considered. Since Tungsten was made available by default by Spark, no setup is required to utilise it. Tungsten is used to improve Spark applications' CPU and memory efficiency. The following list summarises the project's primary optimisation options:

By getting rid of JVM objects, memory use is decreased and memory management is optimised. Additionally, this eliminates the expense of garbage collection.
Because JVM objects are heavier than binary format, data is handled in binary format rather than JVM objects, which speeds up processing. UnsafeRow format is another name for the binary format.
The Spark's structured APIs are used to create the bytecode for the written code. When writing huge queries, one gets decent speed.

By eliminating serialisation problems, writing code in Scala or Java will optimise the application. It is strongly advised to write user-defined functions (UDFs) in Scala for optimisation. For improved optimisation, you can switch between APIs and RDD. When intensive calculation is required, one can initially use DataFrame APIs before switching to RDD for greater application control. Due to higher compression, binary file formats are always preferable over text (CSV or JSON) file formats. The binary format is superior for network transport and storage. Columnar file formats (Parquet & ORC), which are also favoured with Spark, are best if you regularly need to read and calculate certain fields from a table. Additionally, columnar formats compress data at a rapid pace. Additionally, compressions like snappy or LZF guarantee extremely significant data compression.

On structured APIs like SQL, DataFrame, and datasets, Catalyst Optimizer offers a considerable level of speed optimisation. The written query transforms the logical query plan into a physical query plan, which contains details on which files to read or which tables to join. The data is handled more quickly thanks to partitioning and bucketing algorithms. Additionally, Spark SQL's shuffle partitions have a default value of 200, which allows for parallelism. When we have a lot of data, we may reduce the size of the shuffle divisions by increasing the number of them. For the proper optimisation, spark. SQL.shuffle.partitions' value can be modified. When it comes to RDD, the number of partitions may be changed using coalesce () and repartition (). Shuffle hash, Broadcast hash, and Cartesian Spark SQL joins are useful for specific types of optimisations. When working with large queries, setting sparl. SQL.codegen to True is a good idea. The slow-running jobs can be monitored using the Spark UI, and once we have set spark speculation to True, we can make sure that the slow-running activities may be executed on another node for speedy completion.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore