Apache Spark Interview questions and Answers

  • October 28, 2022
  • How is Apache Spark different from MapReduce?

    • a) Spark is open-source where MapReduce is commercialized
    • b) MapReduce is fault-tolerant and Spark isnt
    • c) "Both of the platforms support real-time processing"
    • d) "Spark is In-memory computation whereas MapReduce is Disk-based computation "

    Answer - d) "Spark is In-memory computation whereas MapReduce is Disk-based computation "

    Apache Spark and MapReduce are very different in several features like Data Processing - Spark handles the processing in-memory whereas MapReduce is disk-based. Speed - MapReduce is very slow. Spark is considered to be 100x faster than MapReduce computation. MapReduce supports only low-level programming whereas Spark has multiple language support (Scala, Java, Python, SQL, and R). These two platforms are also different in the capabilities of Real-time and Batch mode operations support. Spark supports both the modes whereas MapReduce has only Batch-mode operations capability.

  • Which of these are not Apache Spark Features?

    • a) Lazy Evaluation
    • b) Real-time Processing
    • c) Batch-mode Processing only
    • d) In-Memory Computation

    Answer - c) Batch-mode Processing only

    Apache Spark is known as a super-fast in-memory cluster computing framework. It has many features which make it the first choice for Data Analysts, Data Engineers, and Data Scientists. Low Latency: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation. In-Memory Computation: The in-memory computation feature of Spark increases the speed of data processing. It uses Data flow lineage graphs called DAG to speed-up data processing. Batch-mode and Real-time: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc. Fault Tolerance: Spark supports fault tolerance. It uses special data abstractions called RDDs which are memory abstractions of the data,

