Most Asked Big Data Interview Questions and Answers
Table of Content
- What does Master/Slave Architecture mean in Hadoop?
- Explain Heartbeat in HDFS?
- What is fault tolerance in HDFS?
- What are Hadoop Daemons?
- How are HDFS blocks replicated?
- What tools/programming languages are preferred for Big Data?
- Name the tools which are helpful to extract Big data?
- Hadoop 1.x is known as a Single point of failure. How to recover from it?
- Explain the Hadoop framework’s configuration files?
What does Master/Slave Architecture mean in Hadoop?
Hadoop is based on the Distributed framework, which implies multiple nodes (systems) are interconnected to work as a single powerful system. Hadoop serves as both the Storage device and Compute engine.
To manage the multiple devices (nodes) in the cluster, Hadoop employs Master/Slave architecture for both distributed storage components and distributed computation components. The master node controls the operations and manages the entire cluster. Slave nodes are responsible to work on the tasks assigned by the Master node.
Hadoop Architecture can be explained as:
- Storage Component – HDFS
- Compute Engine – MapReduce
- Cluster Manager – YARN
Hadoop has 5 java background services/processes to support the components:
- Secondary Namenode
Storage component services:
Executes and manages the operation of file system like closing, opening, renaming of directories and files.
A file system that is exposed to the clients, which allows them to store the files. These files are split into blocks, which are stored in the DataNode(s). DataNode(s) are responsible for copying, creation, and deletion of the block, as and when instructed by NameNode.
Computation component services:
Master: Resource Manager (Jobtracker)
Resource Manager (Jobtracker) is the service that manages users and the map/reduce task interactions. When a map/reduce job is submitted, Resource Manager (Jobtracker) puts it in a queue and executes them on a first-come/first-served basis and also manages the assignment of the map and reduce tasks to the Node Managers (Tasktrackers). Slaves: Node Manager (Tasktraker)
Node Manager (Tasktracker) execute tasks upon instruction from the Master - Resource Manager (Jobtracker) and handles data motion from the map and reduce stages.
Explain Heartbeat in HDFS?
A signal is sent by each DataNode to the NameNode at regular intervals (default is 3 seconds) to time to indicate that DataNode is alive, this signal is called Heartbeat.
If NameNode does not receive this signal from DataNode, then that Node is considered to be dead. Heartbeats also carry information like data transfers, Datanode storage used, and total storage capacity on DataNode.
What is fault tolerance in HDFS?
Hadoop is a distributed computing framework in which Data and Computation both are distributed among the cluster of nodes. Data is divided into Multiple blocks and stored on different nodes. If any node goes down there will be a data loss, so to overcome this problem Hadoop makes multiple copies of the data blocks and stores them on different Data nodes. Even in case of a node failure, the data retrieval can be done through other nodes and this is how fault tolerance is achieved in Hadoop.
What are Hadoop Daemons?
5 Java (Daemon) processes run in the background for Hadoop. 3 of these daemons are for the HDFS component which is then distributed for storage component. 2 of the demons run on the master node and 2 runs on the slave node.
On the Master Node, the 3 daemon processes are-
- NameNode - This service maintains the metadata for HDFS.
- Resource Manager (JobTracker) – Manages MapReduce Jobs.
- Secondary NameNode – Deals with the organization functions of NameNode. On Slave Node, 2 background processes are-
- Node Manager (TaskTracker) - This service process instantiates and monitors each map-reduce task.
Click here to learn Data Science in Hyderabad
How are HDFS blocks replicated?
Hadoop is a distributed computing framework in which Data and Computation both are distributed among the cluster of nodes. Data is divided into Multiple blocks and stored on different nodes and even if a single node is failed data will be lost, hence Hadoop uses a replication concept where it makes multiple copies of the same block and stores them on different nodes. The replication factor is 3 by default.
One replica is present on the machine on which the client is running, the second replica is created on a randomly chosen node from a different rack and a third replica is created on a randomly selected machine on the same remote rack on which the second replica is created.
What tools/programming languages are preferred for Big Data?
There are many numbers of tools that are supported by Bigdata. The tool or the programming language is chosen based on the requirement.
Big Data Hadoop is a framework that can be treated as a platform or operating system which allows multiple software to work together to obtain the desired result. The platform combines the strength of multiple systems (computers interconnected to form a cluster) to handle huge amounts of data storage and processing.
To store the huge amount of data, Hadoop uses the data partitioning/sharding concept, where a single data file is broken down into smaller chunks of data units called Blocks (standard size: 64/128 MB).
To process the data MapReduce component is used, wherein the Map tasks retrieve the relevant data from the distributed blocks and Reduce task that provides the result on the retrieved intermediate key-value pair data.
Hadoop provides a provision to work with other tools that are developed to work along with Hadoop. A cluster manager called YARN (yet another resource manager) provides the required resources to these tools for processing.
Example: To perform ETL tasks an open-source tools called Pig is popularly used. Apache Pig was developed by Yahoo.
To perform structured data analysis Apache Hive is used. Hive is SQL on Hadoop.
Similarly, we have Mahout to perform Machine Learning tasks on the data residing on HDFS.
The current popular functional oriented programming languages for Data Analytics: R and Python are supported with Big Data. R programming language is known for this maturity with statistical analysis and visualization libraries.
Python language is known for its Machine Learning, Deep Learning, and AI development aspects.
The most popular language in the world: Java, is used to develop the Big Data platform Hadoop, so by default, Java is supported.
Apache Spark which is a distributed computing framework is gaining popularity among the Data Analysts. Spark is used for performing real-time analysis at rapid speeds which are not supported by Hadoop.
Big Data also supports various other languages and tools such as C, C++, Scala, SAS, MATLAB, SPSS, etc.
Name the tools which are helpful to extract Big data?
The business generates data using various sources. This data needs to be stored and processed to find insights that help management to take timely and informed decisions. The data generated by different sources should be extracted, aggregated, and processed. Importing or extracting the data onto the Big Data platform requires specialized tools. There are numerous tools available for Ingestion of structured and unstructured data. Example: Nifi, Sqoop, Kafka, Talend, Chukwa, Flume, Morphine's, Scriptella, etc.
Data Ingestion can be broadly classified into 2 types: Offline or Online
Ingestion tools must investigate various factors for ingesting the data from source to destination, like the file formats, hardware capabilities of the source and destination, different protocols, scalability, security aspects, etc.
The servers (source and destination) may be on-premise or on the cloud. The tools used for ingestion come with features like data security, reliability, fault tolerance, scalability, etc.
Hadoop 1.x is known as a Single point of failure. How to recover from it?
Hadoop has a master/slave architecture and the services of Hadoop are managed by 5 daemons namely: Namenode, Datanodes, Secondary Namenode, ResourceManager, NodeManger. Namenode is a single service which is the master and controls the operations in the Hadoop cluster. If the node running Namenode service (Master node) is failed, then we cannot access the cluster – this condition is called as a Single point of failure.
Hadoop has a mechanism to back up the metadata – FsImage and Editlogs every 60 mins onto Secondary Namenode. In case the cluster becomes inaccessible due to failure of Master node then the backup of metadata is used to recover the time point of the latest backup of FsImage.
This approach is not full proof and there may still be data loss as the default time interval of the scheduled backup of metadata is every 60 mins. In case the Master node is crashed after 50 mins of the latest backup of metadata, then we can recover the system until the latest period losing all the transactions happened in the 50 mins duration.
Explain the Hadoop framework’s configuration files?
To configure the Hadoop framework we need to configure the parameters in the setup files.
The configuration files that need to be updated /edited to setup Hadoop environment are:
core-site.xml – Hadoop core configuration settings, the runtime environment settings are mentioned in this file. It specifies the port and hostname
hdfs-site.xml – This file contains the storage component HDFS related settings. Replication factor and Block size settings can be altered/defined in this file.
yarn-site.xml – ResourceManager and NodeManager settings are defined in this file
mapred-site.xml – MapReduce related settings are defined in this file. From Hadoop 2.x, this file is deprecated and the yarn services are configured in yarn-site.xml.
Click here to learn Data Science in Bangalore