In this Hadoop Interview Questions and Answers tutorial guide, we are going to cover the top 25+ Hadoop Interview Questions and Answers. This guide includes Hadoop scenario based interview questions, Hadoop interview questions for Freshers as well as Hadoop interview questions and answers for experienced. Going through this collection of interview questions will help you a lot to crack Hadoop Interview easily without facing any difficulties.
So, if you have decided to start your career in Big Data or Hadoop, then you’re at the right place. From basics to complex Hadoop Interview Questions, we have got everything covered. In this guide, we have categorized all these Hadoop Interview Questions and Answers according to the Hadoop Ecosystem Components-
Hadoop Interview Questions
HDFS Hadoop Interview Questions
MapReduce Hadoop Interview Questions
Hadoop Interview Questions
Question: Difference between Hadoop and Traditional RDBMS
|Data Types||semi-structured and unstructured data.||structured data.|
|Schema||Schema on Read||Schema on Write|
|Applications||Data discovery and Massive Storage/Processing of Unstructured data.||Best suited for OLTP and complex ACID transactions.|
|Speed||Writes are Fast||Reads are Fast|
Question: Define the Four V’s of Big Data Indicates
Here is the well-defined explanation:
Volume –Scale of data
Velocity –Analysis of streaming data
Variety – Different forms of data
Veracity –Uncertainty of data
Question: Define HDFS and YARN
HDFS – Hadoop Distributed File System! It’s the storage unit of Hadoop and it’s responsible for storing all sorts of data blocks in a distributed environment. HDFS works on the principle of master and slave technology.
YARN – Yet Another Resource Negotiator! It is the processing framework in Hadoop and it helps in managing resources and provides an execution thread to process.
Question: Discuss the Various Hadoop Components and Their Roles in a Hadoop Cluster
- NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.
- Datanode5: It is the slave node that contains the actual data.
- Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
- ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
- NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
- JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.
Question: Define Active and Passive “NameNodes”?
Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
Question: How Namenode Handles Datanode Failures?
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead. The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.
Question: Structured Data vs Unstructured Data – Difference?
Data which can be stored in database systems in the form of rows and columns are referred to as Structured Data. For example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in database systems are called Unstructured Data. For example, data in XML records can be referred to as semi structured data.
Furthermore, unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Few of the examples of unstructured data are as follows: Facebook updates, Tweets on Twitter, Reviews, web logs, etc.
Question: Name the Best Configuration to Run Hadoop?
According to my view, the best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low – end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.
Question: List Out The Most Common Defined Input Formats in Hadoop?
- Text Input Format- This is the default input format defined in Hadoop.
- Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
- Sequence File Input Format- This input format is used for reading files in sequence.
Question: How to Choose Different File Formats For Storing and Processing Data using Apache Hadoop?
The factors that determine the particular file format is as follows
i) Schema evolution to add, alter and rename fields.
ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
iii) Split ability to be processed in parallel.
iv) Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
HDFS Hadoop Interview Questions
Question: Define Block and Block Scanner in HDFS?
Block – The minimum amount of data that can be read or written is often referred as Block and the default size of Block in HDFS is 64MB
Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
Question: Difference Between Check Point Node and Backup Node
Check Point Node – Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
Backup Node – Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
Question: Mention the port Number for Task Tracker, Name Node, and Job Tracker
Task Tracker 50060
Job Tracker 50030
Question: Describe the Process of Inter Cluster Data Copying
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
Question: What is the Difference Between NAS and HDFS?
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
Question: What is the Method to Change the Files at Arbitrary Locations at HDFS?
As you know, HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
Question: What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
The Hadoop job fails when the Job Tracker is down.
Question: When a client submits a hadoop job, who receives it?
As we explained before, the Name Node receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Hadoop MapReduce Interview Questions and Answers
Question: Describe the Usage of Context Object
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
Question: Briefly Explain the Process of Partitioning, Shuffle and Sort Phase
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
Sort Phase– Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.
Question: How to Create a Custom Partitioner for a Hadoop Map Reduce Job?
- A new class must be created that extends the pre-defined Partitioner Class.
- getPartition method of the Partitioner class must be overridden.
- The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.
Question: List Out The Different Operational Commands in HBase at Record Level and Table Level?
Record Level – put, get, increment, scan and delete.
Table Level – describe, list, drop, disable and scan.
Question: Define Row Key
Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
Question: State the Difference Between HBase and Hive?
HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.
Question: List Out Three Different Types of Tombstone Markers in HBase for Deletion?
Family Delete Marker- This markers marks all columns for a column family.
Version Delete Marker-This marker marks a single version of a column.
Column Delete Marker-This markers marks all the versions of a column.
The Future of Big Data and Hadoop Job Trends
If you are applying for a Hadoop job role, it is best to be prepared to answer any Hadoop interview question that might come your way.
Did you find these hadoop interview questions useful and think that one of the prospective hadoopers will benefit from it? If yes, then please use the social media share buttons to help the big data community at large. We will keep updating this list of Hadoop Interview questions, to suit the current industry standards. Stay tuned and connected with Softwareguiders.com!