Monitoring the Hadoop cluster while reading a file from its Distributed File System

Vinay Jadhav
5 min readOct 6, 2020

Abhinav Sinha1, Piyush Bagani2, Vinay Jadhav3, Ayush Chaplot4

1K.R. Mangalam University, South West Delhi, Delhi, India

2LDRP Institute of Technology & Research, Gandhinagar, Gujarat, India

3Datta Meghe College of Engineering, Airoli, Maharashtra, India

4Poornima College of Engineering, Jaipur, Rajasthan, India

Arth20 Learner’s Success Head: Anupreet Dube, TINJRIT, Udaipur, Rajasthan, India

Abstract:

The Apache™ Hadoop is a open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Introduction:

Hadoop is designed is such a way that it detect and handles the failures that occurs in its cluster. So, our aim is to monitor the Hadoop cluster while reading a file using cat command from the distributed file system.

Methodology:

We have performed following steps to monitor the Hadoop cluster:

1. setup a Hadoop cluster with one Name Node(master) and 4 Data Nodes(Slave) and one client Node.

2. Upload a file through client to the Name Node.

3. Check which Data Node the Master chooses to store the file. We can check this by using web UI provided by Hadoop using port number 50070.

4. Once uploaded try to read the file through Client using the cat command and while reading the file delete or stop that Data Node from which the file is been read. Now, monitor how does Hadoop handles this situation. To monitor the Hadoop cluster, we have used two tools:

i. tcptrack

ii. tcpdump

Getting Started

We began with creating 1 Client, 1 NameNode, and 4 DataNode instances over AWS Elastic Compute Cloud (EC2), each of them having Red Hat Enterprise Linux 8 as the guest OS. JDK 8u171 and Hadoop 1.2.1 were then installed on all these nodes.

● hdfs-site.xml configuration for the Name Node:

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/nn</value>

</property>

</configuration>

● core-site.xml configuration for the Name Node:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://0.0.0.0:9001</value>

</property>

</configuration>

● hdfs-site.xml configuration for all the Data Nodes:

<configuration>

<property>

<name>dfs.data.dir</name>

<value>/dn</value>

</property>

</configuration>

● core-site.xml configuration for all the Data Nodes and the Client:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://65.0.27.101:9001</value>

</property>

</configuration>

**Here, 65.0.27.101 was the Public IP of our Name Node at the time of testing.

Hadoop Cluster implemented during our research

First, we uploaded a dummy text file of size 157 MB and fetched its content into Client’s shell prompt, which looked something like this:

Method 1**:

To keep a track on the transaction activity between the Client and the Data Nodes in real time, at our first attempt we used a simple, yet quite helpful 3rd-party tool called tcptrack. We have to use tcptrack -i eth0 command to view the transaction in terminal.

In the image shown below, the Data Node at IP 13.127.80.171 has connected to the Client and begins transferring the data from the required block.

But during this process, we stopped that Data Node’s instance while the data was being transferred, and observed that within a couple of seconds a new data transaction began between the Client and the Data Node at IP 13.212.35.117, as shown below.

And the same occurred when we stopped even that Data Node. A new Data Node at IP 15.206.166.29 showed up in the list.

** The IP addresses used while performing this method are different than IP addresses mentioned in diagram because the research was performed on two different days because the public IP address of all the ec2 instances change every time we shut down and start the instance

Method 2:

The image below shows the first Data Node (XX.XX.254.94) connected to the Client while fetching the dummy text content.

Upon stopping the Data Node’s instance, another Data Node (XX.XX.41.146) connected to the Client and resumed the pending transaction.

And again, the same was observed upon stopping this instance as well, with a new DataNode (XX.XX.213.64) connecting to the Client.

Results:

We found that even if we stop a data node from which the client is reading the file, the reading of file does not stop.

This is due to a concept called Replication. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

Conclusion:

By this end of this task, we:

  1. Successfully determined the IP of those Data Nodes which transfer the data in blocks to the client while fetching a dummy text content in real time.
  2. Successfully observed how Hadoop deals with sudden Data Node failures, in which case the replicas of the blocks being transferred come into action and fulfill the pending data transactions.
  3. Successfully determined which node deals with uploading the data in blocks to the cluster’s Data Nodes, which turns out to be done by the Client.
  4. Successfully understood the process of replicating the blocks within the cluster, which turns out to be done internally by the Data Nodes by connecting to each other through the network.

--

--