What is Big Data?

Vinay Jadhav
4 min readSep 17, 2020

Have you ever thought, how do facebook, instagram, twitter and other platforms store your account data and make it available to you whenever required?

well, if you think they store data in a huge hard disk then, you are wrong my friend. There are so many things that happens behind the screen.

First, let’s see some facts about data:

A data center normally holds petabytes to exabytes of data. Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.

1 petabyte = 1000000 gigabytes

Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data.

Every day, 306.4 billion emails are sent, and 5 million Tweets are made.

95 million photos and videos are shared every day on Instagram.

By the end of 2020, 44 zettabytes will make up the entire digital universe.

So, how do these companies manage such a huge volume of data?

What is Big Data?

According to wikipedia,

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

Big Data is not actually a technology, but it is a problem that we face in today’s world. Its related to how we manage such a huge amount of data.

Three Sub-problems in Big Data:

  1. Volume : The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.
  2. Velocity : The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.
  3. Variety : The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies.

So, here we will see how we solve the Volume and Velocity problems in big data:

To solve the volume and velocity problems we use a concept called Distributed Storage.

Master-Slave Model

In Distributed Storage, we split the data in smaller blocks and save them in hard disk of different commodity hardware.

So, instead of investing in single hardware to store data which is obviously expensive method, companies use distributed storage as it a cheaper option.

The main advantage of distributed storage is that it is scalable. So, if your requirement to store data increases, then you can purchase more hardware.

okay, so we solved the Volume problem of big data. But what about velocity problem?

well, it is also solved by distributed storage.

How?

lets, consider an example.

Suppose, we want to upload 40 TB of data to any storage device which takes 40 minutes to upload the data.

Now, if we use distributed storage, having 4 nodes( 1 master node and 4 slave nodes) then we can split the 40 TB data in 4 blocks of 10 TB each. So, it will require 10 minutes to upload data in each of Storage device. Since, we can upload the data in all nodes at same time, our data is uploaded in 10 minutes.

So, we have also solved the Velocity problem.

But, how do you build the Distributed Storage cluster?

There are many softwares available in market to build the Storage Cluster. One of them, which is widely used in industry is HADOOP.

Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment.

Hadoop consists of the Hadoop Common package, which provides file system and operating system level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the Java Archive (JAR) files and scripts needed to start Hadoop.

--

--