Hadoop

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

What is Hadoop?
Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is the technology to store massive datasets on a cluster of cheap machines in a distributed manner. In the year 2008 Yahoo gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6 in the year 2013.

It includes:​
  • HDFS – Hadoop distributed file system​
  • MapReduce – offline computing engine​
  • YARN - It is the resource management layer in Hadoop
Hadoop Features

Reliability 
In the Hadoop cluster, if any node goes down, it will not disable the whole cluster. Instead, another node will take the place of the failed node. Hadoop cluster will continue functioning as nothing has happened. Hadoop has built-in fault tolerance feature.
Scalable 
Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the cloud you need not worry about scalability. You can easily procure more hardware and expand your Hadoop cluster within minutes.
Economical
Hadoop gets deployed on commodity hardware which is cheap machines. This makes Hadoop very economical. Also as Hadoop is an open system software there is no cost of license too.
Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of sub-tasks. These sub-tasks are independent of each other. Hence they execute in parallel giving high throughput.
Distributed Storage
Hadoop splits each file into the number of blocks. These blocks get stored distributedly on the cluster of machines.
Fault Tolerance
Hadoop replicates every block of file many times depending on the replication factor. Replication factor is 3 by default. In Hadoop suppose any node goes down then the data on that node gets recovered. This is because this copy of the data would be available on other nodes due to replication. Hadoop is fault tolerant.
Hadoop Ecosystem Tool
Difference between Hadoop and RDBMS 
 RDBMS HADOOP
RDBMS relies on the structured data and the schema of the data is always known.Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
RDBMS provides limited or no processing capabilities.Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.On the contrary, Hadoop follows the schema on read policy.
In RDBMS, reads are fast because the schema of the data is already known.The writes are fast in HDFS because no schema validation happens during HDFS write.
Licensed software, therefore, I have to pay for the software.Hadoop is an open source framework. So, I don’t need to pay for the software.
RDBMS is used for OLTP (Online Trasanctional Processing) system.Hadoop is used for Data discovery, data analytics or OLAP system.

No comments:

Post a Comment