Apache Hadoop — HDFS, MapReduce, Hive and HBase

Imagine you’re going from local 1 GB data processing to 1 TB of data processing… to then 100 TB of data processing.. and then eventually Petabytes of data!
1 Petabyte of data is 1000 TB, or 1 million GB.
Your data might be unstructured & structured (videos), sensor data, weather data, etc.. and you need to do processing on all of it to get information
You can technically use Hadoop for this. Hadoop is:

Screenshot 2024-06-22 at 12.56.28 PM.png

Hadoop is optimized to handle Massive quantity of data by using distributed processing
distributed processing just means that multiple computers (often referred to as nodes, servers, or machines) work together to process data and perform tasks — Hadoop does this

3 main component:

HDFS → Hadoop Distributed File system → Stores large data by taking a single cluster to hundreds of clusters
Mapreduce → Processing unit of Hadoop, processes data by splitting it into smaller chunks, assigning each chunk to a different node in cluster for parrallel processing (at the same time)
1. For a while, Mapreduce was the only way to access data stored in HDFS. Then came along Hive, and Pig, and so on
YARN (Yet Another Resource Negogiator)→ Prepares the RAM, storage, and the CPU for hadoop to run data in batch, stream, interactive and graph processing which are stored in HDFS

Not good for processing transactions
Not good when work can’t be done in parrallelized
Not good for when you need low latency (This is bcz you have to do multiple I/O operations from Disc, vs reading from Ram (Spark) or Cache
Not good when theres dependencies in the data (i.e. Dataset 1 needs to be processed before rest, and so on )

The drawbacks of Hadoop outweigh the benefits