What is Hadoop?
- Imagine you’re going from local 1 GB data processing to 1 TB of data processing… to then 100 TB of data processing.. and then eventually Petabytes of data!
- 1 Petabyte of data is 1000 TB, or 1 million GB.
- Your data might be unstructured & structured (videos), sensor data, weather data, etc.. and you need to do processing on all of it to get information
- You can technically use Hadoop for this. Hadoop is:
- Hadoop is optimized to handle Massive quantity of data by using distributed processing
- distributed processing just means that multiple computers (often referred to as nodes, servers, or machines) work together to process data and perform tasks — Hadoop does this
How does Hadoop work?
3 main component:
- HDFS → Hadoop Distributed File system → Stores large data by taking a single cluster to hundreds of clusters
- Mapreduce → Processing unit of Hadoop, processes data by splitting it into smaller chunks, assigning each chunk to a different node in cluster for parrallel processing (at the same time)
- For a while, Mapreduce was the only way to access data stored in HDFS. Then came along Hive, and Pig, and so on
- YARN (Yet Another Resource Negogiator)→ Prepares the RAM, storage, and the CPU for hadoop to run data in batch, stream, interactive and graph processing which are stored in HDFS
Challenges of Hadoop
- Not good for processing transactions
- Not good when work can’t be done in parrallelized
- Not good for when you need low latency (This is bcz you have to do multiple I/O operations from Disc, vs reading from Ram (Spark) or Cache
- Not good when theres dependencies in the data (i.e. Dataset 1 needs to be processed before rest, and so on )
The drawbacks of Hadoop outweigh the benefits
Mapreduce