Introduction

What is a Data Lake?

Could be a Hadoop Cluster
Cloud Storage, AWS S3 (Object Storage) → interact with the metadata of the files to get the files/ data
What if you have a 1000 files?.. Have to go through each one and query it.. ?
Table Format → Clusters files into different Tables
- Table format → Abstraction on datalake such that it organizes files into tables
- different table formats, achieve this differently.
How did we get here?
- Well, back when everyone was scrambling around and running out of resources to vertically scale their databases…. Trying parrallel processing on queries, etc and other ways to optimize their code but wasn’t working out too well.
- This is when Hadoop was created → to store and process vast amount of data, make it scalable and cost effective
- Hadoop used HDFS as the primary storage to store large files across multiple nodes in a cluster. It did this by breaking large files into smaller blocks and replicates these blocks accross different nodes in the cluster → For fault tolterance (incase one node goes down, the data is still replicated on others). It also used Distributed processing now on those nodes using MapReduce.
- Hadoop full details:
Now, the issues with Hadoop was that anyone who wasn’t technically savvy had issues:
1. Any user who wanted to use the data had to figure out how to fit their question into the MapReduce programming model and then write Java code to implement it.
  1. MR required you to write complex java code that required you to tell it which files are it’s dataset, and then run the scripts on them
2. There was no metadata defining information about the dataset, like its schema.
To get data in the hands of more of their users and address these shortcomings, they built Hive.

Now, Hive takes a user’s query in SQL and translates it into MapReduce Jobs so that they could get their answers
- To address the meta-data issue, Hive Table Format was created

Hive Table Format

Hive had to figure out how to create this Table format, so that it overcomes Hadoops issues
- How to represent these tables so we can use SQL?
Approach: Directory based approach
If a file is in this folder, it’s a table. If theres any sub-folders in the folder, that means there is partitions.
Was more efficient! How → Well, based on how you organize your data with the partitions, it was able to query only those partitions as needed.
Automatically update a whole partition - i.e. update directory X every month

Screenshot 2024-01-29 at 4.02.53 PM.png

Smaller updates → What if you wanted to just update one row ?
- Had to thru all files, find the row, and then change it
Large tables took a LONG time! All those opening files, reading files, searching files, closing files, for each and every file… was very slow
Users have to know the physical layout of the data → There is partitioning… for example, what they do is take the data for July, put it in a july folder, data for august, gets put in august folder, etc.
and you sometimes partition the data by ENGINEERED (new) features that your users are not aware of. For example, instead of partitioning on the normaly timestamp (seconds or miliseconds are also recorded), you create two new columns such as year or month and might partition on one of those.
So to take actual advantage of partitioning, users would need to know the above to be able to get faster query responses.