What is a Data Lake?
Could be a Hadoop Cluster
Cloud Storage, AWS S3 (Object Storage) → interact with the metadata of the files to get the files/ data
What if you have a 1000 files?.. Have to go through each one and query it.. ?
Table Format → Clusters files into different Tables
Table format → Abstraction on datalake such that it organizes files into tables
different table formats, achieve this differently.
How did we get here?
Now, the issues with Hadoop was that anyone who wasn’t technically savvy had issues:
To get data in the hands of more of their users and address these shortcomings, they built Hive.
Now, Hive takes a user’s query in SQL and translates it into MapReduce Jobs so that they could get their answers
Hive Table Format
Hive had to figure out how to create this Table format, so that it overcomes Hadoops issues
Approach: Directory based approach
If a file is in this folder, it’s a table. If theres any sub-folders in the folder, that means there is partitions.
Was more efficient! How → Well, based on how you organize your data with the partitions, it was able to query only those partitions as needed.
Automatically update a whole partition - i.e. update directory X every month