Functional Programming, Lambda Functions, Resilient Distributed Datasets (RDDs), Parralel Programming, Queries, parts, and Spark SQL
Dataframe vs Datasets → RDDs, transformations, Catalyst and Tungsten, SQL optimization
Development & Production Environments → Cluster Managers, Apache Spark & Kubernetes, Cluster configurations, Application request processing
Apache UI, Monitoring, Managing Memory, Log files, and Tuning

Spark Introduction

Spark is an Open source in memory application for distributed data processing on massive data volumes

Spark is in-memory →All operations happen on RAM itself → No need for r/w from disk!
Scales very well with data
Primarily written in Scala, and runs on JVMs
Distributed computing → A group/cluster of computers working together to appear as one system to the end user
- distributed processing just means that multiple computers (often referred to as nodes, servers, or machines) work together to process data and perform tasks — Hadoop, Spark both do this
Parrallel Processing: Instead of running tasks one by one, and then if theres an error, have to debug and re-run whole task, we can put the instructions onto different computers and debug independently if one has an error, others can still continue working
Instructions can be run in parrallel and does not have to wait on previous instructions to be completed → Allows us to reduce processing time, less memory, more scalability and can add / remove more nodes/computers as needed!
What is a Computing Cluster? → Basically bunch of computers/servers/nodes horizontally scaled out to handle larger amount of data
Compute cluster’s nodes can be run independently and if one fails, does not effect all of them → This is EASY parrallel method
hard Parrallel model → You need the servers to communicate with each other (i.e. need data from 1 server to server 4… etc).
- This makes you write messages to each other or write data to a shared file system (Like S3) which everyone can ready from and adapt
- More Complexity
Fault Tolerance → If one node group goes down in the cluster, the partitions are replicated on other servers which allows us to re-deploy a worker with it’s partition data not lost
- A system is fault-tolerant if it can continue performing despite parts failing

Difference between Parallel Computing and Distributed Computing

Screenshot 2024-06-26 at 8.46.32 PM.png

Distributed computing offers scalability and modular growth
- Inherently scalable as you can always just horizontally scale as needed
- Distributed Computing gives Fault Tolerance & Redundancy (i.e. it is able to make sure the system stays working even with one or almost all nodes down, but having backup components
  - Redundancy vs Fault Tolerance

Table of Contents

Spark Introduction

Difference between Parallel Computing and Distributed Computing