Spark is an Open source in memory application for distributed data processing on massive data volumes
Spark is in-memory →All operations happen on RAM itself → No need for r/w from disk!
Scales very well with data
Primarily written in Scala, and runs on JVMs
Distributed computing → A group/cluster of computers working together to appear as one system to the end user
Parrallel Processing: Instead of running tasks one by one, and then if theres an error, have to debug and re-run whole task, we can put the instructions onto different computers and debug independently if one has an error, others can still continue working
Instructions can be run in parrallel and does not have to wait on previous instructions to be completed → Allows us to reduce processing time, less memory, more scalability and can add / remove more nodes/computers as needed!
What is a Computing Cluster? → Basically bunch of computers/servers/nodes horizontally scaled out to handle larger amount of data
Compute cluster’s nodes can be run independently and if one fails, does not effect all of them → This is EASY parrallel method
hard Parrallel model → You need the servers to communicate with each other (i.e. need data from 1 server to server 4… etc).
Fault Tolerance → If one node group goes down in the cluster, the partitions are replicated on other servers which allows us to re-deploy a worker with it’s partition data not lost