Spark Crash Course

Spark is good when we are dealing with distributed systems

Introduction to Spark

Overview: This page provides an introduction to Apache Spark, an open-source distributed computing system designed for big data processing and analytics. The code examples are provided in Python, but Spark also supports other programming languages such as Scala and Java.
Details: It explains Spark's key features such as in-memory processing, fault tolerance, and support for various programming languages. It also covers Spark's versatility in handling diverse workloads, including batch processing, iterative algorithms, and real-time streaming.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \\
    .appName("Spark Intro") \\
    .getOrCreate()

# Perform basic operations
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
count = rdd.count()

print("Count: ", count)

Spark Architecture

Overview: This page covers the high-level architecture of Spark, highlighting its components and their roles in distributed data processing.
Details: It discusses the driver program, which coordinates the execution of Spark applications, and the cluster manager responsible for allocating resources. It also explains the role of worker nodes that execute tasks and the distributed storage system for fault-tolerant data storage.

from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext(appName="RDD Example")

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform operations on RDD
squared_rdd = rdd.map(lambda x: x ** 2)
result = squared_rdd.collect()

print("Squared RDD: ", result)

RDD (Resilient Distributed Datasets)

Overview: This page introduces RDD, the core data abstraction in Spark, representing an immutable distributed collection of objects.