Table of Contents

  1. Functional Programming, Lambda Functions, Resilient Distributed Datasets (RDDs), Parralel Programming, Queries, parts, and Spark SQL
  2. Dataframe vs Datasets → RDDs, transformations, Catalyst and Tungsten, SQL optimization
  3. Development & Production Environments → Cluster Managers, Apache Spark & Kubernetes, Cluster configurations, Application request processing
  4. Apache UI, Monitoring, Managing Memory, Log files, and Tuning

Spark Introduction

Spark is an Open source in memory application for distributed data processing on massive data volumes

Difference between Parallel Computing and Distributed Computing

Screenshot 2024-06-26 at 8.46.32 PM.png