Apache Flink | Notion

Flink's Notions of Time:
- Event Time: Actual time when an event happened.
- Ingestion Time: When Flink first noticed the event.
- Processing Time: When Flink starts working with the event.
Why Event Time is Crucial:
- Provides consistent and reproducible results.
- Doesn't depend on when analysis is done, but when the event occurred.
Watermarks Explained:
- Watermarks help Flink know when it's received all events up to a certain time.
- Acts like a marker saying, "Events up to this time have been considered."
Lateness and Watermarks:
- Any event arriving after its watermark is considered "late."
- Helps in understanding the completeness of the event stream.
Trade-offs in Stream Processing:
- You can prioritize speed (may miss some data) or accuracy (might be slower).
- Balance between immediate processing and waiting for all data.
Approaches to Watermarking:
- Simple method: assume there's a maximum delay for events.
- Allows handling of out-of-order data and ensures stream completeness.

What is Flink? Apache Flink® — Stateful Computations over Data Streams

Flink is a stream processor → Data has to come from somewhere (databases, logs, transactions, etc) passes thru Flink and goes to somewhere else

Screenshot 2023-07-25 at 1.48.55 PM.png

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams

Any kind of data is produced as stream events such as credit card transactions, sensor measurements, machine logs, user interactions on a website / app, DNS data… are all good cases for Apache Flink
Unbounded Streams: Have a start but no defined end.
- They do not terminate and provide data as it is generated
- Must be continuously processed as events must be handled as they’re coming
- We mostly have to process this type of data in order which events occurred to be able to reason about result completeness
Bounded Streams: Defined start and end → Ingest all data before performing any computation
- Ordered ingestion is NOT required (can be sorted after)
- Also known as batch processing → data at rest, i.e. analysis on historical data

Screenshot 2023-07-25 at 2.04.06 PM.png

Apache Flink is a distributed processing engine that requires compute resources in order to execute applications. It integrates with common cluster resource managers (Hadoop, YARN, Kubernetes), but can also run as a standalone cluster. When deploying applications, Flink automatically identifies required resources, requests them from the resource manager, and replaces failed containers by requesting new resources.

It’s one of the few frameworks that provides both Batch and Real-time Data Processing
Other technologies — Batch Processing: Hadoop, Hive, Spark
Other technologies — Real Time Processing: Spark Streaming

Apache Flink Benefits