- Flink's Notions of Time:
- Event Time: Actual time when an event happened.
- Ingestion Time: When Flink first noticed the event.
- Processing Time: When Flink starts working with the event.
- Why Event Time is Crucial:
- Provides consistent and reproducible results.
- Doesn't depend on when analysis is done, but when the event occurred.
- Watermarks Explained:
- Watermarks help Flink know when it's received all events up to a certain time.
- Acts like a marker saying, "Events up to this time have been considered."
- Lateness and Watermarks:
- Any event arriving after its watermark is considered "late."
- Helps in understanding the completeness of the event stream.
- Trade-offs in Stream Processing:
- You can prioritize speed (may miss some data) or accuracy (might be slower).
- Balance between immediate processing and waiting for all data.
- Approaches to Watermarking:
- Simple method: assume there's a maximum delay for events.
- Allows handling of out-of-order data and ensures stream completeness.
What is Flink? Apache Flink® — Stateful Computations over Data Streams
Flink is a stream processor → Data has to come from somewhere (databases, logs, transactions, etc) passes thru Flink and goes to somewhere else
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams
- Any kind of data is produced as stream events such as credit card transactions, sensor measurements, machine logs, user interactions on a website / app, DNS data… are all good cases for Apache Flink
- Unbounded Streams: Have a start but no defined end.
- They do not terminate and provide data as it is generated
- Must be continuously processed as events must be handled as they’re coming
- We mostly have to process this type of data in order which events occurred to be able to reason about result completeness
- Bounded Streams: Defined start and end → Ingest all data before performing any computation
- Ordered ingestion is NOT required (can be sorted after)
- Also known as batch processing → data at rest, i.e. analysis on historical data
Apache Flink is a distributed processing engine that requires compute resources in order to execute applications. It integrates with common cluster resource managers (Hadoop, YARN, Kubernetes), but can also run as a standalone cluster. When deploying applications, Flink automatically identifies required resources, requests them from the resource manager, and replaces failed containers by requesting new resources.
- It’s one of the few frameworks that provides both Batch and Real-time Data Processing
- Other technologies — Batch Processing: Hadoop, Hive, Spark
- Other technologies — Real Time Processing: Spark Streaming
Apache Flink Benefits