Stream Processing: The Key Ideas

Stream processing is the discipline focused on processes and techniques used to extract information and value from unbounded data. This kind of data has undefined, theoretically infinite size, and often arrives in no particular order in the processing system. Even worse, it must be handled on limited, physical hardware. How could something infinite fit into something finite?

Well, it doesn’t. Instead, data enters in the system, stays in memory for a short time, and then either moves on or expire - that is why is called unbounded. In other words, data should always be in motion through the hardware, like in message queues or event streams.

The counterpart of stream processing is batch processing, where we deal with bounded data. These are easier to manage because bounded data size is known, like text files and spreadsheets. They’re good examples of bits and bytes at rest in a computer. Let’s table batch processing for now.

The tricky part in stream processing is that data sources, the system that processes their data, and destinations tracks time differently - like separate timelines in the MCU. Sources, or as we call producers , are often IoT sensors, smartphones, and other gadgets. Their dimension is ruled by the event timeline, and this is often the only timeline business cares about. Once data is created, they are sent to a streaming processing system, such as an Apache Kafka cluster, and its universe is controlled by the processing timeline - this one is important to technical people. The image below illustrates the relationship between event and processing timelines.

Events generating data on the producer side are represented by red circles, while their arrival in the stream processing system is shown as blue circles. A delivery delay is always expected - meaning that the event generating data always happen before it arrives the processing system. That might seems obvious, but is a frequently overlooked fact. On top of that, there is no guarantee that events will arrive in the same order they were created, which makes things challenging.

Whether the time data arrives in the processing system matters depends on the type of analysis. Stateless operations, like mapping or filtering input values, don’t rely on having all the data - the system can work on each event independently. However, stateful operations like sum, sorting, finding maximum or minimum values, do require us a complete view of the data.

Take an example of finding the maximum value in a column. With a static table, the processing system scans through all rows of a column and identify the highest value. But how do we make this if the dataset is always growing? This is where windowing and watermarking come in. We work with both techniques to set manageable chunks of data and help the system decide when it has seen “enough” data to safely compute results. We are going to see how those techniques work.

Windowing Strategies

In stream processing, windows add boundaries in the event timeline. The most commons windowing strategies are the tumbling window and the sliding window.

Tumbling windows divide the event timeline into fixed intervals, and any data generated within each interval is seem as a different chunk.

In the image above, we have a 10-minute tumbling window. The first window starts at 00:00 and ends at 00:10, the second one starts at 00:10 and ends at 00:20. The same logic applies to subsequent windows.

Sliding windows are more complex. In addition to its “length” interval, we also consider an update interval. The image below illustrate a 10-minute sliding window with 5-minute update interval.

In the image above, the sliding window starts at 00:00, updates once at 00:05 and ends at 00:10. When the first update occurs, the window “forgets” events that happened before 00:05. When the second update happens, the window doesn’t remember events that happened before 00:10. The same idea applies to updates that will come.

But, as we mentioned before, the event and processing timelines come from different devices. Data can arrive too late in the processing system - like the event at 00:12 that arrived at 00:30. Should the processing system still include that event in its calculations? What if we actually have thousands of other events happening in the producer between 00:00 and 00:10, but due to networking conditions, data arrives only at 00:50? It could arrive even later. Should we keep waiting for them? We need to set a threshold to how late an event could arrive - that is what watermarking is about.

Watermarking

Watermarks delimit how late an event could arrive in the streaming processing system and still be considered in a stateful operation. The image below illustrates different watermarks combined to 10-minute tumbling windows to calculate the number of events happening in a 30 minutes interval.

When it’s 00:20 in the streaming system, only one event has arrived—because events that occurred at 00:12 and 00:16 in the event timeline haven’t shown up yet. At 00:30, the system is counting 3 or 4 events, depending on the watermark we define.

With a 10-minute watermark, the system will only consider events that arrive within 10 minutes of their creation time. In that case, the count from 00:00 to 00:30 would be 3 events. But if we set a 25-minute watermark, the system would wait longer and count 4 events instead.

I believe that when watermarks are presented in a graph, like the one above, they become much easier to interpret. We can think of them as lines - like staircases in a building - that go up in parallel. It as if and we’re counting how many people are on the stairs, but we also get to decide how many floor levels to consider.

Conclusion

Stream processing comes with its own set of challenges - just like everything we have to learn. To overcome these challenges, understanding ideas like different timelines for the same process can ease the journey. Here, we’ve seen that windowing and watermarking are just artificial boundaries that we create to analyze a potentially unlimited set data with finite resources. That makes me wonder if Thanos would snap away half of all producers to make stream processing easier.

References

Spark Structured Streaming Programming Guide
DAMJI, et al. Learning Spark: Lightning-Fast Data Analytics, 2 ed. O’Reilly Media. 2020.

Windowing Strategies#

Watermarking#

Conclusion#

References#

Windowing Strategies

Watermarking

Conclusion

References