Spark

Stream Processing: The Key Ideas

Stream processing is the discipline focused on processes and techniques used to extract information and value from unbounded data. This kind of data has undefined, theoretically infinite size, and often arrives in no particular order in the processing system. Even worse, it must be handled on limited, physical hardware. How could something infinite fit into something finite? Well, it doesn’t. Instead, data enters in the system, stays in memory for a short time, and then either moves on or expire - that is why is called unbounded. In other words, data should always be in motion through the hardware, like in message queues or event streams. ...

The Spill Problem in Apache Spark

In the post Spark Executor and its memory, we explored tasks, their partitions, and how executor memory is divided into different regions, each with its own responsibilities. We also mentioned that problems may arise when partitions are too large. That is what we are going to discuss. As we have seen, data partitions are stored in the executor unified memory, processed, and released when a task finishes. The size of these partitions varies, and sometimes they aren’t processed quickly enough to free up space for other partitions. Because of this, Spark compresses data from memory and stores it on disk until the application needs it. When there is enough space and the application requires those partitions, Spark reads, decompresses, and writes them back into memory for processing. This is what Spill actualy is. ...

Developing a Testable Batch Spark Application

Introduction In my experience, developing testable Spark applications code is not an easy task for data practitioners. I am not going to discuss the underlying reasons. In this post, I present my reasoning while developing a testable batch Spark Application. The text is presented in two sections. The first section , TDD - Developing code from the tests, I show an example of how to develop a code that is modular, readable, comprehensive, testable, and easy to maintain. In the last section More than producing pretty code - it’s about building organizational knowledge, I emphasize the benefits of using TDD in data projects based on my experience and on other sources that may help you to understand this methodology. ...

Why should you be careful with DISTINCT?

If there’s a chance a DataFrame contains duplicated rows, it’s a good idea to deduplicate it before loading into the table. Better to be safe than sorry, right? Absolutely. But sometimes using DISTINCT clauses carelessly lead to serious performance issues. I think every data practitioner has made this mistake of adding DISTINCT clauses to every query and DataFrame to ensure no duplicated rows are sneaking in. Since I’ve seem a lot of people doing this, I figured it’s a good idea to walk through an example and explain why this isn’t the best solution. ...

Adding Job Descriptions Details to an Apache Spark Application

Having a clear job description in an Apache Spark application makes it easy to spot optimization opportunities. By using the setJobGroup method properly, you can quickly link code issues with what shows up in the Spark UI . In this short post, I’ll show how to do just that. The problem Let’s suppose we are running a benchmark for a simple application that only reads and sorts data. We want to evaluate its performance by varying the number of partitions. The initial code is shown below. ...

Spark Executor and its memory

In the Spark Application Architecture post, we discussed Apache Spark architecture concepts. As we could see, tasks are the fundamental unity of work in Spark, and we are going to use it here to talk about Spark Executor and its memory. In the section “Tasks and Partitions”, we are going to see the relation among tasks, partitions and the hardware. In the second section, “On-Heap and Off-Heap Memory”, we talk about the executor memory with a special focus on the On-Heap memory. In third part, “Reserved, Unified and User Memories”, we describe better the On-Heap memory and how it’s used. In the fourth, “Unified Memory: Storage and Execution”, we unveil some details about how this memory behaves accordingly to the size of objects being stored in it. ...

Apache Spark Application Architecture

In this post, I’d like to show some concepts for better understanding of Apache Spark applications. Most of the content here is available in many books, blog posts, paid courses and free YouTube videos. Here, I just compiled these materials and added some important details regarding my experience. This text is divided in three sections. The first section, “Apache Spark Components Overview”, I present the basic Apache Spark components and their respective roles when executing an application, as well as the composition of an Apache Spark application. In the second section, “Actions, Transformations and Lazy Evaluation”, I discuss these three important concepts that are frequently mentioned in the first section, as well as in every text about Apache Spark. The third section is the Conclusion, where I wrap up the previous sections. ...