The Spill Problem in Apache Spark
In the post Spark Executor and its memory, we explored tasks, their partitions, and how executor memory is divided into different regions, each with its own responsibilities. We also mentioned that problems may arise when partitions are too large. That is what we are going to discuss. As we have seen, data partitions are stored in the executor unified memory, processed, and released when a task finishes. The size of these partitions varies, and sometimes they aren’t processed quickly enough to free up space for other partitions. Because of this, Spark compresses data from memory and stores it on disk until the application needs it. When there is enough space and the application requires those partitions, Spark reads, decompresses, and writes them back into memory for processing. This is what Spill actualy is. ...
Deploying a password generator application in AWS EKS - Part 2
In the previous post we started to deploy a password generator application in AWS EKS. By the end of it, two problems came to light: There is no SSL certificate assuring that our domain is ours. It could be someone else, faking a connection to sniff data from your computer. We need a ✨remarkable✨ address. We are going to start with registering an address. In order to follow, we need to buy a domain from a registrar. In our example, we use Cloudflare ...
Deploying a password generator application in AWS EKS - Part 1
In the post Creating a distroless signed docker image, a password generator application was shipped in a distroless signed docker image. In this tutorial, we are going to use that image and deploy the same application in AWS EKS. We are not going to show how to install local dependencies or how to set an AWS Account and user permissions, however we provide the documentation for such. Dependencies In order to complete this tutorial, you need to install and configure the following applications and services: ...
Developing a Testable Batch Spark Application
Introduction In my experience, developing testable Spark applications code is not an easy task for data practitioners. I am not going to discuss the underlying reasons. In this post, I present my reasoning while developing a testable batch Spark Application. The text is presented in two sections. The first section , TDD - Developing code from the tests, I show an example of how to develop a code that is modular, readable, comprehensive, testable, and easy to maintain. In the last section More than producing pretty code - it’s about building organizational knowledge, I emphasize the benefits of using TDD in data projects based on my experience and on other sources that may help you to understand this methodology. ...
Kubernetes Architecture - The Basics
Kubernetes is an open-source container orchestrator popular among Software Engineers, DevOps Engineers, and it’s gaining momentum in Data. In this post, I’m sharing the notes I took while studying Kubernetes Architecture. Before starting, I’d like to summarize some key-words. The Jargon Agent: it’s a software that acts in behalf of an user or other software, which can also be an agent. Container: containers are all about resources isolation. An application running in a container shares the same hardware as the host, but it only gets the amount of computing resources, i.e. CPU, memory and network, that the developer allows. It’s like setting a slice of a computer dedicated to run an application. Containerized Application: an application running in a container. Container Engine: it’s a high-level software tool responsible for automating the process of creating isolated, lightweight environments. It’s the component humans usually interact with in order to create containers. This include managing container images and container orchestration. Container engines use a container runtime to process requests made by an user. Container Runtime: it’s the container engine component responsible for the interactions between the application in a container and the host operating system, resource allocation and container execution. Controller: controllers are non-terminating loops that regulates the state of a system. For example, a thermostat in a room keeps checking the temperature in order to decide to turn on or off an air-conditioner. Cluster: is a set of computers (nodes) connected in a network in order to work together as they were a single computer. Orchestrator: an orchestrator is a system that reacts to a demand for computing resources. The orchestrator is responsible for allocating the desired amount of resources when tasks are submitted, checking if the proper amount of resources are available during the execution of an application and self-healing when something breaks. Pods: are the smallest deployable unit of computing that you can create and manage in Kubernetes. Pods are composed by one or more containers, depending on the need. Containers in a Pod share the same network, storage, and run in the same node. The Big Picture A 40000ft look on Kubernetes Architecture looks like this: ...
Creating a distroless signed docker image
Distroless Images Distroless container images, unlike the traditional ones, does not include software that are common in distro-based images, such as package managers and shells. This approach aims to minimize the image size and reduce vulnerabilities by removing unnecessary components to run an application. These types of images are suitable for production environments rather than running interactive containers, since they are often smaller and have less attack vectors than traditional images. ...
Why should you be careful with DISTINCT?
If there’s a chance a DataFrame contains duplicated rows, it’s a good idea to deduplicate it before loading into the table. Better to be safe than sorry, right? Absolutely. But sometimes using DISTINCT clauses carelessly lead to serious performance issues. I think every data practitioner has made this mistake of adding DISTINCT clauses to every query and DataFrame to ensure no duplicated rows are sneaking in. Since I’ve seem a lot of people doing this, I figured it’s a good idea to walk through an example and explain why this isn’t the best solution. ...
Adding Job Descriptions Details to an Apache Spark Application
Having a clear job description in an Apache Spark application makes it easy to spot optimization opportunities. By using the setJobGroup method properly, you can quickly link code issues with what shows up in the Spark UI . In this short post, I’ll show how to do just that. The problem Let’s suppose we are running a benchmark for a simple application that only reads and sorts data. We want to evaluate its performance by varying the number of partitions. The initial code is shown below. ...
Spark Executor and its memory
In the Spark Application Architecture post, we discussed Apache Spark architecture concepts. As we could see, tasks are the fundamental unity of work in Spark, and we are going to use it here to talk about Spark Executor and its memory. In the section “Tasks and Partitions”, we are going to see the relation among tasks, partitions and the hardware. In the second section, “On-Heap and Off-Heap Memory”, we talk about the executor memory with a special focus on the On-Heap memory. In third part, “Reserved, Unified and User Memories”, we describe better the On-Heap memory and how it’s used. In the fourth, “Unified Memory: Storage and Execution”, we unveil some details about how this memory behaves accordingly to the size of objects being stored in it. ...
Apache Spark Application Architecture
In this post, I’d like to show some concepts for better understanding of Apache Spark applications. Most of the content here is available in many books, blog posts, paid courses and free YouTube videos. Here, I just compiled these materials and added some important details regarding my experience. This text is divided in three sections. The first section, “Apache Spark Components Overview”, I present the basic Apache Spark components and their respective roles when executing an application, as well as the composition of an Apache Spark application. In the second section, “Actions, Transformations and Lazy Evaluation”, I discuss these three important concepts that are frequently mentioned in the first section, as well as in every text about Apache Spark. The third section is the Conclusion, where I wrap up the previous sections. ...