Leandro as a Service

Stream Processing: The Key Ideas

Stream processing is the discipline focused on processes and techniques used to extract information and value from unbounded data. This kind of data has undefined, theoretically infinite size, and often arrives in no particular order in the processing system. Even worse, it must be handled on limited, physical hardware. How could something infinite fit into something finite? Well, it doesn’t. Instead, data enters in the system, stays in memory for a short time, and then either moves on or expire - that is why is called unbounded. In other words, data should always be in motion through the hardware, like in message queues or event streams. ...

The Gaps And Islands Problem

The gaps and islands is a classical data analysis problem where we aim to identify gaps or contiguous ranges of values (islands). This problem is relevant to applications where we need to identify the interruption of a sequence. To illustrate the problem, let’s supose we have cards as ilustrated in the image below: Each card contains an identification code at the top, a number in the middle representing anything with an inherit sense of order, like a timestamp or sequential code, and a letter at the bottom indicating an event. ...

The Spill Problem in Apache Spark

In the post Spark Executor and its memory, we explored tasks, their partitions, and how executor memory is divided into different regions, each with its own responsibilities. We also mentioned that problems may arise when partitions are too large. That is what we are going to discuss. As we have seen, data partitions are stored in the executor unified memory, processed, and released when a task finishes. The size of these partitions varies, and sometimes they aren’t processed quickly enough to free up space for other partitions. Because of this, Spark compresses data from memory and stores it on disk until the application needs it. When there is enough space and the application requires those partitions, Spark reads, decompresses, and writes them back into memory for processing. This is what Spill actualy is. ...

Deploying a password generator application in AWS EKS - Part 2

In the previous post we started to deploy a password generator application in AWS EKS. By the end of it, two problems came to light: There is no SSL certificate assuring that our domain is ours. It could be someone else, faking a connection to sniff data from your computer. We need a ✨remarkable✨ address. We are going to start with registering an address. In order to follow, we need to buy a domain from a registrar. In our example, we use Cloudflare ...

Deploying a password generator application in AWS EKS - Part 1

In the post Creating a distroless signed docker image, a password generator application was shipped in a distroless signed docker image. In this tutorial, we are going to use that image and deploy the same application in AWS EKS. We are not going to show how to install local dependencies or how to set an AWS Account and user permissions, however we provide the documentation for such. Dependencies In order to complete this tutorial, you need to install and configure the following applications and services: ...

Developing a Testable Batch Spark Application

Introduction In my experience, developing testable Spark applications code is not an easy task for data practitioners. I am not going to discuss the underlying reasons. In this post, I present my reasoning while developing a testable batch Spark Application. The text is presented in two sections. The first section , TDD - Developing code from the tests, I show an example of how to develop a code that is modular, readable, comprehensive, testable, and easy to maintain. In the last section More than producing pretty code - it’s about building organizational knowledge, I emphasize the benefits of using TDD in data projects based on my experience and on other sources that may help you to understand this methodology. ...

Kubernetes Architecture - The Basics

Kubernetes is an open-source container orchestrator popular among Software Engineers, DevOps Engineers, and it’s gaining momentum in Data. In this post, I’m sharing the notes I took while studying Kubernetes Architecture. Before starting, I’d like to summarize some key-words. The Jargon Agent: it’s a software that acts in behalf of an user or other software, which can also be an agent. Container: containers are all about resources isolation. An application running in a container shares the same hardware as the host, but it only gets the amount of computing resources, i.e. CPU, memory and network, that the developer allows. It’s like setting a slice of a computer dedicated to run an application. Containerized Application: an application running in a container. Container Engine: it’s a high-level software tool responsible for automating the process of creating isolated, lightweight environments. It’s the component humans usually interact with in order to create containers. This include managing container images and container orchestration. Container engines use a container runtime to process requests made by an user. Container Runtime: it’s the container engine component responsible for the interactions between the application in a container and the host operating system, resource allocation and container execution. Controller: controllers are non-terminating loops that regulates the state of a system. For example, a thermostat in a room keeps checking the temperature in order to decide to turn on or off an air-conditioner. Cluster: is a set of computers (nodes) connected in a network in order to work together as they were a single computer. Orchestrator: an orchestrator is a system that reacts to a demand for computing resources. The orchestrator is responsible for allocating the desired amount of resources when tasks are submitted, checking if the proper amount of resources are available during the execution of an application and self-healing when something breaks. Pods: are the smallest deployable unit of computing that you can create and manage in Kubernetes. Pods are composed by one or more containers, depending on the need. Containers in a Pod share the same network, storage, and run in the same node. The Big Picture A 40000ft look on Kubernetes Architecture looks like this: ...

Creating a distroless signed docker image

Distroless Images Distroless container images, unlike the traditional ones, does not include software that are common in distro-based images, such as package managers and shells. This approach aims to minimize the image size and reduce vulnerabilities by removing unnecessary components to run an application. These types of images are suitable for production environments rather than running interactive containers, since they are often smaller and have less attack vectors than traditional images. ...

Why should you be careful with DISTINCT?

If there’s a chance a DataFrame contains duplicated rows, it’s a good idea to deduplicate it before loading into the table. Better to be safe than sorry, right? Absolutely. But sometimes using DISTINCT clauses carelessly lead to serious performance issues. I think every data practitioner has made this mistake of adding DISTINCT clauses to every query and DataFrame to ensure no duplicated rows are sneaking in. Since I’ve seem a lot of people doing this, I figured it’s a good idea to walk through an example and explain why this isn’t the best solution. ...

Adding Job Descriptions Details to an Apache Spark Application

Having a clear job description in an Apache Spark application makes it easy to spot optimization opportunities. By using the setJobGroup method properly, you can quickly link code issues with what shows up in the Spark UI . In this short post, I’ll show how to do just that. The problem Let’s suppose we are running a benchmark for a simple application that only reads and sorts data. We want to evaluate its performance by varying the number of partitions. The initial code is shown below. ...