Kubernetes Architecture - The Basics
Kubernetes is an open-source container orchestrator popular among Software Engineers, DevOps Engineers, and it’s gaining momentum in Data. In this post, I’m sharing the notes I took while studying Kubernetes Architecture. Before starting, I’d like to summarize some key-words. The Jargon Agent: it’s a software that acts in behalf of an user or other software, which can also be an agent. Container: containers are all about resources isolation. An application running in a container shares the same hardware as the host, but it only gets the amount of computing resources, i....
Distroless signed docker image application
Distroless Images Distroless container images, unlike the traditional ones, does not include software that are common in distro-based images, such as package managers and shells. This approach aims to minimize the image size and reduce vulnerabilities by removing unnecessary components to run an application. These types of images are suitable for production environments rather than running interactive containers, since they are often smaller and have less attack vectors than traditional images....
Why should you be careful with DISTINCT?
If there’s a chance a DataFrame contains duplicated rows, it’s a good idea to deduplicate it before loading into the table. Better to be safe than sorry, right? Absolutely. But sometimes using DISTINCT clauses carelessly lead to serious performance issues. I think every data practitioner has made this mistake of adding DISTINCT clauses to every query and DataFrame to ensure no duplicated rows are sneaking in. Since I’ve seem a lot of people doing this, I figured it’s a good idea to walk through an example and explain why this isn’t the best solution....
Adding Job Descriptions Details to an Apache Spark Application
Having a clear job description in an Apache Spark application makes it easy to spot optimization opportunities. By using the setJobGroup method properly, you can quickly link code issues with what shows up in the Spark UI . In this short post, I’ll show how to do just that. The problem Let’s suppose we are running a benchmark for a simple application that only reads and sorts data. We want to evaluate its performance by varying the number of partitions....