Why should you be careful with DISTINCT?

If there’s a chance a DataFrame contains duplicated rows, it’s a good idea to deduplicate it before loading into the table. Better to be safe than sorry, right? Absolutely. But sometimes using DISTINCT clauses carelessly lead to serious performance issues. I think every data practitioner has made this mistake of adding DISTINCT clauses to every query and DataFrame to ensure no duplicated rows are sneaking in. Since I’ve seem a lot of people doing this, I figured it’s a good idea to walk through an example and explain why this isn’t the best solution....

October 8, 2024 Â· Leandro Kellermann de Oliveira

Adding Job Descriptions Details to an Apache Spark Application

Having a clear job description in an Apache Spark application makes it easy to spot optimization opportunities. By using the setJobGroup method properly, you can quickly link code issues with what shows up in the Spark UI . In this short post, I’ll show how to do just that. The problem Let’s suppose we are running a benchmark for a simple application that only reads and sorts data. We want to evaluate its performance by varying the number of partitions....

September 30, 2024 Â· Leandro Kellermann de Oliveira