The Spill Problem in Apache Spark
In the post Spark Executor and its memory, we explored tasks, their partitions, and how executor memory is divided into different regions, each with its own responsibilities. We also mentioned that problems may arise when partitions are too large. That is what we are going to discuss. As we have seen, data partitions are stored in the executor unified memory, processed, and released when a task finishes. The size of these partitions varies, and sometimes they aren’t processed quickly enough to free up space for other partitions. Because of this, Spark compresses data from memory and stores it on disk until the application needs it. When there is enough space and the application requires those partitions, Spark reads, decompresses, and writes them back into memory for processing. This is what Spill actualy is. ...