The Gaps And Islands Problem

The gaps and islands is a classical data analysis problem where we aim to identify gaps or contiguous ranges of values (islands). This problem is relevant to applications where we need to identify the interruption of a sequence. To illustrate the problem, let’s supose we have cards as ilustrated in the image below:

Each card contains an identification code at the top, a number in the middle representing anything with an inherit sense of order, like a timestamp or sequential code, and a letter at the bottom indicating an event.

For any reason, we need to create groups with a rule that someone has defined. Here, cards that appear before an event a and the respective card with an event a belong to the same group (island). The image below illustrates how cards are grouped into islands based on the defined rule:

We use those islands to create new cards containing only the first and the last event of each. The image below illustrates this outcome:

Or we could even fill the gaps in the sequence within an island by creating a new event x in it, such what occurs in the island 3. This outcome is illustrated below, with the card nx8 being the one created to fill the gap between the cards n23and n8:

The same approach can be applied to rows in tables. In this post, I’m going to use two mock datasets to illustrate a real scenario where the gaps and islands problem was applied. I’ll be using the Databricks Community to run this example. To build mock datasets, simply copy, paste and run the following command in the first cell of your Python notebook:

column_names_1 = ['id', 'day', 'vl_ds1']
dataset_1 = [
  ('A', 1, 'value_1'),
  ('A', 2, 'value_2'),
  ('A', 3, 'value_3'),
  ('A', 4, 'value_4'),
  ('A', 5, 'value_5'),
  ('A', 6, 'value_6'),
  ('B', 2, 'value_2'),
  ('B', 3, 'value_3'),
  ('C', 4, 'value_4'),
  ('C', 5, 'value_1'),
]

df1 = spark.createDataFrame(dataset_1, column_names_1)
df1.createOrReplaceTempView("dataset1")

column_names_2 = ['id', 'day', 'vl_ds2']
dataset_2 = [
  ('A', 2, 'value_2a'),
  ('A', 4, 'value_4a'),
  ('B', 2, 'value_2b'),
  ('C', 4, 'value_4c'),
]

df2 = spark.createDataFrame(dataset_2, column_names_2)
df2.createOrReplaceTempView("dataset2")

Framing the problem

At work this week, a friend of mine asked for help with a similar situation. Let’s suppose we have two datasets named dataset1 and dataset2, each one having values for events vl_ds1 and vl_ds2, respectively. These events are related in ispite of being recorded in different tables, and events that are on dataset2 can only happen if there is a event in dataset1 in the same day.

The dataset1 is represented below:

id	day	vl_ds1
A	1	value_1
A	2	value_2
A	3	value_3
A	4	value_4
A	5	value_5
A	6	value_6
B	2	value_2
B	3	value_3
C	4	value_4
C	5	value_1

And the dataset2 is:

id	day	vl_ds2
A	2	value_2a
A	4	value_4a
B	2	value_2b
C	4	value_4c

When joining both tables with the dataset1 on the left, we have the following result`:

id	day	vl_ds1	vl_ds2
A	1	value_1	null
A	2	value_2	value_2a
A	3	value_3	null
A	4	value_4	value_4a
A	5	value_5	null
A	6	value_6	null
B	2	value_2	value_2b
B	3	value_3	null
C	4	value_4	value_4c
C	5	value_1	null

But we need to create a vl_ds2_fix column where the first non-null value of vl_ds2 is propagated to the preceding and the following rows for a particular id, until it appears a non-null value. The table below represents the desired outcome:

id	day	vl_ds1	vl_ds2_fix	vl_ds2
A	1	value_1	value_2a	null
A	2	value_2	value_2a	value_2a
A	3	value_3	value_2a	null
A	4	value_4	value_4a	value_4a
A	5	value_5	value_4a	null
A	6	value_6	value_4a	null
B	2	value_2	value_2b	value_2b
B	3	value_3	value_2b	null
C	4	value_4	value_4c	value_4c
C	5	value_1	value_4c	null

With that in mind, let’s start solving the problem

Solving the problem

First, we need to create the joined table with new columns containing the previous and next values of vl_ds2. These columns will help us identify a pattern:

%sql
with full_data(
select 
  a.id
  , a.day
  , a.vl_ds1
  , b.vl_ds2
  , lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
  , lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
	left join dataset2 b on a.id = b.id and a.day = b.day
)
select * from full_data

This leads us to the follwoing table:

id	day	vl_ds1	vl_ds2	prev_ds2	next_ds2
A	1	value_1	null	null	value_2a
A	2	value_2	value_2a	null	null
A	3	value_3	null	value_2a	value_4a
A	4	value_4	value_4a	null	null
A	5	value_5	null	value_4a	null
A	6	value_6	null	null	null
B	2	value_2	value_2b	null	null
B	3	value_3	null	value_2b	null
C	4	value_4	value_4c	null	null
C	5	value_1	null	value_4c	null

Now, we start writing rules to define the churn, that is, when the combination of vl_ds2 ,prev_ds2 and next_ds2 indicates that a sequence was broken or not. By exploring the table above, we notice the following:

When vl_ds2 and prev_ds2 are null, there is no churn.
When vl_ds2 is not null, but prev_ds2 and next_ds2 are both null, there is no churn,
When vl_ds2 is null, but prev_ds2 or next_ds2 aren’t, there is churn.

Tip: write the CASE WHEN statements starting by the most specific boolean expression to the least specific one. In our example, we have boolean expression, and the second one we presented above is the most specific one, since it presents 3 rules. Furthermore, this should be the first expression in the CASE WHEN statement.

The previous observation lead us to the following code:

%sql
with full_data(
	select 
  	a.id
  	,a.day
  	,a.vl_ds1
  	,b.vl_ds2
  	,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
  	,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
	from dataset1 a
		left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
	select id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,case when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
    when vl_ds2 is null
    	and prev_ds2 is null
    	then 0
   	when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
  	when vl_ds2 is null
    	and (
      	prev_ds2 is not null
      	or next_ds2 is not null
    	)
    	then 1 end as churn
	from full_data
)
select * from tb_churn

id	day	vl_ds1	vl_ds2	prev_ds2	next_ds2	churn
A	1	value_1	null	null	value_2a	0
A	2	value_2	value_2a	null	null	0
A	3	value_3	null	value_2a	value_4a	1
A	4	value_4	value_4a	null	null	0
A	5	value_5	null	value_4a	null	1
A	6	value_6	null	null	null	0
B	2	value_2	value_2b	null	null	0
B	3	value_3	null	value_2b	null	1
C	4	value_4	value_4c	null	null	0
C	5	value_1	null	value_4c	null	1

The next step is creating the island_id column by making a cumulative sum on the churn column:

%sql
with full_data(
	select 
  	a.id
  	,a.day
  	,a.vl_ds1
  	,b.vl_ds2
  	,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
  	,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
	from dataset1 a
		left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
	select id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,case when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
    when vl_ds2 is null
    	and prev_ds2 is null
    	then 0
   	when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
  	when vl_ds2 is null
    	and (
      	prev_ds2 is not null
      	or next_ds2 is not null
    	)
    	then 1 end as churn
	from full_data
),
tb_island_id(
	select
		sum(churn) over (
  		partition by id 
  		order by day 
  		rows between unbounded preceding and current row
		) as island_id
  	,id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,churn
	from tb_churn
)
select *
from tb_island_id
 order by id
 ,day

island_id	id	day	vl_ds1	vl_ds2	prev_ds2	next_ds2	churn
0	A	1	value_1	null	null	value_2a	0
0	A	2	value_2	value_2a	null	null	0
1	A	3	value_3	null	value_2a	value_4a	1
1	A	4	value_4	value_4a	null	null	0
2	A	5	value_5	null	value_4a	null	1
2	A	6	value_6	null	null	null	0
0	B	2	value_2	value_2b	null	null	0
1	B	3	value_3	null	value_2b	null	1
0	C	4	value_4	value_4c	null	null	0
1	C	5	value_1	null	value_4c	null	1

Now, we can define the CASE WHEN conditions to fill the gaps:

%sql
with full_data(
	select 
  	a.id
  	,a.day
  	,a.vl_ds1
  	,b.vl_ds2
  	,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
  	,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
	from dataset1 a
		left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
	select id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,case when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
    when vl_ds2 is null
    	and prev_ds2 is null
    	then 0
   	when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
  	when vl_ds2 is null
    	and (
      	prev_ds2 is not null
      	or next_ds2 is not null
    	)
    	then 1 end as churn
	from full_data
),
tb_island_id(
	select
		sum(churn) over(
      partition by id 
      order by day 
      rows between unbounded preceding and current row
    ) as island_id
  	,id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,churn
	from tb_churn
)
select island_id
  ,id
  ,day
  ,vl_ds1
  ,vl_ds2 
  ,case when vl_ds2 is null 
  			and prev_ds2 is not null 
  			then prev_ds2
      when vl_ds2 is null
      	and prev_ds2 is null 
      	and next_ds2 is null
      	then 
      	first_value(prev_ds2, true) ignore nulls over(
          partition by id, island_id 
          order by day 
          rows between unbounded preceding and current row
        )
      when prev_ds2 is null 
      	then 
      	first_value(vl_ds2, true) ignore nulls over(
      		partition by id, island_id
          order by day 
          rows between current row and unbounded following
        )
      when vl_ds2 is null 
      	then
      	first_value(vl_ds2, true) ignore nulls over(
        	partition by id, island_id
          order by day 
          rows between unbounded preceding and current row
        )
      end as vl_ds2_fix
  ,prev_ds2
  ,next_ds2
  ,churn
from tb_island_id
  order by id
  	,day

island_id	id	day	vl_ds1	vl_ds2	vl_ds2_fix	prev_ds2	next_ds2	churn
0	A	1	value_1	null	value_2a	null	value_2a	0
0	A	2	value_2	value_2a	value_2a	null	null	0
1	A	3	value_3	null	value_2a	value_2a	value_4a	1
1	A	4	value_4	value_4a	value_4a	null	null	0
2	A	5	value_5	null	value_4a	value_4a	null	1
2	A	6	value_6	null	value_4a	null	null	0
0	B	2	value_2	value_2b	value_2b	null	null	0
1	B	3	value_3	null	value_2b	value_2b	null	1
0	C	4	value_4	value_4c	value_4c	null	null	0
1	C	5	value_1	null	value_4c	value_4c	null	1

In order to keep only the values that we actually need, we write:

%sql
with full_data(
	select 
  	a.id
  	,a.day
  	,a.vl_ds1
  	,b.vl_ds2
  	,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
  	,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
	from dataset1 a
		left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
	select id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,case when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
    when vl_ds2 is null
    	and prev_ds2 is null
    	then 0
   	when vl_ds2 is not null
    	and prev_ds2 is null
    	and next_ds2 is null
    	then 0
  	when vl_ds2 is null
    	and (
      	prev_ds2 is not null
      	or next_ds2 is not null
    	)
    	then 1 end as churn
	from full_data
),

tb_island_id(
	select
		sum(churn) over(
      partition by id 
      order by day 
      rows between unbounded preceding and current row
    ) as island_id
  	,id
  	,day
  	,vl_ds1
  	,vl_ds2
  	,prev_ds2
  	,next_ds2
  	,churn
	from tb_churn
)
select id
  ,day
  ,vl_ds1
  ,case when vl_ds2 is null 
  			and prev_ds2 is not null 
  			then prev_ds2
      when vl_ds2 is null
      	and prev_ds2 is null 
      	and next_ds2 is null
      	then 
      	first_value(prev_ds2, true) ignore nulls over(
          partition by id, island_id 
          order by day 
          rows between unbounded preceding and current row
        )
      when prev_ds2 is null 
      	then 
      	first_value(vl_ds2, true) ignore nulls over(
      		partition by id, island_id
          order by day 
          rows between current row and unbounded following
        )
      when vl_ds2 is null 
      	then
      	first_value(vl_ds2, true) ignore nulls over(
        	partition by id, island_id
          order by day 
          rows between unbounded preceding and current row
        )
      end as vl_ds2_fix
  ,vl_ds2 
from tb_island_id
  order by id
  	,day

id	day	vl_ds1	vl_ds2_fix	vl_ds2
A	1	value_1	value_2a	null
A	2	value_2	value_2a	value_2a
A	3	value_3	value_2a	null
A	4	value_4	value_4a	value_4a
A	5	value_5	value_4a	null
A	6	value_6	value_4a	null
B	2	value_2	value_2b	value_2b
B	3	value_3	value_2b	null
C	4	value_4	value_4c	value_4c
C	5	value_1	value_4c	null

Conclusion

The gaps and island problem is an approach suitable for writing well-structured code to solve a common data analysis problem involving sequences and their interruptions. One of the advantages of this approach is that rules and patterns must be clear when creating the groups of data that are going to be analyzed. Consequentialy, is also easy for everyone to spot the rules in the code and update them when needed.

Framing the problem#

Solving the problem#

Conclusion#

Framing the problem

Solving the problem

Conclusion