The gaps and islands is a classical data analysis problem where we aim to identify gaps or contiguous ranges of values (islands). This problem is relevant to applications where we need to identify the interruption of a sequence. To illustrate the problem, let’s supose we have cards as ilustrated in the image below:
Each card contains an identification code at the top, a number in the middle representing anything with an inherit sense of order, like a timestamp or sequential code, and a letter at the bottom indicating an event.
For any reason, we need to create groups with a rule that someone has defined. Here, cards that appear before an event a
and the respective card with an event a
belong to the same group (island). The image below illustrates how cards are grouped into islands based on the defined rule:
We use those islands to create new cards containing only the first and the last event of each. The image below illustrates this outcome:
Or we could even fill the gaps in the sequence within an island by creating a new event x
in it, such what occurs in the island 3. This outcome is illustrated below, with the card nx8
being the one created to fill the gap between the cards n23
and n8
:
The same approach can be applied to rows in tables. In this post, I’m going to use two mock datasets to illustrate a real scenario where the gaps and islands problem was applied. I’ll be using the Databricks Community to run this example. To build mock datasets, simply copy, paste and run the following command in the first cell of your Python notebook:
column_names_1 = ['id', 'day', 'vl_ds1']
dataset_1 = [
('A', 1, 'value_1'),
('A', 2, 'value_2'),
('A', 3, 'value_3'),
('A', 4, 'value_4'),
('A', 5, 'value_5'),
('A', 6, 'value_6'),
('B', 2, 'value_2'),
('B', 3, 'value_3'),
('C', 4, 'value_4'),
('C', 5, 'value_1'),
]
df1 = spark.createDataFrame(dataset_1, column_names_1)
df1.createOrReplaceTempView("dataset1")
column_names_2 = ['id', 'day', 'vl_ds2']
dataset_2 = [
('A', 2, 'value_2a'),
('A', 4, 'value_4a'),
('B', 2, 'value_2b'),
('C', 4, 'value_4c'),
]
df2 = spark.createDataFrame(dataset_2, column_names_2)
df2.createOrReplaceTempView("dataset2")
Framing the problem
At work this week, a friend of mine asked for help with a similar situation. Let’s suppose we have two datasets named dataset1
and dataset2
, each one having values for events vl_ds1
and vl_ds2
, respectively. These events are related in ispite of being recorded in different tables, and events that are on dataset2
can only happen if there is a event in dataset1
in the same day.
The dataset1
is represented below:
id | day | vl_ds1 |
---|---|---|
A | 1 | value_1 |
A | 2 | value_2 |
A | 3 | value_3 |
A | 4 | value_4 |
A | 5 | value_5 |
A | 6 | value_6 |
B | 2 | value_2 |
B | 3 | value_3 |
C | 4 | value_4 |
C | 5 | value_1 |
And the dataset2
is:
id | day | vl_ds2 |
---|---|---|
A | 2 | value_2a |
A | 4 | value_4a |
B | 2 | value_2b |
C | 4 | value_4c |
When joining both tables with the dataset1
on the left, we have the following result`:
id | day | vl_ds1 | vl_ds2 |
---|---|---|---|
A | 1 | value_1 | null |
A | 2 | value_2 | value_2a |
A | 3 | value_3 | null |
A | 4 | value_4 | value_4a |
A | 5 | value_5 | null |
A | 6 | value_6 | null |
B | 2 | value_2 | value_2b |
B | 3 | value_3 | null |
C | 4 | value_4 | value_4c |
C | 5 | value_1 | null |
But we need to create a vl_ds2_fix
column where the first non-null value of vl_ds2
is propagated to the preceding and the following rows for a particular id, until it appears a non-null value. The table below represents the desired outcome:
id | day | vl_ds1 | vl_ds2_fix | vl_ds2 |
---|---|---|---|---|
A | 1 | value_1 | value_2a | null |
A | 2 | value_2 | value_2a | value_2a |
A | 3 | value_3 | value_2a | null |
A | 4 | value_4 | value_4a | value_4a |
A | 5 | value_5 | value_4a | null |
A | 6 | value_6 | value_4a | null |
B | 2 | value_2 | value_2b | value_2b |
B | 3 | value_3 | value_2b | null |
C | 4 | value_4 | value_4c | value_4c |
C | 5 | value_1 | value_4c | null |
With that in mind, let’s start solving the problem
Solving the problem
First, we need to create the joined table with new columns containing the previous and next values of vl_ds2
. These columns will help us identify a pattern:
%sql
with full_data(
select
a.id
, a.day
, a.vl_ds1
, b.vl_ds2
, lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
, lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
left join dataset2 b on a.id = b.id and a.day = b.day
)
select * from full_data
This leads us to the follwoing table:
id | day | vl_ds1 | vl_ds2 | prev_ds2 | next_ds2 |
---|---|---|---|---|---|
A | 1 | value_1 | null | null | value_2a |
A | 2 | value_2 | value_2a | null | null |
A | 3 | value_3 | null | value_2a | value_4a |
A | 4 | value_4 | value_4a | null | null |
A | 5 | value_5 | null | value_4a | null |
A | 6 | value_6 | null | null | null |
B | 2 | value_2 | value_2b | null | null |
B | 3 | value_3 | null | value_2b | null |
C | 4 | value_4 | value_4c | null | null |
C | 5 | value_1 | null | value_4c | null |
Now, we start writing rules to define the churn, that is, when the combination of vl_ds2
,prev_ds2
and next_ds2
indicates that a sequence was broken or not. By exploring the table above, we notice the following:
- When
vl_ds2
andprev_ds2
are null, there is no churn. - When
vl_ds2
is not null, butprev_ds2
andnext_ds2
are both null, there is no churn, - When
vl_ds2
is null, butprev_ds2
ornext_ds2
aren’t, there is churn.
Tip: write the
CASE WHEN
statements starting by the most specific boolean expression to the least specific one. In our example, we have boolean expression, and the second one we presented above is the most specific one, since it presents 3 rules. Furthermore, this should be the first expression in theCASE WHEN
statement.
The previous observation lead us to the following code:
%sql
with full_data(
select
a.id
,a.day
,a.vl_ds1
,b.vl_ds2
,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
select id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,case when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and prev_ds2 is null
then 0
when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and (
prev_ds2 is not null
or next_ds2 is not null
)
then 1 end as churn
from full_data
)
select * from tb_churn
id | day | vl_ds1 | vl_ds2 | prev_ds2 | next_ds2 | churn |
---|---|---|---|---|---|---|
A | 1 | value_1 | null | null | value_2a | 0 |
A | 2 | value_2 | value_2a | null | null | 0 |
A | 3 | value_3 | null | value_2a | value_4a | 1 |
A | 4 | value_4 | value_4a | null | null | 0 |
A | 5 | value_5 | null | value_4a | null | 1 |
A | 6 | value_6 | null | null | null | 0 |
B | 2 | value_2 | value_2b | null | null | 0 |
B | 3 | value_3 | null | value_2b | null | 1 |
C | 4 | value_4 | value_4c | null | null | 0 |
C | 5 | value_1 | null | value_4c | null | 1 |
The next step is creating the island_id
column by making a cumulative sum on the churn
column:
%sql
with full_data(
select
a.id
,a.day
,a.vl_ds1
,b.vl_ds2
,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
select id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,case when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and prev_ds2 is null
then 0
when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and (
prev_ds2 is not null
or next_ds2 is not null
)
then 1 end as churn
from full_data
),
tb_island_id(
select
sum(churn) over (
partition by id
order by day
rows between unbounded preceding and current row
) as island_id
,id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,churn
from tb_churn
)
select *
from tb_island_id
order by id
,day
island_id | id | day | vl_ds1 | vl_ds2 | prev_ds2 | next_ds2 | churn |
---|---|---|---|---|---|---|---|
0 | A | 1 | value_1 | null | null | value_2a | 0 |
0 | A | 2 | value_2 | value_2a | null | null | 0 |
1 | A | 3 | value_3 | null | value_2a | value_4a | 1 |
1 | A | 4 | value_4 | value_4a | null | null | 0 |
2 | A | 5 | value_5 | null | value_4a | null | 1 |
2 | A | 6 | value_6 | null | null | null | 0 |
0 | B | 2 | value_2 | value_2b | null | null | 0 |
1 | B | 3 | value_3 | null | value_2b | null | 1 |
0 | C | 4 | value_4 | value_4c | null | null | 0 |
1 | C | 5 | value_1 | null | value_4c | null | 1 |
Now, we can define the CASE WHEN
conditions to fill the gaps:
%sql
with full_data(
select
a.id
,a.day
,a.vl_ds1
,b.vl_ds2
,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
select id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,case when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and prev_ds2 is null
then 0
when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and (
prev_ds2 is not null
or next_ds2 is not null
)
then 1 end as churn
from full_data
),
tb_island_id(
select
sum(churn) over(
partition by id
order by day
rows between unbounded preceding and current row
) as island_id
,id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,churn
from tb_churn
)
select island_id
,id
,day
,vl_ds1
,vl_ds2
,case when vl_ds2 is null
and prev_ds2 is not null
then prev_ds2
when vl_ds2 is null
and prev_ds2 is null
and next_ds2 is null
then
first_value(prev_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between unbounded preceding and current row
)
when prev_ds2 is null
then
first_value(vl_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between current row and unbounded following
)
when vl_ds2 is null
then
first_value(vl_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between unbounded preceding and current row
)
end as vl_ds2_fix
,prev_ds2
,next_ds2
,churn
from tb_island_id
order by id
,day
island_id | id | day | vl_ds1 | vl_ds2 | vl_ds2_fix | prev_ds2 | next_ds2 | churn |
---|---|---|---|---|---|---|---|---|
0 | A | 1 | value_1 | null | value_2a | null | value_2a | 0 |
0 | A | 2 | value_2 | value_2a | value_2a | null | null | 0 |
1 | A | 3 | value_3 | null | value_2a | value_2a | value_4a | 1 |
1 | A | 4 | value_4 | value_4a | value_4a | null | null | 0 |
2 | A | 5 | value_5 | null | value_4a | value_4a | null | 1 |
2 | A | 6 | value_6 | null | value_4a | null | null | 0 |
0 | B | 2 | value_2 | value_2b | value_2b | null | null | 0 |
1 | B | 3 | value_3 | null | value_2b | value_2b | null | 1 |
0 | C | 4 | value_4 | value_4c | value_4c | null | null | 0 |
1 | C | 5 | value_1 | null | value_4c | value_4c | null | 1 |
In order to keep only the values that we actually need, we write:
%sql
with full_data(
select
a.id
,a.day
,a.vl_ds1
,b.vl_ds2
,lag(b.vl_ds2) over(partition by a.id order by a.day asc ) as prev_ds2
,lead(b.vl_ds2) over(partition by a.id order by a.day asc ) as next_ds2
from dataset1 a
left join dataset2 b on a.id = b.id and a.day = b.day
),
tb_churn(
-- 1 when there is churn, 0 otherwise
select id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,case when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and prev_ds2 is null
then 0
when vl_ds2 is not null
and prev_ds2 is null
and next_ds2 is null
then 0
when vl_ds2 is null
and (
prev_ds2 is not null
or next_ds2 is not null
)
then 1 end as churn
from full_data
),
tb_island_id(
select
sum(churn) over(
partition by id
order by day
rows between unbounded preceding and current row
) as island_id
,id
,day
,vl_ds1
,vl_ds2
,prev_ds2
,next_ds2
,churn
from tb_churn
)
select id
,day
,vl_ds1
,case when vl_ds2 is null
and prev_ds2 is not null
then prev_ds2
when vl_ds2 is null
and prev_ds2 is null
and next_ds2 is null
then
first_value(prev_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between unbounded preceding and current row
)
when prev_ds2 is null
then
first_value(vl_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between current row and unbounded following
)
when vl_ds2 is null
then
first_value(vl_ds2, true) ignore nulls over(
partition by id, island_id
order by day
rows between unbounded preceding and current row
)
end as vl_ds2_fix
,vl_ds2
from tb_island_id
order by id
,day
id | day | vl_ds1 | vl_ds2_fix | vl_ds2 |
---|---|---|---|---|
A | 1 | value_1 | value_2a | null |
A | 2 | value_2 | value_2a | value_2a |
A | 3 | value_3 | value_2a | null |
A | 4 | value_4 | value_4a | value_4a |
A | 5 | value_5 | value_4a | null |
A | 6 | value_6 | value_4a | null |
B | 2 | value_2 | value_2b | value_2b |
B | 3 | value_3 | value_2b | null |
C | 4 | value_4 | value_4c | value_4c |
C | 5 | value_1 | value_4c | null |
Conclusion
The gaps and island problem is an approach suitable for writing well-structured code to solve a common data analysis problem involving sequences and their interruptions. One of the advantages of this approach is that rules and patterns must be clear when creating the groups of data that are going to be analyzed. Consequentialy, is also easy for everyone to spot the rules in the code and update them when needed.