Groupby And Resample Timeseries So Date Ranges Are Consistent
I have a dataframe which is basically several timeseries stacked on top of one another. Each time series has a unique label (group) and they have different date ranges. date = pd.t
Solution 1:
Another way:
import pandas as pd
from itertools import product
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
.merge(df, on=['date', 'group'], how='left')
.sort_values(['group', 'date'])
.reset_index(drop=True))
df
# date group value
#0 2010-01-01 1 1.0
#1 2010-01-02 1 2.0
#2 2010-01-03 1 3.0
#3 2010-01-04 1 NaN
#4 2010-01-05 1 NaN
#5 2010-01-06 1 4.0
#6 2010-01-01 2 5.0
#7 2010-01-02 2 NaN
#8 2010-01-03 2 6.0
#9 2010-01-04 2 NaN
#10 2010-01-05 2 NaN
#11 2010-01-06 2 NaN
Solution 2:
Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.
Set the index then use pandas.MultiIndex.from_product
to produce the Cartesian product of values. I also use fill_value=0
to fill in those missing values.
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0
to fill in missing values. We could leave that out to produce the NaN
.
df.set_index(['date', 'group']) \
.unstack(fill_value=0) \
.asfreq('D', fill_value=0) \
.stack().reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
df.set_index(['date', 'group']) \
.unstack() \
.asfreq('D') \
.stack(dropna=False).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
Post a Comment for "Groupby And Resample Timeseries So Date Ranges Are Consistent"