Skip to content Skip to sidebar Skip to footer

Groupby And Resample Timeseries So Date Ranges Are Consistent

I have a dataframe which is basically several timeseries stacked on top of one another. Each time series has a unique label (group) and they have different date ranges. date = pd.t

Solution 1:

Another way:

import pandas as pd
from itertools import product

date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', 
                                  '2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})


dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
            .merge(df, on=['date', 'group'], how='left')
            .sort_values(['group', 'date'])
            .reset_index(drop=True))

df#         date  group  value#0  2010-01-01      1    1.0#1  2010-01-02      1    2.0#2  2010-01-03      1    3.0#3  2010-01-04      1    NaN#4  2010-01-05      1    NaN#5  2010-01-06      1    4.0#6  2010-01-01      2    5.0#7  2010-01-02      2    NaN#8  2010-01-03      2    6.0#9  2010-01-04      2    NaN#10 2010-01-05      2    NaN#11 2010-01-06      2    NaN

Solution 2:

Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.


Set the index then use pandas.MultiIndex.from_product to produce the Cartesian product of values. I also use fill_value=0 to fill in those missing values.

d=df.set_index(['date','group'])midx=pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],names=d.index.names)d.reindex(midx,fill_value=0).reset_index()dategroupvalue02010-01-01      1112010-01-01      2522010-01-02      1232010-01-02      2042010-01-03      1352010-01-03      2662010-01-04      1072010-01-04      2082010-01-05      1092010-01-05      20102010-01-06      14112010-01-06      20

Or

d=df.set_index(['date','group'])midx=pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],names=d.index.names)d.reindex(midx).reset_index()dategroupvalue02010-01-01      11.012010-01-01      25.022010-01-02      12.032010-01-02      2NaN42010-01-03      13.052010-01-03      26.062010-01-04      1NaN72010-01-04      2NaN82010-01-05      1NaN92010-01-05      2NaN102010-01-06      14.0112010-01-06      2NaN

Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0 to fill in missing values. We could leave that out to produce the NaN.

df.set_index(['date','group'])\.unstack(fill_value=0)\.asfreq('D',fill_value=0)\.stack().reset_index()dategroupvalue02010-01-01      1112010-01-01      2522010-01-02      1232010-01-02      2042010-01-03      1352010-01-03      2662010-01-04      1072010-01-04      2082010-01-05      1092010-01-05      20102010-01-06      14112010-01-06      20

Or

df.set_index(['date','group'])\.unstack()\.asfreq('D')\.stack(dropna=False).reset_index()dategroupvalue02010-01-01      11.012010-01-01      25.022010-01-02      12.032010-01-02      2NaN42010-01-03      13.052010-01-03      26.062010-01-04      1NaN72010-01-04      2NaN82010-01-05      1NaN92010-01-05      2NaN102010-01-06      14.0112010-01-06      2NaN

Post a Comment for "Groupby And Resample Timeseries So Date Ranges Are Consistent"