Skip to content Skip to sidebar Skip to footer

SQL To Pandas - Aggregation Over Partition Python

what is the best way to aggregate values based on a particular over partition by : SQL : select a.*, b.vol1 / sum(vol1) over ( partition by a.sale, a.d_id, a.month, a.p_id )

Solution 1:

The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id', in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):

index = ['sale', 'd_id', 'month', 'p_id']

df1 = df1.set_index(index)

df2 = df2.set_index(index)

merged = df1.join(df2, how='left')

Next, groupby on the index and get the aggregate sum for vol1. Since we need the aggregate aligned to each row, in pandas the transform helps with that:

grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')

From here, we can create vol_r and vol_t via the assign method, and drop the vol1 column:

(merged.assign(vol_r = merged.vol1.div(partitioned_sum), 
               vol_t = lambda df: df.vol_r.mul(df.vol2))
       .drop(columns='vol1')
       .reset_index()
)

   sale  d_id  month  p_id    vol2     vol_r     vol_t
0     2   580      4     9  11.000  0.084653  0.931185
1     2   580      4     9  11.000  0.087070  0.957766
2     2   580      4     9  11.000  0.161611  1.777716
3     2   580      4     9  11.314  0.084653  0.957766
4     2   580      4     9  11.314  0.087070  0.985106
5     2   580      4     9  11.314  0.161611  1.828462
6     2   580      4     9  20.065  0.084653  1.698566
7     2   580      4     9  20.065  0.087070  1.747052
8     2   580      4     9  20.065  0.161611  3.242716

Post a Comment for "SQL To Pandas - Aggregation Over Partition Python"