SQL To Pandas - Aggregation Over Partition Python
what is the best way to aggregate values based on a particular over partition by : SQL : select a.*, b.vol1 / sum(vol1) over ( partition by a.sale, a.d_id, a.month, a.p_id )
Solution 1:
The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id'
, in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):
index = ['sale', 'd_id', 'month', 'p_id']
df1 = df1.set_index(index)
df2 = df2.set_index(index)
merged = df1.join(df2, how='left')
Next, groupby on the index and get the aggregate sum for vol1
. Since we need the aggregate aligned to each row, in pandas the transform
helps with that:
grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')
From here, we can create vol_r
and vol_t
via the assign method, and drop the vol1
column:
(merged.assign(vol_r = merged.vol1.div(partitioned_sum),
vol_t = lambda df: df.vol_r.mul(df.vol2))
.drop(columns='vol1')
.reset_index()
)
sale d_id month p_id vol2 vol_r vol_t
0 2 580 4 9 11.000 0.084653 0.931185
1 2 580 4 9 11.000 0.087070 0.957766
2 2 580 4 9 11.000 0.161611 1.777716
3 2 580 4 9 11.314 0.084653 0.957766
4 2 580 4 9 11.314 0.087070 0.985106
5 2 580 4 9 11.314 0.161611 1.828462
6 2 580 4 9 20.065 0.084653 1.698566
7 2 580 4 9 20.065 0.087070 1.747052
8 2 580 4 9 20.065 0.161611 3.242716
Post a Comment for "SQL To Pandas - Aggregation Over Partition Python"