Comparing Data Frames And Getting The Differences With Python
I have two data frames as shown below. Where we have hours for a project by resources. One was the info about 10 days ago & the other is as of today. I want to find ONLY the di
Solution 1:
To compare and compute the difference, first set the indices of the data frames to be the PR No.
and Resource
columns. Combine the data frames using append
. Then, group by index (which is the combination of PR No.
and Resource
) and compute the difference within each group. This will generate NaNs in the groups containing two values, they are not needed so, the dropna
function takes care of that. Finally, call reset_index
to bring back PR No.
and Resource
as columns.
# setup
data1 = [
["PN1", "Chris", 1],
["PN2", "Julie", 80],
["PN3", "John", 2.4],
["PN4", "Steve", 2]
]
data2 = [
["PN1", "Chris", 11],
["PN2", "Julie", 76],
["PN8", "John", 2.4],
["PN9", "Jonas", 2]
]
df1 = pd.DataFrame(data1, columns = ["PR No.", "Resource", "hours"])
df2 = pd.DataFrame(data2, columns = ["PR No.", "Resource", "hours"])
print(df1)
print(df2)
# solution
group_by_cols = ["PR No.", "Resource"]
indexed_by_group_cols_1 = df1.set_index(group_by_cols)
indexed_by_group_cols_2 = df2.set_index(group_by_cols)
appended = indexed_by_group_cols_1.append(indexed_by_group_cols_2)
grouped_by_index = appended.groupby(appended.index)
compare_diff = grouped_by_index.apply(lambda x: x.diff() if len(x) > 1 else x) \
.dropna().reset_index()
print(compare_diff)
Output:
DF1:
PR No. resource hours
0 PN1 Chris 1.0
1 PN2 Julie 80.0
2 PN3 John 2.4
3 PN4 Steve 2.0
DF2:
PR No. resource hours
0 PN1 Chris 11.0
1 PN2 Julie 76.0
2 PN8 John 2.4
3 PN9 Jonas 2.0
Result:
PR No. resource hours
0 PN1 Chris 10.0
1 PN2 Julie -4.0
2 PN3 John 2.4
3 PN4 Steve 2.0
4 PN8 John 2.4
5 PN9 Jonas 2.0
Post a Comment for "Comparing Data Frames And Getting The Differences With Python"