Arranging Call Data From Salesforce In 15 Minute Intervals
Solution 1:
After learning a bit more, I came up with another (better) solution using groupby()
and explode()
. I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.
I first added a few new columns to split up the status_duration
into the first slice and the rest and replaced the original value of status_duration
with an according 2-element list:
df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
This gives you the following prepared dataframe:
specialistlanguageinterval_start...status_durationfirstrest0DonaldTrumpGerman2021-09-23 14:28:00... [120, 1680] 12016801DonaldTrumpGerman2021-09-23 14:58:01... [119, 5] 11952DonaldTrumpGerman2021-09-24 10:05:00... [600, 30] 600303MonicaLewinskyGerman2021-09-24 10:05:00... [30, 0] 600-570
On this, you can now perform a groupby()
and explode()
similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode()
. To clean up, I dropped the rows with duration 0
and reset the index:
ref= df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
.agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
.explode('status_duration')
ref['interval_start'] =ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')
ref= ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref=ref[ref.status_duration !=0].reset_index()
This gives you your desired output:
specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00 1201DonaldTrumpGerman2021-09-23 14:30:00 9002DonaldTrumpGerman2021-09-23 14:45:00 8993DonaldTrumpGerman2021-09-23 15:00:00 54DonaldTrumpGerman2021-09-24 10:00:00 6005DonaldTrumpGerman2021-09-24 10:15:00 306MonicaLewinskyGerman2021-09-24 10:00:00 30
Note: The problem I described in the other answer, that the final grouping step could result in a status_duration
> 900 should not be possible with real data, since a specialist shouldn't be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.
Solution 2:
Not sure whether this isn't unnecessarily convoluted, but it does get the job done. There are probably nicer, more pythonic approaches though...
I first added a few new columns to the df with the resulting number of intervals that the status_duration
suggests, the number of minutes that fit in the first interval and the remainder of the duration:
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
Then, we add one additional interval for each row with a positive rest and a first slice < 900:
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
Now, I create the new dataframe by using np.repeat()
to duplicate the rows so that I have the right number according to the number of intervals and list comprehensions to build the interval_start
and status_duration
columns using df.iterrows()
:
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el forsublistin [[x['interval_start'] + timedelta(minutes=15*y) foryinrange(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] fori, x in df.iterrows()] forelin sublist],
'status_duration': [el forsublistin [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1else [x['status_duration']] fori, x in df.iterrows()] forelin sublist]
})
Then we round the interval start time
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
All that's left to do now is grouping and resetting the index:
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
Result:
specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00 1201DonaldTrumpGerman2021-09-23 14:30:00 9002DonaldTrumpGerman2021-09-23 14:45:00 8993DonaldTrumpGerman2021-09-23 15:00:00 54DonaldTrumpGerman2021-09-24 10:00:00 6005DonaldTrumpGerman2021-09-24 10:15:00 306MonicaLewinskyGerman2021-09-24 10:00:00 30
One problem remains: The last grouping step could result in 15-minute intervals that through the grouping again get a status_duration
> 900.
Imagine your second row of your input data had an interval_start
that was 2 seconds earlier:
specialistlanguageinterval_startinterval_endstatus_duration0DonaldTrumpGerman2021-09-23 14:28:00 2021-09-23 14:58:00 18001DonaldTrumpGerman2021-09-23 14:57:59 2021-09-23 15:00:03 1242DonaldTrumpGerman2021-09-24 10:05:00 2021-09-24 10:15:30 6303MonicaLewinskyGerman2021-09-24 10:05:00 2021-09-24 10:05:30 30
Then you'd wind up with a status_duration
of 901
after grouping:
specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00 1201DonaldTrumpGerman2021-09-23 14:30:00 9002DonaldTrumpGerman2021-09-23 14:45:00 9013DonaldTrumpGerman2021-09-23 15:00:00 34DonaldTrumpGerman2021-09-24 10:00:00 6005DonaldTrumpGerman2021-09-24 10:15:00 306MonicaLewinskyGerman2021-09-24 10:00:00 30
This is complicated by the fact that this "splilling over" can happen multiple times. One approach would be to repeat the above steps until no new_df
rows with status_duration
> 900 remain. This will carry over the overflow.
Full example:
import pandas as pd
import numpy as np
from datetime import timedelta
input_df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])
defbuild_df(df):
while df['status_duration'].gt(900).any():
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y inrange(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
new_df = new_df[new_df.status_duration != 0]
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
df = new_df.copy()
return df
output_df = build_df(input_df)
Result:
specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00 1201DonaldTrumpGerman2021-09-23 14:30:00 9002DonaldTrumpGerman2021-09-23 14:45:00 9003DonaldTrumpGerman2021-09-23 15:00:00 44DonaldTrumpGerman2021-09-24 10:00:00 6005DonaldTrumpGerman2021-09-24 10:15:00 306MonicaLewinskyGerman2021-09-24 10:00:00 30
Looking at it now, I would guess that there should probably be an easier way, but this is all I got...
Post a Comment for "Arranging Call Data From Salesforce In 15 Minute Intervals"