Skip to content Skip to sidebar Skip to footer

Arranging Call Data From Salesforce In 15 Minute Intervals

I am new in python and pandas and also in stackoverflow so I apologize for any mistakes I make in advance. I have this dataframe df = pd.DataFrame( data=[['Donald Trump', 'Germ

Solution 1:

After learning a bit more, I came up with another (better) solution using groupby() and explode(). I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.


I first added a few new columns to split up the status_duration into the first slice and the rest and replaced the original value of status_duration with an according 2-element list:

df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])

This gives you the following prepared dataframe:

specialistlanguageinterval_start...status_durationfirstrest0DonaldTrumpGerman2021-09-23 14:28:00...     [120, 1680]   12016801DonaldTrumpGerman2021-09-23 14:58:01...        [119, 5]   11952DonaldTrumpGerman2021-09-24 10:05:00...       [600, 30]   600303MonicaLewinskyGerman2021-09-24 10:05:00...         [30, 0]   600-570

On this, you can now perform a groupby() and explode() similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode(). To clean up, I dropped the rows with duration 0 and reset the index:

ref= df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
        .agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
        .explode('status_duration')

ref['interval_start'] =ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')

ref= ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref=ref[ref.status_duration !=0].reset_index()

This gives you your desired output:

specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00              1201DonaldTrumpGerman2021-09-23 14:30:00              9002DonaldTrumpGerman2021-09-23 14:45:00              8993DonaldTrumpGerman2021-09-23 15:00:00                54DonaldTrumpGerman2021-09-24 10:00:00              6005DonaldTrumpGerman2021-09-24 10:15:00               306MonicaLewinskyGerman2021-09-24 10:00:00               30

Note: The problem I described in the other answer, that the final grouping step could result in a status_duration > 900 should not be possible with real data, since a specialist shouldn't be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.

Solution 2:

Not sure whether this isn't unnecessarily convoluted, but it does get the job done. There are probably nicer, more pythonic approaches though...

I first added a few new columns to the df with the resulting number of intervals that the status_duration suggests, the number of minutes that fit in the first interval and the remainder of the duration:

df['len'] = 1 + (df['status_duration']-1)//900

df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)

df['rest'] = df['status_duration'] - df['first']

Then, we add one additional interval for each row with a positive rest and a first slice < 900:

df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])

Now, I create the new dataframe by using np.repeat() to duplicate the rows so that I have the right number according to the number of intervals and list comprehensions to build the interval_start and status_duration columns using df.iterrows():

new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el forsublistin [[x['interval_start'] + timedelta(minutes=15*y) foryinrange(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] fori, x in df.iterrows()] forelin sublist],
                 'status_duration': [el forsublistin [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1else [x['status_duration']] fori, x in df.iterrows()] forelin sublist]
})

Then we round the interval start time

new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')

All that's left to do now is grouping and resetting the index:

new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()

Result:

specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00              1201DonaldTrumpGerman2021-09-23 14:30:00              9002DonaldTrumpGerman2021-09-23 14:45:00              8993DonaldTrumpGerman2021-09-23 15:00:00                54DonaldTrumpGerman2021-09-24 10:00:00              6005DonaldTrumpGerman2021-09-24 10:15:00               306MonicaLewinskyGerman2021-09-24 10:00:00               30

One problem remains: The last grouping step could result in 15-minute intervals that through the grouping again get a status_duration > 900.

Imagine your second row of your input data had an interval_start that was 2 seconds earlier:

specialistlanguageinterval_startinterval_endstatus_duration0DonaldTrumpGerman2021-09-23 14:28:00 2021-09-23 14:58:00             18001DonaldTrumpGerman2021-09-23 14:57:59 2021-09-23 15:00:03              1242DonaldTrumpGerman2021-09-24 10:05:00 2021-09-24 10:15:30              6303MonicaLewinskyGerman2021-09-24 10:05:00 2021-09-24 10:05:30               30

Then you'd wind up with a status_duration of 901 after grouping:

specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00              1201DonaldTrumpGerman2021-09-23 14:30:00              9002DonaldTrumpGerman2021-09-23 14:45:00              9013DonaldTrumpGerman2021-09-23 15:00:00                34DonaldTrumpGerman2021-09-24 10:00:00              6005DonaldTrumpGerman2021-09-24 10:15:00               306MonicaLewinskyGerman2021-09-24 10:00:00               30

This is complicated by the fact that this "splilling over" can happen multiple times. One approach would be to repeat the above steps until no new_df rows with status_duration > 900 remain. This will carry over the overflow.

Full example:

import pandas as pd
import numpy as np
from datetime import timedelta

input_df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])

defbuild_df(df):
    while df['status_duration'].gt(900).any():
        df['len'] = 1 + (df['status_duration']-1)//900
        df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
        df['rest'] = df['status_duration'] - df['first']
        df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
        new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y inrange(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
                 'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
        })
        new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
        new_df = new_df[new_df.status_duration != 0]
        new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
        df = new_df.copy()
    return df

output_df = build_df(input_df)

Result:

specialistlanguageinterval_startstatus_duration0DonaldTrumpGerman2021-09-23 14:15:00              1201DonaldTrumpGerman2021-09-23 14:30:00              9002DonaldTrumpGerman2021-09-23 14:45:00              9003DonaldTrumpGerman2021-09-23 15:00:00                44DonaldTrumpGerman2021-09-24 10:00:00              6005DonaldTrumpGerman2021-09-24 10:15:00               306MonicaLewinskyGerman2021-09-24 10:00:00               30

Looking at it now, I would guess that there should probably be an easier way, but this is all I got...

Post a Comment for "Arranging Call Data From Salesforce In 15 Minute Intervals"