Skip to content Skip to sidebar Skip to footer

Pandas - Expanding Z-score Across Multiple Columns

I want to calculate an expanding z-score for some time series data that I have in a DataFrame, but I want to standardize the data using the mean and standard deviation of multiple

Solution 1:

This is my own attempt at trying to calculate the expanding Z-Scores pooling all of the columns. Comments on how to do it more efficiently would be welcome.

def pooled_expanding_zscore(df, min_periods=2):
"""Calculates an expanding Z-Score down the rows of the DataFrame while pooling all of the columns.

Assumes that indexes are not hierarchical.
Assumes that df does not have columns named 'exp_mean' and 'exp_std'.
"""

# Get last sorted column name
colNames = df.columns.values
colNames.sort()
lastCol = colNames[-1]

# Index name
indexName = df.index.name

# Normalize DataFrame
df_stacked = pd.melt(df.reset_index(),id_vars=indexName).sort_values(by=[indexName,'variable'])

# Calculates the expanding mean and standard deviation on df_stacked
# Keeps just the rows where 'variable'==lastCol
df_exp = df_stacked.expanding(2)['value']
df_stacked.loc[:,'exp_mean'] = df_exp.mean()
df_stacked.loc[:,'exp_std'] = df_exp.std()

exp_stats = (df_stacked.loc[df_stacked.variable==lastCol,:]
            .reset_index()
            .drop(['index','variable','value'], axis=1)
            .set_index(indexName))

# add exp_mean and exp_std back to df
df = pd.concat([df,exp_stats],axis=1)

# Calculate Z-Score
df_mat = df.loc[:,colNames].as_matrix()
exp_mean_mat = df.loc[:,'exp_mean'].as_matrix()[:,np.newaxis]
exp_std_mat = df.loc[:,'exp_std'].as_matrix()[:,np.newaxis]
zScores = pd.DataFrame(
    (df_mat - exp_mean_mat) / exp_std_mat,
    index=df.index,
    columns=colNames)

# Use min_periods to kill off early rows
zScores.iloc[:min_periods-1,:] = np.nan

return zScores

Post a Comment for "Pandas - Expanding Z-score Across Multiple Columns"