Randomly Concat Data Frames By Row

July 31, 2024 Post a Comment

How can I randomly merge, join or concat pandas data frames by row? Suppose I have four data frames something like this (with a lot more rows): df1 = pd.DataFrame({'col1':['1_1',

Solution 1:

Maybe something like this?

import random
import numpy as np

dfs = [df1, df2, df3, df4]
n = np.sum(len(df.columns) for df in dfs)
pd.concat(dfs, axis=1).iloc[:, random.sample(range(n), n)]

Out[130]: 
  col1 col3 col1 col2 col1 col1 col2 col2 col3 col3 col3 col2
04_14_31_14_22_13_11_23_21_33_32_32_2

Or, if only the df's should be shuffled, you can do:

dfs = [df1, df2, df3, df4]
random.shuffle(dfs)
pd.concat(dfs, axis=1)

Out[133]: 
  col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
04_1  4_2  4_3  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3

Solution 2:

UPDATE: a much better solution from @Divakar:

df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"], 'col4':["1_4", "1_4"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"], 'col4':["2_4", "2_4"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"], 'col4':["3_4", "3_4"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"], 'col4':["4_4", "4_4"]})

dfs = [df1, df2, df3, df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)

Output:

In[203]: dfOut[203]:
    012345678910111213141503_13_23_33_41_11_21_31_44_14_24_34_42_12_22_32_414_14_24_34_42_12_22_32_43_13_23_33_41_11_21_31_4

Explanation: (c) Divakar

NumPy based solution

Let's have a NumPy based vectorized solution and hopefully a fast one!

1) Let's reshape an array of concatenated values into a 3D array "cutting" each row into groups of ncols corresponding to the # of columns in each of the input dataframes -

A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)

2) Next up, we trick np.aragsort to give us random unique indices ranging from 0 to N-1, where N is the number of input dataframes -

sidx = np.random.rand(nrows,n).argsort(1)

3) Final trick is NumPy's fancy indexing together with some broadcasting to index into A with sidx to give us the output array -

out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)

4) If needed, convert to dataframe -

df = pd.DataFrame(out_arr)

OLD answer:

IIUC you can do it this way:

dfs = [df1, df2, df3, df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs, axis=1).values
a = np.arange(n * ncols).reshape(n, df1.shape[1])

df = pd.DataFrame(np.asarray([v[i, a[random.sample(range(n), n)].reshape(n * ncols,)] for i in dfs[0].index]))

Output

In[150]: dfOut[150]:
    0123456789101101_11_21_33_13_23_34_14_24_32_12_22_312_12_22_31_11_21_33_13_23_34_14_24_3

Explanation:

In[151]: vOut[151]:
array([['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3'],
       ['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3']], dtype=object)

In[152]: aOut[152]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

Python Playground

Randomly Concat Data Frames By Row

Solution 1:

Solution 2:

Post a Comment for "Randomly Concat Data Frames By Row"