Python Pandas: Create A New Column For Each Different Value Of A Source Column (with Boolean Output As Column Values)

August 21, 2024 Post a Comment

I am trying to split a source column of a dataframe in several columns based on its content, and then fill this newly generated columns with a boolean 1 or 0 in the following way:

Solution 1:

You can try:

df = pd.get_dummies(df, columns=['source_column'])

or if you prefer sklearn

from sklearn.preprocessing importOneHotEncoderenc= OneHotEncoder()
matrix=enc.fit_transform(df['source_column'])

Solution 2:

You can use the pandas function get_dummies, and add the result to df as shown below

In [1]: col_names = df['source_column'].dropna().unique().tolist()

In [2]: df[col_names] = pd.get_dummies(df['source_column'])

In [3]: df
Out[3]: 
  ID source_column  value 1  value 2  value 30  A       value 11001  B          NaN         0002  C       value 20103  D       value 30014  E       value 2010

Solution 3:

So there is this possibility (a little bit hacky).

Reading the DataFrame from your example data:

In [4]: df = pd.read_clipboard().drop("ID", axis=1)

In [5]: df
Out[5]:
   source_column
A            1.0
B            NaN
C            2.0
D            3.0
E            2.0

After that, adding a new column with df['foo'] = 1.

Then work with unstacking:

In [22]: df.reset_index().set_index(['index', 'source_column']).unstack().fillna(0).rename_axis([None]).astype(int)
Out[22]:
              foo
source_column NaN 1.02.03.0
A               0100
B               1000
C               0010
D               0001
E               0010

You then of course have to rename your columns and drop the Nancol, but that should fulfill your needs in a first run.

EDIT:

Other approach to suppress the nan column, you can use groupby+value_counts (kind of hacky too):

In [30]: df.reset_index().groupby("index").source_column.value_counts().unstack().fillna(0).astype(int).rename_axis([None])
Out[30]:
source_column  1.02.03.0
A                100
C                010
D                001
E                010

This is the same idea (unstacking) but suppresses the nan values to be considered by default. You of course have to merge it on the original dataframe to keep the rows with the nan values if you want that. So at all, both approaches work fine, you can choose the one which fulfills your needs best.

Solution 4:

pd.concat([df,pd.crosstab(df.index,df.source_column)],1).fillna(0)

Out[1028]: 
  IDsource_columnvalue1value2value30Avalue11.00.00.01B00.00.00.02Cvalue20.01.00.03Dvalue30.00.01.04Evalue20.01.00.0

Python Playground