Skip to content Skip to sidebar Skip to footer

Make New Pandas Columns Based On Pipe-delimited Column With Possible Repeats

This question pertains to the fine solution to my previous question, Create Multiple New Columns Based on Pipe-Delimited Column in Pandas I have a pipe delimited column that I want

Solution 1:

stack was dropping NaNs. Using dropna=False will solve this:

pd.get_dummies(df1.set_index(['ID','YEAR','AMT']).PARTS.str.split('|', expand=True)\
                  .stack(dropna=False), prefix='Part')\
  .sum(level=0)

Output:

Part_12Part_13Part_28Part_34Part_51ID1202        000009321        100103832        200101723        01101

Solution 2:

you can use sklearn.feature_extraction.text.CountVectorizer:

In [22]: from sklearn.feature_extraction.text import CountVectorizer

In [23]: cv = CountVectorizer()

In [24]: t = pd.DataFrame(cv.fit_transform(df1.PARTS.fillna('').str.replace(r'\|', ' ')).A,
    ...:                  columns=cv.get_feature_names(),
    ...:                  index=df1.index).add_prefix('PART_')
    ...:

In [25]: df1 = df1.join(t)

In [26]: df1
Out[26]:
     ID  YEAR     AMT     PARTS  PART_12  PART_13  PART_28  PART_34  PART_51
01202200799.34None0000019321200961.2112|341001023832201212.3212|12|3420010317232017873.7428|13|5101101

Solution 3:

Using this expanded version - should work too; also, will retain original columns additionally

In [728]:importpandasaspd# Dataframe used from Mike's(data) above:In [729]:df=pd.DataFrame(np.array([.....:     [1202, 2007, 99.34,None],.....:     [9321, 2009, 61.21,'12|34'],.....:     [3832, 2012, 12.32,'12|12|34'],.....:     [1723, 2017, 873.74,'28|13|51']]),.....:columns=['ID','YEAR','AMT','PARTS'])# quick glimpse of dataframeIn [730]:dfOut[730]:IDYEARAMTPARTS01202  2007   99.34None19321  2009   61.2112|3423832  2012   12.3212|12|3431723  2017  873.7428|13|51# expand string based on delimiter ("|")In [731]:expand_str=df["PARTS"].str.split('|',expand=True)# generate dummies df:In [732]:dummies_df=pd.get_dummies(expand_str.stack(dropna=False)).sum(level=0).add_prefix("Part_")# gives concatenated or combined df(i.e dummies_df + original df):In [733]:pd.concat([df,dummies_df],axis=1)Out[733]:IDYEARAMTPARTSPart_12Part_13Part_28Part_34Part_5101202  2007   99.34None0000019321  2009   61.2112|341001023832  2012   12.3212|12|342001031723  2017  873.7428|13|5101101

Post a Comment for "Make New Pandas Columns Based On Pipe-delimited Column With Possible Repeats"