Make New Pandas Columns Based On Pipe-delimited Column With Possible Repeats
This question pertains to the fine solution to my previous question, Create Multiple New Columns Based on Pipe-Delimited Column in Pandas I have a pipe delimited column that I want
Solution 1:
stack
was dropping NaNs. Using dropna=False
will solve this:
pd.get_dummies(df1.set_index(['ID','YEAR','AMT']).PARTS.str.split('|', expand=True)\
.stack(dropna=False), prefix='Part')\
.sum(level=0)
Output:
Part_12Part_13Part_28Part_34Part_51ID1202 000009321 100103832 200101723 01101
Solution 2:
you can use sklearn.feature_extraction.text.CountVectorizer:
In [22]: from sklearn.feature_extraction.text import CountVectorizer
In [23]: cv = CountVectorizer()
In [24]: t = pd.DataFrame(cv.fit_transform(df1.PARTS.fillna('').str.replace(r'\|', ' ')).A,
...: columns=cv.get_feature_names(),
...: index=df1.index).add_prefix('PART_')
...:
In [25]: df1 = df1.join(t)
In [26]: df1
Out[26]:
ID YEAR AMT PARTS PART_12 PART_13 PART_28 PART_34 PART_51
01202200799.34None0000019321200961.2112|341001023832201212.3212|12|3420010317232017873.7428|13|5101101
Solution 3:
Using this expanded version - should work too; also, will retain original columns additionally
In [728]:importpandasaspd# Dataframe used from Mike's(data) above:In [729]:df=pd.DataFrame(np.array([.....: [1202, 2007, 99.34,None],.....: [9321, 2009, 61.21,'12|34'],.....: [3832, 2012, 12.32,'12|12|34'],.....: [1723, 2017, 873.74,'28|13|51']]),.....:columns=['ID','YEAR','AMT','PARTS'])# quick glimpse of dataframeIn [730]:dfOut[730]:IDYEARAMTPARTS01202 2007 99.34None19321 2009 61.2112|3423832 2012 12.3212|12|3431723 2017 873.7428|13|51# expand string based on delimiter ("|")In [731]:expand_str=df["PARTS"].str.split('|',expand=True)# generate dummies df:In [732]:dummies_df=pd.get_dummies(expand_str.stack(dropna=False)).sum(level=0).add_prefix("Part_")# gives concatenated or combined df(i.e dummies_df + original df):In [733]:pd.concat([df,dummies_df],axis=1)Out[733]:IDYEARAMTPARTSPart_12Part_13Part_28Part_34Part_5101202 2007 99.34None0000019321 2009 61.2112|341001023832 2012 12.3212|12|342001031723 2017 873.7428|13|5101101
Post a Comment for "Make New Pandas Columns Based On Pipe-delimited Column With Possible Repeats"