How To Remove, Randomly, Rows From A Dataframe But From Each Label?
This is for a machine learning project. I have a dataframe with 5 columns as features and 1 column as label (Figure A). I want to randomly remove 2 rows but from each label. So, a
Solution 1:
With groupby.apply:
df.groupby('label', as_index=False).apply(lambda x: x.sample(2)) \
.reset_index(level=0, drop=True)
Out:
0 1 2 3 4 label
s1 0.433731 0.886622 0.683993 0.125918 0.398787 1
s1 0.719834 0.435971 0.935742 0.885779 0.460693 1
s2 0.324877 0.962413 0.366274 0.980935 0.487806 2
s2 0.600318 0.633574 0.453003 0.291159 0.223662 2
s3 0.741116 0.167992 0.513374 0.485132 0.550467 3
s3 0.301959 0.843531 0.654343 0.726779 0.594402 3
A cleaner way in my opinion would be with a comprehension:
pd.concat(g.sample(2) for idx, g in df.groupby('label'))
which would yield the same result:
0 1 2 3 4 label
s1 0.442293 0.470318 0.559764 0.829743 0.146971 1
s1 0.603235 0.218269 0.516422 0.295342 0.466475 1
s2 0.569428 0.109494 0.035729 0.548579 0.760698 2
s2 0.600318 0.633574 0.453003 0.291159 0.223662 2
s3 0.412750 0.079504 0.433272 0.136108 0.740311 3
s3 0.462627 0.025328 0.245863 0.931857 0.576927 3
Solution 2:
Here is a pretty straightforward way. Mix up all the rows with sample(frac=1)
and then find the cumulative count for each label and select those with values 1 or less.
df.loc[df.sample(frac=1).groupby('label').cumcount() <= 1]
And here it is with sklearn's stratified kfold. Example taken from here
from sklearn.model_selection import StratifiedKFold
X = df[[0,1,2,3,4]]
y = df.label
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train)
0 1 2 3 4
0 0.656240 0.904032 0.256067 0.916293 0.262773
1 0.526509 0.555683 0.667756 0.208831 0.699438
4 0.096499 0.688737 0.328670 0.260733 0.834091
5 0.320150 0.602197 0.793404 0.911291 0.269915
8 0.913669 0.171831 0.534418 0.862583 0.994561
9 0.718337 0.256351 0.348813 0.420952 0.622890
print(y_train)
0 1
1 1
4 2
5 2
8 3
9 3
Post a Comment for "How To Remove, Randomly, Rows From A Dataframe But From Each Label?"