Looking To Transform Continuous Variables Into Categorical
Sample Data: id val1 val2 val3 val4 val5 val6 val7 ///+8yr NaN 0.0 2.0 NaN 1 3 23 ///1vjh NaN NaN NaN NaN NaN 7 62 ///4wu 3
Solution 1:
IIUC you have two questions. The first question of replacing values larger than 5 with 'larger than 5'
can be achieved with boolean indexing and the second question of grouping can be achieved with pd.cut()
DEMO:
d = pd.read_clipboard()
Part 1
Obtaining the values that does not satisfy the larger than 5 criteria,
rest = d.loc[:,'val1':'val6'][~(d.loc[:,'val1':'val6']>5)]
rest
val1 val2 val3 val4 val5 val6
0NaN0.02.0NaN1.03.01NaNNaNNaNNaNNaNNaN23.0NaNNaNNaNNaNNaN
Obtaining the larger than 5 values
larger_than_5=d.loc[:,'val1':'val6'][d.loc[:,'val1':'val6']>5]
print(larger_than_5)
val1 val2 val3 val4 val5 val6
0NaNNaNNaNNaNNaNNaN1NaNNaNNaNNaNNaN7.02NaNNaN6.0NaN7.08.0
Updating with your logic,
larger_than_5[larger_than_5.notnull()]='Larger than 5'
print(larger_than_5)
val1 val2 val3 val4 val5 val6
0NaNNaNNaNNaNNaNNaN1NaNNaNNaNNaNNaN Larger than 52NaNNaN Larger than 5NaN Larger than 5 Larger than 5
Updating rest
with the logic,
rest.update(larger_than_5)
print(rest)
val1 val2 val3 val4 val5 val6
0NaN0.02NaN131NaNNaNNaNNaNNaN Larger than 523.0NaN Larger than 5NaN Larger than 5 Larger than 5
Replacing values of the original df with updated values as per logic 1
d.loc[:,'val1':'val6']= rest
print(d)
id val1 val2 val3 val4 val5 val6 \0///+8yr NaN0.02NaN131///1vjh NaNNaNNaNNaNNaN Larger than 52///4wu 3.0NaN Larger than 5NaN Larger than 5 Larger than 5
val7
0231622180
Part 2
Obtaining bins
bins = np.arange(0, d['val7'].max()+1, 30)
bins
array([ 0, 30, 60, 90, 120, 150, 180], dtype=int64)
Creating a new series
val7_groups = pd.cut(d['val7'], bins)
val7_groups
0 (0, 30]
1 (60, 90]
2 (150, 180]
Adding that to the dataframe
d['val7_groups']= val7_groups
print(d)
id val1 val2 val3 val4 val5 val6 \0///+8yr NaN0.02NaN131///1vjh NaNNaNNaNNaNNaN Larger than 52///4wu 3.0NaN Larger than 5NaN Larger than 5 Larger than 5
val7 val7_groups
023(0,30]162(60,90]2180(150,180]
you can also set group labels by passing values to the labels parameter in pd.cut()
Post a Comment for "Looking To Transform Continuous Variables Into Categorical"