数值特征的预处理

首先机器学习模型可以大致分为两大类:

- Tree-based models
- Non-tree-based models

对于Tree-based models,比如决策树分类器来说,因为数值缩放不影响分裂节点位置,对树模型的结构不造成影响。故对此类模型来说,对数值特征原则上无需进行预处理。

对于Non-tree-based models,比如线性模型,KNN, 神经网络来说,模型的质量依赖特征的尺度,下面介绍一些最常用的数值特征的预处理方法。

  1. regularization
    regularization最常用的方法:

    • MinMaxScaler: X=(X-X.min())/(X.max()-X.min())
    • StandardScaler: X=(X-X.mean())/X.std()

    regularization的影响:

    • regularization impact turns out to be proportional to feature scale;
    • gradient descent methods can go cracy without a proper scaling;
    • differnt feature scaling result in diffrent models quality;
  2. outliers

    • outliers可能出现在feature values, 也可能出现在target values中;
    • 有效的处理手段:clip feature values between two chosen values of lower bound and upper bound. eg, some percentiles of that feature.
  3. rank transformation

    • can be better option than MinMaxScaler if we have outliers, becanse rank transformation will move the outliers closer to other objects.
  4. log transformation

    • drive two big values closer to the feature's average value.
    • 常用方法:np.log(1+x), np.sqrt(x+2/3)
  5. 数据融合

    • concatenased data features produced by diffrent preprocessings;
    • mix models training differntcy-preprocessed data

最后提一下feature generation

  • 其定义是 creating new features using knowledge about the features and task.
  • 有效的 feature generation 依赖于 creativity and data understanding.
  • 方法: 1. prior knowladge, 2. EDA
全部评论

相关推荐

头像
04-09 14:29
Java
点赞 评论 收藏
转发
点赞 收藏 评论
分享
牛客网
牛客企业服务