本文亦2000墄,建订阅读8人钟 本文对随机最朎帀个甘势瘯征选择上做一个简单的介绍解
随机森林是以神篏桑为埜学习器的集成学验瓈泆。随量栈林非常简南,易于实亰,计算弍留基很小子勴䭦人惊的混昈廏在刡类和回归上)猂建了十分惊人皡经能�数此,随机森林亹賕䨭为⛸互佨阨的孥习杞帯瘴平的法”ㅁ
一、随应森林RF简廋
叁觻渧觮决策㠑白然法,黟的随渭森斗恒定的密晿珂解的述嚏机森林瘍琌滓厄纄簨夯渉嗮个步骤概括澨
湰囈毆衿盏耂圸岎礰䏑态机篮林箚泇ﰈ图片出自式训2):
二态特征重蟺桀觑学
其中, 表示有 个类别, 表示节点 中类别 所占的比例。直观地说,就是随便从节点 中随机抽取两个样本,其类别标记不一致的概率。
特征 在第 棵树节点 的重要性,即节点 分枝前后的 指数变化量为:
其中,和 分别表示分枝后两个新节点的指数。如果,特征 在决策树 i 中出现的节点为集合,那么 在第 棵树的重要性为:
假设 RF 中共有 I 棵树,那么:
最后,把所有求得的重要性评分做一个归一化处理即可。
三、举个例子
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header = None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
import numpy as np
np.unique(df['Class label'])
array([1, 2, 3], dtype=int64)
df.info()
'pandas.core.frame.DataFrame'> RangeIndex: 178 entries, 0 to 177
Datacolumns (total 14 columns):
Classlabel 178 non-null int64
Alcohol178 non-null float64
Malicacid 178 non-null float64
Ash178 non-null float64
Alcalinityof ash 178 non-null float64
Magnesium178 non-null int64
Totalphenols 178 non-null float64
Flavanoids178 non-null float64
Nonflavanoidphenols 178 non-null float64
Proanthocyanins178 non-null float64
Colorintensity 178 non-null float64
Hue178 non-null float64
OD280/OD315of diluted wines 178 non-null float64
Proline178 non-null int64
dtypes: float64(11), int64(3)
memoryusage: 19.5 KB
try:
from sklearn.cross_validation import train_test_split
except:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in
range(x_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
1) Color intensity 0.182483
2) Proline 0.158610
3) Flavanoids 0.150948
4) OD280/OD315 of diluted wines 0.131987
5) Alcohol 0.106589
6) Hue 0.078243
7) Total phenols 0.060718
8) Alcalinity of ash 0.032033
9) Malic acid 0.025400
10) Proanthocyanins 0.022351
11) Magnesium 0.022078
12) Nonflavanoid phenols 0.014645
13) Ash 0.013916
threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected.shape
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header = None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
0
瞧,这不,帮我们选好了3个重要性大于0.15的特征了吗~
参考文献
[2] 杨凯, 侯艳, 李康. 随机森林变量重要性评分及其研究进展[J]. 2015.
(图文+视频)机器学习入门系列下载import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header = None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
1
还没有评论,来说两句吧...