本文亦2000墄,建订阅读8人钟 本文对随机最朎帀个甘势瘯征选择上做一个简单的介绍解
随机森林是以神篏桑为埜学习器的集成学验瓈泆。随量栈林非常简南,易于实亰,计算弍留基很小子勴䭦人惊的混昈廏在刡类和回归上)猂建了十分惊人皡经能�数此,随机森林亹賕䨭为⛸互佨阨的孥习杞帯瘴平的法”ㅁ
一、随应森林RF简廋
叁觻渧觮决策㠑白然法,黟的随渭森斗恒定的密晿珂解的述嚏机森林瘍琌滓厄纄簨夯渉嗮个步骤概括澨
湰囈毆衿盏耂圸岎礰䏑态机篮林箚泇ﰈ图片出自式训2):

二态特征重蟺桀觑学
其中, 表示有 个类别, 表示节点 中类别 所占的比例。直观地说,就是随便从节点 中随机抽取两个样本,其类别标记不一致的概率。
特征 在第 棵树节点 的重要性,即节点 分枝前后的 指数变化量为:
其中,和 分别表示分枝后两个新节点的指数。如果,特征 在决策树 i 中出现的节点为集合,那么 在第 棵树的重要性为:
假设 RF 中共有 I 棵树,那么:
最后,把所有求得的重要性评分做一个归一化处理即可。
三、举个例子
import pandas as pdurl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'df = pd.read_csv(url, header = None)df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash','Alcalinity of ash', 'Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins','Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
import numpy as npnp.unique(df['Class label'])
array([1, 2, 3], dtype=int64)
df.info()
'pandas.core.frame.DataFrame'> RangeIndex: 178 entries, 0 to 177Datacolumns (total 14 columns):Classlabel 178 non-null int64Alcohol178 non-null float64Malicacid 178 non-null float64
Ash178 non-null float64Alcalinityof ash 178 non-null float64Magnesium178 non-null int64Totalphenols 178 non-null float64Flavanoids178 non-null float64Nonflavanoidphenols 178 non-null float64Proanthocyanins178 non-null float64Colorintensity 178 non-null float64Hue178 non-null float64OD280/OD315of diluted wines 178 non-null float64Proline178 non-null int64dtypes: float64(11), int64(3)memoryusage: 19.5 KB
try:from sklearn.cross_validation import train_test_splitexcept:from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierx, y = df.iloc[:, 1:].values, df.iloc[:, 0].valuesx_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)feat_labels = df.columns[1:]forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)forest.fit(x_train, y_train)
importances = forest.feature_importances_indices = np.argsort(importances)[::-1]for f inrange(x_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
1) Color intensity 0.1824832) Proline 0.1586103) Flavanoids 0.1509484) OD280/OD315 of diluted wines 0.1319875) Alcohol 0.1065896) Hue 0.0782437) Total phenols 0.0607188) Alcalinity of ash 0.0320339) Malic acid 0.02540010) Proanthocyanins 0.02235111) Magnesium 0.02207812) Nonflavanoid phenols 0.01464513) Ash 0.013916
threshold = 0.15x_selected = x_train[:, importances > threshold]x_selected.shape
import pandas as pdurl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'df = pd.read_csv(url, header = None)df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash','Alcalinity of ash', 'Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins','Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']0
瞧,这不,帮我们选好了3个重要性大于0.15的特征了吗~
参考文献
[2] 杨凯, 侯艳, 李康. 随机森林变量重要性评分及其研究进展[J]. 2015.
(图文+视频)机器学习入门系列下载
import pandas as pdurl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'df = pd.read_csv(url, header = None)df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']1









还没有评论,来说两句吧...