线性回归建模–变量选择和正则化(2):R包glmnet
4.glmnet包案例
这个数据集prostate在ESL的主页有下载
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.0.1
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-3
prostate <- read.csv("E:/RB/prostate.csv")
head(prostate)
## lcavol age lbph lcp gleason lpsa
## 1 -0.5798 50 -1.386 -1.386 6 -0.4308
## 2 -0.9943 58 -1.386 -1.386 6 -0.1625
## 3 -0.5108 74 -1.386 -1.386 7 -0.1625
## 4 -1.2040 58 -1.386 -1.386 6 -0.1625
## 5 0.7514 62 -1.386 -1.386 6 0.3716
## 6 -1.0498 50 -1.386 -1.386 6 0.7655
x <- as.matrix(prostate[, 2:6])
y <- prostate[, 1]
set.seed(1)
train <- sample(1:nrow(x), nrow(x) * 2/3)
test <- (-train)
# 岭回归
r1 <- glmnet(x = x[train, ], y = y[train], family = "gaussian", alpha = 0)
plot(r1, xvar = "lambda")
r1.cv <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0, nfold = 10)
plot(r1.cv)
mte <- predict(r1, x[test, ])
mte <- apply((mte - y[test])^2, 2, mean)
points(log(r1$lambda), mte, col = "blue", pch = 19)
legend("topleft", legend = c("10 - fold CV", "Test"), col = c("red", "blue"))
r1.min <- glmnet(x = x, y = y, family = "gaussian", alpha = 0, lambda = r1.cv$lambda.min)
coef(r1.min)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.49547
## age 0.01736
## lbph -0.07641
## lcp 0.28666
## gleason 0.08176
## lpsa 0.50190
# lasso
r2 <- glmnet(x = x[train, ], y = y[train], family = "gaussian", alpha = 1)
plot(r2, xvar = "lambda")
plot(r2)
r2.cv <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = 1, nfold = 10)
plot(r2.cv)
mte <- predict(r2, x[test, ])
mte <- apply((mte - y[test])^2, 2, mean)
points(log(r2$lambda), mte, col = "blue", pch = 19)
legend("topleft", legend = c("10 - fold CV", "Test"), col = c("red", "blue"))
# cv.min vs cv.1se,用全部数据再次拟合模型
r2.cv$lambda.min
## [1] 0.002954
r2.cv$lambda.1se
## [1] 0.1771
r2.1se <- glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = r2.cv$lambda.1se)
coef(r2.1se)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.3234
## age .
## lbph .
## lcp 0.2462
## gleason .
## lpsa 0.4320
r2.min <- glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = r2.cv$lambda.min)
coef(r2.min)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.44505
## age 0.01851
## lbph -0.08585
## lcp 0.29688
## gleason 0.05081
## lpsa 0.53741
# 岭回归和lasso的比较
lasso.pred <- predict(r2, s = r2.cv$lambda.1se, newx = x[test, ])
ridge.pred <- predict(r1, s = r1.cv$lambda.1se, newx = x[test, ])
mean((lasso.pred - y[test])^2)
## [1] 0.3946
mean((ridge.pred - y[test])^2)
## [1] 0.4239
关于glmnet包的使用
(1)glment()和cv.glmnet()
第一次用这个包的时候,我有个很蠢的问题,为什么有了cv.glmnet()还需要保留glmnet()呢? cv.glmnet()可以通过交叉验证得到(关于lambda的)最优的方程,但是就glment包来说仍然不是一个完美的结果,关于alpha的交叉验证依然需要使用者自己来完成(包的文档中给了点提示)。glmnet()仍然需要保留,因为可以得到正则化的路径,因为算法的原因,coordinate descent 在选取极值上有随机性,路径在变量的选择中还是很重要的。
(2)cv.glmnet()中的lambda.min和lambda.1se
lambda.min
value of lambda that gives minimum cvm.
lambda.1se
largest value of lambda such that error is within 1 standard error of the minimum. 关于这两个输出值的使用,似乎有点混乱。看了很多网上的讨论推荐使用lambda.1se的比较多,这样可以得到更简洁的模型。 涉及到所谓的1-SE rule。 “one standard error” rule to select the best model, i.e. selecting the most parsimonious model from the subset of models whose score is within one standard error of the best score.但是还有这样的说法:1se rule在低noise的时候才好用高noise的时候,有一两个fold的error很大,cv curve就会增长很快,导致选的lambda太大。 还请高人指点啊!
参考:
[6]Trevor Hastie,Sparse Linear Models:with demonstrations using glmnet.2013.
[7] Zou, Hui & Trevor Hastie (2005): Regularization and variable selection via the Elastic Net, JRSS (B)67(2):301-320)
这个数据集prostate在ESL的主页有下载
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.0.1
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-3
prostate <- read.csv("E:/RB/prostate.csv")
head(prostate)
## lcavol age lbph lcp gleason lpsa
## 1 -0.5798 50 -1.386 -1.386 6 -0.4308
## 2 -0.9943 58 -1.386 -1.386 6 -0.1625
## 3 -0.5108 74 -1.386 -1.386 7 -0.1625
## 4 -1.2040 58 -1.386 -1.386 6 -0.1625
## 5 0.7514 62 -1.386 -1.386 6 0.3716
## 6 -1.0498 50 -1.386 -1.386 6 0.7655
x <- as.matrix(prostate[, 2:6])
y <- prostate[, 1]
set.seed(1)
train <- sample(1:nrow(x), nrow(x) * 2/3)
test <- (-train)
# 岭回归
r1 <- glmnet(x = x[train, ], y = y[train], family = "gaussian", alpha = 0)
plot(r1, xvar = "lambda")
r1.cv <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = 0, nfold = 10)
plot(r1.cv)
mte <- predict(r1, x[test, ])
mte <- apply((mte - y[test])^2, 2, mean)
points(log(r1$lambda), mte, col = "blue", pch = 19)
legend("topleft", legend = c("10 - fold CV", "Test"), col = c("red", "blue"))
r1.min <- glmnet(x = x, y = y, family = "gaussian", alpha = 0, lambda = r1.cv$lambda.min)
coef(r1.min)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.49547
## age 0.01736
## lbph -0.07641
## lcp 0.28666
## gleason 0.08176
## lpsa 0.50190
# lasso
r2 <- glmnet(x = x[train, ], y = y[train], family = "gaussian", alpha = 1)
plot(r2, xvar = "lambda")
plot(r2)
r2.cv <- cv.glmnet(x = x, y = y, family = "gaussian", alpha = 1, nfold = 10)
plot(r2.cv)
mte <- predict(r2, x[test, ])
mte <- apply((mte - y[test])^2, 2, mean)
points(log(r2$lambda), mte, col = "blue", pch = 19)
legend("topleft", legend = c("10 - fold CV", "Test"), col = c("red", "blue"))
# cv.min vs cv.1se,用全部数据再次拟合模型
r2.cv$lambda.min
## [1] 0.002954
r2.cv$lambda.1se
## [1] 0.1771
r2.1se <- glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = r2.cv$lambda.1se)
coef(r2.1se)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.3234
## age .
## lbph .
## lcp 0.2462
## gleason .
## lpsa 0.4320
r2.min <- glmnet(x = x, y = y, family = "gaussian", alpha = 1, lambda = r2.cv$lambda.min)
coef(r2.min)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.44505
## age 0.01851
## lbph -0.08585
## lcp 0.29688
## gleason 0.05081
## lpsa 0.53741
# 岭回归和lasso的比较
lasso.pred <- predict(r2, s = r2.cv$lambda.1se, newx = x[test, ])
ridge.pred <- predict(r1, s = r1.cv$lambda.1se, newx = x[test, ])
mean((lasso.pred - y[test])^2)
## [1] 0.3946
mean((ridge.pred - y[test])^2)
## [1] 0.4239
关于glmnet包的使用
(1)glment()和cv.glmnet()
第一次用这个包的时候,我有个很蠢的问题,为什么有了cv.glmnet()还需要保留glmnet()呢? cv.glmnet()可以通过交叉验证得到(关于lambda的)最优的方程,但是就glment包来说仍然不是一个完美的结果,关于alpha的交叉验证依然需要使用者自己来完成(包的文档中给了点提示)。glmnet()仍然需要保留,因为可以得到正则化的路径,因为算法的原因,coordinate descent 在选取极值上有随机性,路径在变量的选择中还是很重要的。
(2)cv.glmnet()中的lambda.min和lambda.1se
lambda.min
value of lambda that gives minimum cvm.
lambda.1se
largest value of lambda such that error is within 1 standard error of the minimum. 关于这两个输出值的使用,似乎有点混乱。看了很多网上的讨论推荐使用lambda.1se的比较多,这样可以得到更简洁的模型。 涉及到所谓的1-SE rule。 “one standard error” rule to select the best model, i.e. selecting the most parsimonious model from the subset of models whose score is within one standard error of the best score.但是还有这样的说法:1se rule在低noise的时候才好用高noise的时候,有一两个fold的error很大,cv curve就会增长很快,导致选的lambda太大。 还请高人指点啊!
参考:
[6]Trevor Hastie,Sparse Linear Models:with demonstrations using glmnet.2013.
[7] Zou, Hui & Trevor Hastie (2005): Regularization and variable selection via the Elastic Net, JRSS (B)67(2):301-320)
请问这个数据集在哪里能下载?“ESL的主页”是指什么呢?不胜感激!
ESL就是那本统计学习的本质。
同问,数据哪里下呢?
ESL: http://statweb.stanford.edu/~tibs/ElemStatLearn/
在 an intro to stat learning这本书的r lab中都是用的lambda.min 来作为best lambda=)
好棒,如果能在程序里加些注释,再解释下输出结果就更好啦~~
我用我的数据在做这一行的时候:
points(log(r2$lambda), mte, col = "blue", pch = 19)
弹出了错误说x 、y的长度不一样:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
我一直都很纠结这个....一家一个说法....
为什么对于同一组数据每次使用cv.glmnet跑出来的结果中lambda.min都不一样啊,这个lambda.min难道不是最优的lambda取值吗?每次结果不一致那我到底如何选取lambda的值,从而确定lasso模型选择出来的变量个数呢?哪位大神能不能解释一下,感激不尽
glmnet()仍然需要保留,因为可以得到正则化的路径,请问我fit<-glmnet(x,y)之后,print(fit),出现的结果都距离0很近,我看到的解释是: 它在 0 和 1 之间,越接近 1 说明模型的表现越好,如果是 0,说明模型的预测结果还不如直接把因变量的均值作为预测值来的有效。是不是说明LASSO出来的模型效果不太好!?得尝试一下alpha=0,0.5?!
是不是说明LASSO出来的模型效果不太好!?我需要搞一些别的更加复杂变量进去?不管他继续往下做cv.glmnet是不是不合理?!
> 我来回应