贝叶斯结构化时间序列方法
这是google的两个数据科学家的工作,主要是为了解决多变量时间序列分析中的变量筛选问题。
we call Bayesian Structural Time Series (BSTS) that seems to work well for variable selection problems in time series applications.
具体做法可以看Scott和Varian的两篇论文。
We offer a very brief description here; more details are available in
Scott and Varian [2012a,b].
思路如下:首先按照时间序列分解的思路将时间序列看成三部分的组合,即常数u、趋势t、和其它影响因素x。如此,时间序列可以表述为:yt = u + b*t + beta*x + e:
进步一步分解为:
由此,可以估计回归系数beta和残差e。由此,可以构造优化的卡尔曼预测(optimal Kalman forecast)。从后验概率分布中选取预测能力最强的时间序列。效果图如下:
更新:
一年之后, Scott将上面的这些想法写成一个R包,发布在cran上:
bsts_0.5.1.tar.gz 27-Jun-2014 13:56 114K
bsts_0.6.0.tar.gz 03-Dec-2014 17:02 116K
bsts_0.6.1.tar.gz 05-Dec-2014 01:22 116K
https://cran.r-project.org/web/packages/bsts/index.html
另外,在作者的谷歌个人站点,Scott还维护了一个bsts的课程资料:
https://sites.google.com/site/stevethebayesian/googlepageforstevenlscott/course-and-seminar-materials/bsts-bayesian-structural-time-series
所以,现在可以直接使用这个R包了。
## Example 6: Including regressors
library(bsts)
data(iclaims)
ss <- AddLocalLinearTrend(list(), initial.claims$iclaimsNSA)
ss <- AddSeasonal(ss, initial.claims$iclaimsNSA, nseasons = 52)
model <- bsts(iclaimsNSA ~ ., state.specification = ss, data =
initial.claims, niter = 1000)
plot(model)
plot(model, "components")
plot(model, "coefficients")
plot(model, "predictors")
使用我自己的收集的一个数据来做逐步递增的模型:
关于模型的总体情况,主要反映的是回归部分的效果,例如:
> ?summary.lm.spike
> summary(model, burn = 0)
$residual.sd
[1] 0.2121491
$prediction.sd
[1] 0.2767504
$rsquare
[1] 0.8909681
$relative.gof
[1] 0.6482318
$size
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 3.000 2.746 3.000 4.000
$coefficients
mean sd mean.inc sd.inc inc.prob
news 0.332497280 0.07541234 0.33653571 0.06630017 0.988
query 1.076958052 0.21219863 1.10005930 0.14337502 0.979
poll -1.076729016 0.78528349 -1.47497126 0.50681842 0.730
sent 0.004495306 0.02895581 0.09174094 0.09632891 0.049
(Intercept) 0.000000000 0.00000000 0.00000000 0.00000000 0.000
coefficients
A five-column matrix with rows representing model coefficients.
The first two columns are the posterior mean and standard deviation of each coefficient, including the point mass at zero.
The next two columns are the posterior mean and standard deviations conditional on the coefficient being nonzero.
The last column is the probability of a nonzero coefficient.
residual.sd
A summary of the posterior distribution of the residual standard deviation parameter.
rsquare
A summary of the posterior distribution of the R^2 statistic: 1 - residual.sd^2 / var(y)
计算方法:Spike and slab regression
这是一种贝叶斯技术的方法,回归过程:
a、首先,所有变量都拥有同等被回归估计的可能性;
b、我们根据回归系数,建立分布(零均值,spike是非零系数的可能性;大方差,系数的先验分布);
c、系数先验分布+系数的条件分布,将两者与同后验概率结合;
d、通过蒙特卡洛技术重复拟合分布,最后得到变量、变量系数、Y的分布的总结表。
library(BoomSpikeSlab)
#Title: MCMC for Spike and Slab Regression
#Author: Steven L. Scott <stevescott@google.com>
n <- 100
p <- 10
ngood <- 3
niter <- 1000
sigma <- .8
x <- cbind(1, matrix(rnorm(n * (p-1)), nrow=n))
beta <- c(rnorm(ngood), rep(0, p - ngood))
y <- rnorm(n, x %*% beta, sigma)
x <- x[,-1]
model <- lm.spike(y ~ x, niter=niter)
plot.ts(model$beta)
opar <- par(ask=TRUE)
on.exit(par(opar))
hist(model$sigma) ## should be near 8
plot(model)
summary(model)
因果影响
故事到这里,并没有结束。在2015年与Brodersen et al.合作的一篇基于bsts测量因果影响的一篇论文发表在Annals of Applied Statistics (2015).他们高调开放了自己的代码在github上。http://google.github.io/CausalImpact/CausalImpact.html
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41854.pdf
install.packages("devtools")
library(devtools)
devtools::install_github("google/CausalImpact")
library(CausalImpact)
参考文献
Steve Scott and Hal Varian. Bayesian variable selection for nowcasting
economic time series. Technical report, Google, 2012a. URL http://www.ischool.berkeley.edu/~hal/Papers/2012/fat.pdf
Steve Scott and Hal Varian. Predicting the present with Bayesian structural
time series. Technical report, Google, 2012b. URL http://www.ischool.berkeley.edu/~hal/Papers/2013/pred-present-with-bsts.pdf
we call Bayesian Structural Time Series (BSTS) that seems to work well for variable selection problems in time series applications.
具体做法可以看Scott和Varian的两篇论文。
We offer a very brief description here; more details are available in
Scott and Varian [2012a,b].
思路如下:首先按照时间序列分解的思路将时间序列看成三部分的组合,即常数u、趋势t、和其它影响因素x。如此,时间序列可以表述为:yt = u + b*t + beta*x + e:
进步一步分解为:
由此,可以估计回归系数beta和残差e。由此,可以构造优化的卡尔曼预测(optimal Kalman forecast)。从后验概率分布中选取预测能力最强的时间序列。效果图如下:
BSTS |
更新:
一年之后, Scott将上面的这些想法写成一个R包,发布在cran上:
bsts_0.5.1.tar.gz 27-Jun-2014 13:56 114K
bsts_0.6.0.tar.gz 03-Dec-2014 17:02 116K
bsts_0.6.1.tar.gz 05-Dec-2014 01:22 116K
https://cran.r-project.org/web/packages/bsts/index.html
另外,在作者的谷歌个人站点,Scott还维护了一个bsts的课程资料:
https://sites.google.com/site/stevethebayesian/googlepageforstevenlscott/course-and-seminar-materials/bsts-bayesian-structural-time-series
所以,现在可以直接使用这个R包了。
## Example 6: Including regressors
library(bsts)
data(iclaims)
ss <- AddLocalLinearTrend(list(), initial.claims$iclaimsNSA)
ss <- AddSeasonal(ss, initial.claims$iclaimsNSA, nseasons = 52)
model <- bsts(iclaimsNSA ~ ., state.specification = ss, data =
initial.claims, niter = 1000)
plot(model)
plot(model, "components")
plot(model, "coefficients")
plot(model, "predictors")
使用我自己的收集的一个数据来做逐步递增的模型:
why we tweet about us election |
关于模型的总体情况,主要反映的是回归部分的效果,例如:
> ?summary.lm.spike
> summary(model, burn = 0)
$residual.sd
[1] 0.2121491
$prediction.sd
[1] 0.2767504
$rsquare
[1] 0.8909681
$relative.gof
[1] 0.6482318
$size
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 3.000 2.746 3.000 4.000
$coefficients
mean sd mean.inc sd.inc inc.prob
news 0.332497280 0.07541234 0.33653571 0.06630017 0.988
query 1.076958052 0.21219863 1.10005930 0.14337502 0.979
poll -1.076729016 0.78528349 -1.47497126 0.50681842 0.730
sent 0.004495306 0.02895581 0.09174094 0.09632891 0.049
(Intercept) 0.000000000 0.00000000 0.00000000 0.00000000 0.000
coefficients
A five-column matrix with rows representing model coefficients.
The first two columns are the posterior mean and standard deviation of each coefficient, including the point mass at zero.
The next two columns are the posterior mean and standard deviations conditional on the coefficient being nonzero.
The last column is the probability of a nonzero coefficient.
residual.sd
A summary of the posterior distribution of the residual standard deviation parameter.
rsquare
A summary of the posterior distribution of the R^2 statistic: 1 - residual.sd^2 / var(y)
计算方法:Spike and slab regression
这是一种贝叶斯技术的方法,回归过程:
a、首先,所有变量都拥有同等被回归估计的可能性;
b、我们根据回归系数,建立分布(零均值,spike是非零系数的可能性;大方差,系数的先验分布);
c、系数先验分布+系数的条件分布,将两者与同后验概率结合;
d、通过蒙特卡洛技术重复拟合分布,最后得到变量、变量系数、Y的分布的总结表。
library(BoomSpikeSlab)
#Title: MCMC for Spike and Slab Regression
#Author: Steven L. Scott <stevescott@google.com>
n <- 100
p <- 10
ngood <- 3
niter <- 1000
sigma <- .8
x <- cbind(1, matrix(rnorm(n * (p-1)), nrow=n))
beta <- c(rnorm(ngood), rep(0, p - ngood))
y <- rnorm(n, x %*% beta, sigma)
x <- x[,-1]
model <- lm.spike(y ~ x, niter=niter)
plot.ts(model$beta)
opar <- par(ask=TRUE)
on.exit(par(opar))
hist(model$sigma) ## should be near 8
plot(model)
summary(model)
因果影响
故事到这里,并没有结束。在2015年与Brodersen et al.合作的一篇基于bsts测量因果影响的一篇论文发表在Annals of Applied Statistics (2015).他们高调开放了自己的代码在github上。http://google.github.io/CausalImpact/CausalImpact.html
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41854.pdf
install.packages("devtools")
library(devtools)
devtools::install_github("google/CausalImpact")
library(CausalImpact)
参考文献
Steve Scott and Hal Varian. Bayesian variable selection for nowcasting
economic time series. Technical report, Google, 2012a. URL http://www.ischool.berkeley.edu/~hal/Papers/2012/fat.pdf
Steve Scott and Hal Varian. Predicting the present with Bayesian structural
time series. Technical report, Google, 2012b. URL http://www.ischool.berkeley.edu/~hal/Papers/2013/pred-present-with-bsts.pdf
> 我来回应