盒子
盒子

多元线性回归

###1.多元线性回归模型
假定被解释变量Y和多个解释变量X1,X2,X3,…,Xk之间具有线性关系,是解释变量的线性函数,则称为多元线性回归模型。

对于n组观测值,写成方程组的形式为:

变换成矩阵形式为:

接下来的任务就是计算估计值


###2.参数估计

参数估计的方法有最小二乘估计,QR估计法

####2.1 最小二乘

要计算参数值,跟据最小二乘原理需要使得如下残差最小:

这里我们将回归模型同乘样本观测值矩阵X的转置X^t,

将常数项归并到参数项中得到正规形式为:

最后结果为计算下列矩阵:

####2.2 QR分解
qr分解,可以形象地表示为下图:

其中, Q 是一个标准正交方阵, R 是上三角矩阵。
具体的QR分解算法和算例见维基 QR.

利用QR分解法避免了最小二乘中的求逆运算。


###3.代码

python

这里用python的可能熟悉,numpy.linalg里面的一个函数lstsq.这个函数功能是返回一个线性矩阵方程的最小二乘解法。

我们不用自带函数,并使用前一篇讲到的sweep operator,和这里的QR分解分别求参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy import linalg
import pandas as pd
def qr(A):
n, m = A.shape
R = A.copy()
Q = np.eye(n)
for k in range(m-1):
x = np.zeros((n, 1))
x[k:, 0] = R[k:, k]
v = x
v[k] = x[k] + np.sign(x[k,0]) * np.linalg.norm(x)
s = np.linalg.norm(v)
if s != 0:
u = v / s
R -= 2 * np.dot(u, np.dot(u.T, R))
Q -= 2 * np.dot(u, np.dot(u.T, Q))
Q = Q.T
return Q, R
def mySweep(A, m):
B = np.copy(A)
n = B.shape[0]
for k in range(m):
for i in range(n):
for j in range(n):
if i != k and j != k:
B[i,j] = B[i,j]- B[i,k]*B[k,j]/B[k,k]
for i in range(n):
if i != k:
B[i,k] = B[i,k]/B[k,k]
for j in range(n):
if j != k:
B[k,j] = B[k,j]/B[k,k]
B[k,k] = - 1/B[k,k]
return B
def mylm1(X,Y):
n, m = X.shape
Z = np.hstack((np.ones(n).reshape((n, 1)), X, Y.reshape((n, 1))))
A = np.dot(Z.T, Z)
S = mySweep(A,m+1)
beta_hat = S[0:m+1, m+1]
return beta_hat
def mylm2(X,Y):
n, m = X.shape
Z = np.hstack((np.ones(n).reshape((n, 1)), X, Y.reshape((n, 1))))
_, R = qr(Z)
R1 = R[:m+1, :m+1]
Y1 = R[:m+1, m+1]
beta = np.linalg.solve(R1, Y1)
return beta
A = pd.read_csv(r"C:\Users\Nicky\Documents\state_x77.csv")
X = A.loc[:,['Population','Income','Illiteracy','Murder','HS.Grad','Frost','Area','Density']].values
Y = A.loc[:,['Life.Exp']].values
mybeta1 = mylm1(X,Y)
print(mybeta1)
mybeta2 = mylm2(X,Y)
print(mybeta2)
Z = np.hstack((np.ones(X.shape[0]).reshape((X.shape[0], 1)),X))
w = linalg.lstsq(Z,Y)[0]
w = w.reshape(1,X.shape[1]+1)
print(w)

R语言

在R中也有对应的函数来求多元线性回归方程(lm())

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
myqr <- function(A)
{
n = nrow(A)
m = ncol(A)
R = A
Q = diag(n)
for (k in 1:(m-1))
{
x = matrix(rep(0, n), nrow = n)
x[k:n, 1] = R[k:n, k]
g = sqrt(sum(x^2))
v = x
v[k] = x[k] + sign(x[k,1])*g
s = sqrt(sum(v^2))
if (s != 0)
{
u = v / s
R = R - 2 * u %*% t(u) %*% R
Q = Q - 2 * u %*% t(u) %*% Q
}
}
result <- list(t(Q), R)
names(result) <- c("Q", "R")
result
}
mySweep <- function(A, m)
{
n <- dim(A)[1]
for (k in 1:m)
{
for (i in 1:n)
for (j in 1:n)
if (i!=k & j!=k)
A[i,j] <- A[i,j] - A[i,k]*A[k,j]/A[k,k]
for (i in 1:n)
if (i!=k)
A[i,k] <- A[i,k]/A[k,k]
for (j in 1:n)
if (j!=k)
A[k,j] <- A[k,j]/A[k,k]
A[k,k] <- - 1/A[k,k]
}
return(A)
}
mylm1 <- function(X,Y)
{
n = nrow(X)
m = ncol(X)
Z = cbind(rep(1, n), X, Y)
R = myqr(Z)$R
R1 = R[1:(m+1), 1:(m+1)]
Y1 = R[1:(m+1), m+2]
beta = solve(R1, Y1)
return(beta)
}
mylm2 <- function(X,Y)
{
col=ncol(X)
row=nrow(X)
Z=cbind(rep(1,row),X,Y)
A=t(Z)%*%Z
S=mySweep(A,col+1)
beta_hat=S[1:(col+2),col+2]
return(beta_hat)
}
b<- read.csv("state_x77.csv")
X=cbind(b$Population,b$Income,b$Illiteracy ,b$Murder,b$HS.Grad , b$Frost , b$Area , b$Density)
Y=b$Life.Exp
mylm1(X,Y)
mylm2(X,Y)
lm(Y~X)