【R言語】一般化線形モデル　関数glmの使いかた

RユーザのためのRStudio[実践]入門 −tidyverseによるモダンな分析フローの世界−

ポチップ

Rstudioで統計解析がしたい方へ

¥3,058 （2022/06/01 17:32時点 | Amazon調べ）

ポチップ

関数glm

R言語で一般化線形モデルを実行するには、関数glmを用います。

基本的な使いかたは回帰分析で用いたlmと変わりません。

lmとの違いとして、引数に誤差構造やリンク関数を指定する点があります。

関数glmは以下の引数から成ります。

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL,
    etastart, mustart, offset, control = list(…), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL,

etastart, mustart, offset, control = list(…), model = TRUE, method = "glm.fit",

x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

引数は以下の通りです。

関数glmの引数
formula	回帰に用いるモデル（式）	offset	オフセット項の指定
family	誤差構造とモデルに用いられるリンク関数。	control	適合のプロセスを制御するパラメータのリスト
data	回帰に用いるデータセット	model	返り値にモデルを含むかどうか
weights	prior weights. 事前に重みを与えたい場合数値ベクトルで与える。	method	モデルを適合する際に用いるアルゴリズム
subset	回帰する際に用いる観測値の部分集合	x	返り値に計画行列Xを含むかどうか
na.action	dataがNAを含む場合にとる行動。	y	返り値に目的変数yを含むかどうか
start	回帰係数の初期値	singular.ok	計画行列Xが非正則の場合、回帰を行うか
etastart	線形予測しの初期値	contrasts	オプションのリスト。詳細はcontrasts.arg。
mustart	線形予測値の平均値の初期値

上で紹介した引数のうち、familyについて詳しく見ていきます。

引数familyを変更することで、誤差構造とそれに対応するリンク関数を変更することが可能となります。

誤差構造とリンク関数を以下にまとめました。

familyオブジェクト
family	link
binomial	"logit", "probit", "cauchit", "log", "cloglog"
gaussian	"identity", "log", "inverse"
Gamma	"inverse", "identity", "log"
inverse.gaussian	"1/mu^2", "inverse", "identity", "log"
poisson	"log", "identity", "sqrt"
quasi	"logit", "probit", "clolog", "identity", "inverse", "log","1/mu^2", "sqrt"
quasibinomial	"logit", "probit", "cauchit", "log", "cloglog"
quasipoisson	"log", "identity", "sqrt"

例として、目的変数がc(1, 3, 6, 3, 9, 2, 3, 4)のような離散値である場合、母集団にポアソン分布</spanを指定します。

また、c(1, 0, 1, 1, 0, 1, 1)のような二値である場合、二項分布を指定します。

この後、より具体的な実行方法をみていきます。

glmの実行例

誤差構造がポアソン分布

まず、glmの使用例として、ポアソン回帰の実行方法についてみていきます。

誤差構造がポアソン分布に従う場合の回帰分析の実行方法を解説していきます。

簡単に言うと、目的変数が「1, 2, 3, 4」といった離散値であるときの回帰分析の手順について説明します。

ここでは、以下の線形モデルから成るポアソン回帰の実行方法について触れます。

リンク関数にlogを使っていることに注意してください。

ポアソン回帰

ポアソン回帰は次の線形モデルで表されます。

\begin{align}\log \lambda_i = \log \mathrm{E}[Y_i | \boldsymbol{x}]= \beta_0 + \beta_1x_{1i} + \cdots + \beta_kx_{ki}, \ \ i = 1, \ldots, n,\end{align}

ここに$\lambda_i$は事象の発生回数の期待値であり、$\boldsymbol{x}_i = (x_{1i}, \ldots, x_{ki})^T$である。

では早速、ポアソン回帰の実行方法をみていきましょう。

データセットの用意

目的変数がポアソン分布に従う場合のglmの実行方法について解説します。

次のようなデータセットポアソン分布に従う列countsを持つデータを用います。

counts <- c(18, 17, 15, 20, 10, 20, 25, 13, 12, 19, 16, 14)
outcome <- gl(3, 1, 12)
treatment <- gl(3, 3, 12)
data <- data.frame(counts, outcome, treatment)

counts <- c(18, 17, 15, 20, 10, 20, 25, 13, 12, 19, 16, 14)

outcome <- gl(3, 1, 12)

treatment <- gl(3, 3, 12)

data <- data.frame(counts, outcome, treatment)

次のように、dataは離散変数のcountsと「1, 2, 3」を水準に持つ因子型の列outcomeとtreatmentから成ります。

> head(data)
  counts outcome treatment
1     18       1         1
2     17       2         1
3     15       3         1
4     20       1         2
5     10       2         2
6     20       3         2

> head(data)

counts outcome treatment

1 18 1 1

2 17 2 1

3 15 3 1

4 20 1 2

5 10 2 2

6 20 3 2

また、パッケージggplot2を用いてcountsとoutcomeの経験分布及び箱ひげ図を描画すると、outcomeの水準によってcountsが変化することが分かります。

library(ggplot2)
library(gridExtra)
densityPlot <- ggplot(data, aes(x = counts, y = ..density.., colour = outcome, fill = outcome))
densityPlot <- densityPlot + geom_density(stat = "density", position = "identity", alpha = 0.75)
densityPlot <- densityPlot + xlim(min(data$counts), max(data$counts))

boxPlot <- ggplot(data, aes(x = counts, y = outcome, fill = outcome)) 
boxPlot <- boxPlot + geom_boxplot(alpha = 0.75, colour = "gray")

grid.arrange(densityPlot, boxPlot)

library(ggplot2)

library(gridExtra)

densityPlot <- ggplot(data, aes(x = counts, y = ..density.., colour = outcome, fill = outcome))

densityPlot <- densityPlot + geom_density(stat = "density", position = "identity", alpha = 0.75)

densityPlot <- densityPlot + xlim(min(data$counts), max(data$counts))

boxPlot <- ggplot(data, aes(x = counts, y = outcome, fill = outcome))

boxPlot <- boxPlot + geom_boxplot(alpha = 0.75, colour = "gray")

grid.arrange(densityPlot, boxPlot)

上を実行することで、次のような経験分布と箱ひげ図がプロットされます。outcomeが1のcountsは2, 3のものより比較的大きいことが読み取れます。

また、treatmentとcountsの経験分布と箱ひげ図も以下のように引数を変更することでプロットすることができます。

densityPlot <- ggplot(data, aes(x = counts, y = ..density.., colour = treatment, fill = treatment))
densityPlot <- densityPlot + geom_density(stat = "density", position = "identity", alpha = 0.75)
densityPlot <- densityPlot + xlim(min(data$counts), max(data$counts))

boxPlot <- ggplot(data, aes(x = counts, y = treatment, fill = treatment)) 
boxPlot <- boxPlot + geom_boxplot(alpha = 0.75, colour = "gray")

grid.arrange(densityPlot, boxPlot)

densityPlot <- ggplot(data, aes(x = counts, y = ..density.., colour = treatment, fill = treatment))

densityPlot <- densityPlot + geom_density(stat = "density", position = "identity", alpha = 0.75)

densityPlot <- densityPlot + xlim(min(data$counts), max(data$counts))

boxPlot <- ggplot(data, aes(x = counts, y = treatment, fill = treatment))

boxPlot <- boxPlot + geom_boxplot(alpha = 0.75, colour = "gray")

grid.arrange(densityPlot, boxPlot)

treatmentが1のcountsは3のものと比べて、ばらつきが非常に小さいことが分かります。

回帰分析の実行

回帰分析を実行する前に、まずdataをトレインデータとテストデータに分けます。後にテストデータを用いた予測をするために、dataを学習用と検証用に分割します。

今回は簡単に前から9個のデータをトレインデータ、残りのデータをテストデータに分割します。

#トレインデータとテストデータの作成
trainIndex <- seq_len(9)
validationIndex <- setdiff(seq_len(nrow(data)), trainIndex)

trainData <- data[trainIndex, ] #トレインデータ
validationData <- data[validationIndex, ] #テストデータ

#トレインデータとテストデータの作成

trainIndex <- seq_len(9)

validationIndex <- setdiff(seq_len(nrow(data)), trainIndex)

trainData <- data[trainIndex, ] #トレインデータ

validationData <- data[validationIndex, ] #テストデータ

改めて、次の9行から成るデータに対し回帰分析を実行していきます。

> trainData
  counts outcome treatment
1     18       1         1
2     17       2         1
3     15       3         1
4     20       1         2
5     10       2         2
6     20       3         2
7     25       1         3
8     13       2         3
9     12       3         3

> trainData

counts outcome treatment

1 18 1 1

2 17 2 1

3 15 3 1

4 20 1 2

5 10 2 2

6 20 3 2

7 25 1 3

8 13 2 3

9 12 3 3

ポアソン回帰を実行するには、次のように関数glmの引数を与えます。

1つ目の引数formulaと2つ目の引数dataに関しては、関数lmと全く同じです。formulaにはモデルを表す式を代入し、dataには回帰に用いるデータセットを代入します。

これらの引数に加えて、3つ目の引数familyにpoissonを代入することでポアソン回帰ができます。

regression <- glm(counts ~., data = data, family = poisson)

1	regression <- glm(counts ~., data = data, family = poisson)

regressionを参照するとモデルの回帰係数やAICがあるのが分かります。

> regression

Call:  glm(formula = counts ~ ., family = poisson, data = data)

Coefficients:
(Intercept)     outcome2     outcome3   treatment2   treatment3  
    3.01539     -0.38137     -0.29585      0.01005      0.01005  

Degrees of Freedom: 11 Total (i.e. Null);  7 Residual
Null Deviance:	    11.37 
Residual Deviance: 5.782 	AIC: 71.31

> regression

Call: glm(formula = counts ~ ., family = poisson, data = data)

Coefficients:

(Intercept) outcome2 outcome3 treatment2 treatment3

3.01539 -0.38137 -0.29585 0.01005 0.01005

Degrees of Freedom: 11 Total (i.e. Null); 7 Residual

Null Deviance: 11.37

Residual Deviance: 5.782 AIC: 71.31

次にこれらの結果をまとめる方法について解説していきます。

次のように、関数summayによって、これらの解析結果をリストの要素としてまとめることができます。

summaryReg <- summary(regression)

1	summaryReg <- summary(regression)

ポイント

summaryRegを参照すると、Coefficientsの欄の(intercept)とoutcome2の一番右に「*」がついているのが分かります。p値が0.05未満であり、統計的に有意である変数であることが言えます。

> summaryReg

Call:
glm(formula = counts ~ ., family = poisson, data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1450  -0.3701  -0.2113   0.6050   1.1397  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.01539    0.13142  22.945   <2e-16 ***
outcome2    -0.38137    0.17336  -2.200   0.0278 *  
outcome3    -0.29585    0.16908  -1.750   0.0802 .  
treatment2   0.01005    0.17350   0.058   0.9538    
treatment3   0.01005    0.17350   0.058   0.9538    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 11.367  on 11  degrees of freedom
Residual deviance:  5.782  on  7  degrees of freedom
AIC: 71.315

Number of Fisher Scoring iterations: 4

> summaryReg

Call:

glm(formula = counts ~ ., family = poisson, data = data)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.1450 -0.3701 -0.2113 0.6050 1.1397

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.01539 0.13142 22.945 <2e-16 ***

outcome2 -0.38137 0.17336 -2.200 0.0278 *

outcome3 -0.29585 0.16908 -1.750 0.0802 .

treatment2 0.01005 0.17350 0.058 0.9538

treatment3 0.01005 0.17350 0.058 0.9538

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 11.367 on 11 degrees of freedom

Residual deviance: 5.782 on 7 degrees of freedom

AIC: 71.315

Number of Fisher Scoring iterations: 4

また、解析結果を得る方法についてみていきます。summaryReg$. . . でsummaryReg中の要素を参照することができます。

> coeffs <- summaryReg$coefficients
> aic <- summaryReg$aic

> coeffs
               Estimate Std. Error    z value      Pr(>|z|)
(Intercept)  3.01538709  0.1314192 22.9447986 1.660647e-116
outcome2    -0.38136756  0.1733559 -2.1999112  2.781320e-02
outcome3    -0.29584538  0.1690815 -1.7497203  8.016659e-02
treatment2   0.01005034  0.1734964  0.0579282  9.538058e-01
treatment3   0.01005034  0.1734964  0.0579282  9.538058e-01

> aic
[1] 71.31494

> coeffs <- summaryReg$coefficients

> aic <- summaryReg$aic

> coeffs

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.01538709 0.1314192 22.9447986 1.660647e-116

outcome2 -0.38136756 0.1733559 -2.1999112 2.781320e-02

outcome3 -0.29584538 0.1690815 -1.7497203 8.016659e-02

treatment2 0.01005034 0.1734964 0.0579282 9.538058e-01

treatment3 0.01005034 0.1734964 0.0579282 9.538058e-01

> aic

[1] 71.31494

ポイント

上の例では、summaryReg$coefficientsで回帰係数やp値などが格納されたデータフレームに取得し、summaryReg$aicでモデルのaicを取得しています。

最後に、これらの解析結果をcsvファイルに出力しましょう。

関数write.csvを用いて、上で取り上げた解析結果をcsvファイルに出力します。

データフレームcoeffsとaicを含むデータフレームresultsを以下のように与えます。

results <- as.data.frame(cbind(round(coeffs, 5),
                         aic = c(round(aic, 5), rep("", nrow(coeffs) - 1))))

1 2	results <- as.data.frame(cbind(round(coeffs, 5), aic = c(round(aic, 5), rep("", nrow(coeffs) - 1))))

resultsは以下のようになっています。

> results
            Estimate Std. Error  z value Pr(>|z|)      aic
(Intercept)  3.01539    0.13142  22.9448        0 71.31494
outcome2    -0.38137    0.17336 -2.19991  0.02781         
outcome3    -0.29585    0.16908 -1.74972  0.08017         
treatment2   0.01005     0.1735  0.05793  0.95381         
treatment3   0.01005     0.1735  0.05793  0.95381

> results

Estimate Std. Error z value Pr(>|z|) aic

(Intercept) 3.01539 0.13142 22.9448 0 71.31494

outcome2 -0.38137 0.17336 -2.19991 0.02781

outcome3 -0.29585 0.16908 -1.74972 0.08017

treatment2 0.01005 0.1735 0.05793 0.95381

treatment3 0.01005 0.1735 0.05793 0.95381

以下のように、write,csvでデータフレームresultsをcsvファイルへと出力します。

write.csv(results, "ポアソン回帰.csv")

1	write.csv(results, "ポアソン回帰.csv")

上を実行すると、作業ディレクトリ中に"ポアソン回帰.csv"が保存されているのが確認できます。

csvファイルを開くと次の画像のように適切に出力されているのが分かります。

テストデータを用いた予測

次に、先ほど作ったテストデータをモデルにあてはめて、テストデータを予測していきます。

テストデータvalidationDataの値をモデルにあてはめるには、関数predictを用います。

次のようにpredictの引数typeを"response"にすることで、予測値を得ることができます。

> predictedY <- predict(regression, validationData, type = "response") #予測値
> predictedY
      10       11       12 
20.39698 13.92965 15.17337

> predictedY <- predict(regression, validationData, type = "response") #予測値

> predictedY

10 11 12

20.39698 13.92965 15.17337

また、予測値とテストデータの比較方法についてみていきます。

これらの予測値を実際の値と比較するために、次のようなデータフレームpredictedDataを構成します。

> names(predictedY) <- rep("", length(predictedY))
> predictedData <- data.frame(counts = predictedY,
+                             outcome = outcome[validationIndex],
+                             treatment = treatment[validationIndex])
> head(predictedData)
    counts outcome treatment
1 20.39698       1         1
2 13.92965       2         1
3 15.17337       3         1

> names(predictedY) <- rep("", length(predictedY))

> predictedData <- data.frame(counts = predictedY,

+ outcome = outcome[validationIndex],

+ treatment = treatment[validationIndex])

> head(predictedData)

counts outcome treatment

1 20.39698 1 1

2 13.92965 2 1

3 15.17337 3 1

counts以外のoutcomeとtreatmentはvalidationDataのものと全く同じです。

次に、このpredictedDataとvalidationDataを縦方向に結合させます。

まず、どこまでのデータが予測値であるか実際の値であるかを区別するために、labelsというベクトルを作ります。

> labels <- c(rep("validation", nrow(validationData)),
+             rep("prediction", nrow(predictedData)))
> labels <- as.factor(labels)
> labels
[1] validation validation validation prediction prediction prediction
Levels: prediction validation

> labels <- c(rep("validation", nrow(validationData)),

+ rep("prediction", nrow(predictedData)))

> labels <- as.factor(labels)

> labels

[1] validation validation validation prediction prediction prediction

Levels: prediction validation

最後に、rbindとdata.frameを用いて次のようなテストデータと予測値が結合したデータフレームを作成します。

> validPredData <- data.frame(labels, rbind(validationData, predictedData))
> validPredData
       labels   counts outcome treatment
10 validation 19.00000       1         1
11 validation 16.00000       2         1
12 validation 14.00000       3         1
1  prediction 20.39698       1         1
2  prediction 13.92965       2         1
3  prediction 15.17337       3         1

> validPredData <- data.frame(labels, rbind(validationData, predictedData))

> validPredData

labels counts outcome treatment

10 validation 19.00000 1 1

11 validation 16.00000 2 1

12 validation 14.00000 3 1

1 prediction 20.39698 1 1

2 prediction 13.92965 2 1

3 prediction 15.17337 3 1

ggplotを用いて、labelsの値についてcountsの値を描画します。

densityPlot <- ggplot(validPredData, aes(x = outcome, y = counts, fill = labels, color = labels))
densityPlot <- densityPlot + geom_point()
densityPlot

densityPlot <- ggplot(validPredData, aes(x = outcome, y = counts, fill = labels, color = labels))

densityPlot <- densityPlot + geom_point()

densityPlot

上を実行すると次のようなプロットが出力されます。

青がテストデータの値で、赤が予測値です。

だいたい予測できていることが見て取れます。

レッドブルエナジードリンク 250ml×24本

¥3,509 （2022/06/11 02:16時点 | Amazon調べ）

ZONe Ver.2.2.0 type-T エナジードリンク 500mlx24本

ポチップ

¥4,036 （2022/06/11 02:41時点 | Amazon調べ）