在分析二元结果时,逻辑回归是分析师对回归建模的默认方法。随机研究中,当然很容易估计比较两个治疗组的风险比。对于观察数据,治疗不是随机分配的,估计治疗效果的风险比有点棘手。
理想情况 - 随机治疗分配
理想情况下,我们首先模拟(在Stata中)一个大型数据集,该数据集可能在随机试验中出现:
gen x = rnormal() gen z =(runiform()<0.5) gen xb = x + z gen pr = exp(xb)/(1 + exp(xb)) gen y =(runiform()<pr)
此代码为10,000个人生成数据集。每个都有一个基线变量x的值,它是从标准N(0,1)分布模拟的。接下来,根据随机研究,我们模拟一个二进制变量z,概率0.5为1,概率0.5为0.然后生成二元结果y,我们从逻辑回归模型生成它,对数几率为1等于x + z。因此,对于x,调整z = 1与z = 0的真实优势比是exp(1)= 2.72。
由于处理是随机分配的,我们可以忽略x并使用带有日志链接的GLM命令估计比较z = 1到z = 0的风险比:
Generalized linear models No. of obs = 10000 Optimization : ML Residual df = 9998 Scale parameter = 1 Deviance = 12960.39176 (1/df) Deviance = 1.296298 Pearson = 10000 (1/df) Pearson = 1.0002 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u) [Log] AIC = 1.296439 Log likelihood = -6480.195879 BIC = -79124.59 ------------------------------------------------------------------------------ | OIM y | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- z | 1.429341 .0241683 21.13 0.000 1.382748 1.477503 _cons | .495886 .0070829 -49.11 0.000 .4821963 .5099643 ------------------------------------------------------------------------------
风险比估计为1.43,因为数据集很大,95%置信区间非常窄。
估算观测数据的风险比
现在让我们考虑观测数据的情况。为此,我们模拟了一个新的数据集 :
gen x=rnormal() gen tr_xb=x gen tr_pr=exp(tr_xb)/(1+exp(tr_xb)) gen z=(runiform() < tr_pr) gen xb=x+z gen pr=exp(xb)/(1+exp(xb)) gen y=(runiform() < pr)
如果我们为y运行相同的GLM模型,忽略x,我们得到:
. glm y z, family(binomial) link(log) eform Generalized linear models No. of obs = 10000 Optimization : ML Residual df = 9998 Scale parameter = 1 Deviance = 12014.93919 (1/df) Deviance = 1.201734 Pearson = 10000 (1/df) Pearson = 1.0002 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u) [Log] AIC = 1.201894 Log likelihood = -6007.469594 BIC = -80070.04 ------------------------------------------------------------------------------ | OIM y | Risk Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- z | 1.904111 .0353876 34.65 0.000 1.836 1.974748 _cons | .4101531 .0069811 -52.36 0.000 .396696 .4240667 ------------------------------------------------------------------------------
使用对数广义线性模型
最明显的方法是在我们的GLM命令中添加x:
. glm y z x, family(binomial) link(log) eform Iteration 0: log likelihood = -9271.4631 Iteration 1: log likelihood = -5833.7585 Iteration 2: log likelihood = -5733.9167 (not concave) Iteration 3: log likelihood = -5733.9167 (not concave) ...
然而,这无法收敛 。
通过逻辑模型估计风险比率
一个相对简单的替代方案是使用逻辑模型来估计调整x的治疗风险比。
Logistic regression Number of obs = 10000 LR chi2(2) = 2797.60 Prob > chi2 = 0.0000 Log likelihood = -5343.6848 Pseudo R2 = 0.2075 ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- z | 2.947912 .1442639 22.09 0.000 2.678297 3.244668 x | 2.693029 .0811728 32.87 0.000 2.538542 2.856918 _cons | .9910342 .0322706 -0.28 0.782 .929761 1.056345 ------------------------------------------------------------------------------
然而,我们可以使用该模型来计算每个人的预测概率,首先假设所有个体都未被治疗(z = 0),然后假设所有个体都被治疗(z = 1):
replace z=0 predict pr_z0, pr replace z=1 predict pr_z1, pr
此代码首先生成一个新变量zcopy,它保留原始治疗分配变量的副本。然后我们将所有个体z设置为0,并询问y = 1的预测概率。然后我们将所有个体设置为z = 1,并再次计算P(y = 1)。现在估计z = 1与z = 0相比的风险比,我们简单地考虑这两个条件下的边际风险比,即P(y = 1 | z = 1)/ P(y = 1) | Z = 0):
summ pr_z0 pr_z1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- pr_z0 | 10000 .4970649 .2060767 .0166056 .9765812 pr_z1 | 10000 .7094445 .1765126 .0474181 .9919309 di .7094445/.4970649 1.4272673 replace z=zcopy
给出了我们估计的风险比,比较z = 1到z = 0,为1.43,与我们第一次模拟数据时估计的风险比相同,其中治疗分配是完全随机的(特别是独立于x)。
置信区间
我们已经找到了风险比的点估计,但我们当然也喜欢置信区间,以指示估计的精确度。
首先,当z设置为0然后设置为1时,我们使用teffects给出我们对二元结果的边际均值的估计(相当于y = 1的概率):
Iteration 0: EE criterion = 1.898e-21 Iteration 1: EE criterion = 5.519e-34 Treatment-effects estimation Number of obs = 10000 Estimator : regression adjustment Outcome model : logit Treatment model: none ------------------------------------------------------------------------------ | Robust y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- POmeans | z | 0 | .4957607 .0072813 68.09 0.000 .4814897 .5100318 1 | .7078432 .0071566 98.91 0.000 .6938164 .7218699 ------------------------------------------------------------------------------
为了计算风险比和置信区间,我们首先使用teffects ra,coeflegend来查找Stata保存估算值的名称:
Treatment-effects estimation Number of obs = 10000 Estimator : regression adjustment Outcome model : logit Treatment model: none ------------------------------------------------------------------------------ y | Coef. Legend -------------+---------------------------------------------------------------- POmeans | z | 0 | .4957607 _b[POmeans:0bn.z] 1 | .7078432 _b[POmeans:1.z]
我们现在可以使用nlcom计算风险比及其置信区间。但是,由于这将为我们提供基于Wald的对称置信区间,因此最好找到对数风险比的这个区间,然后将得到的区间反向转换为风险比例:
_nl_1: log(_b[POmeans:1.z] / _b[POmeans:0bn.z]) ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _nl_1 | .3561291 .0172327 20.67 0.000 .3223536 .3899047 ------------------------------------------------------------------------------ . di exp(.3561291) 1.4277919 . di exp(.3223536) 1.3803728 . di exp(.3899047) 1.47684
因此,我们估计风险比为1.43,95%置信区间为1.38至1.48。