SAS log likelihood fit statistic from SCORE statement in PROC LOGISTIC [closed]

Question

This question appears to be off-topic because it focuses on programming, debugging, or performing routine operations, or it asks about obtaining datasets. You could try the support links we maintain or the Open Data site instead.

If the question is actually a statistical topic disguised as a coding question, then OP should edit the question to clarify this. After the statistical content has been clarified, the question is eligible for reopening.

Closed 3 months ago.

Improve this question

When using a SCORE statement in PROC LOGISTIC in SAS, I can get fit statistics with FITSTAT. My response variable is binary.

I want to get log likelihood, but looking at this documentation, I'm actually not sure if SAS is using the right formula for log likelihood. This is the formula given in the SAS documentation (my data doesn't have any frequency or weights columns): $$ \log L =\sum_i f_i w_i \log(\hat\pi_i). $$

In particular, the $\hat\pi_i$ in the formula is said to be the 'predicted probability of the observation'. By this do they mean the 'corrected' probability (which is the predicted probability of the observed outcome of $y_i$). See this page for a reference. It's particularly suspicious if you look at the SAS documented binary brier score formula, which seems to suggest that $\hat\pi_i$ is not the corrected probability but just the probability that $y_i=1$ (but perhaps I'm reading that wrong; I'm not too sure about all the $f_i, w_i, n_i$ and $ r_i$ variables).

My understanding is that log likelihood should be: $$ \sum_i y_i \log(p_i) + (1-y_i)\log(1-p_i) $$ where $p_i$ is the predicted probability that $y_i=1$. Or alternatively, $$ \sum_i \log(P_i) $$ where $P_i$ is the 'corrected' probability (i.e. $P_i= p_i$ if $y_i=1$ and $P_i=1-p_i$ if $y_i=0$.)

So basically I'm not sure if I trust SAS here. After posting this I will try to test this on a small dataset to see, but would appreciate any insight.

"Scoring" in SAS terms is prediction, which does not take into account the observed outcome -- in fact, more often than not there isn't even any particular one, nor does the already fitted model care! What, then, is $y_i$ in your proposed formula? (you'll notice that the non-normalized LL of the model fitting does use the formula you mention) — PBulls, Commented Apr 15 at 7:07
@PBulls the fitstat option requires the observed values. "Specifying the FITSTAT option displays the following fit statistics when the data set being scored includes the response variable:" — cpahanson, Commented Apr 15 at 7:32
I stand corrected, it does take into account the observed response for fitstat, but it then also takes the predicted probability of the observed outcome as $\pi_i$ which makes the two formulae equivalent (minus normalization constant that is included in the default reported fit statistics). — PBulls, Commented Apr 15 at 8:03
@PBulls Thanks. Do you think that's consistent with the $\hat\pi_i$ that appears in the same table in the binary brier score formula? This table: documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/… — cpahanson, Commented Apr 15 at 8:05
Presumably yes, they say "For binary response models, $\hat\pi_i$ is the predicted probability of the observation" which empirically seems to match with taking the prediction for the observed response. — PBulls, Commented Apr 15 at 8:09

PBulls · Accepted Answer · 2024-04-15 08:29:32Z

To summarize the comments, FITSTAT requires you to provide the observed response, which is then used to match each predicted probability to the observed outcome.

Put differently, the $y_i$ and $(1-y_i)$ in your second formulae essentially serve as indicator functions, the mapping of which is automatically handled by SAS internally, and it'll select $\hat\pi_i=p_i$ for outcome $y_i$ or $\hat\pi_i=(1-p_i)$ for response $(1-y_i)$. The end result is identical -- the only difference versus the log likelihood SAS prints in the default fit statistics is that that includes an additional normalizing constant.

When specifying "single-trial" syntax (i.e. not using model events/trials = ...) you have to keep in mind that $n_i=r_i=1$, but it's all still the same in the end.

This is easily verified empirically using the example dataset here:

proc logistic data=neuralgia;
   class treatment sex;
   model pain(event="Yes") = treatment|sex age duration;
   score out=score fitstat;
   /* Score LL = -24.2979, Brier = 0.120297 */
run;

data _null_;
   do _n_ = 1 by 1 until (eof);
      set score end=eof;
      y = (pain eq "Yes");
      p = p_yes;
      ll = sum(ll, y*log(p) + (1-y)*log(1-p));
      /* pi_hat in SAS formulae */
      pi = ifn(y, p_yes, p_no);
      /* Recall ni = ri = 1, and w = sum(ni) */
      brier = sum(brier, 1*(1-pi)**2 + (1-1)*pi**2);
      w = sum(w, 1);
   end;
   brier = brier / w;
   put ll= brier=;
run;
/* ll = -24.29788 brier = 0.1202971 */

Stack Exchange Network

SAS log likelihood fit statistic from SCORE statement in PROC LOGISTIC [closed]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
logistic
sas
log-loss
or ask your own question.

Hot Network Questions

SAS log likelihood fit statistic from SCORE statement in PROC LOGISTIC [closed]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged logisticsaslog-loss or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
logistic
sas
log-loss
or ask your own question.