Td corrigé #!/bin/sh # This is a shell archive, meaning: # 1. Remove everything ... pdf

#!/bin/sh # This is a shell archive, meaning: # 1. Remove everything ...

Mais le recours à une distribution corrigée, essayant d'intégrer les privilèges de la ...... est un exercice privé de sens s'il se limite à dresser la liste de ce qui est ...... (ii) une spécification similaire à un modèle Weibull en temps continu de la forme ...... (iii) la gratuité de certains services publics tels que l'éducation et la santé ...

part of the document

5 to 22.5, etc?

The "right" answer is more philosophy than statistics. Luckily, on most
data sets, I don't think it will make any real difference which way it is
if test -f 'Agreg.3'
echo shar: over-writing existing file "'Agreg.3'"
cat 'Agreg.3'
Calculated hazard for the "test2" data set.

There are two possible formulations of the hazard. The first is in
reference to the hazard function's time scale. The second is with
reference to a particular subject, who might enter/leave the risk set
multiple times, might change strata, etc, and we want the total hazard that
would be experienced by someone traversing that "path".
In a standard Cox model these are the same, since each person starts
at time zero, has no gaps, and stays in the same strata.

1. Baseline hazard for covariate "x".
Let B= beta, s=exp(B)

Start Stop Event x #at risk 1/(dLambda) xbar
1 2 1 1 2 s+1 s/(s+1)
2 3 1 0 3 s+2 s/(s+2)
5 6 1 0 5 3s+2 3s/(3s+2)
2 7 1 1 4 3s+1 3s/(3s+1)
1 8 1 0 4 3s+1 3s/(3s+1)
7 9 1 1 5 (3s+2)/2 3s/(3s+2)
3 9 1 1 "
4 9 0 1 "
8 14 0 0 2 0 0
8 17 0 0 1 0 0

At times 2,3,6,7,8,9 the baseline hazard is
H = cumsum(dLambda), or
H = cumsum( 1/(s+1), 1/(s+2), 1/(3s+2), 1/(3s+1), 1/(3s+1), 2/(3s+2) )

The desired hazard is exp(Bx) * H

The "Greenwood" term in the variance is similar to H, except square all
of the denominators, i.e., G= cumsum( 1/(s+1)^2, ...)

The increments to d are (xbar-x) dLambda.

The total variance is exp(2Bx) * (G + dId')

B=0, x=0 (Survival = exp(-H))

Time Survival Var(log surv) dId' portion
2 .60653 .289880 .03987948
3 .43460 .444632 .08320533
6 .35582 .548804 .14769310
7 .27711 .748855 .28524405
8 .21582 .993771 .46765942
9 .14467 1.372728 .76661724

B=-.0845293, x=.5

Time Survival Var(log surv) dId' portion/ s
2 .6068014 .2496241 .0000763297
3 .4369373 .3606177 .003492431
6 .3571892 .3999496 .002099579
7 .2767459 .4631417 .000012849
8 .2144194 .5308199 .002807969
9 .1432922 .6152773 .006326855

It is a bit of a surprize just how much difference it can make 'where'
the expected survival curve is computed.

2. Baseline survival for someone with a history. A practical example where
one might wish to do this is hard to come by, particularly so with
time dependent covariates. One problem is that it is all too easy to construct
a covariate/followup/strata history that would never occur with a 'real'
subject, so that the resulting curve, though mathematically accurate, is a
useless summary.
Nevertheless---- Let this be the patient's time/covariate pattern

Patient time hazard time x
----------- ----------- ---
0-3 0-3 0
3-6 - - not at risk for event
6-10 6-10 1
10-15 0-5 0 (some intervention 'restarts'
his hazard pattern)
15-25 5-15 2

Time Inc in cum haz Inc in Greenwood var Inc in d
--- -------------- -------------------- ----------------
2 H[1] g[1] (xbar[1]-0)*H[1]
3 H[2] g[2] (xbar[2]-0)*H[2]
7 H[4]*s g[4]*s^2 (xbar[4]-1)*H[4]*s
8 H[5]*s g[5]*s^2 (xbar[5]-1)*H[5]*s
9 H[6]*s g[6]*s^2 (xbar[6]-1)*H[6]*s
12 H[1] g[1] (xbar[1]-0)*H[1]
13 H[2] g[2] (xbar[2]-0)*H[2]
16 H[3]*s^2 g[3]*s^4 (xbar[3]-2)*H[3]*s^2
17 H[4]*s^2 g[4]*s^4 (xbar[4]-2)*H[4]*s^2
18 H[5]*s^2 g[5]*s^4 (xbar[5]-2)*H[5]*s^2
19 H[6]*s^2 g[6]*s^4 (xbar[6]-2)*H[6]*s^2

In the above, H and g refer to the increments in hazard and in the
Greenwood variance found in part 1 above. One feature to be noted is that
with a time dependent "target" value of x, we can no longer factor the
exp(B*x) term out of the sum of increments.

if test -f 'Agreg.efficiency'
echo shar: over-writing existing file "'Agreg.efficiency'"
cat 'Agreg.efficiency'
The agreg function can do both time dependent and multiple event Cox
models. It is modeled on the martingale notation of Anderson and Gill, and
hence the name. In the martingale notation, one has for each subject 3
data objects. N(t) is the cumulative number of events experienced by the
subject, X(t) is the time-dependent covariate vector, and Y(t) is an
indicator function which is 1 whenever the subject is at risk. In the usual
way of martingales and gambling schemes, Y can depend in any way on X and N
as long as it does not look into the future.
In this implimentation, we constuct an X matrix of constant values along
with vectors start, stop, and event. Each subject is entered as multiple
rows of the X matrix. Each row represents an interval (start, stop] over
which the time dependent covariates were constant, and event is a 0/1
indicator with event==1 if an event occured at time "stop". The time interval
is open at the left and closed on the right.
The program can handle multiple events and/or missing intervals of risk.
For example, consider a study where the 'event' is readmission to the
hospital. A patient can have multiple re-admissions. During the period just
after an admission, however, the patient is not at risk (while they are still
in the hospital). Each period of time outside of the hospital would appear
as a separate row of data in x.
The more common way to deal with time dependent cox models is to have
a computation occur at each death time, for example, BMDP and the new SAS
PHREG do this. One advantage of that proceedure over this one is the
ability to code continuously time-dependent covariates: our method only
accomodates step functions. (Giving every subject a separate line of data
per death time in the entire data set would work, so technically we can do
continuous variables, but the task of setting up said data matrix is
daunting). However, our method has several advantages over the BMDP strategy:

1) The ability to handle multiple events per subject, and intervals
without observation.

2) The ablitily to tie the time scale down to something other than
patient entry to the study, i.e., patients who don't start at time zero. For
instance, assume that you were doing a study on factory workers, and that you
wanted to model a baseline hazard that was tied to calendar time rather than
employee exposure time, perhaps there were changes in the equipment or some
such and you expect the baseline hazard to jump on March 1.
An example I have worked with involved Parkinson's disease and
L-Dopa therapy. The natural time 0 for each subject is the time of diagnosis,
so that the baseline hazard tracks the natural history of the disease. But
patients didn't enter the study until they came to our facility for care,
usually through referral. In the BMDP scheme, one might attempt to code this
as t=time from Dx, with a time dependent variable X1(t) = "has come to Mayo".
The fit will show that referral is the worst thing that can happen to you,
since no one dies until they come here!
The issue of "aligning" subjects properly on the hazard time scale has
not gotton much attention in the past, and deserves some thought.

3) Time dependent strata are easily coded. If someone changes strata
at time s they are broken into two observations at that time point. The
second observation can even be restarted in time to start at 0 if you
wish. For instance, if the strata change corresponds to some strong
intervention therapy, then the natural risk set in the second strata might
be based on "time since intervention" rather than "time since entry to study".

4) When only a few subjects have time dependent variates, this method can
be much more efficient. In a recent study here with multiple lab tests,
the proportion with 1,2, etc tests was geometrically decreasing. Only 2
patients had 5 values. Thus most patients had only 1 line of data in x.
Let r(t) be the number of subjects in the risk set at death time t, p be
the number of covariates, q the number of time-dependent covariates, and
s(t) the number of rows in this strata (when set up the agreg way). The
agreg calculation has at each death time a search over s(t) terms with a
sum over r(t) of them, the BMDP calculation has a sum over r(t) terms, each of
which requires q calls to the computation subroutine. So my total time is
O(p*p* E(r)) + O(E(s)), BMDP's is O(p*p* E(r)) + O(q* E(r)), each times the
number of events. If the subroutine is at all complex agreg wins.

As a footnote, I was really surprized the first time I ran agreg on a
stratified data set: it was faster than I expected! When strata are
introduced the program spends less time searching out whom is part of the
current risk set. Without strata it has to look over the entire data set,
with strata it has to look only over the current group.

Terry T.
if test -f 'Coxreg.1'
echo shar: over-writing existing file "'Coxreg.1'"
cat 'Coxreg.1'
Examples that test out the Cox and agreg programs.

1) Here is a made up data set for Cox reg, where all of the results have been
worked out exactly.

Time Status x
1 1 1
1 0 1
2 1 1
2 1 0
3 0 0
4 1 0

Define B= beta (since I don't have Greek chars), and s=exp(B)

loglik(B) = 2B - log(3s+3) - 2log(s+3)

U = derivative of loglik with respect to B
= (6 + 3s - s^2)/ ((s+1) (s+3))

I = information matrix = s/ (s+1)^2 + 6s/(s+3)^2

Actual solution is at U=0, or s= .5*( 3 +- sqrt(33) )
s= 4.3722813....
beta-hat = log(s) = 1.4752849

loglik(0) = -4.56435
loglik(B) = -3.8247495
I(B) = .6341681

Iteration is B= 0, 1.6, 1.4727236, 1.4752838, ...

Let a=(s+1)*(3s+3), and b=(s+3)^2.
Then the score residuals are (2s+3)/a
-s /a
-s/a + 3(3-s)/b
s/a + s(s+1)/b
s/a + 2s/b
s/a + 2s/b

Score residuals at B=0 are (1/24) * (10, -2, 7, -1, 5, 5)

Martingale residuals are 1- s/(3s+3)
0- s/(3s+3)
1- s( 1/(3s+s) + 2/(s+3) )
1- ( 1/(3s+s) + 2/(s+3) )
0- ( 1/(3s+s) + 2/(s+3) )
1- s( 1/(3s+s) + 2/(s+3) +1)

Shoenfeld residuals are Time=1: 1/(s+1)
2: (3-s)/(3+s)
4: 0

2. A worked agreg example

Start Stop Event x #at risk 1/(dLambda)
1 2 1 1 2 s+1
2 3 1 0 3 s+2
5 6 1 0 5 3s+2
2 7 1 1 4 3s+1
1 8 1 0 4 3s+1
7 9 1 1 5 (3s+2)/2
3 9 1 1
4 9 0 1
8 14 0 0 2 0
8 17 0 0 1 0

loglik = 4B - ln(s+1) - ln(s+2) - 3ln(3s+2) -2ln(3s+1)
loglik(0) = -9.3926619, loglik(solution) = -9.387015

U = 4 - (63s^4 + 201s^3 +184s^2 + 48s)/(9s^4 +36s^3 +47s^2 +24s +4)
solution at s=30834667/33554432 = .9189446, thanks to Macsyma
beta-hat = log(s) = -.0845293

I = s/(s+1)^2 + 2s/(s+2)^2 + 6s/(3s+2)^2 + 3s/(3s+1)^2
+ 3s/(3s+1)^2 + 12s/(3s+2)^2
I(0) = 2821/1800 = 1.56722222222....

Score residuals at B=log(2), or s=2
1/9, -3/8, -21/32, -165/784, -2417/14112, 33/392, -15/784, -211/784,
3/16, 3/16
Score residuals at B=0
1/4, -2/9, -12/25, -413/3600, 191/1800, 71/400, 7/200, -73/200, 6/25,

Shoenfeld residuals Time=2 : 1/(s+1)
3 : -s/(s+2)
6 : -3s/(s+2)
7 : 1/(3s+1)
8 : 3s/(3s+1)
9 : 4/(3s+2)

Martingale residuals: Let a,b,c,d, e, and f be the increments of
dLambda listed above, i.e., a= 1/(s+1). Then the residuals are (event - ?)
where ? = a *s
(b + c + d) *s
a + b + c + d + e
(e + f)*s
(c + d + e + f) *s
(c + d + e + f) *s

if test -f 'Coxreg.2'
echo shar: over-writing existing file "'Coxreg.2'"
cat 'Coxreg.2'

1. Assume that at time t there are d tied deaths. Then the Breslow
approximation to the likelihood is

w[1] w[2] w[d]
----- * ----- * ... * -----
sum(w) sum(w) sum(w)

Here I have numbered the people in the risk set from 1 to n, and made sure
that the d deaths are in positions 1 to d. W is the vector of risk weights
w[i] = exp(beta %*% x[i,])

Let m = the sum of the scores for the deaths = sum(w[1:d]). The Efron
approximation is

w[1] w[2] w[3]
------- * ----------- * ------------ * ....
sum(w) sum(w) - m/d sum(w) - 2m/d

This is equivalent to using new weights at each of the death times within
a tie! Multiply each of the old weights by a factor (1+ 1/d - j/d) to get
the new jth term. Since this downweighting is not a function of beta, none
of the derivatives wrt beta changes in its form, e.g., I is still a sum
of weighted covariances, but I have new weights.

For the canonical coxreg example, at time 2, the weights are

first death: s, 1, 1, 1 (as before)
second : s/2, 1/2, 1, 1

The third term of U is (0 - s/(s+5))
The third term of I is (s/(s+5)) - (s/(s+5))^2
The new solution is s^3 - 23s -30 =0, so beta ~1.6768575

2. Exact partial likelihood. The term is

w[1] * w[2] * ......w[d]
sum(prod(w[d choose n]))

The denominator is the sum over all subsets of size d, and the number inside
the sum is the product of the weights from that subset. For the
coxreg example, at time 2, the likelihood term is now

s * 1 s
------------ = ---
s*1 + s*1 + s*1 + 1*1 + 1*1 + 1*1 3s +1

And the maximum occurs at beta= infinity.

If the problem is changed to "x=c(1,1,2,0,0,0)", the rest staying the
same, then beta=.961687.


Residuals under the Efron approximation.

Martingale beta =0 beta-hat
1 - s(1/3s+3) 5/6 .7191708
0 - s(1/3s+3) -1/6 -.2808092
1- (s/3s+3 + s/s+3 + s/s+5) 5/12 -.4383401
1- (1/3s+3 + 1/s+3 + 1/s+5) 5/12 .7310864
0- (1/3s+3 + 1/s+3 + 2/s+5) -3/4 -.365544
" "

1/s+1 1/2 .1575124
1- s(s+4)/((s+3)(s+5)) 19/24
0- "" 5/24 -.5787554

Score (I hated this one)
(2s+3)/((s+1)(3s+3)) 5/12 .1132783

-s /((s+1)(3s+3)) -1/12 -.04423407

(675+1305s +756s^2 -4s^3 -79s^4 -13s^5)/
(3(s+1)^2 (s+3)^2 (s+5)^2)) 55/144 -.1029196

(147s -9s^2 -200s^3 -140s^4 -35s^5 -3s^6)/
(3(s+1)^2 (s+3)^2 (s+5)^2)) -5/144 -.4078401

2s (177 + 282s + 182s^2 + 50s^3 + 5s^4)/
(3(s+1)^2 (s+3)^2 (s+5)^2)) 29/144 .2208585


Residuals under the exact likelihood:
Not in my lifetime.
if test -f 'Hazard.cox'
echo shar: over-writing existing file "'Hazard.cox'"
cat 'Hazard.cox'
Here is a simple example of the hazard and variance of the hazard, worked
out by hand.

Time Status x
1 1 1
1 0 1
2 1 1
2 1 0
3 0 0
4 1 0

Define B= beta (since I don't have Greek chars), and s=exp(B)

Now, the weights for each subject are exp(B*x) = (s,s,s,1,1,1), giving the
hazard and cumulative hazard of
Time Hazard Cum hazard
1 1/(3s+3) 1/(3s+3)
2 2/(s+3) " + 2/(s+3)
3 0 " + " + 0
4 1/1 " + " + " + 1

This estimated survival for a subject with x=0 is just exp(-1* cum hazard).
For a subject with arbitrary x, multiply the cumulative hazard by exp(B*x).

The variance of the cumulative hazard is the sum of two terms. Term 1
is a natural extension of Greenwood's formula to the case where there are
weights. It is a running sum, with an increment at each death time of
#deaths/ (sum of weights, for those still at risk)^2
Another reasonable Greenwood term would have r(r-d) instead of r^2 in the
denominator, where r=sum of weights over the risk set and d=sum of weights for
the deaths (though I have seen no one propose this in a paper). For an x other
than zero, the weights are relative to the chosen x, which has the effect
of multiplying the term by exp(2*B*x).

The second term is dId', where I is the variance-covariance matrix of the
Cox model, and d is a vector. The second term accounts for the fact that the
weights themselves have a variance; d is the derivative of S(t;x) wrt beta,
and can be formally written as
exp(B*x) \int_0^t (\bar z(s) - x) d\hat\Lambda_0 (s)
where z is the covariate set that gave rise to the estimated hazard, and
\bar z(s) is the usual Cox model term "weighted mean of z at time s".
This can be recognized as -1* the score residual process, which measures
leverage of a particular observation on the estimate of beta. It is
intuitive that a small score residual --- an obs with such and such
covariates has little influence on beta --- results in a small added
variance, i.e., beta has little influence on the estimated survival. This
is the score residual process if the 'psuedo subject' has no event, so the
correspondence with the actual score residuals (for the data at hand) is
not perfect.

For x=0, the ith element of d increases at each death time by
(weighted mean of ith covariate) * #deaths/(sum of weights in risk set)
The weighted mean is computed over those still in the risk set.

Time Term 1
------ --------
1 1/(3s+3)^2
2 " + 2/(s+3)^2
3 " + "
4 " + " + 1/(1^2)

Time d
----- ----
1 (s/(s+1))* 1/(3s+3)
2 " + (s/(s+3))* 2/(s+3)
3 " + "
4 " + " + 0 * 1

For this data set, I = s/(s+1)^2 + 6s/(s+3)^2; the variance matrix is I

For B=0, x=0
Time Variance
1 1/36 + 1.6*(1/12)^2 = 7/180
2 (1/36 + 2/16) + 1.6*(1/12 + 2/16)^2 = 2/9
4 (1/36 + 2/16 + 1) + 1.6*(1/12 + 2/16 + 0)^2 = 11/9

For B=1.4752849, x=0
Time Variance
1 .0038498 + .004021 = .007871
2 .040648 + .0704631 = .111111
4 1.040648 + .0704631 =1.111111

For B=0, X=1
The cum hazard is s*old, the first term of the variance is s^2*old.
The new d has increments of

The variance for a posited subject with arbitrary x is most easily
calculated by first recentering the covariate matrix at that value, and
then computing as in the above. But if you want to do it the other way---

Assume you want the hazard and survival for a subject with covariate
vector c. Let r = exp(B'c). Then the new hazard = r * old hazard. The
first term in the variance = r^2 * old first term in the variance. The
new d = r*(old d - c* Cum hazard). This last term is surprizing at first;
I multiply the old hazard by a constant r, and the variance does not go up
by r^2.


Changes due to the Efron approximation.

At time 2, the two deaths now experience a hazard of 1/(s+3) + 1/(s+5),
and the "survivors" at time 2 experience a hazard of 1/(s+3) + 2/(s+5).
The basic idea here is that each of the tied deaths experiences half of the
second hazard jump, so that the identity sum(deaths) = sum(hazard) remains
in force. For curve estimation we have

Time Hazard Cum hazard
1 1/(3s+3) 1/(3s+3)
2 1/(s+3) +2/(s+5) " + 1/(s+3) + 2/(s+5)
3 0 " + " + " +0
4 1/1 " + " + " + 1

The variance formulae extend by noticing that the "Greenwood" term is
the sum of (hazard increment ^2), when you treat a tied death as d separate
hazard increments. The same logic holds for the second term in the hazard.

Time Term 1
------ --------
1 1/(3s+3)^2
2 " + 1/(s+3)^2 + 4/(s+5)^2
3 " + " + "
4 " + " + " + 1/(1^2)

The weighted means at time 2 are now s/(s+3) and s/(s+5).

Time d
----- ----
1 (s/(s+1))* 1/(3s+3)
2 " + (s/(s+3))* 1/(s+3) + (s/(s+5))* 2/(s+5)
3 " + "
4 " + " + 0 * 1

For this data set, I = s/(s+1)^2 + 3s/(s+3)^2 + 5s/(s+5)^2; the variance
matrix is I inverse.

For B=0, x=0 c== I inverse = 144/83 ~ 1.7349
Time Variance
1 1/36 + c(1/12)^2 = 119/2988
2 (1/36 + 1/16 + 4/25) + c(1/12 + 1/16 + 1/18)^2 = 1996/6225
4 (1/36 + 1/16 + 4/25 +1) + c(1/12 + 1/16 + 1/18 +0)^2 = 1 + "

For B=1.676857, x=0
Time Variance
1 .00275667 + .00319386 = 0.0059505
2 .05445330 + .0796212 = 0.134075
4 1.05445330 + .0796212 = 1.134075
if test -f 'Install'
echo shar: over-writing existing file "'Install'"
cat 'Install'
First-- a note on my philosophy. I can't ensure that this will work for
all installations or machines, and I won't even try. The instructions below
work for me, on my machines. I am quite willing to give help if these don't
work for you, but make no guarrantees about the efficacy of the same.


Dynamic loading:

0. Make a directory SHOME/library/survival, and place the code there.
cd SHOME/library/survival
sh < /wherever_you_have_put_it/survival3

"SHOME" refers to the S home directory. This is where you want to
install things so that all S users can see them.

1. In that directory:
mkdir .Data
mkdir .Data/.Help
touch .Data/.Audit
chmod a-w .Data/.Audit

The two lines about the .Audit file are optional-- but I assume that you
don't want to waste a bunch of space with an audit file of the installation

To save some space, you may want to delete those rate tables (for
expected survival) that are irrelevant to your geographic needs: Arizona
survexp.azr.s Arizona (by race)
survexp.fl.s Florida
survexp.flr.s Florida Minnesota
survexp.mnwhite.s Minnesota
survexp.wnc.s West-North-Central US US total
survexp.usr.s US
survexp.uswhite.s US white

2. If you are running Splus:
a. Remove my Makefile

b. Type "Splus CHAPTER *.c *.d *.s" to make a nice new one
that is optimized for your machine.

c. If you are in subdirectory "xxx" then this new Makefile will
create an object xxx_l.o. Note that CHAPTER will fail if your
directory name has punctuation marks in it (like ".").

c. READ the first portion of the new Makefile. In particular,
find the comments on static versus dynamic loading and set things
up to choose dynamic loading.

d. See the manual page for "library" for instructions on.

e. type "make dyn.load".

3. If you are running ATT S instead of Splus there are 2 choices:

a. They don't recognized ".s" extensions. So do
cat *.s > all.q
S CHAPTER all.q *.c *.d

Then proceed pretty much as before.

b. Try using my Makefile. (I don't use CHAPTER locally because
I need to interface with SCCS). Find the line that defines
"SPLUS = ../../Splus", and change it to point to the actual
shell script. (This is probably ../../S or /usr/local/bin/S).
If your system doesn't have the Unix "install" command, then
find the string "install -m0444" in the Makefile, and change it
to "cp".
The makefile won't work for Ultrix, and perhaps for other systems
as well, as it uses many Sun "template" extensions.

4. Anyone who want to use the code will have to type the following
lines, every time they use S:

library(survival, first=T)
library.dynam("survival", "survival_l.o")

4a. You might want to make a local function out of the lines in (2),
say "newsurv", and place it on the default S search path.

4b. Users of Splus (at least) have 2 other options.
i. In this directory, exectute the "First.lib" function
% Splus
> source("First.lib")
> q()
Now your users only have to type the first line above, i.e.,
library(survival, first=T)
and can skip the second one.

ii. Do what we do at Mayo, and install an local startup file. A copy
of ours in here as First.local. Use it as a template for your own.

% Splus
> source("First.local")
> First.local() #test it!
> q()
% mv .Data/First.local SHOME/splus/.Functions/.First.local

4c. The "first=T" argument is important, since I override 4 of the
standard S functions.

5. You may want to move the Examples directory to another place, or
remove it.

To test the routines, cd to the Examples directory. Be aware that many
of my tests depend on the "date" functions, also available from statlib.

if test -f 'Mart.residual'
echo shar: over-writing existing file "'Mart.residual'"
cat 'Mart.residual'
Hopefully you, the reader, know TeX, because that is how the formulas
in this document are written. Notation is that of the Biometrika paper
(1990) ``Martingale Based Residuals for Survival Data''.

\def\bhat{\hat \beta} %define "bhat" to mean "beta hat"
\def\Mhat{\widehat M} %define "Mhat" to mean M-hat
\def\zbar{\bar Z}
\def\lhat{\hat \Lambda}

Let $Y_i(t)$ be the indicator that a given subject is at risk and under
observation at time t, $Z_{ij}(t)$ be the jth covariate of the ith person
(possibly time dependent), and $\bhat$ be the Cox or Agreg estimate of
beta. Let $N_i(t)$ be the step function for the ith subject, which counts
the number of ``events'' for that subject up to time t. Define $w_i(t)$ to
be $exp^{\bhat ' Z_i(t)}$, i.e., the risk score for the ith subject after
the fit is finished. Then the Breslow (or Tsiatis or Link) estimate of the
baseline hazard is
$$ \lhat(t) = {\int_0^n {\sum_{i=1}^n dN_i(s)} \over
{\sum_{j=1}^n Y_i(s)w_i(s)}} .$$

The martingale residual at time t is
$$ \Mhat_i(t) = N_i(t) - \int_0^t w_i(s) Y_i(s) d \lhat(s). $$
The program returns the residual at $t = \infty$.
For non-time-dependent covariates, the $w_i$ term can be factored out of the
integral, giving $w_i \lhat(t)$.

The deviance residual is a normalizing transform of the martingale residual
$$ deviance_i = sign(\Mhat_i) * \sqrt{-\Mhat_i - N_i log((N_i - \Mhat_i)/N_i)}$$

The other two residuals are based on the score process $L_{ij}(t)$ for the
ith subject and the jth variable:
$$ L_{ij}(t) = \int_0^t (Z_{ij}(s) - \zbar_j(s)) d \Mhat_i(s),$$
where $\zbar_j(s)$ is defined as the weighted mean (using $w_i$'s) of the
jth covariate, over the subjects still in the risk set at time s.

The score residual is then defined, for each subject and each variable
(an n by p matrix) as $L_{ij}(\infty)$. It is the sum of the score process
over time.

The other useful transform is to sum the score process up over indivduals
to get a total score process $L_j(t)$. This is just the score vector at
time t, so that at $\bhat$ we must have $L_j(0) = L_j(\infty) = 0$.
Because $\lhat$ is discreet, our estimated score process will also be
discreete, having jumps at each of the unique death times. Residual type
4 is a p column matrix with one row per jump. To get the ``Brownian bridge''
plots found in the technical report, just use the cumsum function on the
returned value.

There are two simplifying identies for the type 4 residuals:
$$ \sum_i L_{ij}(t) = \sum_i \int_0^t Z_{ij}(s) d \Mhat_i(s) $$
$$ \sum_i L_{ij}(t) = \sum_i \int_0^t (Z_{ij}(s) - \zbar_j(s)) d N_i(s) $$
The second of these is equivalent to a residual proposed by Schoenfeld.
Since the sums are the same for all $t$, each increment must be the same.
Our program uses the first identity, as $d \Mhat$ seems to be less work to
compute than $\zbar$. Note that $d \Mhat_i(t)$ is zero when subject i is
not in the risk set at time t.

Residuals 1, and 3 are integrals over time of some object. One
consequence of this is that they work well for the Anderson-Gill
program also. Specificly, in setting up agphm, a single subject is broken
up into multiple lines of the input data, as though he were a set of
different individuals observed over disjoint times. After the agphm
program is finished, the residual for that person is just the sum of the
residuals for these ``psuedo'' subjects. This property is not true for the
deviance residual, however.
Residual 4 works the same in both models, since it is reported at the
unique death times, and not for individuals at all.
if test -f 'Notes'
echo shar: over-writing existing file "'Notes'"
cat 'Notes'
Predicted survival after a Cox model

1. Time dependent covariates are near impossible.

a. I need to know who is who. The cohort survival for a study of
sized 2 where $\beta$=large and $x_1(t)=0$, $x_2(t)=1$, is much
different than if subjects 1&2 interchange covariates after a
short term.

In the first case, the overall survival is large -- there is >50%
chance that subject 1 is still alive. In case 2, they are both dead
after a while, since both have had a period of high risk. But the
input files for the two cases are identical!

b. Conceptually, things are undefined if someone has an interval
without risk.

c. For both the Ederer and Hakulinen estimates, I have to predict the
covariate beyond the time of measurement . The conditional estimate
only estimates a difference, not a true survival, so I'd be better
off to plot the time-dep coefficient.
if test -f 'Param'
echo shar: over-writing existing file "'Param'"
cat 'Param'
Since the density of a dist with small \sigma can be >1, the loglik for
the saturated model may be >0. Thus the deviance may be more than 2*LL.

if test -f 'ToDo'
echo shar: over-writing existing file "'ToDo'"
cat 'ToDo'
survreg: Add predict and plot methods.
Figure out how to use/expand the glm "family" concept to survreg
Impliment the residual tests of Escobar and Meeker PARTIAL
Add Gaussian, generalized gamma, and generalized F dist PARTIAL

Make sure that survtest uses the Left continuous KM estimate! DONE

Add the survival weights of Lin's paper to the program. NOT NEEDED, see paper
with Pat Grambsch.

How do caseweights propogate to the residuals calculation? THEY DON'T
How do risk weights propogate? Just an offset term.

Can't put a survival object inside a data frame. (You can with model.frame,
but then goes nuts). FIXED

Suspect that the plot of null agreg residiuals, not collapsed on subject,
versus the x variate will correctly give the functional form for a time
dependent covariate.

Understand Dave Harrington's deviance residuals arguments.

Add a 'mode' argument to the regression routines. If set to 'single', then
the X matrix is kept in single precision only. Saves a lot of
storage for Frank, et. al.

Allow singular matrices if singular.ok=T. Requires a change to qr style
calculations? DONE-- via a change to cholesky routines.

Add caseweights to coxreg and
Just how to do this for the Efron approx is highly unclear. DONE

Does my "na.action" option override an na.action attribute on a data frame? NO

Should or should not missing be included as a strata in and coxreg.
ADDED AS AN ARG TO "strata()"

test the singular=ok option for weibull & company

Remove the "infinite beta" check from coxreg. DONE

Allow multiple strata statements. DONE

Set up the "methods" based NA actions. DONE

Allow a strata argument to surv.exp DONE

Test the use of expected rather than observed information in survreg. TRIED -
didn't work out, and better starting estimates obviate the need.

Expand survdiff to rho-gamma tests.

Add frailty to coxph.

Update model.newframe to use hidden transformation parameters.

Individual survival estimates from predict. PARTIAL

Allow the newdata in survfit.coxph to contain only a subset of the values,
and use the means for the rest.
For a factor, default to the first level rather than the mean.

In print.survfit, warn if there is truncation? In this case the estimate of
the mean is problematic.

Hazard ratio output for survreg.

plot methods for coxph

Add a column "number of censored" to Kaplan-Meier printout. For left
truncated data (ie agreg), we need to print the number at risk more
often, which means that essentially the same column is added.

If there is a strata*covariate interaction term in a Cox model, then
the survfit method for the model gets very confused.

Add the Hakulinen method for expected survival post Cox model.

Realize that I will never get all of this done.
if test ! -d 'latex'
mkdir 'latex'
cd 'latex'
if test -f 'S.sty'
echo shar: over-writing existing file "'S.sty'"
cat 'S.sty'
% S book functions
% Example environment: for S examples -- like verbatim except that
% you can insert special things by using \ commands and braces (which
% then themselves have to be escaped)
% \s inserts a Postscript symbol that stands for the 1988 S book in references
% The next few lines change the text width:
%the default width corresponds to
% some functions for avoiding lousy pagebreaks
\def\need#1 {\vskip 0pt plus #1 \penalty -200 \vskip 0pt plus -#1}
\def\Exampleskip{{\widowpenalty=200\need 20pt }}
% \Sfigure{filename} inserts a figure using psfig
\def\docspecials{\do\ \do\$\do\&%

\def\alltt{\trivlist \item[]\if@minipage\else\vskip\parskip\fi
\@tempswafalse \def\par{\if@tempswa\hbox{}\fi\@tempswatrue\@@par}
\obeylines \small \tt \catcode``=13 \@noligs \let\do\@makeother \docspecials


%%following two lines toggle on/off marginal notes
%%\newcommand{\Marginpar}[1]{\marginpar{\small\em #1}}
\newcommand{\Co}[1]{{\small\tt #1}}
% some commands from Doug Bates for nonlinear chapter
\newcommand{\E}[1]{{\rm E}\left[ #1 \right]}
\newcommand{\bG}{\bf G}
\newcommand{\bX}{\bf X}
\newcommand{\by}{\bf y}
\newcommand{\bz}{\bf z}
\newcommand{\trans}{^{\rm T}}
\newcommand{\logit}{\mathop{\rm logit}\nolimits}

% \lower 2.2pt\psfig{figure=/usr/jmc/book2/,height=11pt,silent=}
%\kern -1.5pt
% }}

% commands for figures


\def\@makechapterhead#1{{\def\\{,}\addtocontents{toc}{{\@author}} }
\vspace*{50pt} { \parindent 0pt \raggedright
\ifnum \c@secnumdepth >\m@ne \huge\bf \@chapapp{} \thechapter \par
\vskip 20pt \fi \Huge \bf #1\par
\vskip 1em {\large \lineskip .75em
\nobreak \vskip 40pt }

\vskip 10pt
\setbox\@tempboxa\hbox{ #1: {\it #2}}
\ifdim \wd\@tempboxa >\hsize { #1: {\it #2}}\par \else \hbox
%\newcommand{\DraftHeading}{\chaptermark{\copyright AT\&T: DRAFT \today}}
\newcommand{\Plus}{\mbox{ \tt + }}
\newcommand{\Star}{\mbox{ \tt * }}
\newcommand{\Colon}{\mbox{ \tt :\,}}
\newcommand{\Divide}{\mbox{ \tt / }}
\newcommand{\In}{\mbox{ \tt \%in\% }}
\newcommand{\Minus}{\mbox{ \tt - }}
\newcommand{\var}{{\rm var}}
\newcommand{\ev}{{\it E}}
\newcommand{\dev}{{\it D}}
\newcommand{\AIC}{{\it AIC}}
\renewcommand{\vec}[1]{\mbox{\boldmath $#1$}}
\newcommand{\df}{{\rm df}}

%These next two are to do with placement of figures; see page 178 of Latex Manual

%This is for the S HELP file documentation
\item[{\small\uppercase{#1}}]\ \\
\index{#1 (Helpfile)}
\vbox to .375in{\vfill
\hbox to \textwidth {\ {\bf #1}\hfill #2\hfill {\bf #1}\ }
\markboth{\hskip 1pc{\bf #1}\hfill \uppercase{Appendix B}}{\uppercase{ S Functions and Classes}\hfill{\bf #1}\hskip 1pc}
\newenvironment{Helpfile}[2]{\need 2in \Title{#1}{#2}\begin{list}{}{\setlength{\leftmargin}{.58in}\setlength{\labelwidth}{.5in}\setlength{\labelsep}{.08in}}\item}{\end{list}}


\newcommand{\seeTitle}[2]{\need 20pt \Title{#1}{#2}}
\newcommand{\GLIM}{{\sc glim}}
\newcommand{\GENSTAT}{{\sc genstat}}
\newcommand{\FORTRAN}{{\sc fortran}}
\newcommand{\C}{{\sc c}}
\newcommand{\GAM}{{\sc gam}}
\newcommand{\GLM}{{\sc glm}}
\newcommand{\LM}{{\sc lm}}
\newcommand{\UNIX}{{\sc unix}}
\newcommand{\LINPACK}{{\sc linpack}}

\hyphenation{weight-ed data-base para-met-riz-ed Like-li-hood like-li-hood}
\hyphenation{point-wise pre-plot}
if test -f '.gs1.therneau'
echo shar: over-writing existing file "'.gs1.therneau'"
cat '.gs1.therneau'
ZZversion 3¨screensize 32x80¨t Tue Nov 29 07:55:53 1994¨8~75¤border^return indentmake window 1¬Readme 12x36 12x36vZª
echo shar: a missing newline was added to "'.gs1.therneau'"
if test -f 'Readme'
echo shar: over-writing existing file "'Readme'"
cat 'Readme'
The "official" reference for this document is:

Terry M Therneau, A package for survival analysis in S, Technical Report
number 53, Section of Biostatistics, Mayo Clinic, 1994.

If you have no access to latex we will happily mail you a copy.

Terry Therneau
Section of Biostatistics
Mayo Clinic
Rochester, Minnesota 55905
if test -f 'survival.tex'
echo shar: over-writing existing file "'survival.tex'"
cat 'survival.tex'
\title {A Package for Survival Analysis in S}
\author{Terry M. Therneau \\
Mayo Foundation }

\def\bhat{\hat \beta} %define "bhat" to mean "beta hat"
\def\Mhat{\widehat M} %define "Mhat" to mean M-hat
\def\zbar{\bar Z}
\def\lhat{\hat \Lambda}


Over the last ten years I have been using the S package as a personal tool
for my investigations of survival analysis. This work gained a large
amount of momentum during my joint efforts with Patricia Grambsch
and Tom Fleming on residuals for survival models, some of which is
described in \cite{TGF}.
The set of
routines based on that work has been posted to statlib for several years
under the name ``survival2''
and is also included as a standard part of the S-plus package from
Statistical Sciences. With the advent of the new object-oriented
methods within S, laid
out in the book ``Statistical models in S" by John Chambers and Trevor
Hastie \cite{Smodel}, I became interested in adapting some of these methods to the
survival package.

The execution of this task has turned out to be a substantial effort, but
it afforded a chance to upgrade several of the routines with new features,
and I think the collection has been much improved in both functionality and
ease of use. Substantial opportunities for further
improvements still remain, however.

Section 2 gives a very terse overview of the available commands, without
attempting to explain the various options.
Section 3 contains the formal statististical details for the methods.

Section 4 gives
detailed examples. This is the place to expend most of your reading effort,
at least for early users.
Sections 5, 6 and 7 discuss three separate areas of the
package's interaction with S in general: side effects, missing values,
and conflicts with existing S functions.

There are several good texts on survival analysis; examples are Miller
\cite{Miller} and Kalbfleisch and Prentice \cite{Kalb}. A more
technical treatment of the Cox model, including the important extensions
of time dependent-covariates and multiple events, can be found in
Fleming and Harrington \cite{FandH}, and in Andersen et al. \cite{Ander2}.

My thanks to Sin-Ho Jung, Paul Novotny, and particularly Peter Thall for
helpful comments on an earlier draft of this manuscript.
Frank Harrell continues to provide feedback on the methods, along with
new ideas for improvement. (If I ever find time to implement all of
Frank's suggestions, this will become an {\em amazing} survival package!)
The summary below is purposefully very terse. If you are familiar
with survival analysis {\em and} with other S modeling functions it will
provide a good summary. Otherwise, just skim the section to get an overview
of the type of computations available from this package, and move on to
section 3 for a fuller description.

\item[Surv()] A {\em packaging} function; like I() and C() it doesn't
transform its argument. This is used for the left hand side of all the
\item \Co{Surv(time, status)} -- right censored data
\item \Co{Surv(time, s==3)} -- right censored data, a value of 3 = death
\item \Co{Surv(t1, t2, status)} -- counting process data, as in the current
agreg() function
\item \Co{Surv(t1, ind, type='left')} -- left censoring
\item Methods
Math.Surv \\
Ops.Surv \\
Summary.Surv \\
{[}.Surv \\ \\
\item[coxph()] Cox's proportional hazards model.
\item \Co{coxph(Surv(time, status) {\Twiddle}x, data=aml)} --
standard Cox model
\item \Co{coxph(Surv(t1, t2, stat) {\Twiddle} (age + surgery)* transplant)} --
time dependent data.
\item \Co{y 1}r_i}
\right) \ldots,
where ${r_1, r_2, \ldots, r_i}$ are the per subject risk scores.
Assume that the real data are continuous, but the data as recorded have tied
death times. For instance, we might have several subjects die on day 1 of
their hospital stay, but of course they did not all perish at the same
For a simple example, assume 5 subjects in time order,
with the first two both dying at the same recorded time.
If the time data had been more precise, then
the first two terms in the likelihood would be either
\left(\frac{r_1}{r_1+r_2+r_3+r_4+r_5} \right)
\left( \frac{r_2}{r_2+r_3+r_4+r_5} \right)
\left(\frac{r_2}{r_1+r_2+r_3+r_4+r_5} \right)
\left( \frac{r_1}{r_1+r_3+r_4+r_5}\right) \,,
but we don't know which. Notice that the product of the numerators remains
constant, but that of the denominators does not. How do we approximate this?

The Breslow approximation is the most commonly used, as it is the easiest
to program. It simply uses the complete sum $\sum r_i$ for both denominators.
Clearly, if the proportion of ties is large this will deflate the partial

The Efron approximation uses $.5r_1 + .5r_2 + r_3 + r_4 +r_5$ as the
second denominator, based on the idea that $r_1$ and $r_2$ each have
a 50\% chance of appearing in the ``true'' second term. If there were 4
tied deaths, then the ratios for $r_1$ to $r_4$ would be 1, 3/4, 1/2, and
1/4 in each of the four denominator terms, respectively.
Though it is not widely used, the Efron approximation is only slightly more
difficult to program than the Breslow version. In particular, since the
downweighting is independent of $w$ and thus of $\beta$, the form of the
derivatives is unchanged.

There are several ways to approach an ``exact'' calculation. One is to
use the average of the two possible denominators as the denominator for the
second term. This calculation quickly gets cumbersome if the number
of tied subjects $d$ who perish at a given time is at all large, since it
is the average of $d$ terms for the second denominator, ${d \choose 2}$
terms for the third, etc.
Note that if the risk scores for the tied subjects were all equal, then the
Efron approximation agrees precisely with this exact calculation.

Another tack is to use the marginal probability that subjects 1 and
2 both expire before subjects 3, 4 and 5. The form of the likelihood
changes considerably in this case, and the product of terms 1 and 2 is
replaced by
\int_0^\infty \left[1-\exp \left(-\frac{r_1}{r_3+r_4+r_5}\right) \right]
\left[1-\exp \left(-\frac{r_2}{r_3+r_4+r_5}\right) \right]
e^{-t} \,dt
If there are $d$ subjects $r_1$ to $r_d$ tied some given time, and we let $s$ be the
sum of the remaining scores in the risk set, the above integral expands to
1 - \sum_{i=1}^{d}\frac{s}{r_i +s} + \sum_{i \ne j} \frac{s}{r_i + r_j +s}
- \sum_{i \ne j \ne k} \frac{s}{r_i+r_j+r_k +s} + \ldots \,,
which is the same amount of work as the average denominator calculation.
(Though similar, the two expressions are not equivalent for $d>2$).
Some tedious algebra shows that if the risk scores for the tied subjects are
all equal, this equals $d!$ times the Efron approximation,
and thus leads to exactly the same solution for $\bhat$.
This would imply that
the first and second ``exact'' methods would be close for actual data sets.

The exact logistic likelihood, or exact partial likelihood, comes from viewing
the data as genuinely discrete.
The denominator in this case is
$ \sum_{i \ne j} r_i r_j$
if there are two subjects tied, $\sum_{i\ne j \ne k} r_i r_j r_k$ if there
are three subjects tied, etc.
The compute time for this case will be even larger than for the calculation
above. If there are ties, the value can be considerably different than the
first exact method.

The SAS phreg procedure implements the second and third exact method. A
small amount of empiric checking has verified that the Efron approximation is
very close to the exact marginal likelihood, and so only the exact partial
likelihood has been implemented in the S package.

Because of the superiority of the Efron approximation, the coxph function has
departed from all other Cox regression programs (that I know of) by making
it the default option rather than the Breslow approximation.
Note that when there are no ties, all the methods reduce to the same form.

The Efron approximation also induces changes in the residuals' definitions.
In particular, the Cox score statistic is still
U = \sum_{i=1}^n \int_0^\infty (Z_i(s) - \bar Z(s))\, dN_i(s)\,,
but the definition of $\bar Z(s)$ has changed if there are tied deaths
at time $s$.
If there are $d$ deaths at $s$, then there were $d$ different values of $\bar Z$
used at the time point. The Schoenfeld residuals use
$\bar{\bar Z}$, the average of
these $d$ values, in the computation.

The martingale and score residuals require a new definition of
If there are $d$ tied deaths at time $t$,
we again assume that in the exact (but unknown) untied data
there are events and corresponding jumps in the cumulative hazard at $t \pm
\epsilon_1 < ...< t \pm \epsilon_d$.
Then each of the tied subjects will in
expectation experience all of the first hazard increment, but only
$(d-1)/d$ of the second, $(d-2)/d$ of the
next, and etc. If we equate observed to expected hazard at each
of the $d$ deaths, then the total increment in hazard at the time point is
the sum of the denominators of the weighted means. Returning to our earlier
example of 5 subjects of which 1 and 2 are a tied deaths:
d\lhat(t) = \frac{1}{r_1+r_2+r_3+r_4+r_5} + \frac{1}{r_1/2 + r_2/2
+r_3 + r_4 + r_5}\,.
For the null model where $r_i=1$ for all $i$, this agrees with the suggestion
of Nelson (1969) to use $1/5+1/4$ rather than $2/5$ as the increment to the
cumulative hazard.

The score residuals do not work out to as neat a formula, though the
computation is no harder. For subject 1 in the example,
the residual at time 1 is the sum $a+b$ of the 2 terms:
a &=& \left(Z_1 - \frac{r_1Z_1 + r_2Z_2+ \ldots + r_5Z_5}
{r_1 + r_2 + \ldots + r_5} \right )
\left ( \frac{dN_1}{2} - \frac{r_1}{r_1 + r_2 + \ldots+ r_5}
\right ) \; {\rm and}\\
b &=& \left(Z_1 - \frac{r_1Z_1/2 + r_2Z_2/2 + \ldots +r_5Z_5}
{r_1/2 + r_2/2 + \ldots + r_5} \right )
\left ( \frac{dN_1}{2} - \frac{r_1/2}{r_1/2 + r_2/2 + \ldots+ r_5}
\right ) \,.
This product does not neatly collapse into $(Z_1 - \bar{\bar Z})\, d\Mhat_i$
but is nevertheless fairly easy to compute.
To those who wish to check the algebra: start with the expanded ($d$ term)
definition of the increment to $U$, and repeat the infinitesimal jackknife
calculations of Cain and Lange \cite{Cain}.

This argument carries through as well to predicted survival. There is a
change in weights but no change in the form of the equations.

The connection between residuals and the exact partial likelihood is not
as precise, e.g. the score residuals will not correspond to an infinitesimal
The exact calculation is used only rarely, the form of the
computations will be quite different, and it thus appears to be not worth
the bother. If residuals are requested after a Cox fit with the \Co{exact}
method, the Breslow formulae are used.

\subsubsection{Tests for Proportional Hazards}

The key ideas of this section are taken from Grambsch and Therneau
Most of the common alternatives to proportional hazards can be cast in
terms of a {\em time-varying coefficient} model. That is, we assume that
\lambda(t;Z) = \lambda_0(t) e^{\beta_1(t)Z_1 + \beta_2(t)Z_2 + \ldots}
(If $Z_j$ is a 0-1 covariate, such as treatment, this formulation is completely
general in that it encompasses all alternatives to proportional hazards.)
The proportional hazards assumption is then a test for $\beta(t) = \beta$,
which is a test for zero slope in the appropriate plot of
$\bhat(t)$ on $t$.

Let $i$ index subjects, $j$ index variables, and $k$ index the death times.
Then let $s_k$ be the Schoenfeld residual and $V_k$ be the contribution to
the information matrix at time $t_k$ (see equation \ref{inf}).
Define the rescaled Schoenfeld residual as
s^*_k = \bhat + s_k V_k^{-1} \, .

The main results are:
\item $E(s^*_k) = \beta(t_k)$, so that a smoothed plot of $s^*$ versus time gives a
direct estimate of $\bhat(t)$.
\item Many of the common tests for proportional hazards are linear tests for
zero slope, applied to the plot of $r^*$ versus $g(t)$ for some function $g$.
In particular, the Z:PH test popularized in the SAS PHGLM procedure
corresponds to $g(t) =$ rank of the death time.
The test of Lin \cite{Lin} corresponds to $g(t) = K(t)$, where $K$ is the
\item Confidence bands, tests for individual variables, and a global test
are available, and all have the fairly standard ``linear models'' form.
\item The estimates and tests are affected very little if the individual
variance estimates $V_k$ are replaced by their global average
$\Vbar=\sum V_k /d = {\cal I}/d$. Calculations then require only the Schoenfeld
residuals and the standard Cox variance estimate ${\cal I}^{-1}$.

For the global test, let $g_1(t), g_2(t), \ldots$ be the desired
transformations of time for variables 1, 2, etc,
and $G_k$ be the diagonal matrix $g_1(t_k), g_2(t_k), \ldots$ .
T = \left( \sum G_k s_k \right)' D^{-1} \left( \sum G_k s_k \right)
is asymptotically $\chi^2$ on $p$ degrees of freedom, where
D = \sum G_kV_kG_k - \left(\sum G_k V_k \right )
\left( \sum V_k \right)^{-1} \left(\sum G_k V_k \right )' \, .
Because the $s_k$ sum to zero, a little algebra shows that the above
expression is invariant if $G_k$ is replaced by $G_k - cI$ for any
constant $c$. Subtraction of a mean will, however, result in less
computer round-off error.

In any rational application, we will have $g_1 = g_2= \ldots = g$, and then
we can replace each matrix $G_k$ with a scalar $g(t_k) \equiv g_k$
in the above formulas. A
further simplification occurs by using $\Vbar$, leading to
T = \left[ \sum (g_k-\bar g) s_k \right]'
\left [ \frac{d {\cal I}^{-1}}{\sum(g_k - \bar g)^2} \right ]
\left[ \sum( (g_k- \bar g) s_k \right]

For a given covariate $j$, the diagnostic plot will have $y_k = s^*_{kj}$
on the vertical axis and $g_k$ on the horizontal.
The variance matrix of the $y$ vector is $\Sigma_j = (A - cJ) + cI$,
where $A$ is a $d$x$d$ diagonal matrix whose $k$th diagonal element is
$V^{-1}_{k,jj}$, $c = {\cal I}^{-1}_{jj}$,
$J$ is a $d$x$d$ matrix of ones and $I$ is the identity.
The $cI$ term reflects the uncertainty in $s^*$ due the the $\bhat$
term. If only the shape of $\beta(t)$ is of interest (e.g., is it linear
or sigmoid) the term could be dropped. If absolute values are important
(e.g. $\beta(t)=0$ for $t>2$ years) it should be retained.

For smooths that are linear operators, such as splines or the lowess function,
the final smooth is $\hat y = H y$ for some matrix $H$. Then $\hat y$
will be asymptotically normal with mean 0 and variance $H \Sigma_j H'$.
Standard errors can be computed using ordinary linear model methods.
Although the covariance between residuals is roughly $-1/d$ times their variance, it cannot
be neglected in computing standard errors for the smooth. For larger
smoothing spans, simulations showed up to 25\% inflation in the width of
the interval if covariance was ignored.

If $V_k$ is replaced with $\Vbar$, then $\Sigma_j$ simplifies to
${\cal I}^{-1}_{jj}((d+1)I -J)$. With the same substitution, the
component-wise test for linear association is
t_j = \frac{\sum (g_k - \bar g) y_k}
{\sqrt{d {\cal I}^{-1}_{jj} \sum(g_k - \bar g)^2 }}

The \Co{cox.zph} function uses (\ref{Tph}) as a global test of proportional
hazards, and (\ref{tph}) to test individual covariates. The plot method for
\Co{cox.zph} uses a natural spline smoother (\Co{lowess} might be preferred,
but the necessary $H$ matrix is not readily obtained);
confidence bands for the smooth are based
on the full covariance matrix, with $\Vbar$ replacing $V_k$.

Formulae aside, reasonably accurate results can be obtained by using other
methods directly on the residuals. The return value of\Co{cox.zph}
contains both the $g(t)$ vector and the $y$ matrix that are appropriate for plots.
These can be used as the $x$ and $y$ data for a \Co{gam} model with
identity link and Gaussian errors. for example.
The size of the confidence
band will be conservative (too large) since,
as discussed above,
the correlation
between the data points has been ignored. This effect will tend to decrease
as the sample size increases, since a smaller fraction of the data will be
in any smoothing window. Secondly, the overall estimate of variation may
be larger, since it is estimated using the variation of each $y$ value from
the fitted function; the $V_k$ estimates are based on the variation of each
$y$ from its risk set.

Though the simulations in Grambsch and Therneau (1993) did not uncover
any situations where the simpler formulae based on $\Vbar$ were
less reliable, such cases could arise. The substitution trades a possible
increase in bias for a substantial reduction in the variance of the individual
$V_k$. It is likely to be unwise in those cases where the variance of
the covariates, within the risk sets, differs substantially between different
risk sets. Two examples come to mind. The first would be a stratified
Cox model, where the strata represent different populations. In a muli-center
clinical trial for instance, inner city, Veterans Administration and
suburban hospitals often service quite disparate populations.
In this case a separate average $\Vbar$ should be formed for each strata.
A second example
is where the covariate mix changes markedly over time, perhaps because of
aggressive censoring of certain patient types.

These cases have not been addressed directly in the software. However,
\Co{coxph.detail} will return all of the $V_k$ matrices,
which can then be used to construct specialized tests for
such situations.

Clearly, no one scaling function $g(t)$ will be optimal for all situations.
The \Co{cox.zph} function directly supports four common choices: identity,
log, rank, and 1 - Kaplan-Meier. By default, it will use the last of these,
based on the following rationale. Since the test for proportional hazards
is essentially a linear regression of the scaled residual on $g(t)$, we
would expect this test to be adversely effected if there are outliers in $x$.
We would also like the test to be only mildly (if at all) effected by the
censoring pattern of the data. The Kaplan-Meier transform appears to satisfy
both of these criteria.

\subsubsection{Robust Variance}
Robust variance calculations are based on the {\em sandwich estimate}
V = A B A' \,
where $A^{-1} = \cal I$ is the usual information matrix, and B is a
``correction term''.
The genesis of this formula can be found in Huber \cite{Huber}, who discusses
the behavior of any solution to an estimating equation
\sum_{i=1}^n \phi(x_i, \bhat) =0 \,.
Of particular interest is the case of a maximum likelihood estimate based on distribution
$f$ (so that $\phi = \partial \log f / \partial \beta$),
when in fact the data are observations from distribution $g$.
Then, under appropriate conditions, $\bhat$ is asymptotically normal with
mean $\beta$ and covariance $V= ABA'$, where
A &=& \left ( \frac{\partial E \Phi(\beta)} {\partial \beta}
\right )^{-1} \\
and $B$ is the covariance matrix for $\Phi = \sum \phi(x_i, \beta)$.
Under most situations the derivative can be moved inside the expectation,
and $A$ will be the inverse of the usual information matrix.
This formula was rediscovered by White \cite{White1} \cite{White2}
(under less general conditions I believe, but all these papers are a bit
over my head),
and is also known in the econometric literature as White's method.

Under the common case of maximum likelihood estimation we have
\sum \phi &=& \sum_{i=1}^n \frac{\partial \log f(x_i)} {\partial \beta} \\
&\equiv& \sum_{i=1}^n u_i(\beta) \,.
Then by interchanging the order of the expectation and the derivative,
$A^{-1}$ is the expected value of the information matrix, which will be
estimated by the observed information ${\cal I}$.
Since $E[u_i(\beta)]$ =0,
B = {\rm var}(\Phi) &=& E(\Phi^2) \nonumber \\
&=& \sum_{i=1}^n E[u_i'(\beta) u_i(\beta)]
+ \sum_{i \ne j} E[u_i'(\beta) u_j(\beta)] \,
where $u_i(\beta)$ is assumed to be a row vector.
If the observations are independent, then the $u_i$ will also be independent
and the cross terms in equation (\ref{U4}) above will be zero. Then a natural
estimator of $B$ is
\widehat B &=& \sum_{i=1}^n u_i'(\bhat) u_i(\bhat) \\
&=& U'U \,,
where $U$ is the matrix of {\em score residuals}, the $i$th row of $U$ equals
$u_i(\bhat)$. The column sums of $U$ are the efficient score vector $\Phi$.

As a simple example consider generalized linear models. McCullagh and
Nelder \cite{glim} maintain that overdispersion ``is the norm in practice and
nominal dispersion the exception.'' To account for overdispersion they
recommend inflating the nominal covariance matrix of the regression
coefficients $A = (X'WX)^{-1}$ by a factor
c = \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{V_i} / (n-p) \, ,
where $V_i$ is the nominal variance. Smith and Heitjan \cite{SH} show that
$AB$ may be regarded as a multivariate version of this variance
adjustment factor, and that $c$ and $AB$ may be interpreted as the average
ratio of actual variance $(y_i - \mu_i)^2$ to nominal variance $V_i$. By
premultiplying by $AB$, each element of the nominal variance-covariance
matrix $A$ is adjusted differentially for departures from nominal

When the observations are not independent, the estimator $B$
must be adjusted accordingly. The ``natural'' choice
$(\sum u_i)^2$ is not available of course, since $\Phi(\bhat)=0$ by
definition. However, a reasonable estimate is available
when the correlation is confined to subgroups. In particular, assume that
the data comes from clustered sampling with $j=1, 2, \dots, k$ clusters, where
there may be correlation within cluster but observations from different
clusters are independent. Using equation (\ref{U4}), the cross-product terms
between clusters can be eliminated, and the resulting equation rearranged
{\rm var}(\Phi) = \sum_{j=1}^k \tilde u_j(\beta)' \tilde u_j(\beta) \,,
where $\tilde u_j$ is the sum of $u_i$ over all subjects in the $j$th cluster.
This leads to the {\em modified sandwich estimator}
V = A (\tilde U' \tilde U) A \,,
where the collapsed score matrix $\tilde U$ is obtained by replacement of
each cluster of rows in $U$ by the sum of those rows.
If the total number of clusters is
small, then this estimate will be sharply biased towards zero, and some other
estimate must be considered. In fact, ${\rm rank}(V) < k$, where $k$ is
the number of clusters. Asymptotic results for the modified sandwich
estimator require that the number of clusters tend to infinity.

Application of these results to the Cox model requires an expression for the
score residuals matrix $U$. Equations (\ref{U1}) and (\ref{U2}) show the
partial likelihood written in two forms, and (\ref{U3}) yet another;
which should be used as a basis for
our work? One way to proceed is to define a weighted Cox partial likelihood,
and then let
u_i(\beta) \equiv \left( \frac{\partial U}{\partial w_i}
\right)_ { w =1} \, ,
where $w$ is the vector of weights; this approach isused in Cain and Lange
\cite{Cain} to define a leverage or influence measure for Cox regression.
In particular, they derive the leverage matrix
L = U {\cal I}^{-1}\,,
where $L_{ij}$ is the approximate change in $\bhat$ when observation $i$ is
removed from the data set. Their estimate can be recognized as a form of
the {\em infinitesimal jackknife}, see for example the discussion in Efron
\cite{Efron} for the linear models case.
The same leverage estimate is derived using a slightly different argument
by Reid and Cr\'{e}peau \cite{Reid}. They mention, but do not persue, the use
of $L'L$ as a variance estimate.

Specific applications
of the sandwich and modified sandwich estimators, detailed below, have all
re-derived this result as part of their development.

In fact the connection to the jackknife is quite general. For any model
stated as an estimating equation, the Newton-Raphson iteration has step
\Delta \beta = 1'(U{\cal I}^{-1})\,,
the column sums of the matrix $L= U{\cal I}^{-1}$. At the solution $\hat \beta$
the iteration's step size is, by definition, zero. Consider the following
approximation to the jackknife
\item treat the information matrix ${\cal I}$ as fixed
\item remove observation $i$
\item beginning at the full data solution $\hat\beta$, do one Newton-Raphson
This is equivalent to removing one row from $L$, and using the new column
sum as the increment. Since the column sums of $L(\bhat)$ are zero, the
increment must be $\Delta \beta = -L_{i.}$. That is, the rows of $L$ are
an approximation to the jackknife, and the sandwich estimate of variance
$L'L$ is an approximation to the jackknife estimate of variance.

When the data are correlated, the appropriate form of the jackknife is to
leave out an entire {\em subject} at time, rather than one observation, i.e.,
the grouped jackknife. To approximate this, we leave out groups of
rows from $L$, leading to $\tilde L' \tilde L$ as the approximation to the

Lin and Wei \cite{LW} show the applicability of Huber's work to the partial
likelihood, and derive the ordinary Huber sandwich estimate
$V= {\cal I}^{-1}(U'U){\cal I}^{-1}= L'L$.
They also discuss situations in which this is estimate is preferable,
including the important cases of omitted covariates and incorrect functional
form for the covariate.
The relationship of their estimate to the leverage matrix $L$ is
not noted by the authors.

Lee, Wei and Amato \cite{LWA}
consider highly stratified data sets which arise from inter observation
correlation. As an example they use paired eye data on visual loss
due to diabetic retinopathy, where photocoagulation was randomly assigned
to one eye of each patient.
There are $n/2=1742$ clusters (patients) with 2 observations per cluster.
Treating each pair of eyes as a cluster, they derive the modified sandwich
estimate $V=\tilde L' \tilde L$, where $\tilde L$ is derived from $L$ in the
following way. $L$ will have one row, or observation, per eye.
Because of possible correlation, we want to reduce this to a leverage matrix
$\tilde L$ with one row per individual. The leverage (or row) for an individual is
simply the sum of the rows for each of their eyes. (A subject, if any, with
only one eye would retain that row of leverage data unchanged).
The resulting estimator is shown to be much more efficient than analysis
stratified by cluster.
A second example given in Lee, Wei and Amato
concerns a litter-matched experiment. In this case the number of rats/litter
may vary.

Wei, Lin and Weissfeld \cite{WLW} consider multivariate survival times. An
example is the measurement of both time to progression of disease and
time to death for a group of cancer patients.
The data set again contains $2n$ observations, time and status variables,
subject id, and covariates. It also contains an indicator variable \Co{etype} to
distinguish the event type, progression vs. survival. The suggested model
is stratified on event type, and includes all strata$\times$covariate
interaction terms. One way to do this in S is
> fit2 Ltilde newvar 0$.

Using weights appropriate to the survey sampling scheme, Binder suggests use
of the modified sandwich estimate ${\cal I}^{-1} B {\cal I}^{-1}$ where
$B = {\rm var}(U)$ ``can be estimated using design based methods'', though
he gives no specifics on what these might be. His derivation of the
score residual vectors $u_i$ differs from the above, but the same result
is obtained, and shown in the last equation of his section 3.
In a simulation study he compares the naive, sandwich, and modified sandwich
estimators, with the latter being the most reliable.

Lin \cite{Lin} also develops a weighted Cox model, in the context of tests
for proportional hazards. His estimates of the score and hence of
$\bhat$ are based on equation (\ref{bind:u}), but without redefinition of
$\bar Z$ to include weights. It is thus not related to case weights, but
rather to weighted log-rank statistics such as the Tarone-Ware family
Estimates for this model can be obtained from S in three steps; assume that
$w$ is the weight variable:
\item Use \Co{coxph} with $w$ as weights and $-\log(w)$ as an offset to
estimate Lin's weighted $\bhat$.
\item Fit a second cox model, without weights or an offset, but with the
coefficients constrained to equal the results of the first model.
(Use initial values and \Co{iter=0}.)
The \Co{coxph.detail}
function can be applied to this second model to obtain the individual
$v_i$ estimates.
\item Estimate the variance of $\bhat$ as $ABA$, where $A= (\sum w_i
v_i)^{-1}$ and $B= \sum w_i^2 v_i$.
Tests for proportional hazards are more easily accomplished, however, using the
\Co{cox.zph} function.

An exciting use of weights is presented in Pugh et al. \cite{Pugh}, for
inference with missing covariate data. Let $\pi_i$ be the probability
that none of the covariates for subject $i$ is missing, and $p_i$
be an indicator function which is 0 if any of the covariates actually
is NA, so that $E(p_i) = \pi_i$.
The usual strategy is to compute the Cox model fit over only
the complete cases, i.e., those with $p_1=1$. If information is not missing
at random, this can lead to serious bias in the estimate of $\bhat$.
A weighted analysis with weights of $p_i/ \pi_i$ will correct for this
imbalance. There is an obvious connection between this idea and survey
sampling: both reweight cases from underrepresented groups.

In practice $\pi_i$ will be unknown, and the authors suggest estimating
it using a logistic regression with $p_i$ as the dependent variable. The
covariates for the logistic regression may be some subset of the Cox
model covariates (those without missing information), as well as others.
In an example, the authors use a logistic model with follow-up time
and status as the predictors.
Let $T$ be the matrix of score residuals from the logistic model, i.e.
T_{ij} = \frac{\partial}{\partial \alpha_j}
[p_i \log \pi_i(\alpha) + (1-p_i) \log(1- \pi_i(\alpha))] ,
where $\alpha$ are the coefficients of the fitted logistic regression.
Then the estimated variance matrix for $\bhat$
is the sandwich estimator ${\cal I}^{-1} B {\cal I}^{-1}$,
B = U'U - [U'T] [T'T]^{-1} [T'U] \,.
This is equivalent to first replacing each row of $U$ with the residuals
from a regression of $U$ on $T$, and then forming the product $U'U$.
Note that if the logistic regression is completely uninformative
($\hat\pi_i = $ constant), this reduces to the ordinary sandwich estimate.

For either of the Breslow or the Efron approximations,
the extra programming to handle weights is modest.
For the Breslow method the logic behind the addition is also straightforward,
and corresponds to the derivation given above. For tied data and the
Efron approximation, the formula is based on extending the basic idea of
the approximation, $E(f(r_1, r_2, \ldots)) \approx f(E(r_1), E(r_2), \ldots)$
to include the weights, as necessary. Returning to the simple example
of section \ref{Tied} above, the second term of the partial likelihood is
\left( \frac{w_1r_1}{w_1r_1 + w_3r_3 + w_4r_4 + w_5r_5} \right )
\left( \frac{w_2r_2}{w_2r_2 + w_3r_3 + w_4r_4 + w_5r_5} \right )\,.
To compute the Efron approximation,
separately replace the numerator with $.5(w_1r_1 + w_2r_2)$ and the
denominator with $ .5w_1r_1 + .5w_2r_2 + w_3r_3 + w_4r_4 + w_5r_5$.

\subsubsection{Estimated survival}
The methods for expected survival after a Cox model parallel those for
expected survival based on a population. One can create individual
survival curves or cohort survival curves, and the latter can be in any
of the Ederer, Hakulinen, or conditional frameworks.

For individual survival,
recall that the estimated cumulative hazard for a subject with covariate
process $x(t)$ is
\Lambda(t; x) = \int_0^t e^{\bhat' x(s)} d\lhat_0(s) ',,
with either the Breslow or Efron estimator of $\lhat_0$.
The choice of the Breslow or Efron estimate should be consistent with the
option chosen to break ties when the Cox model was fit (this is the default
action of the software).
The estimated
survival function is then $\hat S(t;x) = \exp(-\lhat(t;x))$.

If the vector of coefficients $\bhat$ were treated as fixed, then the
variance of the cumulative hazard would be
V(t) = \int_0^t e^{2\bhat' x(s)} \frac {d \bar N(s)}
{(\sum Y_i(s) e^{\bhat' Z_i(s)})^2}
for the Breslow estimator, which is a natural extension of the Aalen estimate
of variance in Nelson's hazard estimator. If Efron's method were used, the
variance estimate will be slightly larger.

The actual variance for the cumulative hazard must also account for the
error in estimation of $\bhat$
Tsiatis \cite{tsiatis} and Link \cite{Link1, Link2} have derived this, and
with some rearragement their result can be written as
V_c(t) = \int_0^t \{1 + [x(s) - \zbar(s)]' {\cal I}^{-1} (x(s) - \zbar(s)]\}
dV(s) '\,,
where $V_c$ is the variance under the Cox model, $V$ is the naive variance
given above, $Z$ is the covariate set for the fitted model, $x$ is the
covariate vector for this curve,
and ${\cal I}^{-1}$ is the variance matrix for $\bhat$.
The increase in variance is largest for a covariate set that is far from
the mean.
Gail and Byar \cite{Gail} have extended this result to the estimation of
multiple curves, and show that if $x(s)$ and $x^*(s)$ were two separate
covariate vectors, then the covariance of the two estimated cumulative
hazards is
\int_0^t \{1 + [x(s) - \zbar(s)]' {\cal I}^{-1} (x^*(s) - \zbar(s)]\}
dV(s) '\,,
where $V$ has had the obvious substitution of $x(s) + x^*(s)$ in place of
$2 x(s)$.
No results are given for a stratified Cox model. Any individual curve
will be for a given covariate set $x$ and a given strata $k$, so the variance
term will be as before with $dV_k$ as a differential. The cross product
terms could presumably be integrated with respect to $(dV_k + dV_{k^*})/2$;
however, this creates extra complexity in the programming and has not yet
been implimented.

Since the formula involves the covariate average over time $\zbar(t)$ for
the {\em original study}, it cannot usually be calculated from published
summaries, e.g., to esimate the expected survival of the patient in front
of you based on a report in the literature. However, for most Cox fits
$\zbar(t)$ will vary only slightly over time, and a reasonable approximation
could be made if the report contained both the initial averages and the
confindence intervals for any particular case.

We can also write $V_c(t) = V(t) + d(t)'{\cal I}^{-1} d(t)$. Some
manipulation shows that
the vector $d(t)$ is the score residual process for a hypothetical new
subject with covariates $x$ and no events, which is a measure of the leverage
of such an observation on the estimate of $\beta$. It is intuitive that
if the covariate value(s) $x$ has small leverage on $\bhat$,
then the variance in $\bhat$ will have small effect on the curve.

The calculations for a weighted model have the same form, using the jumps
in the weighted cumulative hazard function
\lhat_0(t) = \sum_{i=1}^n \int_0^t w_i \frac{dN_i(s)}
{\sum Y_j(s)w_j \hat r_j(s)} \,.

This curve is applicable to a single patient,
and is the appropriate object to plot when considering the predicted survival
for some future patient who has some particular set of covariates.
Another use is to fit a stratified model, for example
Surv(time, status) ~ age + sex + strata(treatment)
Then a plot of the pair of curves is a comparison of the treatment groups,
adjusted to common age and sex values. This can be useful when two treatments
are unbalanced with respect to an important covariate.

It is common practice to use these curves for group survival as well.
The curve for the ``average'' patient, i.e., the curve corresponding to
a fictional subject with mean values for each covariate, is then used
as the predicted survival curve for the entire cohort. Though convenient
this procedure is incorrect. What should be done follows from exactly the
same discussion as found above in the section on expected survival for a
population matched reference. Either the Ederer, Hakulinen, or conditional
computation can be used.

One use of these cohort averages is to summarize and present the results
of a study.
This issue is reviewed in Thomsen, Keiding and Altman \cite{Thomsen},
albeit using very different terms. Center stage is given to the analog of
the Ederer estimate, referred to as the {\em
direct adjusted survival curve}, a term coined earlier by Makuch
Using a simple example, Thomsen et. al. demonstrate that the estimate based
on a single average patient will be badly biased whenever there is
large variation in the individual risk scores $\beta'x$.

A second use is for historical comparison.
With the increased availability of published regression analyses of survival
with specific diseases it has become possible to make historical comparisons
between the observed survival in a (possibly fairly small) study group and the
survival that was to be expected from a published regression analysis. One
area where this has been used is in transplantation, where randomized studies
are logistically (and perhaps ethically) impossible, so that the best answer
available to the question ``what would have happened to these patients if they
had not been transplanted?'' will be via comparison with published
pre-transplatation-therapy experience \cite{Dickson}.
In this case $\lambda_i$ will come from the older study, with the original
Cox model covariates as the matching variables. Follow-up and censoring
time will come from the new data set.

A variance estimate for the direct adjusted curve $S_d$ is derived in Gail and
Byar \cite{Gail}. Let
S_d = (1/n) \sum_{i=1}^n S_i(t)
Where $S_i$ is the individual survival curve for subject $i$ in the {\em new}
data set. This is calculated as above, using $x_i(t)$ for the new subject
but risk sets and variance based on the original Cox model fit. Then
{\rm var}(S_d) = (1/n)^2 [ \sum {\rm var}(S_i) + \sum_{i\ne j}
{\rm cov}(S_i, S_j) ].

Thomsen et al. also discuss the
conditional estimate
\exp \left[-\int_0^t \frac{\sum Y_i(s)\lambda_0(s) e^\eta_i(s)}
{\sum Y_i(s)} \,ds\right ] \,.
They conclude that the curve itself is ``not easy to interpret'' because it
mixes observed mortality, through $Y_i$, with expected mortality,
through $\lambda_i$. However, the difference in log survival curves can
be used as an estimate of excess mortality as is done in
Andersen and V\ae th \cite{Ander4}.

This author believes that extension of Hakulinen's cohort method
is the most appropriate way to combine expected curves in the Cox model.
However, I am not aware of any discussion of this in the literature.
The failure of the Ederer method, remember, occurs when there is significant
change in the enrollment criteria over the course of a study. This is of
major concern in historical reviews that span $>10$ years, as the age profile
often changes dramatically. For a Cox model, this will be an issue whenever
there is similar change in the values at entry for one of the variables
that was included in the model.

To conclude this section I must comment on a common {\em abuse} of the
Cox model expected curve. This is the comparison of the
expected curve for a study to the Kaplan-Meier of the same study as a
test for ``goodness of fit''.
This is nonsense since:
\item Some difference between the two can arise because of different
approximations, i.e., the S functions default to an estimate that is
comparable to the Fleming-Harrington method. This will differ from the
Kaplan-Meier in the tails, where $n$ is small.
\item Some simple algebra shows that the conditional estimator {\em is} the
F-H estimate of the raw data, independent of the value of $\bhat$.
\item If the censoring pattern is not independent of risk, then the Ederer
estimate will differ from the K-M because of the censoring effect, even if
the Cox model is completely correct.
\item For most data sets the value of $\zbar(t)$ varies only slowly with time.
In this case the individual survival curve for the average subject $\zbar(0)$
will also be approximately equal to the F-H estimator, independent of $\bhat$.
\item If a data set includes time-dependent covariates, the individual
survival curve for any fixed $x$ can be very surprizing.
\subsection{Parametric Regression}

\subsubsection{An IRLS formulation}
With some care, parametric survival can be written so as to fit into the
iteratively reweighted least squares formulation used in Generalized
Linear Models of McCullagh and Nelder \cite{glim}.
A detailed description of this setup for general maximum likelihood
computation is found in Green \cite{green}.

Let $y$ be the data vector, and $x_i$ be the vector of covariates for the
$ith$ observation. Assume that
$$ z_i \equiv \frac{t(y_i) - x_i'\beta}{\sigma} \sim f $$
for some distribution $f$, where $y$ may be censored and $t$ is a
transformation function.

Then the likelihood for $t(y)$ is
$$ l = \left( \prod_{exact} f(z_i)/\sigma \, \right)
\left( \prod_{right} \int_{z_i}^\infty f(u) du \, \right)
\left( \prod_{left} \int_{-\infty}^{z_i} f(u) du \,\right)
\left( \prod_{interval} \int_{z_i^*}^{z_i} f(u) du \right),
where ``exact'', ``right'', ``left'', and ``interval'' refer to uncensored,
right censored, left censored, and interval censored observations,
and $z_i^*$ is the lower endpoint of a censoring interval.
Then the log likelihood is defined as
\log(l) = \sum_{exact} g_1(z_i) - log(\sigma) + \sum_{right} g_2(z_i) +
\sum_{left} g_3(z_i) + \sum_{interval} g_4(z_i, z_i^*).
\label{ggdef} \end{equation}

Derivatives of the LL with respect to the regression parameters are:
\frac{\partial LL}{\partial \beta_j} &=&
\sum_{i=1}^n x_{ij} \frac{\partial g}{\partial \eta_i} \\
\frac{\partial^2 LL} {\partial \beta_j \beta_k} &=&
\sum x_{ij}x_{ik} \frac{\partial^2 g}{\partial \eta_i^2}\, ,
where $\eta = X'\beta$ is the vector of linear predictors.

Thus if we treat $\sigma$ as fixed, then iteration
is equivalent to IRLS with weights of $-g''$ and adjusted dependent variable
of $\eta - g'/g''$.
The Newton-Rhaphson step defines an update $\delta$
$$ (X^T DX) \delta = X^T U, $$
where $D$ is the diagonal matrix formed from $-g''$,
and $U$ is the vector $g'$.
The current estimate $\beta$ satisfies $X \beta = \eta$, so that the new
estimate $\beta + \delta$ will have
(X^T DX)(\beta + \delta) &=& X^T D \eta + X^T U \\
&=& (X^T D) (\eta + D^{-1}U)
One clever use of this within S is to return an object
that inherits from class {\em glm}. Many of the stepwise methods are
inherited: they do one step update approximations, effectively under the
assumption that the extra parameters are fixed. This is a useful and quick
first approximation for new fits.

Nevertheless, there are several differences from GLM models.
\item Glm models assume that $y$ comes from a particular distribution, while
its mean follows some link form $t(\mu) = \eta$. Here we assume that the
transformed data $t(y)$ itself follows the given distribution.
If the data were uncensored, the fit of a gaussian model with log link will
{\em not} be the same as a glm fit with log link. Survfit will assume that
the error distribution is log-normal.
\item The nice starting estimate procedure for glm models had to be modified.
The other coefs are not independent of $\sigma$, and $\sigma$ often appears
to be the most touchy variable in the iteration. We start with a naive
estimate: the variance of y, ignoring censoring.
With sigma fixed, the other variables can be estimated by doing one iteration
with $\eta = t(y)$ (for interval censored use the midpoint),
e.g., the glm trick.
Then $\sigma$ is reestimated as the variance of the
unweighted residuals.
\item The maximized value of the loglikelihood
for a right or left censored observation is 0, since by making $\eta$ sufficiently
large or small we can insure that the relevant integral is 1.
Thus there is no ``deviance correction". The deviance correction for
an interval censored datum is a nuisance for the
the asymmetric distributions, but simple for the others.

Most often, the routines will be used with $t(y) = \log(y)$, which
corresponds to the set of accelerated failure time models. But using the
above formulation, it was not very hard to make the programs completely
general. Note that $t()$
does not appear explicitly in any of the derivations.
The only difference between the following two fits is the scale of the
fitted values: in the latter they are in units of $\log(y)$.
> fit1 fit2 ovarian coxph(Surv(futime, fustat){\Twiddle} age, ovarian)
Call: coxph(formula = Surv(futime, fustat) {\Twiddle} age,

coef exp(coef) se(coef) z p
age 0.162 1.18 0.0497 3.25 0.00116

Likelihood ratio test=14.3 on 1 df, p=0.000156 n= 26

For a more complicated model, the result should probably be saved in a
temporary variable. It can then be printed multiple times, and residuals
and/or predicted values may be extracted.
> fit print(fit)
Call: coxph(formula = Surv(futime, fustat) {\Twiddle} + rx +

coef exp(coef) se(coef) z p 1.347 3.844 0.680 1.980 0.0478
rx -0.749 0.473 0.595 -1.260 0.2078 0.453 1.573 0.590 0.767 0.4431

Likelihood ratio test=6.03 on 3 df, p=0.11 n= 26

The print function is invoked automatically, so in the example above the
user could have typed ``fit'' instead of ``print(fit)''. A more complete
printout is produced by the summary function. This adds confidence intervals,
Wald and score tests,
and an $R^2$ measure based on work of Nagelkirke \cite{Nagel}. This
measure needs to be proven over time, but certainly is one of
the easier ones to implement that I've seen, and appears very well founded.
An option to the summary function can be
used to get confidence intervals at levels other than .95.

The stratified Cox model can be obtained by
using a strata directive within the fit.
> fit summary(fit)
coxph(formula = Surv(futime, fustat) {\Twiddle} age + + strata(rx),

N= 26

coef exp(coef) se(coef) z p
age 0.1385 1.149 0.048 2.885 0.00391 -0.0967 0.908 0.630 -0.154 0.87799

exp(coef) exp(-coef) lower .95 upper .95
age 1.149 0.871 1.045 1.26 0.908 1.102 0.264 3.12

Rsquare= 0.387 (max possible= 0.874 )
Likelihood ratio test= 12.7 on 2 df, p=0.00174
Efficient score test = 12.2 on 2 df, p=0.0022

After the fit is completed residuals from the fitted model can be obtained
using the \Co{resid} function. By default the martingale residuals are
produced, also available are deviance, score, and Schoenfeld residuals.
For any of these it is sufficient to give the shortest unique abbreviation
of the residual type. Two common transformations of the score residuals
can also be requested: \Co{dbeta} and \Co{dfbetas}. These are the
approximate change in the coefficient vector if observation $i$ is dropped,
and that change scaled by the variance of $\beta$.

Martingale residuals are used most often in an
assessment of functional form, score residuals play a role in assessing
influential or leverage data points as well as computation of robust
``sandwich'' variance estimators, and the Schoenfeld residuals are
useful in assessing time trends or lack of proportionality in one of
the coefficients of the model. Deviance residuals, though they have an
interesting theoretical justification, have not proven very useful in

> fit mresid dresid sresid resid(fit, "scho")
age rx
59 2.69315678 0.06761161 -0.1256239 -0.5072536
115 5.36390193 0.08039118 -0.1493686 -0.6031317
156 -0.89877404 0.10683988 -0.1985109 0.1984379
268 6.95664457 0.12857952 -0.2389036 0.2388158
329 -15.73656567 0.28889884 -0.5367805 -0.4634169
353 4.06104424 -0.70587652 0.4535120 0.5282024
365 5.50035871 0.25348266 0.4796229 -0.4413864
431 -8.06809462 0.27490178 -0.4297023 -0.5248323
464 -2.15471513 0.23158423 0.5066040 0.4814387
475 0.57065101 0.25226661 0.5518479 0.5244351
563 0.06487254 -0.47274521 0.3319974 0.2747028
638 1.64752693 -0.50593435 -0.6446946 0.2939883

The martingale and deviance residuals are each a vector of length $n$,
where $n$ is the number of subjects in the data.
The score residuals form an $n$ by $p$ matrix, with one column per regressor
variable, and are components of the first derivative of the partial
likelihood. By definition, the column sums of the score residual matrix
will be zero at $\hat\beta$. The Schoenfeld residuals have one row for
each death in the data set and $p$ columns,
the time point of the death is returned as the
row label of the matrix.
As with other models in S,
a factor variable may be expanded into multiple contrasts in the $X$ matrix
(though there are none in this example).
It will then appear as multiple columns in the score or Schoenfeld residuals
as well.

Tests for proportional hazards are based on rescaled Schoenfeld residuals,
and can be obtained with \Co{cox.zph}.
They are based on Grambsch and Therneau \cite{GTresid}, and are discussed
further in section on mathematical backround.
> temp print(temp)
rho chisq p
age -0.03989 0.02618 0.8715 -0.14168 0.24626 0.6197
rx 0.13246 0.20012 0.6546 0.48448 1.88192 0.1701
GLOBAL NA 3.36086 0.4993
> plot(temp)
> plot(temp, var=2)

The plot shows time (or a monotone function of time) on the $x$ axis and
the rescaled residuals on the $y$ axis, with one plot per covariate.
An overlaid smooth curve is an estimate of $\beta(t)$, a time-dependent
regression coefficient. Under proportional hazards we must have
$\beta(t) = \beta$, i.e., the hazard ratio does not vary with time. The
standard printout includes the correlation coefficients $\rho$ for the plots
of each variable
along with tests for $\rho=0$, and a global test of proportional hazards
based on all the covariates. (There is no applicable value of $\rho$ for
the global test).
The \Co{var} option can be used to
create only a single plot, in this case for the second variable in the
model statement: (\Co{}). This is useful when one wishes to
add a main title, a horizontal line at $y=0$, or other annotation to
the plot.

Predicted values are available based on either the linear predictor
$\eta=X'\hat\beta$, the risk for an individual relative to the average subject
within the data set $r=\exp(\eta)$, the expected number of events for an
individual over the time interval that they were observed to be at risk
(which is a component of the martingale residual), or for individual
components of the linear predictor $\eta$.

> lp risk expt term round(risk,2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14.43 18.9 9.71 26.49 0.38 1.14 2.16 0.44 0.54 0.93 1.21 1.18 1.71 5.21 0.42

16 17 18 19 20 21 22 23 24 25 26
1.27 0.16 1.66 0.86 0.1 0.23 0.31 0.2 0.25 0.17 0.72
> fit

An optional \Co{data} argument to the predict function
is a new data set. It allows one to obtain predicted
values for subjects who are not part of the original
study. Because it is positioned as the second argument to \Co{predict},
the examples above must explicitly include
``type ='' (or a second comma), where this was not necessary in \Co{residual}.
Another option to the
predict function is to return standard errors of the
predicted values.

\subsection{Fitted survival curves}
The survfit function is used for fitting a survival curve, either to
original data or after a Cox model or parametric model fit.
> sf summary(sf)

The above example would result in four survival curves, indexed by the two
levels of treatment and the two levels of residual disease. The right hand
side of the formula is interpreted differently than it would be for an
ordinary linear or Cox model.
Technically, the formula should have been expressed using
a * operator instead of +, since the desired result is for all
four levels or `interactions'.
We process the + symbol as though it
were an interaction, as well as producing labels that are longer (and
hopefully more readable) than the default labels generated by the *

Each of the following formulas would have produced
the same output curves, though the first has different labels.
Surv(futime, fustat) {\Twiddle} interaction(rx,
Surv(futime, fustat) {\Twiddle} strata(rx,

Another example is shown
below, for a somewhat smaller study of acute
myelogenous leukemia; the data can be found on page
49 of Miller \cite{Miller},

> amlsf sf
Call: survfit.formula(formula = Surv(time, status) {\Twiddle} group, data=aml)

n events mean se(mean) median 0.95CI 0.95CI
group=Maintained 11 7 52.6 19.83 31 18 NA
group=Nonmaintained 12 11 22.7 4.18 23 8 NA

Similarly to other S model programs, the print function gives a very
short synopsis and the summary function provides more complete information.
For survfit the default print contains only n, the total number of events,
the mean survival time and its standard error,
median survival, and confidence intervals for the
median. The mean is based on a truncated estimate, i.e., the survival
curve is assumed to go to zero just past the last follow-up time. It is
thus an under estimate of the mean when the last observation(s) is
censored. The confidence interval for the median is based on the
confidence intervals for the survival curve:
the lower and upper limits are the intersection of a horizontal line at .5
with the lower and upper confidence bands for $S(t)$.
The confidence interval will change if a different
confidence level or confidence type is specified in the \Co{survfit} call.
If the upper confidence band for $S(t)$ never reaches 0.5, as in the example
above, then the upper confidence limit for the median is unknown.

Call: survfit.formula(formula = Surv(time, status) {\Twiddle} group, data=aml)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
9 11 1 0.909 0.0867 0.7541 1.000
13 10 1 0.818 0.1163 0.6192 1.000
18 8 1 0.716 0.1397 0.4884 1.000
23 7 1 0.614 0.1526 0.3769 0.999
31 5 1 0.491 0.1642 0.2549 0.946
34 4 1 0.368 0.1627 0.1549 0.875
48 2 1 0.184 0.1535 0.0359 0.944

time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 12 2 0.8333 0.1076 0.6470 1.000
8 10 2 0.6667 0.1361 0.4468 0.995
12 8 1 0.5833 0.1423 0.3616 0.941
23 6 1 0.4861 0.1481 0.2675 0.883
27 5 1 0.3889 0.1470 0.1854 0.816
30 4 1 0.2917 0.1387 0.1148 0.741
33 3 1 0.1944 0.1219 0.0569 0.664
43 2 1 0.0972 0.0919 0.0153 0.620
45 1 1 0.0000 NA NA NA

By default, the summary includes one row for each time at which a
death occurred, and for each of these times lists the number of subjects
who are at risk just prior to the event, the number of events that
occurred, the survival estimate, its standard error, and upper and lower
95\% confidence intervals calculated on the hazard scale.

Options to the survfit
routine include estimates based on the Nelson cumulative hazard estimator
instead of the Kaplan-Meier (as proposed by Fleming and Harrington),
confidence intervals based on the log hazard scale, and the level for the
confidence interval. An option to the summary routine
allows the listing to be printed for selected survival times and/or to
include censored time points within the printout.

Survival estimates for the data set as a whole can be generated by
using the null model, i.e. \Co{\Twiddle 1} as the right-hand side of the formula.
> survfit(Surv(time, status) {\Twiddle}1, aml)
It is also allowable to leave off the \Co{\Twiddle 1}; the same result will
be obtained.

The survfit function now supports case weights via the weight argument.
One obvious use of this feature is for data sets with multiplicities of the
input lines, i.e., instead of having one line of data appear three times it
could appear once with a weight of three. For instance, if weights of two
are attached to each of the 23 cases in the AML data this effectively
doubles the data set, the resultant survival is identical to the original,
but the variance of the hazard is halved. Case weights were included as an
option in the program less for this reason, however, than to facilitate
various manipulations that require fractional case weights, see for
example Turnbull \cite{Turnbull}, who uses this in an EM algorithm for
left censored data.
When \Co{survfit} is used with fractional weights, the returned variance
estimate is probably worthless.

The survfit function also can generate predicted survival curves for a Cox
model by using the resultant fit as a first argument to the function.
> attach(ovarian)
> fit survfit(fit)

This will produce two survival curves,
one for each of the two \Co{rx} strata;
each curve will be for a ``pseudo cohort" whose age and ecog performance
score are equal to the mean values for the data set that was fit.

An important aspect of the new modeling language is the (largely undocumented)
number of {\em side effects}.
If the last line above were executed without first attaching the \Co{ovarian}
data frame (perhaps in a later S session) it would fail.
This is because estimation of the survival curve
requires some data summaries that were not saved in the {\em fit} object.
In order to obtain them,
the data is reconstructed based on the call that produced the {\em fit}
object, and this reconstruction requires essentially the same environment to exist as was
originally present.
However, if the original fit had been produced using the \Co{data=ovarian}
argument to \Co{coxph}, no attachment would be necessary.
More subtle errors arise if important options differ at the later call, i.e.,
\Co{na.action} or \Co{contrasts}.
A summary of the side effects in survival models is
found later in this document.

The original version of survfit had an option for {\em risk weights}.
This option had to do with survival curves following
a Cox model with known coefficients, and is now
obtained in a more obvious way. The example below shows a fit
with risk weights, i.e. $\exp(X'\beta)$, where $X'\beta= $ .1 to 2.3.
> survfit(coxph(Surv(aml$time, aml$status) {\Twiddle} offset(1:23/10)))

This example is contrived:
normally the offset statement would contain
the result of a coxph fit, perhaps on a different set of data.
I have used this feature on two occasions in my own analyses. In each case
the second data set had a small number of patients, and we wished to
test the effect of variable ``x'' after adjusting for a set of baseline variates
known to effect survival. Though the data set was not large enough to
model and estimate all of the variables concurrently, it was sufficient when
the regression coefficients for the baseline variates were fixed at the
values from an earlier, larger study.
For instance, if the earlier study had a variable \Co{old} and a fitted
regression coefficient of 2.3, the new model would have had
\Co{offset(2.3 *old)} as a term in the model.

Another use of offsets is to model the {\em relative} mortality of a
population. This is discussed in example Andersen, et al.
using diabetes in the county of Fyn. In this model the covariates were
age at onset, sex, and the population mortality or hazard ``phaz''
which depends on age and sex and is
set up as a time dependent covariate. Then a model for
absolute mortality is
Surv(start, stop, status) ~ age + sex
and the model for relative mortality is
Surv(start, stop, status) ~ age + sex + offset(log(phaz))
As a comparison of the two models, they suggest fitting a third model
Surv(start, stop, status) ~ age + sex + log(phaz)
where a values of $\bhat_3 =0$ corresponds to absolute mortality and
$\bhat_3 =1$ to a model for the relative mortality. The fitted value of
-.39 suggests that the model for absolute mortality is preferred.

Note that when a curve is requested using only the \Co{offset} directive,
using the same data set that created the coefficients for the offset,
the resultant survival estimate is identical
to the curve based on the model's \Co{fit} component but the variance will be
deflated. The survival curve from a Cox model includes in
its variance a term that accounts for the error in $\hat\beta$, and an offset
model implicitedly states that $\beta$ is known without error.

In order to get curves for a pseudo cohort other than one centered at the
mean of the covariates, use the newdata argument to survfit. The newdata
argument is a list, data frame or matrix which contains in each row all
variables found in the right-hand side of the equation that was fit,
excluding strata terms.
Multiple rows in the new data frame will result in multiple ``cohorts" of
subjects, for example:
> fit survfit(fit, newdata=list(age=c(30,50),,2))
This will produce two survival curves, the first for an imagined cohort of
subject who were age 30 with a performance score of 2, and the second for
an age 30 group. A more complete example is given below.
(Amendment: if the model includes a strata by covariate interaction, use of
the \Co{newdata} argument currently leads to failure, including a very
nonintuitive error message. This is a bug, and should be fixed in the next

In keeping with other S models, the newdata argument can also be
the number of a data frame.
(I don't like this, but it was important to be compatible).
When the formula has only one variable on the
right-hand side of the equation and the given newdata argument is a single
then the intent of the user is ambiguous.
For instance
> fit survfit(fit, newdata=3)
Is the intention to have $x=3$ or to use the variable $x$ found in data
frame number 3?
To avoid this anomaly, be explicit by writing
\Co{survfit(fit, newdata=list(x=3))}.
(And for those who are curious, the ambiguity is, if necessary, resolved by
the following rule:
If the number is a positive
integer, this routine assumes it is a frame number, otherwise the number is
assumed to be the actual value for the covariate.)

Here is another example from the Fleming data
Note the use of \Co{x=T}.
This causes a copy of the final $X$ matrix to be saved in the \Co{fit}
object, including any transformations, dummy variables that represent factors,
When subsequent calculations are planned for a
fitted Cox model, such as residuals or survival curves, this will save
significant computation time (at the expense of a larger \Co{fit} object)
since $x$ does not have to be reconstructed.
> fit summary(fit)
coxph(formula = Surv(futime, fustat) {\Twiddle} age + + strata(rx),
data=ovarian, x=T)
N= 26

coef exp(coef) se(coef) z p
age 0.1385 1.149 0.048 2.885 0.00391 -0.0967 0.908 0.630 -0.154 0.87799

exp(coef) exp(-coef) lower .95 upper .95
age 1.149 0.871 1.045 1.26 0.908 1.102 0.264 3.12

Rsquare= 0.387 (max possible= 0.874 )
Likelihood ratio test= 12.7 on 2 df, p=0.00174
Efficient score test = 12.2 on 2 df, p=0.0022

> summary(survfit(fit))
Call: survfit.coxph(object = fit)

time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.978 0.0266 0.9275 1
115 12 1 0.951 0.0478 0.8620 1
156 11 1 0.910 0.0760 0.7722 1
268 10 1 0.861 0.1055 0.6776 1
329 9 1 0.736 0.1525 0.4909 1
431 8 1 0.627 0.1704 0.3680 1
638 5 1 0.333 0.2296 0.0865 1

time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.943 0.141 0.703 1
365 12 1 0.880 0.144 0.638 1
464 9 1 0.789 0.165 0.523 1
475 8 1 0.697 0.177 0.424 1
563 7 1 0.597 0.182 0.329 1

> summary(survfit(fit, list(age=c(30,70),,3))))
Call: survfit.coxph(object = fit, newdata = list(age = c(30, 70), = c( 2, 3)))

time n.risk n.event survival1 survival2
59 13 1 0.999 0.87905
115 12 1 0.999 0.74575
156 11 1 0.998 0.57399
268 10 1 0.996 0.41765
329 9 1 0.992 0.16677
431 8 1 0.988 0.06492
638 5 1 0.973 0.00161

time n.risk n.event survival1 survival2
353 13 1 0.999 0.7093
365 12 1 0.997 0.4739
464 9 1 0.994 0.2494
475 8 1 0.991 0.1207
563 7 1 0.987 0.0489

The first call to survfit asks only for a single curve at the mean of the
covariates. (The value of those means is stored in the fit object as
fit\$means.) The second call asks for curves for two hypothetical cohorts, one
has an age of 30 and a performance score of two, the second is age 70 with
a performance score of three. The printout requires some
explanation. The printouts for the two treatment strata are listed in sequence:
since the event times are different in the two strata they cannot be listed
side-by-side. The survivals for the two age * ps cohorts are listed side
by side since they are computed at the same time points.

Age is a very significant variable in the model:
survival of a subject age 70 is significantly worse
than one
age 30. The standard error and confidence intervals
are computed in the second example as they were in
the first, and are present in the returned survival
structure, but since their inclusion would be too wide for the
paper the printing routine leaves them off.

The functions for predicted survival, unfortunately, share a basic flaw
with S prediction in other models. If the model equation involves any
special functions such as \Co{ns}, \Co{poly} {\em or involves factor
variables}, then naive use of \Co{survfit} will result in incorrect answers.
In particular, for a factor it is important that \Co{newdata} has the same
number of levels as the original data, or contrasts will not be coded as
they were originally.
> fit srv srv srv srv srv options(contrasts=c("contr.treatment", "contr.poly"))
> sfit.1 _ coxph(Surv(start, stop, event){\Twiddle} (age + surgery)*transplant,
data=jasa1, method='breslow')
> sfit.2 _ coxph(Surv(start, stop, event){\Twiddle} year*transplant,
data=jasa1, method='breslow')
> sfit.3 _ coxph(Surv(start, stop, event){\Twiddle} (age + year)*transplant,
data=jasa1, method='breslow')
> sfit.4 _ coxph(Surv(start, stop, event){\Twiddle} (year + surgery)*transplant,
data=jasa1, method='breslow')
> sfit.5 _ coxph(Surv(start, stop, event){\Twiddle} (age +surgery)*transplant +year,
data=jasa1, method='breslow')
> sfit.6 _ coxph(Surv(start, stop, event){\Twiddle} age*transplant + surgery + year,
data=jasa1, method='breslow')
> summary(sfit.1)
coxph(formula = Surv(start, stop, event) {\Twiddle} (age + surgery) * transplant)

N= 172

coef exp(coef) se(coef) z p
age 0.0138 1.014 0.0181 0.763 0.446
surgery -0.5457 0.579 0.6109 -0.893 0.372
transplant 0.1181 1.125 0.3277 0.360 0.719
age:transplant 0.0348 1.035 0.0273 1.276 0.202
surgery:transplant -0.2916 0.747 0.7582 -0.385 0.701

exp(coef) exp(-coef) lower .95 upper .95
age 1.014 0.986 0.979 1.05
surgery 0.579 1.726 0.175 1.92
transplant 1.125 0.889 0.592 2.14
age:transplant 1.035 0.966 0.982 1.09
surgery:transplant 0.747 1.339 0.169 3.30

Rsquare= 0.07 (max possible= 0.969 )
Likelihood ratio test= 12.4 on 5 df, p=0.0291
Efficient score test = 12 on 5 df, p=0.0345

> sfit.2
Call: coxph(formula = Surv(start, stop, event) {\Twiddle} year * transplant)

coef exp(coef) se(coef) z p
year -0.265 0.767 0.105 -2.518 0.0118
transplant -0.282 0.754 0.514 -0.549 0.5831
year:transplant 0.136 1.146 0.141 0.967 0.3337

Likelihood ratio test=8.61 on 3 df, p=0.035 n= 172

One line of the above printout may generate confusion: \Co{N = 172}.
This is the number of {\em observations} in the data set, not the
number of subjects. There are actually 103 patients, of which 69 had a
transplant and are thus represented using 2 rows of data.

When there are time dependent covariates, the predicted survival curve
can present something of a dilemma. The usual call's result is for a
pseudo cohort whose covariates do not change--
>fit1 fit2 data survfit(sfit.1, data, individual=T)

Another useful extension is time dependent strata. The following examples
come from a study of liver transplantation. As in the heart transplant study,
there is a variable waiting time for any patient. One question of interest
is the efficacy of liver transplant, another is the utility of a particular
risk score developed at the Mayo Clinic. The variables for a given
patient are:
\item (tstart, tstop, status]: the time interval, open on the left and closed
on the right. Status is 1 if the subject died at time tstop. All times are
in days since enrollment in the study.
\item base.rsk: The risk score at study entry. This covariate was defined by
the ``Mayo model'' analysis, on an independent data set.
The actual definition involves 5 variables: .871*log(bilirubin) + .039*age + \ldots.
\item trisk: time-dependent risk score. The latest value we have for the
risk score (most patients are evaluated about once a year).
\item transplant: time-dependent transplant status.
\item tx.time: For transplanted patients, the number of days from enrollment
to transplant.

There are 83 patients, who generate approximately 600 observations in the
constructed data frame. The number of observations for a given patient
depends on the number of determinations of his/her risk score, and on whether
they were transplanted.
> attach(timedep)
> options(contrasts=c("contr.treatment", "contr.poly"))
> yy tfit1 tfit2 tfit3 tfit4 yyy coxph(yyy {\Twiddle} trisk*strata(trplant) )
In SAS or BMDP, it is possible to mimic time dependent strata by breaking a
subject into two new subjects. Because each subject's time interval
implicitly starts at 0 in these packages, there is effectively a realignment
of the data.

\subsection{Differences in survival}
There is a single function \Co{survdiff} to test for differences between 2
or more survival curves. It implements the $G^\rho$ family of Fleming and
Harrington \cite{FH2}. A single parameter $\rho$ controls the weights
given to different survival times, $\rho=0$ yields the log-rank test and
$\rho=1$ the Peto-Wilcoxon. Other values give a test that is intermediate
to these two. The default value is $\rho=0$.

The interpretation of the formula is the same as for \Co{survfit}, i.e.,
variables on the right hand side of the equation jointly break the
patients into groups.
> survdiff(Surv(time, status){\Twiddle} group, aml)

N Observed Expected (O-E)^2/E
Maintained 11 7 10.689 1.273
Nonmaintained 12 11 7.311 1.862

Chisq= 3.4 on 1 degrees of freedom, p= 0.06534

For one-sample tests see the section on expected survival.

\subsection{Competing Risks}
This running example is taken from the paper by Wei, Lin and Weissfeld (WLW)
\cite{WLW}. They use a data set on time to recurrence of bladder
cancer; a copy of the data may be obtained from statlib. A portion of
the data is shown below:

Recurrence Time
Treatment Follow-up Initial Initial ----------------
group time number size 1 2 3 4

1 0 1 1
1 1 1 3
1 4 2 1
1 7 1 1
1 10 5 1
1 10 4 1 6
1 14 1 1
1 18 1 1
1 18 1 3 5
1 18 1 1 12 16
1 23 3 3
1 23 3 3 10 15
1 23 1 1 3 16 23
1 23 1 1 3 9 21
. .
. .
. .

Code for reading in this data can be found in the Examples directory. After
reading it in, we have created the following data set

> bladder[1:20,]
id rx size number start stop event enum
1 1 1 1 3 0 1 0 1
2 1 1 1 3 1 1 0 2
3 1 1 1 3 1 1 0 3
4 1 1 1 3 1 1 0 4
5 2 1 2 1 0 4 0 1
6 2 1 2 1 4 4 0 2
7 2 1 2 1 4 4 0 3
8 2 1 2 1 4 4 0 4
9 3 1 1 1 0 7 0 1
10 3 1 1 1 7 7 0 2
11 3 1 1 1 7 7 0 3
12 3 1 1 1 7 7 0 4
13 4 1 5 1 0 10 0 1
14 4 1 5 1 10 10 0 2
15 4 1 5 1 10 10 0 3
16 4 1 5 1 10 10 0 4
17 5 1 4 1 0 6 1 1
18 5 1 4 1 6 10 0 2
19 5 1 4 1 10 10 0 3
20 5 1 4 1 10 10 0 4
. .
. .
. .

Notice that this data set has exactly 4 observations for each subject.
A second data set, bladder2, has had all of the rows with
\Co{start==stop} removed, and also has a fifth observation for some
subjects (those with follow-up after the fourth recurrence).

The model explored in WLW is easily fit by the following commands.
The key addition to the model is \Co{cluster(id)}, which asserts that
subjects with the same value of the variable \Co{id} may be correlated.
In order to compare the results directly to WLW, we wish to
look at a different set of contrasts than the S default. These are
created ``by hand''
> options(contrasts='contr.treatment')
> wfit rx cmat wvar sqrt(diag(wvar))
[1] 0.3075006 0.3639071 0.4151602 0.4896743

The same coefficients can also be obtained, as WLW do, by performing four
separate fits, and then combining the results.
> fit1 fit2 fit3 fit4 sc1 sc2 sc3 sc4 t11 t12 t44 wvar.all wvar fit2pa fit2pb afit sqrt(diag(afit$var))
[1] 0.24183067 0.05689293 0.07221747
> sqrt(diag(afit$naive.var))
[1] 0.20007253 0.04800789 0.07025758
The naive estimate of standard error is .20, the correct estimate of .24
is intermediate between the naive estimate and the linear combination
estimate. (Since this model does not include strata by covariate
interactions, I would not expect an exact match).
Further discussion on these estimators can be found in section \ref{rvar}.

\subsection{Expected Survival}

Expected survival is closely related to a standard method in
epidemiology called {\em person years}, which consists of counting the total
amount of follow-up time contributed by the subjects within any of several
strata. Person-years analysis
is accomplished within S by the \Co{pyears} function. The
main complication of this function, as opposed to
a straightforward use of \Co{tapply}, is
that a subject may contribute to several different cells of the output array
during his follow-up. For instance, if the desired output table were
treatment group * age in years, a subject with 4 years of observation would
contribute to 5 different cells of the table (4 cells if he entered the
study exactly on his birthdate). The following counts up years of
observation for the Stanford heart patients by age group and surgical

age fit plot(x1, resid(fit))
That is, the residual from a model is the same shape as the {\em input} data,
independent of reductions that may have occurred in the intervening $X$
Other actions in this last step are possible. For instance, if the na.action
had imputed numbers for the missing values, the predicted value for such
observations should perhaps be set to NA or otherwise marked as

Of the four effects, the second is dealt with by the \Co{na.action} extension to base
S. The first, the ability to choose a global action, is accomplished with
a change to \Co{model.frame.default}. If neither the particular function
call nor the data frame contained an explicit na.action, the routine now
checks for an na.action option.
To make na.omit the global default, type
> options(na.action='na.omit')
Because \Co{model.frame.default} is called by all of the S modeling functions,
this action will apply to all of them, not just survival.

In order to implement the third and fourth effects, it was necessary to
pass na.action information as part of the returned model fit. This in
turn requires the na.action routine to store something in the model frame
that is passed to the fitting function.
Because the definition of what ``should be done'' to label the printout or
to modify the residuals depends on the na.action chosen, the fitting,
print, residual, and predict routines do not try to manipulate the
passed information directly, but use it as an argument to \Co{naprint}
and \Co{naresid} methods. This allows the actions taken to be extremely
general, and more importantly, they are easily modified by the user.

\item The na.action routine optionally adds an attribute named {\em na.action}
to the data frame. The class of the attribute object should be the name of
the na action, but the contents or form of the object is unrestricted
and depends on the particular na.action.
\item The fitting routine adds this attribute to its output, as a component
named {\em na.action}.
\item The print routines for coxph, survfit, and etc. each contain a call
to the naprint method, with the passed na.action object as its argument.
This method is assumed to return a character string appropriate for inclusion
in the printout.
\item The residual and predicted value routines each pass their final result
through a call to the naresid method, whose arguments are the na.action
object and the vector or matrix of results.
The package includes a modified version of \Co{na.omit} along with the
methods \Co{naprint.omit} and \Co{naresid.omit}.

The default naprint method returns a null string, and the default naresid
method returns its second argument, this mimics the unmodified S.
This allows the survival routines to work with ``standard'' S na.action
The standard modeling functions such as \Co{lm} ignore any attributes on
the data frame, and work ``as usual'' with the modified na.omit function.
\section {Overrides}
The survival package overrides four S functions. Effects of the
model.frame.default and na.omit functions were discussed above. The na.omit change
should have no effect on other S modeling routines. The change to
model.frame.default is more global: most of the modeling functions will now
``obey'' the na.action specified in an options statement. There are two
exceptions. The tree routine does not use model.frame.default, and appears
to not be affected at all, and the gam function has side effects. I have
not tracked this down completely, but think that it is due to If a gam model has been fit, and missing values were
deleted due to a global na.action option, then further manipulations using
the result of that fit can give mysterious errors about mismatched dimensions
and such. I believe that when the follow-up functions rebuild portions of
the data frame they do so without calling model.frame.default, and thus
create mismatches with the original fit. For the present, it is best to
explicitly give the na.action in gam.

The overrides to data.frame and allow for the inclusion
of Surv and date objects in a data frame. I have not seen any untoward
effects due to this change.

\bibitem{AG} Andersen, P.K. and Gill, R.D. (1982).
Cox's regression model for counting processes: A large sample study.
{\em Ann. Stat.} {\bf 10}, 1100-20.

\bibitem{Ander2} Andersen, P.K., Borgan, O., Gill, R.D., and Keiding, N. (1991)
{\em Statistical Models Based on Counting Processes}.
Springer-Verlag, New York.

\bibitem{Ander3} Andersen, P.K., Borch-Johnsen, K., Deckert, T., Green, A.,
Hougaard, P., Keiding, N., and Kreiner, S. (1985).
A Cox regression model for the relative mortality and its application to
diabetes mellitus survival data.
{\em Biometrics} {\bf 41}, 921-32.

\bibitem{Ander4} Andersen, P.K. and V\ae th, M. (1989).
Simple parametric and non-parametric models for excess and relative
{\em Biometrics} {\bf 45}, 523-35.

\bibitem{Barlow} { Barlow, W.E. and Prentice, R.L.} (1988).
Residuals for relative risk regression.
{\em Biometrika} {\bf 75}, 65-74.

\bibitem{Berry} Berry, G. (1983)
The analysis of mortality by the subject years method.
{\em Biometrics} {\bf 39}, 173-84.

\bibitem{Binder} Binder, D.A. (1992).
Fitting Cox's proportional hazards models from survey data.
{\em Biometrika} {\bf 79}, 139-47.

\bibitem{Cain} Cain, K.C. and Lange, N.T. (1984).
Approximate case influence for the proportional hazards regression model
with censored data.
{\em Biometrics} {\bf 40}, 493-9.

\bibitem{Smodel} Chambers, J.M. and Hastie, T.J. (1992).
{\em Statistical Models in S}.
Wadsworth and Brooks/Cole, California.

\bibitem{Chen} Chen, C. and Wang, P.C. (1991).
Diagnostic plots in Cox's regression model.
{\em Biometrics} {\bf 47}, 841-50.

\bibitem{Crowley} Crowley, J. and Hu, M. (1977), Covariance analysis of
heart transplant data.
{\em J. Am. Stat. Assoc.} {\bf 72}, 27-36.

\bibitem{DKorn} Dorey, F.J. and Korn, E.L. (1987).
Effective sample sizes for confidence intervals for survival
{\em Statistics in Medicine} {\bf 6}, 679-87.

\bibitem{Ederer} Ederer, F., Axtell, L.M. and Cutler, S.J. (1961).
The relative survival rate: A statistical methodology.
{National Cancer Institute Mongraphs} {\bf 6}, 101-21.

\bibitem{Eder2} Ederer, F. and Heise, H. (1977).
Instructions to IBM 650 programmers in processing survival computations,
{\em Methodological Note No. 10, End Results Evaluation Section},
National Cancer Institute.

\bibitem{ovarian} Edmunson, J.H., Fleming, T.R., Decker, D.G.,
Malkasian, G.D., Jefferies, J.A., Webb, M.J., and Kvols, L.K. (1979),
Different Chemotherapeutic Sensitivities and Host Factors Affecting
Prognosis in Advanced Ovarian Carcinoma vs. Minimal Residual Disease.
{\em Cancer Treatment Reports} {\bf 63}, 241-47.

\bibitem{Efron} Efron, B. (19xx)
{\em The Jackknife, the bootstrap, and other resampling plans}.
SIAM CMBS-NSF Monograph {\bf 38}.

\bibitem{Escobar} Escobar, L.A. and Meeker W.Q., Jr. (1992).
Assessing influence in regression analysis with censored data.
{\em Biometrics} {\bf 48}, 507-528.

\bibitem{FH2} Fleming, T.R. and Harrington, D.P. (1981),
A class of hypothesis tests for one and two sample censored survival data.
{\em Communications in Statistics} {\bf A10(8)}, 763-94.

\bibitem{FH3} Fleming, T.R. and Harrington, D.P. (1984),
Nonparametric estimation of the survival distribution in censored data.
{\em Comm. in Statistics A} {\bf 13}, 2469-86.

\bibitem{FandH} Fleming, T.R. and Harrington, D.P. (1991),
{\em Counting Processes and Survival Analysis}, Wiley, New York.

\bibitem{Gail} Gail, M.H. and Byar, D.P. (1986).
Variance calculations for direct adjusted survival curves, with
applications to testing for no treatment effect.
{\em Biom J} {\bf 28}, 587-99.

\bibitem{Dickson} Grambsch, P.M., Dickson, E.R., Wiesner, R.H.
and Langworthy, A. (1989).
Application of the Mayo PBC survival model to liver transplant patients.
{\em Mayo Clinic Proc} {\bf 64}, 699-704.

\bibitem{GTresid} Grambsch, P. and Therneau, T.M. (1993)
Proportional hazards tests and diagnostics based on weighted residuals.
{\em Submitted}.

\bibitem{green} Green P.J. (1984).
Iteratively reweighted least squares for maximum likelihood estimation,
and some robust and resistant alternatives (with discussion).
{\em JRSSB} {\bf 46}, 149-92

\bibitem{Hak1} Hakulinen, T. (1982).
Cancer survival corrected for heterogeneity in patient withdrawal.
{\em Biometrics} {\bf 38}, 933.

\bibitem{Hak2} Hakulinen, T. and Abeywickrama, K.H. (1985).
A computer program package for relative survival analysis.
{\em Computer Programs in Biomedicine} {\bf 19}, 197-207.

\bibitem{Hak3} Hakulinen, T. (1977).
On long term relative survival rates.
{\em J. Chronic Diseases} {\bf 30}, 431-43.

\bibitem{Harrington} Harrington, D.P. and Fleming, T.R. (1982).
A class of rank test proceedures for censored survival data.
{\em Biometrika}, {\bf 69}, 553-66.

\bibitem{Hartz} Hartz, A.J., Giefer, E.E. and Hoffmann, G.G. (1983).
A comparison of two method for calculating expected mortality.
{\em Statistics in Medicine} {\bf 2}, 381-6.

\bibitem{Hartz2} Hartz, A.J., Giefer, E.E. and Hoffmann, G.G. (1984).
Letter and rejoinder on
``A comparison of two method for calculating expected mortality."
{\em Statistics in Medicine} {\bf 3}, 301-2.

\bibitem{Hartz3} Hartz, A.J., Giefer, E.E. and Hoffmann, G.G. (1985).
Letters and rejoinder on
``A comparison of two method for calculating expected mortality."
{\em Statistics in Medicine} {\bf 4}, 105-9.

\bibitem{Huber} Huber, P.J. (1967).
The behavior of maximum likelihood estimates under non-standard conditions.
{\em Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
and Probability} {\bf 1}, 221-33.

\bibitem{Kalb} Kalbfleisch, J.D. and Prentice R.L. (1980).
{\em The Statistical Analysis of Failure Time Data}, Wiley, New York.

\bibitem{Klein} Klein, J.P. (1991).
Small sample moments of some estimators of the variance of the
Kaplan-Meier and Nelson-Aalen estimators.
{\em Scand. J. Statistics} {\bf 18}, 333-40.

\bibitem{LWA} {Lee, E.W., Wei, L.J. and Amato D.} (1992).
{\em Cox-type regression analysis for large number of small groups of
correlated failure time observations}.
In Klein, J.P and Goel, P.K. (eds), Survival Analysis, State of the
Art, 237-247, Kluwer Academic Publishers, Netherlands.

\bibitem{Lin} Lin, D.Y. (1991).
Goodness-of-fit analysis for the Cox regression model based on a
class of parameter estimators.
{\em JASA} {\bf 86}, 725-728.

\bibitem{LW} { Lin, D.Y. and Wei, L.J.} (1989).
The robust inference for the Cox Proportional Hazards Model.
{\em J. Am. Stat. Assoc.} {\bf 84}, 1074-79.

\bibitem{LWZ} { Lin, D.Y., Wei, L.J. and Ying, Z.} (1992).
Checking the Cox model with cumulative sums of martingale-based residuals.
{\em Technical Report \#111}, Dept. of Biostatistics, U. of Washington.

\bibitem{Link1} Link, C.L. (1984).
Confidence intervals for the survival function using Cox's proportional
hazard model with covariates.
{\em Biometrics} {\bf 40}, 601-10.

\bibitem{Link2} Link, C.L. (1986).
Confidence intervals for the survival function in the presence of
{\em Biometrics} {\bf 42}, 219-20.

\bibitem{Makuch} Makuch, R.W. (1982).
Adjusted survival curve estimation using covariates.
{\em J Chronic Diseases} {\bf 3}, 437-43.

\bibitem{Mantel} Mantel, N. (1966).
Evaluation of survival data and two new rank order statistics arising in
its consideration.
{\em Cancer Chemotherapy Reports} {\bf 50}, 163-6.

\bibitem{Miller} Miller, R.G. (1981), {\em Survival Analysis},
Wiley, New York.

\bibitem{glim} McCullagh, P. and Nelder, J.A. (1983).
{\em Generalized Linear Models.}
Chapman and Hall.

\bibitem{Nagel} Nagelkirke, N. (1991).
A note on a general definition of the coefficient of determination.
{\em Biometrika} {\bf 78}, 691-2.

\bibitem{Nelson} Nelson, W.B. (1969).
Hazard plotting for incomplete failure data.
{\em J. Quality Technology} {\bf 1}, 27-52.

\bibitem{Pren2} Prentice, R.L., Williams, B.J. and Peterson, A.V. (1981).
On the regression analysis of multivariate failure time data.
{\em Biometrika} {\bf 68}, 373-89.

\bibitem{Pugh} Pugh, M., Robins, J., Lipsitz, S., and Harrington, D. (1993).
Inference in the Cox proportional hazards model with missing covariate
data. In press.

\bibitem{Reid} Reid, N. and Cr\'{e}peau, H. (1985).
Influence functions for proportional hazards regression.
{\em Biometrika} {\bf 72}: 1-9.

\bibitem{Scho} Schoenfeld, D. (1980).
Chi-squared goodness-of-fit tests for the proportional hazards regression
{\em Biometrika} {\bf 67}, 145-53.

\bibitem{SH} Smith, P.J. and Hietjan, D.F. (to appear).
Testing and adjusting for overdispersion in generalized linear models.
{\em J Royal Statistical Soc, Series C}.

\bibitem{Turnbull} Turnbull, B.W. (1974)
Nonparametric estimation of a survivorship function with doubly
censored data.
{\em JASA} {\bf 69}, 169-73.

\bibitem{TGF} Therneau, T.M., Grambsch P.M. and Fleming, T.R. (1990).
Martingale based residuals for survival models.
{\em Biometrika} {\bf 77}, 147-60.

\bibitem {Thomsen} Thomsen, T.L., Keiding, N. and Altman, D.G. (1991),
A note on the calculation of expected survival, illustrated by the
survival of liver transplant patients.
{\em Statistics in Medicine} {\bf 10}, 733-8.

\bibitem{tsiatis} Tsiatis, A.A. (1981).
A large sample study of Cox's regression model.
{\em Ann. Statistics} {\bf 9}, 93-108.

\bibitem{Verheul} Verheul, H.A., Dekker, E., Bossuyt, P., Moulijn, A.C.
and Dunning A.J. (1993).
Backround mortality in clinical survival studies,
{\em Lancet} {\bf 341}, 872-5.

\bibitem{WLW} { Wei, L.J., Lin, D.Y. and Weissfeld, L.} (1989).
Regression analysis of multivariate incomplete failure time data by
modeling marginal distributions.
{\em J. Am. Stat. Assoc.} {\bf 84}, 1065-73.

\bibitem{White1} White, H. (1980).
A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity.
{\em Econometrica} {\bf 48}, 817-30.

\bibitem{White2} White, H. (1982).
Maximum likelihood estimation of misspecified models.
{\em Econometrika} {\bf 50} 1-25.

\bibitem{smoke} {\em The Health Benefits of Smoking Cessation} (1990).
US Department of Health and Human Services. Public Health Service,
Centers for Disease Control, Center for Chronic Disease Prevention and
Health Promotion, Office on Smoking and Health.
DHHS Publication No (CDC)90-8416.

cd ..
if test -f 'predict.doc'
echo shar: over-writing existing file "'predict.doc'"
cat 'predict.doc'
What do I need to do predictions --

linear predictor: exists
+se : X matrix
+newdata: means of old X matrix, new X matrix, new offset

risk -- same as lp

expected -- cumulative hazard for subject= baseline haz + time + risk
+se : sqrt(expected)
+new : baseline hazard function, new time, new x, means of old X,
new offset, new strata

terms -- : X matrix and the means
+se : I matrix
+new : new X matrix and the old means

survival: baseline cum haz & it's time vector, strata
+se : old x matrix, subject strata, time, risk --> weighted means matrix
+new : above + new risk

The baseline hazard is a function of risk, time, and strata. So is the
cumulative hazard. For survival, se=T will usually be desired.
# End of shell archive
exit 0