Declare that the data are time-to-event data

Written by:

Ylva B Almquist

Before we can analyse time-to-event data (a.k.a. survival-time data), we need to declare this for Stata. This is a bit complicated, but take a deep breath! Of course, we need one key variable, namely the variable that reflects the event. Moreover, we need to create some time variables. They should be in date format. It is also good practice to include an identification variable.

Key variables

`failure`	Indicates whether the individual has experienced the event or not. For example: event is coded as 1 and no event as 0.
`event`	The date that the individual experiences the event.
`origin`	The date that defines when the time is zero. This is optional (but we think it makes sense to give the same date here as enter). However, if we want to attain age as the timescale, origin is specified as the date of birth.
`enter`	The date that the individual becomes at risk, i.e. enters the observational period. For example, if you have a follow-up period, enter is specified as the start date of that period.
`exit`	The date that the individual exits the study, i.e. the latest time that the individual is at risk. This is optional (the default is that the individual is removed after the event has occurred). However, it is useful if you want to specify the end date of the follow-up period.
`id`	Identification (id) number.

Note
We do not actually need to name the variables failure, event, origin, enter, exit, or id. This is our choice.

The next step is to use stset to declare that the data is time-to-event data.

We will take our point-of-departure in the following command structure:

stset event, failure(failure==1) enter(time enter) exit(time exit) origin(time origin) scale(365.25) id(id)

Note
The scale option transforms the observational time from days (default) to years.

More information
help stset

Practical example

Dataset

StataData1.dta

Variable name	cvd
Variable label	Out-patient care due to CVD (Ages 41-50, Year 2011-2020)
Value labels	0=No 1=Yes

Variable name	cvd_date_str
Variable label	Date of out-of-patient care due to CVD (Ages 41-50, Years 2011-2020)
Value labels	N/A

Failure

In this example, we will focus on the variable cvd, which measures the occurrence of out-patient care due to cardiovascular disease (CVD). It looks like this:

tab cvd

Event

Connected to this variable, we have a variable (cvd_date_str) reflecting year, month, and day of the individual’s first out-patient care event due to CVD. This variable is currently a string variable which we need to transform into a time variable through a series of steps, earlier described in this guide. We will just quickly repeat these steps here (see Substring & Date variables for more details).

Note
If your dataset already contains this time variable (i.e. if you are using a saved dataset where you already performed the practical exercises in Substring & Date variables), you should not perform these commands again.

gen cvd_year_str= substr(cvd_date_str,1,4)

gen cvd_month_str= substr(cvd_date_str,5,2)

gen cvd_day_str= substr(cvd_date_str,7,2)

gen cvd_year=real(cvd_year_str)

gen cvd_month=real(cvd_month_str)

gen cvd_day=real(cvd_day_str)

gen cvd_date=mdy(cvd_month,cvd_day,cvd_year)

format %d cvd_date

This is what the cvd_date variable looks like in the 100 first individuals (sorted by id).

tab cvd_date in 1/100

But we need to do one more step (should be performed even if you carried out the previous steps in Substring & Date variables): impose the censoring date for the individuals that do not have an event. As will be explained later, we will censor the individuals at the end of follow-up (December 31, 2020). We will create a new variable for this purpose. This is actually the one that we will use in the analysis.

gen cvd_faildate=cvd_date

replace cvd_faildate=mdy(12,31,2020) if cvd_faildate==.

format %d cvd_faildate

Origin

In this dataset, the individuals are born in 1970. We do not have any detailed information about birth date, so we will specify the date as being in the middle of the year (June 30, 1970). After this, we format cvd_origin to be a date variable.

gen cvd_origin=mdy(6,30,1970)

format %d cvd_origin

Enter

The follow-up of out-patient care due to CVD starts on January 1, 2011:

gen cvd_enter=mdy(1,1,2011)

format %d cvd_enter

Exit

The follow-up period ends on December 31, 2020. This will be the exit date for the individuals that do not experience the event.

gen cvd_exit=mdy(12,31,2020)

format %d cvd_exit

For the individuals that do experience the event, the exit date will be replace with the failure date (i.e. the date that they experience the event):

replace cvd_exit=cvd_faildate if cvd==1

Note
In the current example, we will keep it simple and just assume that all individuals stayed alive and did not drop out during the follow-up period.

Stest

Now, we have what we need to stset the data:

stset cvd_faildate, failure(cvd==1) enter(time cvd_enter) exit(time cvd_exit) origin (time cvd_origin) scale(365.25) id(id)

There are four variables that are created when we use stset.

Variables created by stset
`_t0`	Analysis time when observational period starts
`_t`	Analysis time when observational period ends
`_d`	Indicator of event (=1 if event has occurred)
`_st`	Indicator of whether the individual is included in the `stset`

Let us have a look at the variables we used for stset, including the new ones created. We will display this only for the 10 first individuals in the dataset.

list cvd_faildate cvd_origin cvd_enter cvd_exit _st _d _t _t0 in 1/10

All individuals enter on the same date (cvd_enter=01jan2011) and they have the same origin date (cvd_origin=30jun1970). This means that the age at the beginning of the follow-up period is the same of everyone (_t0=40.506502).

We can see that the 5^th and 7^th individual has the value 1 for the variable _d. In other words, they have experienced the event (out-patient care due to CVD). The corresponding date is shown in cvd_failure (11mar2019 and 17jul2012, respectively), and the corresponding age at the event is shown in _t (48.695414 and 42.047912, respectively). For the remaining individuals, cvd_failure is set to 31dec2020 and _t is estimated to 50.505133.

This is the output we got when we executed the stset command:

Here, we can see that we have 10,000 individuals, of which 518 have experienced the event (out-patient care due to CVD). We also have a total of 97,394 years at risk and under observation. The “earliest observed entry t” is 40.5, reflecting age at enter. The “last observed exit t” is 50.5, which reflects age at exit.

Want to unset the data?

To remove the st markers from the dataset, just type:

stset, clear

Want to do a different stest?

It is not uncommon that we apply a number of Cox regressions with different outcomes, using the same dataset. In that case, you should create a set of time variables for each outcome (some variables can often be reused, e.g. origin and enter). Just make sure that you have the right stset active before you carry out the analysis. To check the current status, you can write:

st

Note
You do not have to unset the data before doing another stset.