Step 2: Examine the variables
A good starting point when it comes to the variables is to examine each one of those that are relevant to your research question(s). It might seem straight forward when you are working with example data, but in a thesis you might be presented with large datasets where you are only going to utilize a handful of variables. Further, you will probably produce several new variables based on the ones present in the dataset. Therefore it is very helpful to examine and understand how they are built before spreading your wings and creating new ones. Be thorough – you will thank yourself later.
X, Y and Z variables
X
The independent variable is depression: Postpartum depression in the mother (within 3 months of the birth of the child)
Y
The dependent variable is psyc: Psychiatric problems in the child (1997–2012; 2–17 years)
Z
The covariates are:
- mothereduc: Mother’s level of education (at birth of child; 1995)
- famtype: Family type (at birth of child)
- sex: Sex of the child
- support: Mother’s level of social support (during the child’s first year)
The psykdate_str variable is used to create date variables. Additionally, the id variable is also in the dataset, but will not be included in the analysis.
To investigate the variables you can use the codebook command for all the variables. You can exclude the id variable.
codebook psyc psycdate_str depression mothereduc famtype sex support |
The output will provide information on scale type, which is useful to know when you start working with your variables. You can read more about this in the chapter on measurement scales, but as you can see in the output below, the psyc variable says “Numeric (float)” under type. You can also see that the variable has two unique values, 0=No and 1=Yes. This tells us that the variable is nominal (categorical, where the values cannot be ranked) and binary (only two categories).

Another example is the support variable. This is also a categorical variable, but since it has three categories (1=Low; 2=Medium; 3=High) that can be ranked, the scale is ordinal.

The output in Stata shows us information on all of the variables, although only two examples are shown here. We see that depression has two categories (0=No; 1=Yes), mothereduc has four categories (1=Compulsory school; 2=High school; 3=Higher education; 4=Postgraduate degree), famtype has two categories (0=Parents living together; 1=Single mother), and sex has two categories (1=Boy; 2=Girl).
You can also use the compact option, to reduce and compress the amount of output:
codebook id psyc psycdate_str depression mothereduc famtype sex support, compact |
More informationhelp codebook |
To produce appropriate descriptive statistics for all the categorical variables (psyc, depression, mothereduc, famtype, sex, and support) you can use the tab1 command. You can alternatively use the tab command to examine the variables one by one. Two of the outputs are provided below as examples.
tab1 psyc depression mothereduc famtype sex support |


For the psyc variable you can see that 93% of the study population never had psychiatric problems in the years 1997–2012. From the mothereduc variable we can see that the most common form of graduated education was high school (40%), while higher education is not far behind.
Step 3: Format a date variable
Once you have familiarized yourself more with the data, it is time to prepare for the analyses. Formatting the data properly is crucial to conducting an accurate and meaningful Cox regression analysis. The next step is to format the variable “psycdate_str” as a date variable.
We will generate three variables that specify year, month, and day, respectively, and then combine them into one date variable.
browse psycdate_str |
|
All three variables are string variables. To make things smoother, we will transform them into numeric variables, using real.
|
The next step is to generate the date variable, and to format the date variable so that is makes sense for Stata.
gen psycdate=mdy(psycmonth,psycday,psycyear) |
| Note For more review, re-visit Date variables. |
Step 4: Create a population variable – the “pop” variable
Once you have decided on, created, and recoded all the variables that you will include in your analysis, it is time to define your analytical sample. When creating your “pop” variable it is important to remember to include the variables you are actually going to use, not all the variables in your dataset.
The next step in this example is to create a “pop” variable for the variables psyc, depression, mothereduc, famtype, sex, and support.
gen pop=1 if psyc!=. & depression!=. & mothereduc!=. & famtype!=. & sex!=. & support!=. |
tab1 pop psyc depression mothereduc famtype sex support, m |
8,453 individuals are included in the analytical sample. These individuals represent 99.45% of the original sample, which means that there was a dropout of 0.55 %. Once you have this information, you should be able to discuss whether the dropout is problematic or not. In this case, the dropout is incredibly low, so there is definitely no problem. Usually, especially with survey data, attrition, or non-response, will be higher.
| Note For more review, re-visit The “pop” variable. |
Step 5: Create date variables
Create the following date variables:
- faildate
- origin
- enter
- exit
These variables collectively structure the time-to-event data in survival analysis, defining the time duration, the starting point, the entry and exit times for individuals, and the occurrence of events or censoring. They are crucial in conducting survival analysis and understanding the timing and occurrence of events within a study or observation period.
gen faildate=psycdate |
|
|
|
Step 6: Declare time-to-event data
Declare that the dataset is time-to-event data using the stset command.
| Note Note! Limit this to those with a value of 1 for the pop variable. |
stset faildate if pop==1, failure(psyc==1) enter(time enter) exit(time exit) origin(time origin) scale(365.25) id(id) |
list faildate origin enter exit _st _d _t _t0 in 1/10 |

Note_st is an indicator of whether the individual is included in the stset_d is an indicator of event (=1 if event has occured) _t is analysis time when observational period ends_t0 is analysis time when observational period starts |
All individuals in the sample enter on the same date (enter=01jan1997) and they have the same origin date (origin=30jun1995). This means that the age at the beginning of the follow-up period is the same for everyone in the sample (_t0=1.5085558).