Substring

If you need to subtract a portion (substring) from a string variable, you can use substr.

The authors of the guide can happily reveal that they have applied this a lot when working with ICD codes (classification system for diagnoses).

Function

Basic command
egen newvarname=std(oldvarname)
Explanations
newvarnameInsert the name of the new variable (containing the substring).
oldvarnameInsert the name of the old variable (the original string variable).
substrExtract a portion of the string variable.
startSpecify which position that the starting character in the substring has.
lengthSpecify the length of the substring.
More information
help substr

Practical example

Dataset
StataData1.dta
Variable namecvd_date_str
Variable labelDate of out-of-patient care due to CVD (Ages 41-50, Years 2011-2020)
Value labelsN/A

We have a string variable called cvd_date_str that contains the date of out-patient care due to cardiovascular disease (CVD), coded like YYYYMMDD. Suppose that we want to extract the year (YYYY), month (MM), and day (DD) into separate variables.

gen cvd_year_str= substr(cvd_date_str,1,4)
gen cvd_month_str= substr(cvd_date_str,5,2)
gen cvd_day_str= substr(cvd_date_str,7,2)

As can be noted in the command above, for year, we specify 1 as the position which the starting character in the substring has, and 4 as the length. For month, we specify 5 and 2. And, finally, for day, we specify 7 and 2.

browse cvd_date_str cvd_year_str cvd_month_str cvd_day_str

Let us also add some labels for these new variables.

label variable cvd_year_str "Year of out-patient care due to CVD (Ages 41-50, Years 2011-2020)"
label variable cvd_month_str "Month of out-patient care due to CVD (Ages 41-50, Years 2011-2020)"
label variable cvd_day_str "Day of out-patient care due to CVD (Ages 41-50, Years 2011-2020)"