When we have information about the data material used, the next step is to get to know the data and look at the variables you will use in your analyses. Often, you do not use all of the variables in a dataset, and you may also need to merge datasets and create new variables based on the ones you have.
When we have downloaded and opened the dataset, it is wise to take some time to browse through the dataset and the variables. For this purpose, the describe command will give us an overview of the dataset, number of observations and variables.
| Note If you want to read more about different options of reviewing your dataset and variables, check out the section on Review dataset. |
| Note We want to really urge you to structure your do-files properly right from the start of your analysis. It is highly recommended to describe in your file what you do and why you do it, in order to be able to go back and check your work. You will most likely have to re-do analyses and commands, so having a nice structure from the beginning saves you a lot of time and frustration! It is also highly recommended to name your do-file by the date and the analyses you have been doing when you save it, for example “231116_Variables”. |
We go through examples on how to go about the process of getting to know your variables down below, where we’ve already chosen the relevant variables for our analyses.
X, Y and Z variables
X
The independent variable (bullied) in our dataset is collected using the question “Have you been subjected to bullying at school in the past year?” and is a binary/categorical variable with the answers “No” and “Yes”.
Y
The outcome variable (healthissues) is an index that will be created down below for this thesis, containing several different variables. These variables are:
- headache: How often do you have a headache?
- sad: How often do you feel sad and depressed?
- afraid: How often do you feel afraid?
- poorappetite: How often do you have a poor appetite?
- change: How often do you feel like changing something about yourself?
- nervousstomach: How often do you have a nervous stomach?
- notenough: How often do you feel like you are not good enough?
- nosleep: How often do you have trouble falling asleep?
- dissatappear: How often do you feel dissatisfied with your appearance?
- uncomfortable: How often do you feel sluggish and uncomfortable?
- restless: How often do you sleep restlessly?
Z
The covariates in this thesis are:
- sex: Sex
- gradeyear: Grade year
- bornoutswe: Are you born outside of Sweden?
- famtype: Family type
- numrelocations: How many times have you moved since starting school?
Step 1: Review the variables
The first step is to review all the variables that will be included in the analysis (a total of 17), in order to get to know the data further. We use the following command for this purpose:
codebook bullied headache sad afraid poorappetite change nervousstomach notenough nosleep dissatappear uncomfortable restless sex gradeyear bornoutswe famtype numrelocations |



The output is very extensive, which is why we only add three examples from the output, but we can see that the independent variable bullied is binary (0=no, 1=yes) and the variables that make up the index for the outcome variable are categorical on a five point scale. Regarding the covariates, sex is a binary variable (0=boy, 1=girl), gradeyear is binary (1=7th grade, 2=10th grade), bornoutswe is binary (0=no, 1=yes), famtype is categorical (1=both parents, 2=alternate, 3=one parent), and numrelocations is a continuous variable.
Step 2: Create, review, and describe a new variable
We want to create a summative index of all 11 health issues (headache, sad, afraid, poorappetite, change, nervousstomach, notenough, nosleep, dissatappear, uncomfortable, restless). The new variable should be named “healthissues”
We use the following command for this purpose:
gen healthissues = headache + sad + afraid + poorappetite + change + nervousstomach + notenough + nosleep + dissatappear + uncomfortable + restless |
Then we need to add a label for the variable healthissues. We use the following command for this purpose:
label variable healthissues "Health Issues" |
Then we review the healthissues variable. We use the following command for this purpose:
codebook healthissues |

From the output in Stata, we can see that the variable healthissues is continuous. We obtain descriptive statistics for the healthissues variable using the following commands:
histogram healthissues, freq norm d |

tabstat healthissues, stats(min max mean median sd skewness) |

Now we can look at the distribution of healthissues. The mean and median values are quite similar. Visually, the histogram shows a relatively normal distribution, although slightly skewed. The skewness value, however, is below 0.5, which altogether suggests that we can consider healthissues to have a normal distribution. Here we can also reflect on the values in the variable, what does high and low values mean for healthissues?
| Note For extra review, please re-visit the section on Descriptive analysis. |
Step 3: Define the analytical sample
Next, we create a pop variable that identifies individuals with valid values for the following 7 variables (only the variables we will actually use in the analyses): healthissues, bullied, gradeyear, sex, foreignborn, famtype, numrelocations. We name the variable pop and use the following command for this purpose:
gen pop=1 if healthissues !=. & bullied !=. & gradeyear !=. & sex !=. & bornoutswe !=. & famtype !=. & numrelocations !=. |
Then we review the pop variable using the following command:
codebook pop |

We can also look at the pop variable with this command:
tab pop |

From these commands, we can see how many individuals are in our analytical sample (9,023 individuals). The total sample consisted of 12,540 individuals. This means that 28 percent are considered internal drop outs (9,023/12,540=0.72=72%). It may be appropriate here to look at the missing values, how the missing data is distributed and if it is problematic for our analyses and interpretation of results. If you want to read more about this, see Missing data. In our example, we do not go further into exploring the distribution of missing data.
| Note For extra review on creating a pop variable, please re-visit the section on The “pop” variable. |
Step 4: Generate descriptive statistics and create a box plot
| Note Remember that the descriptive statistics and the box plot should only be based on those with a value of 1 for the pop variable. |
The descriptive statistics may feel a bit unnecessary when we already know the variables using the codebook command previously. However, it is important later in the result section to describe the analytical sample. Therefore, we give examples on how to perform the descriptive statistics of your analytical sample below. It may also be relevant to describe the independent variables and covariates in relation to the outcome variable, which is common to do in scientific studies but not always necessary. More information about how to present descriptive statistics can be found under Designing descriptive tables and graphs.
First we start with the categorical variables, where we want to describe the distribution between the categories for each variable.
codebook bullied sex gradeyear bornoutswe famtype if pop==1 |


Here are two examples of what the output looks like. In the output in Stata, we can note the number of observations for the categories in each variable, and then calculate the percentage in each category.
Next we perform the descriptive analysis for our continuous variables and choose which measurements we want to know about the variables.
tabstat healthissues numrelocations if pop==1, stats(min max mean median sd) |

In the output in Stata, we can note the descriptives for each variable, to later on create the table on descriptive statistics of all study variables. When we’ve seen the distribution between categories for categorical variables and the distribution for the continuous variables, it may also be appropriate to reflect on the group size for the categories. Could there be any potential problems with small sample size when performing the analyses? If so, we might have to modify our analytical model, or at least be aware of potential problems that may occur. In this example, we will not make any modifications.
Now we also want to create a box plot with bullied as the grouping variable and healthissues as the test variable, to visually describe the two groups. This box plot is just an initial exploration to see if there might be something interesting here to begin with, before going into investigating the association using regression analyses.
We use the following command for this purpose:
graph box healthissues if pop == 1, by(bullied) |

The graph shows the distribution of healthissues between the group who is exposed to bullying and those who are not. The distribution is shifted a bit higher for the ones exposed to bullying compared to the other group. This indicates that the group exposed to bullying may potentially suffer from more healthissues.
There are some outliers, but they are relatively few and close to the minimum and maximum values on the graph. We can therefore assume that outliers do not influence the sample too much.
Now, when the analytical sample is determined and descriptive statistics conducted for the variables, we are ready to perform the regression analyses.
A note on writing the section on variables
When you write the part on variables, it is important to be clear on what the variables represent (perhaps answers from a questionnaire), which measurement scale the variables have, and what categories are included in each variable. This information will perhaps be most suitable in both text and tables. It is important to include both information about the description of study variables, as well as a motivation for the way of operationalisation. As with everything else, try to make it as clear as possible!
If you want to go further into the process of how you created new variables, categorized continuous variables, or made some other changes to the original data, discuss with your supervisor whether it is appropriate to add this in the text or as an appendix in your thesis.