From study sample to analytical sample

This section is an attempt to connect the two previous sections. It is like this: we often split our analysis in different steps or models. Thus, different models include different sets of variables; and different variables have different amount of missing data. The total number of individuals may therefore vary across models, and this makes it difficult to compare the results between the models. In other words, we should ensure that all our analyses – and all steps of analysis – are based on the same individuals. These individuals represent our analytical sample (or effective sample). Put differently: our analytical sample is defined as only those individuals who have valid information (i.e. no missing) for all variables we use in our analysis.

It is good to first check the amount of missing data for each of the variables included in the analysis, to see if any certain variable is particularly problematic in terms of missingness. If a variable has serious problems with missingness, it could be wise to exclude it from the analysis (but it depends on how important the variable is for your study).

The analytical sample should not only be the basis for regression analysis, but all other statistical tests and descriptive statistics should also be based on the analytical sample. Moreover, make sure to state the total number of individuals in the heading of each table and each figure. It could look something like this (see Designing descriptive tables and figures, for more advice on how to write headings):

Some examples
Table 1. Descriptive statistics for all study variables (n=9,451).

Figure 5. Histogram of annual income (n=9,451).

Table 3. The association between educational attainment and mortality. Results from logistic regression analysis, separately for men (n=4,701) and women (n=4,750).