Crosstable

Quick facts

Number of variables
Two

Scales of variable(s)
Categorical

Introduction

A crosstable is a description of how individuals are distributed according to two variables.  

This function is used primarily for categorical variables (i.e. nominal/ordinal) but can be used for any type of variable; the main concern is that the table becomes too complex and difficult to interpret if there are many categories/values in the variables used.

Moreover, it is possible to add a chi-square test to the crosstable (for more information, see Chi-squared).  

Unless otherwise specified, a crosstable will only show the frequency distribution. This is usually not what we are after; we rather would like to see the percentage distribution.

There are two options to choose from: column and row percentages. The frequencies (i.e. the number of individuals) in the cells are the same, but the percentages are different since the focus shifts between the tables. If you find this difficult to separate in your mind, one good advice is perhaps to see where the percentages add up to 100% in Total – in the rows or in the columns.  

Note
If we would have individuals with missing information with regard to any of the two variables, these would be excluded from the crosstable unless otherwise specified.

Function

Basic command
tab varname1 varname2
Useful options
tab varname1 varname2, row 
tab varname1 varname2, col
Explanations
varname1Insert the name of the first variable you want to use (is included as the row variable).
varname2Insert the name of the first variable you want to use (is included as the column variable).
rowShow row percentages.
colShow column percentages.
mInclude missing.
Short names
tabtabulate
colcolumn
mmissing
Note
Options can be used simultaneously, e.g:
tab varname1 varname2, row col m
More information
help tabulate twoway

Practical example

Dataset
StataData1.dta
Variable namesex
Variable labelSex
Value labels0=Man
1=Woman
Variable namebullied
Variable labelExposure to bullying (Age 15, Year 1985)
Value labels0=No
1=Yes
tab sex bullied

In the table above, we specified sex as the row variable, and bullied as the column variable.

Frequencies are shown in the different cells. We can observe that: 

  • There are 4,145 men and 4,574 women.
  • There are 7,780 individuals who have not been exposed to bullying and 939 who have been exposed to bullying.

If we focus on the frequency distribution of bullying across gender, we can see that:

  • Among men, there are 3,799 who have not been exposed to bullying and 346 who have been exposed to bullying.
  • Among women, there are 3,981 who have not been exposed to bullying and 593 who have been exposed to bullying.

If we instead shift the focus to the frequency distribution of sex across bullying, the results show that:

  • Among those who have not been exposed to bullying, there are 3,799 men and 3,981 women.
  • Among those who have been exposed to bullying, there are 346 men and 593 women. 

Comparing frequencies are, however, rather tricky since the sample size differs across the categories of the variables. That is why it is often more practical to focus on percentages. 

tab sex bullied, row

In the table above, we have added row percentages. Since sex is our row variable, we will here see the percentage distribution of bullying across sex. 

  • In total, 89% have not been exposed to bullying whereas 11% have been exposed to bullying. 
  • Among the men, 92% have not been exposed to bullying whereas 8% have been exposed to bullying.
  • Among women, 87% have not been exposed to bullying whereas 13% have been exposed to bullying.
tab sex bullied, col

In the table above, we have added column percentages. Since bullied is our column variable, we will here see the percentage distribution of sex across bullying. 

  • In total, 48% are men and 53% are women. 
  • Among those who have not been exposed to bullying, 49% are men and 51% are women.
  • Among those who have been exposed to bullying, 37% are men and 63% are women.
Note
It is seldom necessary to report decimals for percentages – if you choose to do this, it is most often sufficient to report only one decimal.

How to choose between row/column variables and row/column percentages?

There are many ways to think about this, but we will here present our preferred strategy.

To start with, we suggest that you think about what you actually want to compare.

In the example of sex and bullying, it is reasonable to be interested in whether there are sex differences in terms of exposure to bullying. Accordingly, we want to compare men and women.

We recommend that you place the variable you want to make the comparison by – in this case, sex – as the row variable and then order row percentages.

Thinking about our example, this enables us to compare the percentage of bullied men with the percentage of bullied women, to see if there are indeed sex differences in bullying.

Example
To hopefully clarify things further, we can take one more example. Assume that we are interested in whether there are differences in the prevalence of lung cancer between non-smokers and smokers. We thus have two variables: smoking status and lung cancer. Since we want to make the comparison by smoking status, we choose this as our row variable and order row percentages.
Note
Choosing the variable we want to make the comparison by as the row variable and ordering row percentages, is in practice the same as choosing that that variable as the column variable and ordering column percentages. However, we think it the former alternative facilitates the comparison better.