Box plot

Quick facts

Number of variables
One group variable (optional) 
One test variable 

Scales of variable(s)
Group variable: categorical
Test variable: continuous

Introduction

A box plot – or box and whisker plot – is a four-part summary of a variable. The four parts are made up by five components: minimum, first quartile, median, third quartile, and maximum.

Below is a simple illustration: we draw a box from the first quartile (q1) to the third quartile (q3). The line in the middle of the box represents the median (q2). The whiskers represent the minimum (min) and maximum (max) values. This means that each of the four parts contain approximately 25% of the values.

It is not necessary to include a group variable in a box plot, but we chose to place box plots here instead of Descriptive analysis, since we think that it is a nice alternative for comparing groups in a descriptive way.

Box plots are sensitive to outliers, so if you discover that your variable has any extreme values, you might need to reconsider your box plot (e.g. by excluding the outliers).

Function

Basic command
graph box yvar, over(groupvar)
Explanations
yvar  Insert the name of the variable that you want to use as your y-variable.
groupvarInsert the variable defining the groups.
More information
help graph box

Practical Example

Dataset
StataData1.dta
Variable namegpa
Variable labelGrade point average (Age 15, Year 1985)
Value labelsN/A
Variable namesex
Variable labelSex
Value labels0=Man
1=Woman
graph box gpa, over(sex)

The box plot above shows the distribution of gpa according to sex.

We can see that the distribution is slightly shifted upwards among women compared to men: their median grade point average is higher.

There are some outliers, but this does not seem to be a big problem (the dots are few).

Summary
The median grade point average is higher among women than among men.