Handling missing values in Stata

May 16, 2019

Effectively handling missing data in Stata largely comes down to dealing with how Stata treats missing values to begin with. Stata treats missing values as positive infinity and will denote them with a period (ie. .). Furthermore Stata will evaluate the missing values in any logical operations performed, which is why it is crucial to take special measures to avoid the risk of making false inferences about the data. The simplest but also the most verbose way to handle missing values is to use the !=. Some other options include recoding missing values (using the mvencode command) or removing them altogether.

Let's look at some examples using the auto-data dataset:

sysuse auto

Now we can use the misstable command in order to find missing values in the dataset, like so:

misstable summarize

Here we can see that there are 5 missing values for the variable rep78 (repair records from 1978). As stated earlier Stata will evaluate the missing values to positive infinity, and will ultimately be regarded as greater than any of the numerical values in the prior sequence. One way to mitigate this is to use the != operator.

To see which cars have a higher repair record of 5, type:

tab make if rep78 >= 5 & rep78 != .

Another option would be to recode the missing values with the command mvencode. This is clearly a more elegant solution even if it might be difficult to come up with a meaningful re-coding strategy. For example, to turn all missing values into -1, type

mvencode *, mv(-1)

The mvdecode command can similarly be used to decode any encoded values back to their original value:

mvdecode *, mv(-1)