Getting quick insights into a new dataset | Johan Osterberg - Product Engineer

Getting quick insights into a new dataset

May 31, 2019

Whenever you get your hands on a new dataset you probably want to get as much insight as possible, as quickly as possible so that you know what you're dealing with. For this post we'll look at some of the best initital commands to run on a fresh dataset.

  1. describe - start with this to get a quick overview of the data.
describe
  1. summarize - next, run summarize to get basic summary statistics for all variables. Add on detail to get even more information.
summarize 
summarize, detail
  1. mdesc - mdesc is an external module that tabulates missing values. Consequently it has to be downloaded and installed before use. Once installed it is used to quickly get an overview of missing values per variable. More specifically, it displays a table with the number of missing values, total number of cases, and percent missing for each variable in varlist.
ssc install mdesc
mdesc
  1. duplicates report - use duplicates report to get a report of duplicate observations in the data.
duplicates report 
  1. tabulate - run tabulate to get an overview of categorical variables, in the form of frequency and percentage.
tabulate variable_name
  1. pwcorr - run pwcorr to examine correlations between two or more variables.
pwcorr varlist
  1. histogram - visualize your data to get an idea of how it is distributed.
histogram variable_name
  1. scatter - utilize scatterplots in order to understand relationships between continuous variables.
scatter variable1 variable2
  1. box - visualize variables with a box plot to identify outliers.
graph box variable_name
  1. generate - based on your understanding of the data you might want to generate new variables based on some criteria. For instance you might want to generate a binary variable off a continuous one or maybe you want to group some categories together out of a categorical variable.
generate variable_name = exp 

That's it, a few of the most basic commands used to get a quick overview of a dataset. We'll explore some of these further in upcoming posts.


Profile picture

Written by Johan Osterberg who lives and works in Gothenburg, Sweden as a developer specialized in e-commerce. Connect with me on Linkedin

2024 © Johan Osterberg