Classifying Income Bracket from 1994 US Census Data

We ran classification algorithms on 1994 US Census database to classify those who earn more than 50,000 USD from those who don’t. 86.6% classification accuracy and marital status as the top predictor.

Executive Summary

The dataset prepared by Ronny Kohavi and Barry Becker from a 1994 Census database was used to explore the possibility of predicting whether or not a person earns more than 50,000 USD per year.

The dataset was cleaned and prepared for the classification models. The cleaned dataset contains 30,162 rows (observations or samples) and 103 columns (features).

Using all 103 features various classifiers were tested namely: GBM, random forest, decision tree, logistic regression, SVM, kNN, and naive Bayes classifier.

With the minimum target accuracy of 78%, the top three models that give the highest classification accuracy are GBM (86.6%), random forest (84.8%), and logistic regression with L1 normalization (84.8%). The top predictor shared by these three classifiers is marital status.

Note that because more than 90% of the observations are from the US, the results may not be generalizable for other populations.

Data Description

The dataset prepared by Ronny Kohavi and Barry Becker from 1994 Census database contains 15 columns (14 features and 1 target variable). There are a total of 32,560 observations. The data source is: https://archive.ics.uci.edu/ml/datasets/census+income

The features are:

  1. age
  2. workclass
  3. fnlwgt
  4. education
  5. education-num (dropped in modeling since same as education)
  6. marital-status
  7. occupation
  8. relationship
  9. race
  10. sex
  11. capital-gain
  12. capital-loss
  13. hours-per-week
  14. native-country

The target is the annual income, which can either be > 50K USD or <= 50K USD.

Models

Proportional Chance Criterion

The Proportional Chance Criterion (PCC) measures the chance of correctly classifying a data point based on chance alone. As a rule of thumb, to say that our model works, we need to exceed the prediction accuracy of 1.25 x PCC. In this case, we need to exceed 78% accuracy.

Screen Shot 2018-10-07 at 9.24.44 PM

Results

Using all 103 features we tested various classifiers namely: GBM, random forest, decision tree, logistic regression, SVM, kNN, and naive Bayes classifier.

With the minimum target accuracy of 78%, the top three models that give the highest classification accuracy are GBM (86.6%), random forest (84.8%), and logistic regression with L1 normalization (84.8%). The top predictor shared by these three classifiers is marital status.

Note that because more than 90% of the observations are from the US, the results may not be generalizable for other populations.

Screen Shot 2018-10-07 at 9.29.35 PM.png

References and Acknowledgements

I would like to acknowledge Professor Christopher P. Monterola whose Python codes were the foundation of some of the codes used for this analysis.
Dataset was downloaded from https://archive.ics.uci.edu/ml/datasets/census+income

Github link: https://github.com/PrinceJavier/income_class_prediction

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: