Exploring Regression Methods on Public Datasets

A semester capstone project to deepen understanding of upsides/downsides of different regression techniques. Methods were run 20 times each across 20 publicly available datasets and results were averaged.

My team was given the task of proposing an interesting capstone project, given what we learned in our Data Science and Analytics Course.

We landed on a benchmarking task in order to measure the “power” of various regression techniques on live, publicly available datasets. Click the link above to read the 1.5 page proposal.

Analyses Run

  • Multiple Linear Regression

  • K-Nearest Neighbors Regression

  • Ridge Regression

  • Lasso Regression

  • Bagging

  • Random Forest Regression

40 different datasets were gathered from a variety of sources, although the majority were pulled from Kaggle.com. The datasets were initially pulled if they seemed interesting, but ultimately the list was pared down to 20 datasets, eliminating ones that were not sufficiently large, diverse, or were not appropriate for regression analysis

Click the link above to see all ~40 datasets considered

Data Cleaning

Each dataset had unique cleaning requirements, but most followed similar procedures.

These included:

  • Changing data types with the .astype() function

  • Creating dummy variables for categorical predictors using pd.get_dummies()

  • Recoding boolean variables to 0 and 1 using np.where()

  • Re-coding ordinal string data into numerical format with OrdinalEncoder

  • Removing missing values using .dropna()

Results

Next
Next

Finding the Best Bakers on GBBO