Exploring Regression Methods on Public Datasets
A semester capstone project to deepen understanding of upsides/downsides of different regression techniques. Methods were run 20 times each across 20 publicly available datasets and results were averaged.
My team was given the task of proposing an interesting capstone project, given what we learned in our Data Science and Analytics Course.
We landed on a benchmarking task in order to measure the “power” of various regression techniques on live, publicly available datasets. Click the link above to read the 1.5 page proposal.
Analyses Run
Multiple Linear Regression
K-Nearest Neighbors Regression
Ridge Regression
Lasso Regression
Bagging
Random Forest Regression
40 different datasets were gathered from a variety of sources, although the majority were pulled from Kaggle.com. The datasets were initially pulled if they seemed interesting, but ultimately the list was pared down to 20 datasets, eliminating ones that were not sufficiently large, diverse, or were not appropriate for regression analysis
Click the link above to see all ~40 datasets considered
Data Cleaning
Each dataset had unique cleaning requirements, but most followed similar procedures.
These included:
Changing data types with the .astype() function
Creating dummy variables for categorical predictors using pd.get_dummies()
Recoding boolean variables to 0 and 1 using np.where()
Re-coding ordinal string data into numerical format with OrdinalEncoder
Removing missing values using .dropna()