Predicting students' chance at graduate school admission from multiple variables using regression and other forms of statistical reasoning.
Technologies used: Python, linear + polynomial regression, backward elimination, SVR, Decision Tree, Random Forest
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.metrics import mean_squared_error, r2_score from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split # Kaggle dataset dataset = pd.read_csv('Admission_Predict.csv') X = dataset.iloc[:, 1:-1].values y = dataset.iloc[:, 8].values
X Out[332]: array([[ 1. , 337. , 118. , ..., 4.5 , 9.65, 1. ], [ 1. , 324. , 107. , ..., 4.5 , 8.87, 1. ], [ 1. , 316. , 104. , ..., 3.5 , 8. , 1. ], ..., [ 1. , 330. , 116. , ..., 4.5 , 9.45, 1. ], [ 1. , 312. , 103. , ..., 4. , 8.78, 0. ], [ 1. , 333. , 117. , ..., 4. , 9.66, 1. ]]) y Out[334]: array([0.92, 0.76, 0.72, 0.8 , 0.65, 0.9 , 0.75, 0.68, 0.5 , 0.45, 0.52, 0.84, 0.78, 0.62, 0.61, 0.54, 0.66, 0.65, 0.63, 0.62, 0.64, 0.7 ,
Data serparated by dependent and independent variables
Multiple visible instances of linearility
X_train, X_test, y_train, y_test = train_test_split(X_opt, y, test_size = 0.2, random_state = 3) regressor2 = LinearRegression() regressor2.fit(X_train, y_train) y_pred = regressor2.predict(X_test) R2=r2_score(y_test,y_pred) MSE=mean_squared_error(y_test,y_pred) print("R squared value:",R2, '\n' + "Mean squared error:",MSE)
R squared value: 0.78397 Mean squared error: 0.00423
OLS Regression Results ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -1.2594 0.125 -10.097 0.000 -1.505 -1.014 x1 0.0017 0.001 2.906 0.004 0.001 0.003 x2 0.0029 0.001 2.680 0.008 0.001 0.005 x3 0.0057 0.005 1.198 0.232 -0.004 0.015 x4 -0.0033 0.006 -0.594 0.553 -0.014 0.008 x5 0.0224 0.006 4.034 0.000 0.011 0.033 x6 0.1189 0.012 9.734 0.000 0.095 0.143 x7 0.0245 0.008 3.081 0.002 0.009 0.040 ==============================================================================
Columns 3 and 4 will be removed due to P > maxmimum significance
X_opt = X[:, [0,1,2,3,4,5,6,7]] sigLvl = 0.05 def backwardElim(x, y, sl): toDel = [] numVars = len(X_opt[1]) for i in range(0, numVars): regressor_OLS = sm.OLS(y, x).fit() pVals = regressor_OLS.pvalues.astype(float) for j in range(0, numVars): if pVals[j] > float(sl): toDel.append(j) x = np.delete(x, toDel, 1) return x X_new = backwardElim(X_opt, y, sigLvl)
Algorithm for automatic backward elimination
R squared value: 0.78450 Mean squared error: 0.00422
# Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train) # Fitting the SVR Model to the Dataset from sklearn.svm import SVR regressor = SVR(kernel = 'rbf') regressor.fit(X_train, y_train) # Predicting New Result y_pred = sc_y.inverse_transform(regressor.predict(X_test))
The SVR class does not automatically feature scale (transform the data to relative values), so this was done automatically using the Standard Scaler class of the sklearn library. There is no categorical data in this dataset, so an encoder was not needed.
R squared value: 0.79292 Mean squared error: 0.00426
Significance of variables determined by Random Forest Algorithm
For this dataset, the Random Forest Approach provided the closest prediction
The most significant factor (by a significant margin) for graduate college admisison is the student's undergraduate college GPA. The least significant is the student's current university ranking.
For further analysis, I will fine tune the paramaters of the Random Forest algorithm to confirm that a student's GPA truly plays such a significant role in the college admissions process.