Graduate Admissions
Data Analysis

Predicting students' chance at graduate school admission from multiple variables using regression and other forms of statistical reasoning.

Technologies used: Python, linear + polynomial regression, backward elimination, SVR, Decision Tree, Random Forest

Separating Data

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

# Kaggle dataset
dataset = pd.read_csv('Admission_Predict.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, 8].values

X
Out[332]: 
array([[  1.  , 337.  , 118.  , ...,   4.5 ,   9.65,   1.  ],
       [  1.  , 324.  , 107.  , ...,   4.5 ,   8.87,   1.  ],
       [  1.  , 316.  , 104.  , ...,   3.5 ,   8.  ,   1.  ],
       ...,
       [  1.  , 330.  , 116.  , ...,   4.5 ,   9.45,   1.  ],
       [  1.  , 312.  , 103.  , ...,   4.  ,   8.78,   0.  ],
       [  1.  , 333.  , 117.  , ...,   4.  ,   9.66,   1.  ]])
       
y
Out[334]: 
array([0.92, 0.76, 0.72, 0.8 , 0.65, 0.9 , 0.75, 0.68, 0.5 , 0.45, 0.52,
       0.84, 0.78, 0.62, 0.61, 0.54, 0.66, 0.65, 0.63, 0.62, 0.64, 0.7 ,

Data serparated by dependent and independent variables

Visualizing relationships

Multiple visible instances of linearility

Statistical Algorithms

Linear Regression

X_train, X_test, y_train, y_test = train_test_split(X_opt, y, test_size = 0.2, random_state = 3)

regressor2 = LinearRegression()
regressor2.fit(X_train, y_train)
y_pred = regressor2.predict(X_test)

R2=r2_score(y_test,y_pred)
MSE=mean_squared_error(y_test,y_pred)
print("R squared value:",R2, '\n' + "Mean squared error:",MSE)

R squared value: 0.78397
Mean squared error: 0.00423

Linear Regression with Backward Elimination

                            OLS Regression Results                            

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2594      0.125    -10.097      0.000      -1.505      -1.014
x1             0.0017      0.001      2.906      0.004       0.001       0.003
x2             0.0029      0.001      2.680      0.008       0.001       0.005
x3             0.0057      0.005      1.198      0.232      -0.004       0.015
x4            -0.0033      0.006     -0.594      0.553      -0.014       0.008
x5             0.0224      0.006      4.034      0.000       0.011       0.033
x6             0.1189      0.012      9.734      0.000       0.095       0.143
x7             0.0245      0.008      3.081      0.002       0.009       0.040
==============================================================================

Columns 3 and 4 will be removed due to P > maxmimum significance

X_opt = X[:, [0,1,2,3,4,5,6,7]]
sigLvl = 0.05

def backwardElim(x, y, sl):
    toDel = []
    numVars = len(X_opt[1])
    for i in range(0, numVars):
        regressor_OLS = sm.OLS(y, x).fit()
        pVals = regressor_OLS.pvalues.astype(float)
    for j in range(0, numVars):
        if pVals[j] > float(sl):
            toDel.append(j)
    
    x = np.delete(x, toDel, 1)
    return x
          
X_new = backwardElim(X_opt, y, sigLvl)

Algorithm for automatic backward elimination

R squared value: 0.78450
Mean squared error: 0.00422

Support Vector Regression

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

# Fitting the SVR Model to the Dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, y_train)

# Predicting New Result
y_pred = sc_y.inverse_transform(regressor.predict(X_test))

The SVR class does not automatically feature scale (transform the data to relative values), so this was done automatically using the Standard Scaler class of the sklearn library. There is no categorical data in this dataset, so an encoder was not needed.

R squared value: 0.79292
Mean squared error: 0.00426

Random Forest

Significance of variables determined by Random Forest Algorithm

Finding

For this dataset, the Random Forest Approach provided the closest prediction

The most significant factor (by a significant margin) for graduate college admisison is the student's undergraduate college GPA. The least significant is the student's current university ranking.

For further analysis, I will fine tune the paramaters of the Random Forest algorithm to confirm that a student's GPA truly plays such a significant role in the college admissions process.

Graduate Admissions Data Analysis