Executive Summary

This analysis seeks three objectives:

  • (1) describe, visualize, and model a complex dataset using supervised machine learning techniques.
  • (2) describe, visualize, and segment a complex dataset using unsupervised machine learning techniques.
  • (3) extract business insights from applied SQL analysis.

First, this analysis surveyed the distributions with the following selected findings:

distributions-all

  • x1 (independent variable):
    • uniform distribution
      space
  • x2 (independent variable):
    • normal distribution
      space
  • x3 (independent variable):
    • t-distribution
      space
  • x5 (independent variable):
    • exponential distribution
      space
  • x6 (independent variable):
    • binomial distribution
      space
  • ya (dependent variable):
    • chi2 distribution
      space
  • yb (dependent variable):
    • reverse log normal distribution
      space
  • yc (dependent variable):
    • logistic distribution

Regarding predicting the dependent variables, this analysis found the following:

  • For ya:
\begin{align} ya = 0.251(x1) + 6.236(x2) + 0.956(x3) + 7.58 \end{align}



ya-equation

  • For yb:
\begin{align} yb = 0.253(x1) + 0.8422 \end{align}



yb-equation

  • For yc:
    • No model appropriate because yc does not appear to be related to the independent variables.

Second, this analysis found through unsupervised machine learning:

  • that the unsupervised learning data set could be reduced from 10 dimensions to just 3 (depicted as spheroids below).
    • via a Principal Component Analysis (PCA):
  • that the observations could be grouped into 4 cohorts (depicted in hues below).
    • via a K-Means Clustering Analysis:

kmeans-on-pca

Third, this analysis derived business insights through SQL:

  • such as the following query finding the number of posts in 2017:

query-3-a-via-Sublime

  • such as the following query finding the distribution of views for users:

query-3-b-via-Sublime

  • such as the following query finding the top 10 tags associated with posts created in 2017:

query-3-c-via-Sublime

  • such as the following query finding the average response time in seconds from the time a 2017 question first post to when the official answer is accepted:

query-3-d-via-Sublime

Credit to God, my Mother, family and friends.

All errors are my own.

Best,
George John Jordan Thomas Aquinas Hayward, Optimist
Data Scientist
September 30, 2019

Surveying the DS1 Distributions

distributions-all

DS1 Pair Plot

ds1-pair-plot-lower-res

DS1 Correlation Matrix

ds1-correlation-matrix

Using XGBoost for 'ya' Feature Selection

ya-xgboost-feature-selection

Using XGBoost for 'yb' Feature Selection

yb-xgboost-feature-selection

Multiple Linear Regression for 'ya' Prediction

ya-equation

Simple Linear Regression for 'yb' Prediction

yb-equation

Selecting Number of Components for DS2 PCA

picking-pca

3D-Plotting DS2's K-Means Cohort Results After PCA

kmeans-on-pca

SQL Love-Language Queries

query-3-a-via-Sublime
query-3-b-via-Sublime
query-3-c-via-Sublime
query-3-d-via-Sublime

And it begins...

In [1]:
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings("ignore")
from fitter import Fitter
from scipy import stats
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error, \
explained_variance_score, confusion_matrix, accuracy_score, precision_score, recall_score
import xgboost as xgb
import statsmodels.formula.api as sm
from statsmodels.api import add_constant
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer
import matplotlib as mpl
from mpl_toolkits import mplot3d
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import hypertools as hyp

Part I. - Supervised Machine Learning


  • Consider data set 1 (ds1.csv). The data set comprises features (the Five xs) along with three sequences that may or may not be generated from the features (3 ys).

Question 1 - Part I.A


  • Describe the data set in a few sentences. E.g. What are the distributions of each feature? Summary statistics?

👇🏾 We begin by reading in the data. 👇🏾

  • We will check for NULLS.
  • For some reason there is an index column in the raw data, but a Pandas dataframe comes with an index column by default so we will drop this 'extra' index column in the raw data.
  • We will summarize the data.
  • We will check the distributions.
In [2]:
#read in data
ds1 = pd.read_csv("ds1.csv")
In [3]:
#peek at at data
ds1.head(3)
Out[3]:
Unnamed: 0 x1 x2 x3 x5 x6 ya yb yc
0 1 2.642583 -1.715220 1.909334 0.027139 -3.447187 13.630850 1.828765 0.008386
1 2 4.588761 -2.507543 4.239107 1.704150 -2.782809 7.834582 2.162110 0.000008
2 3 7.919796 -5.108415 3.039451 0.992815 5.551587 -5.107041 2.797083 -0.000005
In [4]:
#drop "Unamed:0" column and peek at the data again:
ds1 = ds1.drop(ds1.columns[0], axis=1)
ds1.head(3)
Out[4]:
x1 x2 x3 x5 x6 ya yb yc
0 2.642583 -1.715220 1.909334 0.027139 -3.447187 13.630850 1.828765 0.008386
1 4.588761 -2.507543 4.239107 1.704150 -2.782809 7.834582 2.162110 0.000008
2 7.919796 -5.108415 3.039451 0.992815 5.551587 -5.107041 2.797083 -0.000005
In [5]:
#check for NULLS
msno.matrix(ds1,  color = (.0,.0,.2))
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c27ac7d30>

💭👆🏾This indicates that there are no NULLS.👆🏾💭

👇🏾 We now describe the data. 👇🏾¶

  • When we talk about describing the data, we look at numerical summaries. These include:
    • the standard deviation to get a sense of the spread of the data
    • the mean and median
    • the min and max
    • the null count
  • Pandas has a nice 'describe' method, which we'll use here.
  • Next we'll get to the distributions.
In [6]:
ds1.describe()
Out[6]:
x1 x2 x3 x5 x6 ya yb yc
count 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 1.000000e+05
mean 5.011059 -3.005565 2.500593 0.999136 0.000647 3.827867 2.111854 1.022765e-04
std 2.873568 2.000799 1.752906 1.002006 4.663860 18.438514 0.769306 3.121215e-02
min 0.000015 -12.498524 -3.489212 0.000003 -13.885453 -64.021995 -0.523672 -5.433613e-01
25% 2.536309 -4.353844 1.189920 0.285629 -2.611943 -8.998020 1.580212 -2.423158e-03
50% 5.022191 -3.002649 2.503764 0.690903 -0.000611 2.666992 2.231057 -5.670000e-10
75% 7.486275 -1.649283 3.802236 1.386862 2.621841 15.579687 2.733318 2.476745e-03
max 9.999887 6.089820 8.679097 15.102966 13.924740 107.713997 3.841363 8.183882e-01

👇🏾 We now will now inspect and fit the distributions. 👇🏾¶

  • We will plot histograms to visualize the distributions.
  • We will use the Python third party package "fitter" to check these distributions against some of the most common and useful statistical distributions.

🔬 Function 1: Visualizing & Fitting the Distribution 🔬

  • I am about to write a function which will:
    • first, plot a histogram for our five independent and three dependent variables
    • second, use fitter to fit a distribution to our histogram
    • third, just for good measure, add a probability plot
  • I will loop it through each column in the data frame to do all this automatically.
  • See you on the other side!
In [7]:
def distribution_visualizer_and_fitter(df, col_str):
    #this part makes the histogram:
    fig, axs = plt.subplots(figsize=(12,4))
    plt.hist(df[col_str], bins=100)
    plt.title('Histogram of {}'.format(col_str), fontweight = 'bold')
    plt.ylabel('Frequency')
    plt.xlabel('Values')
    plt.show()
    #this part fits the distribution to the histogram:
    f = Fitter(df[col_str], distributions=['gamma', 'beta', 'rayleigh', 'norm', 'pareto', 'uniform', \
                         'logistic', 'expon', 'chi2', 'pearson3'], verbose = False, timeout = 10)
    f.fit()
    f.summary()
    #this part makes the probability plot
    fig, axs = plt.subplots(ncols=2, figsize=(12,4))
    axs[1].set_axis_off()
    stats.probplot(df[col_str], plot=axs[0])
    plt.show()
In [8]:
for i in ds1.columns:
    distribution_visualizer_and_fitter(ds1, i)

💭👆🏾Conclusion: What do now know about the distributions?👆🏾💭

  • x1 (independent variable):
    • uniform distribution
      space
  • x2 (independent variable):
    • normal distribution
      space
  • x3 (independent variable):
    • t-distribution
      • some will also call this a 'nearly normal' distribution
      • can be modeled with beta distribution
        space
  • x5 (independent variable):
    • exponential distribution
      • a kind of gamma distribution
        space
  • x6 (independent variable):
    • binomial distribution
      space
  • ya (dependent variable):
    • chi2 distribution
      • can also say it's a positively skewed normal distribution, BUT:
      • to me, the key is the long tail on the right, looks a lot like chi2 distribution
        space
  • yb (dependent variable):
    • reverse log normal distribution
    • can be approximated with beta distribution
      space
  • yc (dependent variable):
    • logistic distribution
      • more unclear
      • in my subsequent analysis, I'll discuss how I don't think this 'dependent' feature is actually based on the other independent feature
      • however, from my fitting algorithm, it appears a logistic distribution is the closest

Question 2 - Part I.B


  • Try to come up with a predictive model, e.g. y = f(x_1 , … , x_n) for each y sequence. Describe your model and how you came up with them. What (if any) are the predictive variables? How good would you say each of your models is?

👇🏾 Modeling Gameplan 👇🏾¶

  • First a few assumptions:
    • The problem seems to ask for an equation as the description of the model
    • I assume this means that the client needs a linear regression output
      • Many clients like this because, even if there are more accurate models, a linear regression is highly interpretable
    • Nevertheless, we will still want to begin with a decision tree model because we can use it to figure out which features are most important
      • I will use the decision tree models as a kind of feature selection algorithm
  • I will proceed in two steps:
    • 1. Exploratory Data Analysis
      • Check correlations
      • Check pair plots
      • Check if polynomial transformations or interactions are helpful here
        • I am NOT going to do log transformations because there are too many negative numbers
          • You can add a constant to correct this issue, but you need to really know the underlying data well to do so; and this is something regarding which the Data Scientist should discuss with the team.
          • Why? Because there are some data sets that even though the shift by adding a constant will not alter the variance, it could alter the interpretability or the meaning of the numbers
          • Since we don't know too much about these underlying numbers (as in what they actually represent in real life), I make the decision, for here, not to apply a log transformation
      • To select features I will use:
        • A XGBoost feature importance
        • Backwards elimination regression, if needed
          • Please note that since this is just EDA, there will be no train/test split...I'm just trying to get a sense of what variables 'matter most' across the dataset
        • I will use XGBoost first.
        • This is often my preferred model.
        • However, I'm assuming that the client actually needs an interpretable linear-regression-esque equation
        • Thus we cannot end with XGBoost
          • The nature of the tree splits is to be less interpretable
          • We know that something matters, but we wont' have a coefficient or a direction (positive/negative)
    • 2. Modeling
      • So after deploying XGBoost for selection, I will run Linear Regressions next for the official modeling
      • This will help us with our ability to understand our features.
        • This regression will NOT be scaled.
        • Why?
          • In a word, interpretability
          • Once I scale the data, especially if a normalize it, then I lose the units
          • To be fair, this data does not seem to have units, but I want to make this as real as possible
          • Before we lose the units through scaling and then apply the regularization via Ridge or Lasso, we want to be sure we care more about prediction than inference
          • Given how this problem is stated, I'm assuming it's an odd combination of 'interpretability prediction', which for here would have to be an interpretable linear regression equation
In [9]:
#check correlations
#all credit due to Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
corrmat = ds1.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('ds1_correlation_matrix.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭

  • We can already see that 'dependent' variable 'yc' doesn't seem to be correlated with any of the independent features.
  • We also see that the the 'dependent' variables seem to correlate to one another. This may be be because they are somehow generated from the same independent variables.
  • This is also going to come up later on, but we can see that x1 and x3 are correlated.
  • We can begin to say:
    • ya is correlated to x1, x2, and x3
      • this does not factor in colinearity
    • yb is correlated to x1 and x3
      • this does not factor in colinearity
    • yc does not seem to be correlated to any of the x's
In [10]:
#check the pair plot
#all credit due to Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
sns.pairplot(ds1, height = 2.5)
plt.savefig('ds1_pair_plot_lower_res.png',dpi=150, bbox_inches='tight')
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭

  • As we scan the chart, we can see a few linear-ish or nearly linear relationships.
  • These charts don't include interactions.
  • We see some blobs, and some bizarre relationships.
  • We will eliminate features in the feature selection process.

👇🏾 Using XGBoost & Regressions To See Which Features Seem Important Before Modeling👇🏾¶

  • Now we shall use XGBoost to get an idea for the features that are most important.
  • Before checking the polynomial relationships, we just want to get an idea of what's going on.
  • After each XGBoost output we'll also put all the features into a regression against the given dependent variable and see how closely the regression follows the XGBoost insights.
  • First, we will start by defining our dependent and independent dataframes.
In [11]:
#define the dependent and independent relationships
features = ds1[['x1','x2','x3','x5','x6']]
ya = ds1['ya']
yb = ds1['yb']
yc = ds1['yc']

🔬 Function 2: XGBoost Modeler 🔬

  • This function will check all features against the dependent variables
  • It will then plot the predicteds vs. actuals as well as the residuals.
  • It will provide pertinent model metrics.
  • It will then make a horizontal bar chart for feature importance.
In [12]:
def xgboost_modeler(X, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train_xgb, X_test, y_train_xgb, y_test = train_test_split\
    (X, y, test_size = 0.2, random_state = 1)
    xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)
    xgb_model.fit(X_train_xgb,y_train_xgb)
    y_pred_xgb = xgb_model.predict(X_test)
    #linear regression standard
    #print r^2 and plot linear regression
    print("R^2:"+str(round(r2_score(y_test, y_pred_xgb),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                               (y_pred_xgb)),2)))
    print("Median Absolute Error (in original units): "+str(round(median_absolute_error((y_test),\
                                                                               (y_pred_xgb)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_pred_xgb)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_pred_xgb)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.4, wspace=0.16)
    axs[0].scatter((y_test), (y_pred_xgb))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_pred_xgb)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_pred_xgb)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('xgboost_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for XGBoost")
    print('Mean Mean Absolute Error: {}'.format(-1*round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, \
                                                    cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(xgb_model, X_train_xgb,\
                                                                                     y_train_xgb, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))      
    print('Mean R^2: {}'.format(round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, cv=cv_folds_int,\
                                                      scoring='r2').mean(),3)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    top_xgb_features = pd.DataFrame(sorted(list(zip(X,xgb_model.feature_importances_))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'XGBoost Importance'])
    top_xgb_features
    bar_count = range(len(top_xgb_features.Feature))
    fig, axs = plt.subplots(ncols=2, figsize=(14,4))
    #using a subplot method coupled with an inline parameter to have high resolution
    #note: "[::-1]" reverses the column in a pandas dataframe
    axs[1].set_axis_off()
    axs[0].barh(bar_count, top_xgb_features['XGBoost Importance'][::-1],\
                 align='center', alpha=1)
    axs[0].set_xlabel('Values')
    axs[0].set_yticks(bar_count)
    axs[0].set_yticklabels(top_xgb_features.Feature[::-1], fontsize=10)
    axs[0].set_xlabel('XGBoost Importance')
    axs[0].set_title("XGBoost's Feature Importances",fontweight = 'bold')
    extent = axs[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    fig.savefig('xgb_features_importance_dependent_var_',dpi=300, bbox_inches=extent.expanded(1.5, 1.5))
    plt.show()

👇🏾 First, we check dependent variable 'ya'👇🏾¶

In [13]:
xgboost_modeler(X=features, y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100, y_top=100)
R^2:0.704
Mean Absolute Error (in original units): 7.99
Median Absolute Error (in original units): 6.79
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.02
Residual Kurtosis: 0.03
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 8.02
Median Mean Absolute Error: 8.03
Mean R^2: 0.703
Median R^2: 0.704
Max R^2: 0.707
Min R^2: 0.696

💭👆🏾Just a few thoughts...👆🏾💭

  • This lets us know that for dependent variable ya it looks like:
    • x5 and x6 are **not** important
    • x2 is most important, followed by x1 and x3
  • We'll also peek at a standard regression...
In [14]:
features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = ya, exog = features_regression).fit()
regressor_OLS.summary()
Out[14]:
OLS Regression Results
Dep. Variable: ya R-squared: 0.677
Model: OLS Adj. R-squared: 0.677
Method: Least Squares F-statistic: 4.190e+04
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.00
Time: 23:23:39 Log-Likelihood: -3.7685e+05
No. Observations: 100000 AIC: 7.537e+05
Df Residuals: 99994 BIC: 7.538e+05
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.5697 0.089 84.712 0.000 7.395 7.745
x1 2.5177 0.020 124.571 0.000 2.478 2.557
x2 6.2346 0.017 376.359 0.000 6.202 6.267
x3 0.9511 0.033 28.707 0.000 0.886 1.016
x5 0.0023 0.033 0.070 0.944 -0.063 0.067
x6 0.0041 0.007 0.581 0.561 -0.010 0.018
Omnibus: 41.538 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 42.505
Skew: 0.039 Prob(JB): 5.89e-10
Kurtosis: 3.064 Cond. No. 19.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

💭👆🏾Just a few thoughts...👆🏾💭

  • This regression confirms our XGBoost insights regarding dependent variable ya.

👇🏾 Second, we check dependent variable 'yb'👇🏾¶

In [15]:
xgboost_modeler(X=features, y=yb, cv_folds_int=4, x_left=-5, x_right=5, y_bottom=-5, y_top=5)
R^2:0.9323
Mean Absolute Error (in original units): 0.16
Median Absolute Error (in original units): 0.14
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: -0.02
Residual Kurtosis: 0.03
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.16
Median Mean Absolute Error: 0.16
Mean R^2: 0.932
Median R^2: 0.932
Max R^2: 0.933
Min R^2: 0.931

💭👆🏾Just a few thoughts...👆🏾💭

  • This is amazing!!!
  • It looks like just one variable, **x1** decides what dependent variable 'yb' is going to be.
  • We'll check this with a standard regression...
In [16]:
features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = yb, exog = features_regression).fit()
regressor_OLS.summary()
Out[16]:
OLS Regression Results
Dep. Variable: yb R-squared: 0.895
Model: OLS Adj. R-squared: 0.895
Method: Least Squares F-statistic: 1.702e+05
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.00
Time: 23:25:20 Log-Likelihood: -3038.6
No. Observations: 100000 AIC: 6089.
Df Residuals: 99994 BIC: 6146.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.8427 0.002 396.244 0.000 0.839 0.847
x1 0.2542 0.000 528.388 0.000 0.253 0.255
x2 6.148e-05 0.000 0.156 0.876 -0.001 0.001
x3 -0.0018 0.001 -2.279 0.023 -0.003 -0.000
x5 0.0003 0.001 0.353 0.724 -0.001 0.002
x6 7.677e-05 0.000 0.454 0.650 -0.000 0.000
Omnibus: 2347.227 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2710.565
Skew: -0.336 Prob(JB): 0.00
Kurtosis: 3.445 Cond. No. 19.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

💭👆🏾Just a few thoughts...👆🏾💭

  • Even though you could argue that x3 is also statistically significant, you can see that x1's t-score of over 500 shows that it is likely doing most of the work here, just like XGBoost said.
  • Let's actually run this regression again...but only for x1 and see if we keep the R^2 at about the same level (meaning that the variance can be accounted for by x1).

💭👇🏾Let's also check that one feature, x1, which seems to be doing the brunt of the work here.👇🏾💭

In [17]:
features_regression_x1 = add_constant(ds1.x1)
regressor_OLS = sm.OLS(endog = yb, exog = features_regression_x1).fit()
regressor_OLS.summary()
Out[17]:
OLS Regression Results
Dep. Variable: yb R-squared: 0.895
Model: OLS Adj. R-squared: 0.895
Method: Least Squares F-statistic: 8.511e+05
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.00
Time: 23:25:20 Log-Likelihood: -3041.4
No. Observations: 100000 AIC: 6087.
Df Residuals: 99998 BIC: 6106.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.8428 0.002 531.489 0.000 0.840 0.846
x1 0.2533 0.000 922.571 0.000 0.253 0.254
Omnibus: 2346.712 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2709.981
Skew: -0.336 Prob(JB): 0.00
Kurtosis: 3.445 Cond. No. 11.9


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

💭👆🏾Just a few thoughts...👆🏾💭

  • Yup, the R^2 didn't budge. It stayed right there at 0.895.
    • Looks like we have the x1 independent variable predicting the yb dependent variable.

👇🏾 Third, we check dependent variable 'yc'👇🏾¶

In [18]:
xgboost_modeler(X=features, y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)
R^2:-0.0013
Mean Absolute Error (in original units): 0.01
Median Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -0.7
Residual Kurtosis: 43.41
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.01
Median Mean Absolute Error: 0.01
Mean R^2: -0.003
Median R^2: -0.003
Max R^2: -0.001
Min R^2: -0.006

💭👆🏾Just a few thoughts...👆🏾💭

  • We know from the correlation chart above that yc was going to be trouble because it did not show any correlation to any of the x's.
    • Here we see more of the same. The R^2 in the test case was literally negative. Meaning the model is actually worse than guessing.
    • This is a tell-tale sign that yc has nothing to do with the x's.
    • We'll double-check this with a regression, and we'll give polynomial feature engineering a shot later on.
In [19]:
features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = yc, exog = features_regression).fit()
regressor_OLS.summary()
Out[19]:
OLS Regression Results
Dep. Variable: yc R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.9993
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.416
Time: 23:26:55 Log-Likelihood: 2.0480e+05
No. Observations: 100000 AIC: -4.096e+05
Df Residuals: 99994 BIC: -4.095e+05
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.0001 0.000 -0.427 0.669 -0.001 0.000
x1 0.0001 6.02e-05 2.008 0.045 2.87e-06 0.000
x2 7.659e-06 4.93e-05 0.155 0.877 -8.9e-05 0.000
x3 -0.0002 9.87e-05 -1.844 0.065 -0.000 1.14e-05
x5 8.846e-05 9.85e-05 0.898 0.369 -0.000 0.000
x6 -9.798e-07 2.12e-05 -0.046 0.963 -4.25e-05 4.05e-05
Omnibus: 38901.894 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8034394.166
Skew: 0.721 Prob(JB): 0.00
Kurtosis: 46.888 Cond. No. 19.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

💭👆🏾Just a few thoughts...👆🏾💭

  • Again, we see an R^2 of 0.
  • And p-values don't seem too impressive either. (There are two around the 5% p-value level, but given everything else we've seen, I don't think yc relates to the given independent features)
    • ...and if this were Hayward Data Science (my consulting business), I would not model this because we might get an even worse outcome then not knowing -- thinking we know when we are actually wrong.

💭👇🏾Brief 'Side Path' - Checking for Polynomial Features👇🏾💭

  • We've already discussed that we are not going to perform log transforms due to the negative numbers and the lack of units (such that we don't know if adding a constant will hurt our interpretability even though it wouldn't affect variance.
  • What we will want to do is train our XGBoost models on the polynomial features to see if there is marked improvement.
  • Just for demonstration purposes I will also run a Backwards Elimination Regression Algorithm as well.
In [20]:
def backwardElimination(X, y, significance_level_float): 
    numCols = len(X.columns) 
    for i in range(numCols): 
        regressor_OLS = sm.OLS(y, X).fit()
        maxVar = max(regressor_OLS.pvalues) 
        if maxVar > significance_level_float:
            for j in range(len(regressor_OLS.pvalues)): 
                if (regressor_OLS.pvalues[j] == maxVar): 
                    X = X.drop(X.columns[j], axis = 1) 
    regressor_OLS.summary()
    return X
In [21]:
#make the polynomial features
poly = PolynomialFeatures(degree=3, include_bias=False)
features_w_poly = poly.fit_transform(features)
features_w_poly = pd.DataFrame(features_w_poly, columns=poly.get_feature_names(features.columns))
In [22]:
#adding in a constant to prepare for regression (we had bias as False above so we need to add the constant now)
features_poly_regression = add_constant(features_w_poly)
In [23]:
#XGboost polynomial test for ya
xgboost_modeler(X=features_w_poly, y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100, y_top=100)
R^2:0.7034
Mean Absolute Error (in original units): 8.0
Median Absolute Error (in original units): 6.78
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.02
Residual Kurtosis: 0.03
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 8.03
Median Mean Absolute Error: 8.03
Mean R^2: 0.703
Median R^2: 0.704
Max R^2: 0.707
Min R^2: 0.697

💭👆🏾Just a few thoughts...👆🏾💭

  • Even with the new polynomial features, we still are focused on x1, and x2.
  • Plus the R^2 actually dropped from 0.704 to 0703.
  • We've added a ton of features, but we don't see a marked improvement.
  • All things being equal, we want a simpler model, so I'll be sticking to the features we had when we move to the modeling stage for ya.
In [24]:
#backwards elimination regression algo for ya
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_ya = backwardElimination(features_poly_regression, ya, significance_level_float = 0.00001)
Features_Poly_Pruned_ya = add_constant(Features_Poly_Pruned_ya)
pruned_regressor_OLS = sm.OLS(endog = ya, exog = Features_Poly_Pruned_ya).fit()
pruned_regressor_OLS.summary()
Out[24]:
OLS Regression Results
Dep. Variable: ya R-squared: 0.705
Model: OLS Adj. R-squared: 0.705
Method: Least Squares F-statistic: 3.418e+04
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.00
Time: 23:41:54 Log-Likelihood: -3.7226e+05
No. Observations: 100000 AIC: 7.445e+05
Df Residuals: 99992 BIC: 7.446e+05
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.6807 0.191 3.556 0.000 0.305 1.056
x1 3.0860 0.129 23.934 0.000 2.833 3.339
x2 3.0625 0.048 64.086 0.000 2.969 3.156
x3 0.9346 0.032 29.531 0.000 0.873 0.997
x1^2 0.2466 0.026 9.372 0.000 0.195 0.298
x1 x2 0.8758 0.022 39.854 0.000 0.833 0.919
x1^3 -0.0166 0.002 -9.875 0.000 -0.020 -0.013
x1^2 x2 -0.0365 0.002 -17.168 0.000 -0.041 -0.032
Omnibus: 3.440 Durbin-Watson: 1.993
Prob(Omnibus): 0.179 Jarque-Bera (JB): 3.430
Skew: 0.014 Prob(JB): 0.180
Kurtosis: 3.008 Cond. No. 2.87e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.87e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

💭👆🏾Just a few thoughts...👆🏾💭

  • The regression here seems to generally map on to what I just said about the XGBoost run above.
  • We added a lot of features, but we don't see much uptick in explanation of variance.
  • I'm inclined to stick with my simpler features.
In [25]:
#XGboost polynomial test for yb
xgboost_modeler(X=features_w_poly, y=yb, cv_folds_int=4, x_left=-6, x_right=6, y_bottom=-6, y_top=6)
R^2:0.9323
Mean Absolute Error (in original units): 0.16
Median Absolute Error (in original units): 0.14
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: -0.02
Residual Kurtosis: 0.03
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.16
Median Mean Absolute Error: 0.16
Mean R^2: 0.932
Median R^2: 0.932
Max R^2: 0.933
Min R^2: 0.931

💭👆🏾Just a few thoughts...👆🏾💭

  • This is truly amazing!!!!!!!!
  • No matter what we throw at yb, it just wants to stick with x1.
  • It's yb is the faithful, monogamous type 😊haahahah!
In [26]:
#backwards elimination regression algo for yb
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_yb = backwardElimination(features_poly_regression, yb, significance_level_float = 0.00001)
Features_Poly_Pruned_yb = add_constant(Features_Poly_Pruned_yb)
pruned_regressor_OLS = sm.OLS(endog = yb, exog = Features_Poly_Pruned_yb).fit()
pruned_regressor_OLS.summary()
Out[26]:
OLS Regression Results
Dep. Variable: yb R-squared: 0.930
Model: OLS Adj. R-squared: 0.930
Method: Least Squares F-statistic: 4.446e+05
Date: Mon, 30 Sep 2019 Prob (F-statistic): 0.00
Time: 23:56:28 Log-Likelihood: 17486.
No. Observations: 100000 AIC: -3.496e+04
Df Residuals: 99996 BIC: -3.493e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.3990 0.003 154.586 0.000 0.394 0.404
x1 0.6033 0.002 270.430 0.000 0.599 0.608
x1^2 -0.0604 0.001 -116.458 0.000 -0.061 -0.059
x1^3 0.0028 3.41e-05 82.706 0.000 0.003 0.003
Omnibus: 5.212 Durbin-Watson: 1.995
Prob(Omnibus): 0.074 Jarque-Bera (JB): 5.344
Skew: 0.000 Prob(JB): 0.0691
Kurtosis: 3.036 Cond. No. 1.97e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.97e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

💭👆🏾Just a few thoughts...👆🏾💭

  • Again, it's all x1 and variations thereof.
  • We're defintiey just going to stick to x1 for the modeling.
    • ❤️ x1 and yb are a true data love story. ❤️
In [27]:
#XGboost polynomial test for yc
xgboost_modeler(X=features_w_poly, y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)
R^2:-0.0021
Mean Absolute Error (in original units): 0.01
Median Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -0.69
Residual Kurtosis: 43.28
Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.01
Median Mean Absolute Error: 0.01
Mean R^2: -0.004
Median R^2: -0.003
Max R^2: -0.003
Min R^2: -0.007

💭👆🏾Just a few thoughts...👆🏾💭

  • Again, it's not looking good for modeling yc.
  • At first blust it looks like XGBoost may have finally found something, but:
    • We see that we are essentially now interacting x1, x5 and x6 together in one mega column, and we still are only at about 0.10 gain.
    • Since we've seen such limited correlation in our EDA phase, I'm inclined to be quite skeptical here and say that if you throw enough things against the wall, then something is bound to stick.
      • But we are trying to follow the data, not jam the data into our goals.
  • I would be inclined to not model yc at all here.
In [28]:
#backwards elimination regression algo for yc
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_yc = backwardElimination(features_poly_regression, yc, significance_level_float = 0.00001)
Features_Poly_Pruned_yc = add_constant(Features_Poly_Pruned_yc)
pruned_regressor_OLS = sm.OLS(endog = yc, exog = Features_Poly_Pruned_yc).fit()
pruned_regressor_OLS.summary()
Out[28]:
OLS Regression Results
Dep. Variable: yc R-squared: 0.000
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: inf
Date: Tue, 01 Oct 2019 Prob (F-statistic): nan
Time: 00:10:11 Log-Likelihood: 2.0480e+05
No. Observations: 100000 AIC: -4.096e+05
Df Residuals: 99999 BIC: -4.096e+05
Df Model: 0
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.0001 9.87e-05 1.036 0.300 -9.12e-05 0.000
Omnibus: 38898.706 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8032981.336
Skew: 0.720 Prob(JB): 0.00
Kurtosis: 46.884 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

💭👆🏾Just a few thoughts...👆🏾💭

  • And here we have, as my late Dad would say, 'more trouble in River City.'
    • Namely, in the Backwards Elimination Regression Round none of the independent variables were able to meet the threshold level needed to say in the regression.
    • The data is screaming at us not to model yc.
In [29]:
def multiple_linear_regression_modeler(X, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train1, X_test, y_train1, y_test = train_test_split\
    (X, y, test_size = 0.2, random_state = 1)
    model = LinearRegression()
    model.fit(X_train1,y_train1)
    y_slr_predict = model.predict(X_test)
    #linear regression standard
    #print r^2 and plot linear regression
    print("R^2:"+str(round(r2_score(y_test, y_slr_predict),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                               (y_slr_predict)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_slr_predict)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_slr_predict)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
    axs[0].scatter((y_test), (y_slr_predict))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_slr_predict)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_slr_predict)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('linear_regression_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for Standard Linear Regression")
    print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(model, X_train1, y_train1, \
                                                 cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(model, X_train1, y_train1, \
                                                 cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))  
    
    print('Mean R^2: {}'.format(round(cross_val_score(model, X_train1, y_train1, cv=cv_folds_int, \
                                                 scoring='r2').mean(),2)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    linear_regression_feature_list = []
    linear_regression_coef_list = []
    for i in X.columns:
        linear_regression_feature_list.append(i)
    for i in range(len(model.coef_)):
        linear_regression_coef_list.append(model.coef_[i])
    top_10_linear_regression_features = pd.DataFrame(sorted(list(zip(linear_regression_feature_list,\
                                                                     linear_regression_coef_list))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'Linear Regression Coefficient'])
    top_10_linear_regression_features   
    print(top_10_linear_regression_features)
    print('Intercept: {}'.format(model.intercept_))
In [30]:
def single_linear_regression_modeler(X, single_feature_str, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train1, X_test, y_train1, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
    model = LinearRegression()
    model.fit(X_train1,y_train1)
    y_slr_predict = model.predict(X_test)
    print("R^2:"+str(round(r2_score(y_test, y_slr_predict),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                                   (y_slr_predict)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_slr_predict)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_slr_predict)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
    axs[0].scatter((y_test), (y_slr_predict))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_slr_predict)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_slr_predict)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('linear_regression_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for Standard Linear Regression")
    print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(model, X_train1, y_train1, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(model, X_train1, y_train1, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))  

    print('Mean R^2: {}'.format(round(cross_val_score(model, X_train1, y_train1, cv=cv_folds_int, \
                                                     scoring='r2').mean(),2)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    linear_regression_feature_list = []
    linear_regression_coef_list = []
    for i in list([single_feature_str]):
        linear_regression_feature_list.append(i)
    for i in range(len(model.coef_)):
        linear_regression_coef_list.append(model.coef_[i])
        top_10_linear_regression_features = pd.DataFrame(sorted(list(zip(linear_regression_feature_list,\
                                                                         linear_regression_coef_list))\
           ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'Linear Regression Coefficient'])
    top_10_linear_regression_features   
    print(top_10_linear_regression_features)
    print('Intercept: {}'.format(model.intercept_))

👇🏾 First, let's build a multiple linear regression model for dependent variable 'ya.'👇🏾

In [31]:
multiple_linear_regression_modeler(X=ds1[['x1','x2','x3']], 
                        y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100,y_top=100)
R^2:0.6769
Mean Absolute Error (in original units): 8.33
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.04
Residual Kurtosis: 0.1
Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 8.36
Median Mean Absolute Error: 8.36
Mean R^2: 0.68
Median R^2: 0.677
Max R^2: 0.683
Min R^2: 0.671
  Feature  Linear Regression Coefficient
0      x2                       6.236009
1      x1                       2.513063
2      x3                       0.956032
Intercept: 7.579525599825636

👇🏾 Conclusion: Equation for Predicting 'ya.'👇🏾

$ \begin{align} ya = 0.251(x1) + 6.236(x2) + 0.956(x3) + 7.58 \end{align} $

👇🏾 Second, let's build a single linear regression model for dependent variable 'yb.'👇🏾¶

In [32]:
single_linear_regression_modeler(X=ds1.x1.values.reshape(-1, 1), single_feature_str='x1',\
                                 y=yb, cv_folds_int=4, x_left=-6, x_right=6, y_bottom=-6, y_top=6)
R^2:0.8965
Mean Absolute Error (in original units): 0.2
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: 0.32
Residual Kurtosis: 0.38
Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 0.2
Median Mean Absolute Error: 0.2
Mean R^2: 0.89
Median R^2: 0.894
Max R^2: 0.895
Min R^2: 0.894
  Feature  Linear Regression Coefficient
0      x1                        0.25334
Intercept: 0.8422290091647144

👇🏾 Conclusion: Equation for Predicting 'yb.' 👇🏾

$ \begin{align} yb = 0.253(x1) + 0.8422 \end{align} $

👇🏾 Finally, let's build a multiple linear regression model for dependent variable 'yc.' We don't expect it to work 👇🏾

In [33]:
multiple_linear_regression_modeler(X=ds1,y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)
R^2:1.0
Mean Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -1.59
Residual Kurtosis: 3.56
Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 0.0
Median Mean Absolute Error: 0.0
Mean R^2: 1.0
Median R^2: 1.0
Max R^2: 1.0
Min R^2: 1.0
  Feature  Linear Regression Coefficient
0      yc                   1.000000e+00
1      yb                   7.799962e-15
2      x1                  -1.658834e-15
3      x2                   8.528791e-16
4      x3                   1.688990e-16
5      ya                  -1.358584e-16
6      x5                  -2.225314e-17
7      x6                  -2.418898e-18
Intercept: -5.4766033288217164e-15

Part II. - Unsupervised Machine Learning


  • Consider data set 2 (ds2.csv). The data asset comprises a set of observations.

Question 3 - Part II.A


  • Describe the data set in a few sentences.

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾

  • We will check for NULLS.
  • For some reason there is an index column in the raw data, but a Pandas dataframe comes with an index column by default so we will drop this 'extra' index column in the raw data.
  • We will summarize the data.
  • We will also check the distributions.
In [34]:
ds2 = pd.read_csv('ds2.csv')
In [35]:
#drop the extra index column since a Pandas dataframe has one built in
ds2 = ds2.drop(ds2.columns[0], axis=1)
In [36]:
#peek at the data
ds2.head()
Out[36]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
0 23.778224 13.319974 15.565124 -3.713626 7.296793 -19.371013 -0.894130 -6.110282 -28.959316 2.851336
1 16.602950 23.311281 21.099052 -0.304154 -3.218990 2.357643 12.027277 7.070349 -5.762185 -23.050198
2 12.084683 19.710443 9.837102 -1.081918 -1.201942 9.738019 16.125920 19.119391 -15.582122 -12.292535
3 13.044534 10.749040 5.884407 -11.703525 -4.134358 -22.344666 -1.263349 0.493711 -15.305347 6.799087
4 8.314115 6.748794 5.388535 -0.000290 -4.724787 -16.346812 3.293600 -10.848273 -17.285491 6.034214
In [37]:
#check for NULLS
msno.matrix(ds2,  color = (.0,.0,.2))
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c2aa80048>

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾

  • Let's quickly describe the data.
  • This will let us get the key metrics for us to see what we're dealing with.
In [38]:
#let's describe the data
ds2.describe()
Out[38]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
count 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000
mean 8.677829 11.716801 9.252817 -2.679634 2.774942 0.077631 8.200884 8.712878 -12.860135 -1.339360
std 11.971827 6.655333 9.861443 10.625164 8.800553 15.375478 10.426023 9.881482 8.881026 14.534143
min -25.824199 -8.497562 -23.666439 -29.429655 -22.033329 -35.264019 -21.428538 -16.811146 -36.065150 -36.468083
25% 0.231327 7.161564 2.648845 -10.652694 -4.098043 -14.003670 -0.812888 1.480706 -19.430697 -13.216190
50% 12.754335 11.896021 11.422441 -2.631318 2.484416 1.500838 8.532535 9.628064 -14.417786 -2.094310
75% 17.364337 16.279210 16.503676 5.340314 9.660898 14.050512 17.138898 16.081044 -6.534142 10.561797
max 32.268570 32.909917 31.230550 26.422798 29.312010 31.727042 32.084297 36.847922 13.553705 32.641789
In [39]:
#all credit due to: Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#correlation matrix
corrmat = ds2.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('ds1_correlation_matrix.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭

  • Looking at this correlation matrix, it's hard to see any definite pattern except that x6-x9 seem to be somehow correlated with one another.
In [40]:
#all credit due to: Pedro Marcelino, https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
sns.pairplot(ds2, size = 2.5)
plt.show()
In [41]:
#lets also check the distributions again
for i in ds2.columns:
    distribution_visualizer_and_fitter(ds2, i)

💭👆🏾Just a few thoughts...👆🏾💭

  • One thing that immediately jumps out at you is how bi-modal the data is.

Question 4 - Part II.B


  • How would you visualize this data set? Can you make an interesting visualization?

👇🏾 Gameplan for Principal Component Analysis (PCA)👇🏾

  • The difficult with visualizing data like this, or doing any type of serious unsupervised learning is the high dimensionality.
  • As I read somewhere on StackOverflow once, 'do you know anyone who can see in 10D?" 😊
  • So what we need to is figure out how to reduce the dimensionality.
  • We want to see how few components we need to approximate the cumulative variance.
  • Once we reduce the components a bit, we'll be able to tease out insight.
  • Once we tease out the insight, we'll be able to visualize it.
    • And, not to give away the next section, but then we can begin to cluster it with K-Means.

👇🏾 First thing's first: how many PCA components do we actually need?👇🏾

  • The first thing we do is figure out how many components we need to explain most of the variance in this high dimensional dataset.
  • I'll show you two methods.
    • One through a standard PCA method. Special thanks to Bartosz Mikulski for blogging about it.
    • One through a special Python third-party package. Special thanks to Andrew Heusser for writing it.
In [42]:
#it's important to scale the data so you can compare apples-to-apples how much each component contributes to variance
scaler = MinMaxScaler()
data_rescaled = scaler.fit_transform(ds2)
In [43]:
#we'll need to set a variance threshold. I choose 95%; some others my choose 99%; it's up to you.
pca = PCA(n_components = 0.95)
pca.fit(data_rescaled)
reduced = pca.transform(data_rescaled)
In [44]:
#all credit here to Bartosz Mikulski, https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/
pca = PCA().fit(data_rescaled)

plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, 11, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 11, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭

  • Reading a chart like this, I'd say is both an art and a science.
  • You are sort of looking for the 'elbow' point.
  • As we look above we can see that after 3 components we've already gotten to above 80% of the cumulative variance.
    • That's a major win for us because it reduces our data set to just 3 components.
  • To get to the 95% cumulative variance level we'd need 7 components.
    • But that doesn't really help us so much because 7 components is still a lot to interpret. If we, say, reduced a 3000 component set to 7 then that would be saying something. But if we take 10 components and make it 7, in my opinion we haven't really helped ourselves.
  • Thus, I will go with 3 components as the number for our reduction.
In [45]:
#there is also a third-party Python plug-in called 'Hypertools' which has a similiar procedure
#it looks to correlation
#the package was written by Andrew Heusser, 
#https://hypertools.readthedocs.io/en/latest/auto_examples/plot_describe.html#sphx-glr-auto-examples-plot-describe-py
hyp.describe(ds2)
Out[45]:
{'average': [0.9736018878700515,
  0.9861890676941355,
  0.988700379639046,
  0.9912663044390128,
  0.9933949689870721,
  0.9953561617427169,
  0.9969256847288822,
  0.9985214973696905],
 'individual': [[0.9736018878700515,
   0.9861890676941355,
   0.988700379639046,
   0.9912663044390128,
   0.9933949689870721,
   0.9953561617427169,
   0.9969256847288822,
   0.9985214973696905]]}

💭👆🏾Just a few thoughts...👆🏾💭

  • So, we see that 3 components seems to be the magic elbow point.
  • Let's see what happens when we graph these 3 components....

👇🏾 Second thing's second: let's look at our data in 3D, as opposed to 10D.👇🏾

  • Now we will take a look at our first visualization.
  • We will set our PCA algorithm to have n_components of 3 after all the work we just did to arrive at that number.
In [46]:
pca_execution = PCA(n_components=3)
pca_doing_its_work = pca_execution.fit_transform(data_rescaled)
print("original shape:   ", data_rescaled.shape)
print("transformed shape:", pca_doing_its_work.shape)
original shape:    (2000, 10)
transformed shape: (2000, 3)
In [47]:
pca_conjecture = pd.DataFrame(pca_doing_its_work, columns=['component_1','component_2', 'component_3'])
In [48]:
pca_conjecture.head()
Out[48]:
component_1 component_2 component_3
0 -0.513079 0.150428 0.069967
1 0.164240 0.392656 -0.071182
2 0.243555 0.171085 -0.136090
3 -0.385704 0.017588 0.252501
4 -0.303847 0.083962 0.351603
In [49]:
#all credit here to Jake Vanderplas, 
#https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(pca_conjecture.component_1, pca_conjecture.component_2, pca_conjecture.component_3)
plt.savefig('pca_3D_graph.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭

  • This is a really cool 3D plot of what how we can group the data.
  • Although there are 10(!) dimensions, we can reduce those to basically 3 dimensions.
  • Very, very conveniently, the human mind can picture 3 dimensions on a 3D plot.
    • It's very possible to have reduced to components which still would not be easily interpreted, so I think we really lucked out here! That's why I say 'The Lord Always Delivers.' 😊🙏
  • Key point: It's tempting to think that this means that all the observations (i.e., the rows) can be split into 3 fields. But we're not there just yet.
    • We have basically reduced the columns in this gigantic data set from 10 to 3, but we now will want to figure out how to apply K-Means clustering to find out more.
  • Note: K-Means begins to break down at high dimensions because of what's called 'the curve of dimensionality,'
    • Basically, anything that uses Euclidean distance is going to get a lot less meaningful because dimensions will begin to make everything potentially close to everything else when you combine it all to one vector. Therefore, it helps to reduce the components first.
      • Also, please note: There are other clustering methods that can perform better with high dimensionality, but that's outside the scope of this notebook. If you are curious, more info here.

Question 5 - Part II.C


  • Someone suggests that the observations are really from multiple different files and were accidentally joined into one larger data set. Does anything about the data set suggest this? If so, how many different sources/file do you think there are?

👇🏾 We begin our K-Means Clustering analysis now we've completed the PCA round.👇🏾

  • Let the K-Means games begin!
  • The first rule of thumb is to use an algorithm to choose you n clusters for K means.
  • It's tempting to just jump right in and use the number 3 from before, but we will want to evaluate an **'elbow'** graph again.
    • By the way, in case you are curious, I actually tested K-means on the PCA-generated dataframe framework offline, and 3 was **not** the optimum number of K clusters.
In [50]:
#all credit to Dmitriy Kavyazin, http://bit.ly/2mYDWv7
ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(pca_conjecture) #.iloc[:,:3])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭

  • So we can actually see the elbow point comes at **4** clusters!!! ✅🙏
  • Let's go ahead and plot that next!!
In [51]:
#here comes K-Means at long last!
kmeans = KMeans(n_clusters=4)
kmeans.fit(pca_conjecture)
choice_kmeans = kmeans.predict(pca_conjecture)
In [52]:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(pca_conjecture.component_1, pca_conjecture.component_2, pca_conjecture.component_3,\
             c=choice_kmeans, cmap='Dark2')
plt.savefig('kmeans_on_pca.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭

  • And voilà: now we have a great visualization of the 10D data.
  • We can see that there are basically **3 principal components (the blobs)** and **4 cohorts (the colors)** the actual observations can be sorted into.
    • In future analysis, we'll want to study why 2 cohorts seem to get jammed into 1 component, yet they are not themselves considered the same cohort.
      • We'll want to better understand what factors K-Means() is using to differentiate them.

Question 6 - Part II.D - Bonus


  • Bonus points: If you think there are more than one source in ds2, can you assign each observation to the right source (based on the number of sources you identified in 2c)?

👇🏾 Applying our K-means work from the last question...👇🏾

  • We can simply take the same K-Means classifications we used to make the hues in the 3D plot above, and apply them to the ds2 dataframe.
In [53]:
#just to be safe, we can check the shape so we know that they'll match
print(ds2.shape)
print(choice_kmeans.shape)
(2000, 10)
(2000,)
In [54]:
ds2['cohort_guess'] = choice_kmeans
In [55]:
ds2.sample(10)
Out[55]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 cohort_guess
856 16.258705 7.536404 20.738980 -12.694768 2.361967 -13.735497 1.114422 7.757069 -11.394184 21.557510 2
1107 -3.724744 16.132432 -6.568144 9.931925 2.570133 15.679603 8.533902 15.854891 -3.841659 -2.757685 1
992 7.901057 11.937911 13.580849 -4.198245 -1.866585 -22.293985 -0.826338 -6.077889 -15.110786 5.361091 0
1297 -9.897615 25.005272 -9.610588 18.815536 6.059984 22.159695 29.886364 19.724171 -3.086989 -4.117812 1
445 18.250491 17.457891 18.771506 -0.833597 -13.159148 9.135771 21.157722 11.466289 -5.570972 -17.053007 3
183 -10.539711 15.283469 -11.576097 9.137384 2.902189 15.165293 20.014466 19.858354 5.253761 -18.449395 1
1411 -5.274538 21.802477 -10.129634 2.056461 1.218785 12.577448 20.278365 19.309603 2.381341 -12.527796 1
1161 12.363580 7.115654 16.053999 -1.434357 5.038282 -18.648144 4.398782 -7.789061 -13.777822 10.110424 0
873 8.534240 4.586206 10.560645 -19.470090 15.850141 -6.419461 -2.522902 9.986515 -17.717444 20.802697 2
1348 8.210770 -3.624457 4.939408 -5.635591 21.514586 -10.027645 -1.592554 9.510031 -20.588316 15.208638 2

💭👆🏾Just a few thoughts...👆🏾💭

  • And there we have it!
  • I took a sample of the dataframe instead of the head() method, so you can get an idea of the different 'cohort guesses'.
  • You can see that there are 4 different K-Means cohorts now.
    • 0, 1, 2, and 3
  • I would argue that these are where the disparate files came from before they were joined in ds2.

Part III. - SQL Coding


  • Stack Overflow provides a tool at https://data.stackexchange.com/stackoverflow/query/new that allows SQL queries to be run against their data. After reviewing the database schema provided on their site, please answer the questions below by providing both your answer and the query used to derive it.

👇🏾 Full answers and clickable/runable results can be found on the Stack Overflows query permalink.👇🏾¶

http://bit.ly/2mJ5nct

Question 7 - Part III.A


query-3-a-via-Sublime

In [56]:
query_3_a_results = pd.read_csv('stackoverflow_query_3-a_results.csv')
query_3_a_results
Out[56]:
Num_Posts_2017
0 5062160

Question 8 - Part III.B


query-3-b-via-Sublime

In [57]:
query_3_b_results = pd.read_csv('stackoverflow_query_3-b_results.csv')
query_3_b_results.head(10)
Out[57]:
View_Level Frequency
0 0 6014518
1 1 1317265
2 2 725275
3 3 466790
4 4 326298
5 5 242879
6 6 188453
7 7 151391
8 8 125166
9 9 105231
In [58]:
ax = sns.regplot(x="View_Level", y="Frequency", data=query_3_b_results, fit_reg=False)
ax.set_ylabel('Frequency', fontweight = 'bold')
ax.set_xlim(0,40)
ax.set_xlabel("View Level", fontweight = 'bold')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.set_title("Histogram of Views Generated by Above SQL Query", fontweight = 'bold')
plt.show()
In [59]:
plt.figure(figsize=(16,5))
plt.bar(query_3_b_results.View_Level, query_3_b_results.Frequency)
plt.ylim(0,300000)
plt.xlim(0,80)
plt.title("Zoomed-In Snapshot of Stack Overflow Views Distribution", fontweight='bold')
plt.ylabel("Frequency", fontweight = 'bold')
plt.xlabel("View Level", fontweight = 'bold')
plt.show()

Question 9 - Part III.C


  • What are the top 10 tags associated with posts created in 2017?

query-3-c-via-Sublime

In [60]:
query_3_c_results = pd.read_csv('stackoverflow_query_3-c_results.csv')
query_3_c_results
Out[60]:
Tag Tag_Count
0 javascript 250807
1 python 192652
2 java 175109
3 php 142844
4 android 136363
5 c# 132168
6 html 109411
7 jquery 92614
8 css 75326
9 ios 66249

Question 10 - Part III.D - Bonus


  • Bonus points: For the questions created in 2017, what was the average time (in seconds) between when the question was created and when the accepted answer was provided?

query-3-d-via-Sublime

In [61]:
query_3_d_bonus_results = pd.read_csv('stackoverflow_query_3-d_results.csv')
query_3_d_bonus_results
Out[61]:
Avg_Response_Secs
0 201332

Best,

George John Jordan Thomas Aquinas Hayward, Optimist

george-hayward-data-scientist

GJJTAHO