Executive Summary ¶

This analysis seeks three objectives:¶

(1) describe, visualize, and model a complex dataset using supervised machine learning techniques.
(2) describe, visualize, and segment a complex dataset using unsupervised machine learning techniques.
(3) extract business insights from applied SQL analysis.

First, this analysis surveyed the distributions with the following selected findings:¶

x1 (independent variable):
- uniform distribution
  space
x2 (independent variable):
- normal distribution
  space
x3 (independent variable):
- t-distribution
  space
x5 (independent variable):
- exponential distribution
  space
x6 (independent variable):
- binomial distribution
  space
ya (dependent variable):
- chi2 distribution
  space
yb (dependent variable):
- reverse log normal distribution
  space
yc (dependent variable):
- logistic distribution

Regarding predicting the dependent variables, this analysis found the following:¶

For ya:

\begin{align} ya = 0.251(x1) + 6.236(x2) + 0.956(x3) + 7.58 \end{align}

For yb:

\begin{align} yb = 0.253(x1) + 0.8422 \end{align}

For yc:
- No model appropriate because yc does not appear to be related to the independent variables.

Second, this analysis found through unsupervised machine learning:¶

that the unsupervised learning data set could be reduced from 10 dimensions to just 3 (depicted as spheroids below).
- via a Principal Component Analysis (PCA):
that the observations could be grouped into 4 cohorts (depicted in hues below).
- via a K-Means Clustering Analysis:

Third, this analysis derived business insights through SQL:¶

such as the following query finding the number of posts in 2017:

such as the following query finding the distribution of views for users:

such as the following query finding the top 10 tags associated with posts created in 2017:

such as the following query finding the average response time in seconds from the time a 2017 question first post to when the official answer is accepted:

Credit to God, my Mother, family and friends.

All errors are my own.

Best,
George John Jordan Thomas Aquinas Hayward, Optimist
Data Scientist
September 30, 2019

Table of Contents ¶

Executive Summary
Selected Data Visualizations
Part I. - Supervised Machine Learning
- Question 1 - Part I.A
- Question 2 - Part I.B
Part II. - Unsupervised Machine Learning
Part III. - SQL Coding
Thank You!
#TheLordAlwaysDelivers

Selected Data Visualizations ¶

Surveying the DS1 Distributions ¶

DS1 Pair Plot ¶

DS1 Correlation Matrix ¶

Using XGBoost for 'ya' Feature Selection ¶

Using XGBoost for 'yb' Feature Selection ¶

Multiple Linear Regression for 'ya' Prediction ¶

Simple Linear Regression for 'yb' Prediction ¶

Selecting Number of Components for DS2 PCA ¶

3D-Plotting DS2's K-Means Cohort Results After PCA¶

SQL Love-Language Queries¶

Back to Top

And it begins...¶

import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings("ignore")
from fitter import Fitter
from scipy import stats
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error, \
explained_variance_score, confusion_matrix, accuracy_score, precision_score, recall_score
import xgboost as xgb
import statsmodels.formula.api as sm
from statsmodels.api import add_constant
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import FunctionTransformer
import matplotlib as mpl
from mpl_toolkits import mplot3d
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import hypertools as hyp

Back to Top

Part I. - Supervised Machine Learning ¶

Consider data set 1 (ds1.csv). The data set comprises features (the Five xs) along with three sequences that may or may not be generated from the features (3 ys).

Back to Top

Question 1 - Part I.A ¶

Describe the data set in a few sentences. E.g. What are the distributions of each feature? Summary statistics?

👇🏾 We begin by reading in the data. 👇🏾¶

We will check for NULLS.
For some reason there is an index column in the raw data, but a Pandas dataframe comes with an index column by default so we will drop this 'extra' index column in the raw data.
We will summarize the data.
We will check the distributions.

#read in data
ds1 = pd.read_csv("ds1.csv")

#peek at at data
ds1.head(3)

#drop "Unamed:0" column and peek at the data again:
ds1 = ds1.drop(ds1.columns[0], axis=1)
ds1.head(3)

#check for NULLS
msno.matrix(ds1,  color = (.0,.0,.2))

<matplotlib.axes._subplots.AxesSubplot at 0x1c27ac7d30>

💭👆🏾This indicates that there are no NULLS.👆🏾💭¶

Back to Top

👇🏾 We now describe the data. 👇🏾¶¶

When we talk about describing the data, we look at numerical summaries. These include:
- the standard deviation to get a sense of the spread of the data
- the mean and median
- the min and max
- the null count
Pandas has a nice 'describe' method, which we'll use here.
Next we'll get to the distributions.

ds1.describe()

👇🏾 We now will now inspect and fit the distributions. 👇🏾¶¶

We will plot histograms to visualize the distributions.
We will use the Python third party package "fitter" to check these distributions against some of the most common and useful statistical distributions.
- Fitter was developed by Thomas Cokelaer

Back to Top

🔬 Function 1: Visualizing & Fitting the Distribution 🔬

I am about to write a function which will:
- first, plot a histogram for our five independent and three dependent variables
- second, use fitter to fit a distribution to our histogram
- third, just for good measure, add a probability plot
I will loop it through each column in the data frame to do all this automatically.
See you on the other side!

def distribution_visualizer_and_fitter(df, col_str):
    #this part makes the histogram:
    fig, axs = plt.subplots(figsize=(12,4))
    plt.hist(df[col_str], bins=100)
    plt.title('Histogram of {}'.format(col_str), fontweight = 'bold')
    plt.ylabel('Frequency')
    plt.xlabel('Values')
    plt.show()
    #this part fits the distribution to the histogram:
    f = Fitter(df[col_str], distributions=['gamma', 'beta', 'rayleigh', 'norm', 'pareto', 'uniform', \
                         'logistic', 'expon', 'chi2', 'pearson3'], verbose = False, timeout = 10)
    f.fit()
    f.summary()
    #this part makes the probability plot
    fig, axs = plt.subplots(ncols=2, figsize=(12,4))
    axs[1].set_axis_off()
    stats.probplot(df[col_str], plot=axs[0])
    plt.show()

for i in ds1.columns:
    distribution_visualizer_and_fitter(ds1, i)

Back to Top

💭👆🏾Conclusion: What do now know about the distributions?👆🏾💭¶

x1 (independent variable):
- uniform distribution
  space
x2 (independent variable):
- normal distribution
  space
x3 (independent variable):
- t-distribution
  - some will also call this a 'nearly normal' distribution
  - can be modeled with beta distribution
    space
x5 (independent variable):
- exponential distribution
  - a kind of gamma distribution
    space
x6 (independent variable):
- binomial distribution
  space
ya (dependent variable):
- chi2 distribution
  - can also say it's a positively skewed normal distribution, BUT:
  - to me, the key is the long tail on the right, looks a lot like chi2 distribution
    space
yb (dependent variable):
- reverse log normal distribution
- can be approximated with beta distribution
  space
yc (dependent variable):
- logistic distribution
  - more unclear
  - in my subsequent analysis, I'll discuss how I don't think this 'dependent' feature is actually based on the other independent feature
  - however, from my fitting algorithm, it appears a logistic distribution is the closest

Back to Top

Question 2 - Part I.B ¶

Try to come up with a predictive model, e.g. y = f(x_1 , … , x_n) for each y sequence. Describe your model and how you came up with them. What (if any) are the predictive variables? How good would you say each of your models is?

Back to Top

👇🏾 Modeling Gameplan 👇🏾¶¶

First a few assumptions:
- The problem seems to ask for an equation as the description of the model
- I assume this means that the client needs a linear regression output
  - Many clients like this because, even if there are more accurate models, a linear regression is highly interpretable
- Nevertheless, we will still want to begin with a decision tree model because we can use it to figure out which features are most important
  - I will use the decision tree models as a kind of feature selection algorithm
I will proceed in two steps:
- 1. Exploratory Data Analysis
  - Check correlations
  - Check pair plots
  - Check if polynomial transformations or interactions are helpful here
    - I am NOT going to do log transformations because there are too many negative numbers
      - You can add a constant to correct this issue, but you need to really know the underlying data well to do so; and this is something regarding which the Data Scientist should discuss with the team.
      - Why? Because there are some data sets that even though the shift by adding a constant will not alter the variance, it could alter the interpretability or the meaning of the numbers
      - Since we don't know too much about these underlying numbers (as in what they actually represent in real life), I make the decision, for here, not to apply a log transformation
  - To select features I will use:
    - A XGBoost feature importance
    - Backwards elimination regression, if needed
      - Please note that since this is just EDA, there will be no train/test split...I'm just trying to get a sense of what variables 'matter most' across the dataset
    - I will use XGBoost first.
    - This is often my preferred model.
    - However, I'm assuming that the client actually needs an interpretable linear-regression-esque equation
    - Thus we cannot end with XGBoost
      - The nature of the tree splits is to be less interpretable
      - We know that something matters, but we wont' have a coefficient or a direction (positive/negative)
- 2. Modeling
  - So after deploying XGBoost for selection, I will run Linear Regressions next for the official modeling
  - This will help us with our ability to understand our features.
    - This regression will NOT be scaled.
    - Why?
      - In a word, interpretability
      - Once I scale the data, especially if a normalize it, then I lose the units
      - To be fair, this data does not seem to have units, but I want to make this as real as possible
      - Before we lose the units through scaling and then apply the regularization via Ridge or Lasso, we want to be sure we care more about prediction than inference
      - Given how this problem is stated, I'm assuming it's an odd combination of 'interpretability prediction', which for here would have to be an interpretable linear regression equation

Back to Top

👇🏾 Question 2 - Part I.B - 1. Exploratory Data Analysis 👇🏾¶¶

#check correlations
#all credit due to Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
corrmat = ds1.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('ds1_correlation_matrix.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭¶

We can already see that 'dependent' variable 'yc' doesn't seem to be correlated with any of the independent features.
We also see that the the 'dependent' variables seem to correlate to one another. This may be be because they are somehow generated from the same independent variables.
This is also going to come up later on, but we can see that x1 and x3 are correlated.
We can begin to say:
- ya is correlated to x1, x2, and x3
  - this does not factor in colinearity
- yb is correlated to x1 and x3
  - this does not factor in colinearity
- yc does not seem to be correlated to any of the x's

Back to Top

#check the pair plot
#all credit due to Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
sns.pairplot(ds1, height = 2.5)
plt.savefig('ds1_pair_plot_lower_res.png',dpi=150, bbox_inches='tight')
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭¶

As we scan the chart, we can see a few linear-ish or nearly linear relationships.
These charts don't include interactions.
We see some blobs, and some bizarre relationships.
We will eliminate features in the feature selection process.

Back to Top

👇🏾 Using XGBoost & Regressions To See Which Features Seem Important Before Modeling👇🏾¶¶

Now we shall use XGBoost to get an idea for the features that are most important.
Before checking the polynomial relationships, we just want to get an idea of what's going on.
After each XGBoost output we'll also put all the features into a regression against the given dependent variable and see how closely the regression follows the XGBoost insights.
First, we will start by defining our dependent and independent dataframes.

#define the dependent and independent relationships
features = ds1[['x1','x2','x3','x5','x6']]
ya = ds1['ya']
yb = ds1['yb']
yc = ds1['yc']

Back to Top

🔬 Function 2: XGBoost Modeler 🔬¶

This function will check all features against the dependent variables
It will then plot the predicteds vs. actuals as well as the residuals.
It will provide pertinent model metrics.
It will then make a horizontal bar chart for feature importance.

def xgboost_modeler(X, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train_xgb, X_test, y_train_xgb, y_test = train_test_split\
    (X, y, test_size = 0.2, random_state = 1)
    xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)
    xgb_model.fit(X_train_xgb,y_train_xgb)
    y_pred_xgb = xgb_model.predict(X_test)
    #linear regression standard
    #print r^2 and plot linear regression
    print("R^2:"+str(round(r2_score(y_test, y_pred_xgb),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                               (y_pred_xgb)),2)))
    print("Median Absolute Error (in original units): "+str(round(median_absolute_error((y_test),\
                                                                               (y_pred_xgb)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_pred_xgb)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_pred_xgb)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.4, wspace=0.16)
    axs[0].scatter((y_test), (y_pred_xgb))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_pred_xgb)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_pred_xgb)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('xgboost_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for XGBoost")
    print('Mean Mean Absolute Error: {}'.format(-1*round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, \
                                                    cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(xgb_model, X_train_xgb,\
                                                                                     y_train_xgb, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))      
    print('Mean R^2: {}'.format(round(cross_val_score(xgb_model, X_train_xgb, y_train_xgb, cv=cv_folds_int,\
                                                      scoring='r2').mean(),3)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(xgb_model,\
                                             X_train_xgb, y_train_xgb, cv=cv_folds_int, scoring='r2')),3)))
    top_xgb_features = pd.DataFrame(sorted(list(zip(X,xgb_model.feature_importances_))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'XGBoost Importance'])
    top_xgb_features
    bar_count = range(len(top_xgb_features.Feature))
    fig, axs = plt.subplots(ncols=2, figsize=(14,4))
    #using a subplot method coupled with an inline parameter to have high resolution
    #note: "[::-1]" reverses the column in a pandas dataframe
    axs[1].set_axis_off()
    axs[0].barh(bar_count, top_xgb_features['XGBoost Importance'][::-1],\
                 align='center', alpha=1)
    axs[0].set_xlabel('Values')
    axs[0].set_yticks(bar_count)
    axs[0].set_yticklabels(top_xgb_features.Feature[::-1], fontsize=10)
    axs[0].set_xlabel('XGBoost Importance')
    axs[0].set_title("XGBoost's Feature Importances",fontweight = 'bold')
    extent = axs[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    fig.savefig('xgb_features_importance_dependent_var_',dpi=300, bbox_inches=extent.expanded(1.5, 1.5))
    plt.show()

Back to Top

👇🏾 First, we check dependent variable 'ya'👇🏾¶¶

xgboost_modeler(X=features, y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100, y_top=100)

R^2:0.704
Mean Absolute Error (in original units): 7.99
Median Absolute Error (in original units): 6.79
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.02
Residual Kurtosis: 0.03

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 8.02
Median Mean Absolute Error: 8.03
Mean R^2: 0.703
Median R^2: 0.704
Max R^2: 0.707
Min R^2: 0.696

💭👆🏾Just a few thoughts...👆🏾💭¶

This lets us know that for dependent variable ya it looks like:
- x5 and x6 are **not** important
- x2 is most important, followed by x1 and x3
We'll also peek at a standard regression...

features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = ya, exog = features_regression).fit()
regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

This regression confirms our XGBoost insights regarding dependent variable ya.

Back to Top

👇🏾 Second, we check dependent variable 'yb'👇🏾¶¶

xgboost_modeler(X=features, y=yb, cv_folds_int=4, x_left=-5, x_right=5, y_bottom=-5, y_top=5)

R^2:0.9323
Mean Absolute Error (in original units): 0.16
Median Absolute Error (in original units): 0.14
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: -0.02
Residual Kurtosis: 0.03

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.16
Median Mean Absolute Error: 0.16
Mean R^2: 0.932
Median R^2: 0.932
Max R^2: 0.933
Min R^2: 0.931

💭👆🏾Just a few thoughts...👆🏾💭¶

This is amazing!!!
It looks like just one variable, **x1** decides what dependent variable 'yb' is going to be.
We'll check this with a standard regression...

features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = yb, exog = features_regression).fit()
regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

Even though you could argue that x3 is also statistically significant, you can see that x1's t-score of over 500 shows that it is likely doing most of the work here, just like XGBoost said.
Let's actually run this regression again...but only for x1 and see if we keep the R^2 at about the same level (meaning that the variance can be accounted for by x1).

💭👇🏾Let's also check that one feature, x1, which seems to be doing the brunt of the work here.👇🏾💭¶

features_regression_x1 = add_constant(ds1.x1)
regressor_OLS = sm.OLS(endog = yb, exog = features_regression_x1).fit()
regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

Yup, the R^2 didn't budge. It stayed right there at 0.895.
- Looks like we have the x1 independent variable predicting the yb dependent variable.

Back to Top

👇🏾 Third, we check dependent variable 'yc'👇🏾¶¶

xgboost_modeler(X=features, y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)

R^2:-0.0013
Mean Absolute Error (in original units): 0.01
Median Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -0.7
Residual Kurtosis: 43.41

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.01
Median Mean Absolute Error: 0.01
Mean R^2: -0.003
Median R^2: -0.003
Max R^2: -0.001
Min R^2: -0.006

💭👆🏾Just a few thoughts...👆🏾💭¶

We know from the correlation chart above that yc was going to be trouble because it did not show any correlation to any of the x's.
- Here we see more of the same. The R^2 in the test case was literally negative. Meaning the model is actually worse than guessing.
- This is a tell-tale sign that yc has nothing to do with the x's.
- We'll double-check this with a regression, and we'll give polynomial feature engineering a shot later on.

Back to Top

features_regression = add_constant(features)
regressor_OLS = sm.OLS(endog = yc, exog = features_regression).fit()
regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

Again, we see an R^2 of 0.
And p-values don't seem too impressive either. (There are two around the 5% p-value level, but given everything else we've seen, I don't think yc relates to the given independent features)
- ...and if this were Hayward Data Science (my consulting business), I would not model this because we might get an even worse outcome then not knowing -- thinking we know when we are actually wrong.

Back to Top

💭👇🏾Brief 'Side Path' - Checking for Polynomial Features👇🏾💭¶

We've already discussed that we are not going to perform log transforms due to the negative numbers and the lack of units (such that we don't know if adding a constant will hurt our interpretability even though it wouldn't affect variance.
What we will want to do is train our XGBoost models on the polynomial features to see if there is marked improvement.
Just for demonstration purposes I will also run a Backwards Elimination Regression Algorithm as well.

def backwardElimination(X, y, significance_level_float): 
    numCols = len(X.columns) 
    for i in range(numCols): 
        regressor_OLS = sm.OLS(y, X).fit()
        maxVar = max(regressor_OLS.pvalues) 
        if maxVar > significance_level_float:
            for j in range(len(regressor_OLS.pvalues)): 
                if (regressor_OLS.pvalues[j] == maxVar): 
                    X = X.drop(X.columns[j], axis = 1) 
    regressor_OLS.summary()
    return X

#make the polynomial features
poly = PolynomialFeatures(degree=3, include_bias=False)
features_w_poly = poly.fit_transform(features)
features_w_poly = pd.DataFrame(features_w_poly, columns=poly.get_feature_names(features.columns))

#adding in a constant to prepare for regression (we had bias as False above so we need to add the constant now)
features_poly_regression = add_constant(features_w_poly)

#XGboost polynomial test for ya
xgboost_modeler(X=features_w_poly, y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100, y_top=100)

R^2:0.7034
Mean Absolute Error (in original units): 8.0
Median Absolute Error (in original units): 6.78
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.02
Residual Kurtosis: 0.03

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 8.03
Median Mean Absolute Error: 8.03
Mean R^2: 0.703
Median R^2: 0.704
Max R^2: 0.707
Min R^2: 0.697

💭👆🏾Just a few thoughts...👆🏾💭¶

Even with the new polynomial features, we still are focused on x1, and x2.
Plus the R^2 actually dropped from 0.704 to 0703.
We've added a ton of features, but we don't see a marked improvement.
All things being equal, we want a simpler model, so I'll be sticking to the features we had when we move to the modeling stage for ya.

Back to Top

#backwards elimination regression algo for ya
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_ya = backwardElimination(features_poly_regression, ya, significance_level_float = 0.00001)
Features_Poly_Pruned_ya = add_constant(Features_Poly_Pruned_ya)
pruned_regressor_OLS = sm.OLS(endog = ya, exog = Features_Poly_Pruned_ya).fit()
pruned_regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

The regression here seems to generally map on to what I just said about the XGBoost run above.
We added a lot of features, but we don't see much uptick in explanation of variance.
I'm inclined to stick with my simpler features.

Back to Top

#XGboost polynomial test for yb
xgboost_modeler(X=features_w_poly, y=yb, cv_folds_int=4, x_left=-6, x_right=6, y_bottom=-6, y_top=6)

R^2:0.9323
Mean Absolute Error (in original units): 0.16
Median Absolute Error (in original units): 0.14
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: -0.02
Residual Kurtosis: 0.03

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.16
Median Mean Absolute Error: 0.16
Mean R^2: 0.932
Median R^2: 0.932
Max R^2: 0.933
Min R^2: 0.931

💭👆🏾Just a few thoughts...👆🏾💭¶

This is truly amazing!!!!!!!!
No matter what we throw at yb, it just wants to stick with x1.
It's yb is the faithful, monogamous type 😊haahahah!

Back to Top

#backwards elimination regression algo for yb
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_yb = backwardElimination(features_poly_regression, yb, significance_level_float = 0.00001)
Features_Poly_Pruned_yb = add_constant(Features_Poly_Pruned_yb)
pruned_regressor_OLS = sm.OLS(endog = yb, exog = Features_Poly_Pruned_yb).fit()
pruned_regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

Again, it's all x1 and variations thereof.
We're defintiey just going to stick to x1 for the modeling.
- ❤️ x1 and yb are a true data love story. ❤️

Back to Top

#XGboost polynomial test for yc
xgboost_modeler(X=features_w_poly, y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)

R^2:-0.0021
Mean Absolute Error (in original units): 0.01
Median Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -0.69
Residual Kurtosis: 43.28

Cross-Validation Scoring for XGBoost
Mean Mean Absolute Error: 0.01
Median Mean Absolute Error: 0.01
Mean R^2: -0.004
Median R^2: -0.003
Max R^2: -0.003
Min R^2: -0.007

💭👆🏾Just a few thoughts...👆🏾💭¶

Again, it's not looking good for modeling yc.
At first blust it looks like XGBoost may have finally found something, but:
- We see that we are essentially now interacting x1, x5 and x6 together in one mega column, and we still are only at about 0.10 gain.
- Since we've seen such limited correlation in our EDA phase, I'm inclined to be quite skeptical here and say that if you throw enough things against the wall, then something is bound to stick.
  - But we are trying to follow the data, not jam the data into our goals.
I would be inclined to not model yc at all here.

Back to Top

#backwards elimination regression algo for yc
#note this cell runs the Backwards Elimination, and this cell will takw a while to execute
Features_Poly_Pruned_yc = backwardElimination(features_poly_regression, yc, significance_level_float = 0.00001)
Features_Poly_Pruned_yc = add_constant(Features_Poly_Pruned_yc)
pruned_regressor_OLS = sm.OLS(endog = yc, exog = Features_Poly_Pruned_yc).fit()
pruned_regressor_OLS.summary()

💭👆🏾Just a few thoughts...👆🏾💭¶

And here we have, as my late Dad would say, 'more trouble in River City.'
- Namely, in the Backwards Elimination Regression Round none of the independent variables were able to meet the threshold level needed to say in the regression.
- The data is screaming at us not to model yc.

Back to Top

👇🏾 Question 2 - Part I.B - 2. Modeling 👇🏾¶¶

Back to Top

🔬 Function 3: Multiple Linear Regression Modeler 🔬¶

def multiple_linear_regression_modeler(X, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train1, X_test, y_train1, y_test = train_test_split\
    (X, y, test_size = 0.2, random_state = 1)
    model = LinearRegression()
    model.fit(X_train1,y_train1)
    y_slr_predict = model.predict(X_test)
    #linear regression standard
    #print r^2 and plot linear regression
    print("R^2:"+str(round(r2_score(y_test, y_slr_predict),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                               (y_slr_predict)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_slr_predict)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_slr_predict)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
    axs[0].scatter((y_test), (y_slr_predict))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_slr_predict)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_slr_predict)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('linear_regression_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for Standard Linear Regression")
    print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(model, X_train1, y_train1, \
                                                 cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(model, X_train1, y_train1, \
                                                 cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))  
    
    print('Mean R^2: {}'.format(round(cross_val_score(model, X_train1, y_train1, cv=cv_folds_int, \
                                                 scoring='r2').mean(),2)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(model,\
                                             X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    linear_regression_feature_list = []
    linear_regression_coef_list = []
    for i in X.columns:
        linear_regression_feature_list.append(i)
    for i in range(len(model.coef_)):
        linear_regression_coef_list.append(model.coef_[i])
    top_10_linear_regression_features = pd.DataFrame(sorted(list(zip(linear_regression_feature_list,\
                                                                     linear_regression_coef_list))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'Linear Regression Coefficient'])
    top_10_linear_regression_features   
    print(top_10_linear_regression_features)
    print('Intercept: {}'.format(model.intercept_))

Back to Top

🔬 Function 4: Single Linear Regression Modeler 🔬¶

def single_linear_regression_modeler(X, single_feature_str, y, cv_folds_int, x_left, x_right, y_bottom, y_top):
    X_train1, X_test, y_train1, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
    model = LinearRegression()
    model.fit(X_train1,y_train1)
    y_slr_predict = model.predict(X_test)
    print("R^2:"+str(round(r2_score(y_test, y_slr_predict),4)))
    print("Mean Absolute Error (in original units): "+str(round(mean_absolute_error((y_test),\
                                                                                   (y_slr_predict)),2)))
    print("STD of Training Data (in original units): "+str(round(np.std((y_test)),2)))
    print("Mean of Training Data (in original units): "+str(round(np.mean((y_test)),2)))
    print("Residual Skew: "+str(round(stats.skew((y_slr_predict)-(y_test)),2)))
    print("Residual Kurtosis: "+str(round(stats.kurtosis((y_slr_predict)-(y_test)),2)))
    fig, axs = plt.subplots(ncols=(3), nrows=(1), figsize=(15,4))
    plt.subplots_adjust(top = 0.95, bottom=0.01, hspace=0.25, wspace=0.16)
    axs[0].scatter((y_test), (y_slr_predict))
    axs[0].set_title('Actuals (x) vs Predicteds (y)',fontweight = 'bold')
    axs[0].set_xlim([x_left, x_right])
    axs[0].set_ylim([y_bottom, y_top])
    axs[1].scatter((y_test), ((y_slr_predict)-(y_test)))
    axs[1].set_title('Actuals (x) vs Residuals [pred. - actual] (y)',fontweight = 'bold')
    axs[1].set_xlim([x_left, x_right])
    axs[1].set_ylim([y_bottom, y_top])
    axs[2].hist(((y_slr_predict)-(y_test)), bins=100)
    axs[2].set_title('Histogram of Residuals [predicted - actual]',fontweight = 'bold')
    plt.savefig('linear_regression_modeling_',dpi=300, bbox_inches='tight')
    plt.show()
    print("Cross-Validation Scoring for Standard Linear Regression")
    print('Mean Absolute Error: {}'.format(-1*round(cross_val_score(model, X_train1, y_train1, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error').mean(),2)))
    print('Median Mean Absolute Error: {}'.format(-1*round(np.median(cross_val_score(model, X_train1, y_train1, \
                                                     cv=cv_folds_int, scoring='neg_mean_absolute_error')),2)))  

    print('Mean R^2: {}'.format(round(cross_val_score(model, X_train1, y_train1, cv=cv_folds_int, \
                                                     scoring='r2').mean(),2)))
    print('Median R^2: {}'.format(round(np.median(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Max R^2: {}'.format(round(np.max(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    print('Min R^2: {}'.format(round(np.min(cross_val_score(model,\
                                                 X_train1, y_train1, cv=cv_folds_int, scoring='r2')),3)))
    linear_regression_feature_list = []
    linear_regression_coef_list = []
    for i in list([single_feature_str]):
        linear_regression_feature_list.append(i)
    for i in range(len(model.coef_)):
        linear_regression_coef_list.append(model.coef_[i])
        top_10_linear_regression_features = pd.DataFrame(sorted(list(zip(linear_regression_feature_list,\
                                                                         linear_regression_coef_list))\
           ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'Linear Regression Coefficient'])
    top_10_linear_regression_features   
    print(top_10_linear_regression_features)
    print('Intercept: {}'.format(model.intercept_))

👇🏾 First, let's build a multiple linear regression model for dependent variable 'ya.'👇🏾¶

multiple_linear_regression_modeler(X=ds1[['x1','x2','x3']], 
                        y=ya, cv_folds_int=4, x_left=-100, x_right=100, y_bottom=-100,y_top=100)

R^2:0.6769
Mean Absolute Error (in original units): 8.33
STD of Training Data (in original units): 18.39
Mean of Training Data (in original units): 3.98
Residual Skew: -0.04
Residual Kurtosis: 0.1

Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 8.36
Median Mean Absolute Error: 8.36
Mean R^2: 0.68
Median R^2: 0.677
Max R^2: 0.683
Min R^2: 0.671
  Feature  Linear Regression Coefficient
0      x2                       6.236009
1      x1                       2.513063
2      x3                       0.956032
Intercept: 7.579525599825636

Back to Top

👇🏾 Conclusion: Equation for Predicting 'ya.'👇🏾¶

$ \begin{align} ya = 0.251(x1) + 6.236(x2) + 0.956(x3) + 7.58 \end{align} $

👇🏾 Second, let's build a single linear regression model for dependent variable 'yb.'👇🏾¶¶

single_linear_regression_modeler(X=ds1.x1.values.reshape(-1, 1), single_feature_str='x1',\
                                 y=yb, cv_folds_int=4, x_left=-6, x_right=6, y_bottom=-6, y_top=6)

R^2:0.8965
Mean Absolute Error (in original units): 0.2
STD of Training Data (in original units): 0.77
Mean of Training Data (in original units): 2.12
Residual Skew: 0.32
Residual Kurtosis: 0.38

Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 0.2
Median Mean Absolute Error: 0.2
Mean R^2: 0.89
Median R^2: 0.894
Max R^2: 0.895
Min R^2: 0.894
  Feature  Linear Regression Coefficient
0      x1                        0.25334
Intercept: 0.8422290091647144

Back to Top

👇🏾 Conclusion: Equation for Predicting 'yb.' 👇🏾¶

$ \begin{align} yb = 0.253(x1) + 0.8422 \end{align} $

👇🏾 Finally, let's build a multiple linear regression model for dependent variable 'yc.' We don't expect it to work 👇🏾¶

multiple_linear_regression_modeler(X=ds1,y=yc, cv_folds_int=4, x_left=-0.1, x_right=0.1, y_bottom=-0.1, y_top=0.1)

R^2:1.0
Mean Absolute Error (in original units): 0.0
STD of Training Data (in original units): 0.03
Mean of Training Data (in original units): 0.0
Residual Skew: -1.59
Residual Kurtosis: 3.56

Cross-Validation Scoring for Standard Linear Regression
Mean Absolute Error: 0.0
Median Mean Absolute Error: 0.0
Mean R^2: 1.0
Median R^2: 1.0
Max R^2: 1.0
Min R^2: 1.0
  Feature  Linear Regression Coefficient
0      yc                   1.000000e+00
1      yb                   7.799962e-15
2      x1                  -1.658834e-15
3      x2                   8.528791e-16
4      x3                   1.688990e-16
5      ya                  -1.358584e-16
6      x5                  -2.225314e-17
7      x6                  -2.418898e-18
Intercept: -5.4766033288217164e-15

Back to Top

👇🏾 Conclusion: No Equation for Predicting 'yc.' 👇🏾¶

Back to Top

Part II. - Unsupervised Machine Learning ¶

Consider data set 2 (ds2.csv). The data asset comprises a set of observations.

Back to Top

Question 3 - Part II.A ¶

Describe the data set in a few sentences.

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾¶

We will check for NULLS.
For some reason there is an index column in the raw data, but a Pandas dataframe comes with an index column by default so we will drop this 'extra' index column in the raw data.
We will summarize the data.
We will also check the distributions.

ds2 = pd.read_csv('ds2.csv')

#drop the extra index column since a Pandas dataframe has one built in
ds2 = ds2.drop(ds2.columns[0], axis=1)

#peek at the data
ds2.head()

#check for NULLS
msno.matrix(ds2,  color = (.0,.0,.2))

<matplotlib.axes._subplots.AxesSubplot at 0x1c2aa80048>

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾¶

Let's quickly describe the data.
This will let us get the key metrics for us to see what we're dealing with.

Back to Top

#let's describe the data
ds2.describe()

Back to Top

#all credit due to: Pedro Marcelino; https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#correlation matrix
corrmat = ds2.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('ds1_correlation_matrix.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭¶

Looking at this correlation matrix, it's hard to see any definite pattern except that x6-x9 seem to be somehow correlated with one another.

Back to Top

#all credit due to: Pedro Marcelino, https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
sns.pairplot(ds2, size = 2.5)
plt.show()

Back to Top

#lets also check the distributions again
for i in ds2.columns:
    distribution_visualizer_and_fitter(ds2, i)

💭👆🏾Just a few thoughts...👆🏾💭¶

One thing that immediately jumps out at you is how bi-modal the data is.

Back to Top

Question 4 - Part II.B ¶

How would you visualize this data set? Can you make an interesting visualization?

Back to Top

👇🏾 Gameplan for Principal Component Analysis (PCA)👇🏾¶

The difficult with visualizing data like this, or doing any type of serious unsupervised learning is the high dimensionality.
As I read somewhere on StackOverflow once, 'do you know anyone who can see in 10D?" 😊
So what we need to is figure out how to reduce the dimensionality.
We want to see how few components we need to approximate the cumulative variance.
Once we reduce the components a bit, we'll be able to tease out insight.
Once we tease out the insight, we'll be able to visualize it.
- And, not to give away the next section, but then we can begin to cluster it with K-Means.

Back to Top

👇🏾 First thing's first: how many PCA components do we actually need?👇🏾¶

The first thing we do is figure out how many components we need to explain most of the variance in this high dimensional dataset.
I'll show you two methods.
- One through a standard PCA method. Special thanks to Bartosz Mikulski for blogging about it.
- One through a special Python third-party package. Special thanks to Andrew Heusser for writing it.

#it's important to scale the data so you can compare apples-to-apples how much each component contributes to variance
scaler = MinMaxScaler()
data_rescaled = scaler.fit_transform(ds2)

#we'll need to set a variance threshold. I choose 95%; some others my choose 99%; it's up to you.
pca = PCA(n_components = 0.95)
pca.fit(data_rescaled)
reduced = pca.transform(data_rescaled)

#all credit here to Bartosz Mikulski, https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/
pca = PCA().fit(data_rescaled)

plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, 11, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 11, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭¶

Reading a chart like this, I'd say is both an art and a science.
You are sort of looking for the 'elbow' point.
As we look above we can see that after 3 components we've already gotten to above 80% of the cumulative variance.
- That's a major win for us because it reduces our data set to just 3 components.
To get to the 95% cumulative variance level we'd need 7 components.
- But that doesn't really help us so much because 7 components is still a lot to interpret. If we, say, reduced a 3000 component set to 7 then that would be saying something. But if we take 10 components and make it 7, in my opinion we haven't really helped ourselves.
Thus, I will go with 3 components as the number for our reduction.

#there is also a third-party Python plug-in called 'Hypertools' which has a similiar procedure
#it looks to correlation
#the package was written by Andrew Heusser, 
#https://hypertools.readthedocs.io/en/latest/auto_examples/plot_describe.html#sphx-glr-auto-examples-plot-describe-py
hyp.describe(ds2)

{'average': [0.9736018878700515,
  0.9861890676941355,
  0.988700379639046,
  0.9912663044390128,
  0.9933949689870721,
  0.9953561617427169,
  0.9969256847288822,
  0.9985214973696905],
 'individual': [[0.9736018878700515,
   0.9861890676941355,
   0.988700379639046,
   0.9912663044390128,
   0.9933949689870721,
   0.9953561617427169,
   0.9969256847288822,
   0.9985214973696905]]}

💭👆🏾Just a few thoughts...👆🏾💭¶

So, we see that 3 components seems to be the magic elbow point.
Let's see what happens when we graph these 3 components....

Back to Top

👇🏾 Second thing's second: let's look at our data in 3D, as opposed to 10D.👇🏾¶

Now we will take a look at our first visualization.
We will set our PCA algorithm to have n_components of 3 after all the work we just did to arrive at that number.
- It appears that 3 really is the magic number.

pca_execution = PCA(n_components=3)
pca_doing_its_work = pca_execution.fit_transform(data_rescaled)
print("original shape:   ", data_rescaled.shape)
print("transformed shape:", pca_doing_its_work.shape)

original shape:    (2000, 10)
transformed shape: (2000, 3)

pca_conjecture = pd.DataFrame(pca_doing_its_work, columns=['component_1','component_2', 'component_3'])

pca_conjecture.head()

#all credit here to Jake Vanderplas, 
#https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(pca_conjecture.component_1, pca_conjecture.component_2, pca_conjecture.component_3)
plt.savefig('pca_3D_graph.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭¶

This is a really cool 3D plot of what how we can group the data.
Although there are 10(!) dimensions, we can reduce those to basically 3 dimensions.
Very, very conveniently, the human mind can picture 3 dimensions on a 3D plot.
- It's very possible to have reduced to components which still would not be easily interpreted, so I think we really lucked out here! That's why I say 'The Lord Always Delivers.' 😊🙏
Key point: It's tempting to think that this means that all the observations (i.e., the rows) can be split into 3 fields. But we're not there just yet.
- We have basically reduced the columns in this gigantic data set from 10 to 3, but we now will want to figure out how to apply K-Means clustering to find out more.
Note: K-Means begins to break down at high dimensions because of what's called 'the curve of dimensionality,'
- Basically, anything that uses Euclidean distance is going to get a lot less meaningful because dimensions will begin to make everything potentially close to everything else when you combine it all to one vector. Therefore, it helps to reduce the components first.
  - Also, please note: There are other clustering methods that can perform better with high dimensionality, but that's outside the scope of this notebook. If you are curious, more info here.

Back to Top

Question 5 - Part II.C ¶

Someone suggests that the observations are really from multiple different files and were accidentally joined into one larger data set. Does anything about the data set suggest this? If so, how many different sources/file do you think there are?

Back to Top

👇🏾 We begin our K-Means Clustering analysis now we've completed the PCA round.👇🏾¶

Let the K-Means games begin!
The first rule of thumb is to use an algorithm to choose you n clusters for K means.
It's tempting to just jump right in and use the number 3 from before, but we will want to evaluate an **'elbow'** graph again.
- By the way, in case you are curious, I actually tested K-means on the PCA-generated dataframe framework offline, and 3 was **not** the optimum number of K clusters.

#all credit to Dmitriy Kavyazin, http://bit.ly/2mYDWv7
ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(pca_conjecture) #.iloc[:,:3])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

💭👆🏾Just a few thoughts...👆🏾💭¶

So we can actually see the elbow point comes at **4** clusters!!! ✅🙏
Let's go ahead and plot that next!!

#here comes K-Means at long last!
kmeans = KMeans(n_clusters=4)
kmeans.fit(pca_conjecture)
choice_kmeans = kmeans.predict(pca_conjecture)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.scatter3D(pca_conjecture.component_1, pca_conjecture.component_2, pca_conjecture.component_3,\
             c=choice_kmeans, cmap='Dark2')
plt.savefig('kmeans_on_pca.png',dpi=300, bbox_inches='tight')

💭👆🏾Just a few thoughts...👆🏾💭¶

And voilà: now we have a great visualization of the 10D data.
We can see that there are basically **3 principal components (the blobs)** and **4 cohorts (the colors)** the actual observations can be sorted into.
- In future analysis, we'll want to study why 2 cohorts seem to get jammed into 1 component, yet they are not themselves considered the same cohort.
  - We'll want to better understand what factors K-Means() is using to differentiate them.

Back to Top

Question 6 - Part II.D - Bonus ¶

Bonus points: If you think there are more than one source in ds2, can you assign each observation to the right source (based on the number of sources you identified in 2c)?

👇🏾 Applying our K-means work from the last question...👇🏾¶

We can simply take the same K-Means classifications we used to make the hues in the 3D plot above, and apply them to the ds2 dataframe.

#just to be safe, we can check the shape so we know that they'll match
print(ds2.shape)
print(choice_kmeans.shape)

(2000, 10)
(2000,)

ds2['cohort_guess'] = choice_kmeans

ds2.sample(10)

💭👆🏾Just a few thoughts...👆🏾💭¶

And there we have it!
I took a sample of the dataframe instead of the head() method, so you can get an idea of the different 'cohort guesses'.
You can see that there are 4 different K-Means cohorts now.
- 0, 1, 2, and 3
I would argue that these are where the disparate files came from before they were joined in ds2.

Back to Top

Part III. - SQL Coding ¶

Stack Overflow provides a tool at https://data.stackexchange.com/stackoverflow/query/new that allows SQL queries to be run against their data. After reviewing the database schema provided on their site, please answer the questions below by providing both your answer and the query used to derive it.

👇🏾 Full answers and clickable/runable results can be found on the Stack Overflows query permalink.👇🏾¶¶

http://bit.ly/2mJ5nct ¶

Back to Top

Question 7 - Part III.A ¶

How many posts were created in 2017?
- Stack Overflow Query Permalink:
  - http://bit.ly/2mJ5nct

query_3_a_results = pd.read_csv('stackoverflow_query_3-a_results.csv')
query_3_a_results

Back to Top

Question 8 - Part III.B ¶

What is the distribution of views for users?
- Stack Overflow Query Permalink:
  - http://bit.ly/2mJ5nct

query_3_b_results = pd.read_csv('stackoverflow_query_3-b_results.csv')
query_3_b_results.head(10)

ax = sns.regplot(x="View_Level", y="Frequency", data=query_3_b_results, fit_reg=False)
ax.set_ylabel('Frequency', fontweight = 'bold')
ax.set_xlim(0,40)
ax.set_xlabel("View Level", fontweight = 'bold')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.set_title("Histogram of Views Generated by Above SQL Query", fontweight = 'bold')
plt.show()

plt.figure(figsize=(16,5))
plt.bar(query_3_b_results.View_Level, query_3_b_results.Frequency)
plt.ylim(0,300000)
plt.xlim(0,80)
plt.title("Zoomed-In Snapshot of Stack Overflow Views Distribution", fontweight='bold')
plt.ylabel("Frequency", fontweight = 'bold')
plt.xlabel("View Level", fontweight = 'bold')
plt.show()

Back to Top

Question 9 - Part III.C ¶

What are the top 10 tags associated with posts created in 2017?
- Stack Overflow Query Permalink:
  - http://bit.ly/2mJ5nct

query_3_c_results = pd.read_csv('stackoverflow_query_3-c_results.csv')
query_3_c_results

Back to Top

Question 10 - Part III.D - Bonus ¶

Bonus points: For the questions created in 2017, what was the average time (in seconds) between when the question was created and when the accepted answer was provided?
- Stack Overflow Query Permalink:
  - http://bit.ly/2mJ5nct

query_3_d_bonus_results = pd.read_csv('stackoverflow_query_3-d_results.csv')
query_3_d_bonus_results

	Unnamed: 0	x1	x2	x3	x5	x6	ya	yb	yc
0	1	2.642583	-1.715220	1.909334	0.027139	-3.447187	13.630850	1.828765	0.008386
1	2	4.588761	-2.507543	4.239107	1.704150	-2.782809	7.834582	2.162110	0.000008
2	3	7.919796	-5.108415	3.039451	0.992815	5.551587	-5.107041	2.797083	-0.000005

	x1	x2	x3	x5	x6	ya	yb	yc
0	2.642583	-1.715220	1.909334	0.027139	-3.447187	13.630850	1.828765	0.008386
1	4.588761	-2.507543	4.239107	1.704150	-2.782809	7.834582	2.162110	0.000008
2	7.919796	-5.108415	3.039451	0.992815	5.551587	-5.107041	2.797083	-0.000005

	x1	x2	x3	x5	x6	ya	yb	yc
count	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	1.000000e+05
mean	5.011059	-3.005565	2.500593	0.999136	0.000647	3.827867	2.111854	1.022765e-04
std	2.873568	2.000799	1.752906	1.002006	4.663860	18.438514	0.769306	3.121215e-02
min	0.000015	-12.498524	-3.489212	0.000003	-13.885453	-64.021995	-0.523672	-5.433613e-01
25%	2.536309	-4.353844	1.189920	0.285629	-2.611943	-8.998020	1.580212	-2.423158e-03
50%	5.022191	-3.002649	2.503764	0.690903	-0.000611	2.666992	2.231057	-5.670000e-10
75%	7.486275	-1.649283	3.802236	1.386862	2.621841	15.579687	2.733318	2.476745e-03
max	9.999887	6.089820	8.679097	15.102966	13.924740	107.713997	3.841363	8.183882e-01

Dep. Variable:	ya	R-squared:	0.677
Model:	OLS	Adj. R-squared:	0.677
Method:	Least Squares	F-statistic:	4.190e+04
Date:	Mon, 30 Sep 2019	Prob (F-statistic):	0.00
Time:	23:23:39	Log-Likelihood:	-3.7685e+05
No. Observations:	100000	AIC:	7.537e+05
Df Residuals:	99994	BIC:	7.538e+05
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	7.5697	0.089	84.712	0.000	7.395	7.745
x1	2.5177	0.020	124.571	0.000	2.478	2.557
x2	6.2346	0.017	376.359	0.000	6.202	6.267
x3	0.9511	0.033	28.707	0.000	0.886	1.016
x5	0.0023	0.033	0.070	0.944	-0.063	0.067
x6	0.0041	0.007	0.581	0.561	-0.010	0.018

Omnibus:	41.538	Durbin-Watson:	1.997
Prob(Omnibus):	0.000	Jarque-Bera (JB):	42.505
Skew:	0.039	Prob(JB):	5.89e-10
Kurtosis:	3.064	Cond. No.	19.7

Dep. Variable:	yb	R-squared:	0.895
Model:	OLS	Adj. R-squared:	0.895
Method:	Least Squares	F-statistic:	1.702e+05
Date:	Mon, 30 Sep 2019	Prob (F-statistic):	0.00
Time:	23:25:20	Log-Likelihood:	-3038.6
No. Observations:	100000	AIC:	6089.
Df Residuals:	99994	BIC:	6146.
Df Model:	5
Covariance Type:	nonrobust

Omnibus:	2347.227	Durbin-Watson:	1.989
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2710.565
Skew:	-0.336	Prob(JB):	0.00
Kurtosis:	3.445	Cond. No.	19.7

Omnibus:	2346.712	Durbin-Watson:	1.989
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2709.981
Skew:	-0.336	Prob(JB):	0.00
Kurtosis:	3.445	Cond. No.	11.9

Omnibus:	38901.894	Durbin-Watson:	1.994
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8034394.166
Skew:	0.721	Prob(JB):	0.00
Kurtosis:	46.888	Cond. No.	19.7

Dep. Variable:	ya	R-squared:	0.705
Model:	OLS	Adj. R-squared:	0.705
Method:	Least Squares	F-statistic:	3.418e+04
Date:	Mon, 30 Sep 2019	Prob (F-statistic):	0.00
Time:	23:41:54	Log-Likelihood:	-3.7226e+05
No. Observations:	100000	AIC:	7.445e+05
Df Residuals:	99992	BIC:	7.446e+05
Df Model:	7
Covariance Type:	nonrobust

Omnibus:	3.440	Durbin-Watson:	1.993
Prob(Omnibus):	0.179	Jarque-Bera (JB):	3.430
Skew:	0.014	Prob(JB):	0.180
Kurtosis:	3.008	Cond. No.	2.87e+03

Dep. Variable:	yb	R-squared:	0.930
Model:	OLS	Adj. R-squared:	0.930
Method:	Least Squares	F-statistic:	4.446e+05
Date:	Mon, 30 Sep 2019	Prob (F-statistic):	0.00
Time:	23:56:28	Log-Likelihood:	17486.
No. Observations:	100000	AIC:	-3.496e+04
Df Residuals:	99996	BIC:	-3.493e+04
Df Model:	3
Covariance Type:	nonrobust

Omnibus:	5.212	Durbin-Watson:	1.995
Prob(Omnibus):	0.074	Jarque-Bera (JB):	5.344
Skew:	0.000	Prob(JB):	0.0691
Kurtosis:	3.036	Cond. No.	1.97e+03

Omnibus:	38898.706	Durbin-Watson:	1.994
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8032981.336
Skew:	0.720	Prob(JB):	0.00
Kurtosis:	46.884	Cond. No.	1.00

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.8427	0.002	396.244	0.000	0.839	0.847
x1	0.2542	0.000	528.388	0.000	0.253	0.255
x2	6.148e-05	0.000	0.156	0.876	-0.001	0.001
x3	-0.0018	0.001	-2.279	0.023	-0.003	-0.000
x5	0.0003	0.001	0.353	0.724	-0.001	0.002
x6	7.677e-05	0.000	0.454	0.650	-0.000	0.000

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.0001	0.000	-0.427	0.669	-0.001	0.000
x1	0.0001	6.02e-05	2.008	0.045	2.87e-06	0.000
x2	7.659e-06	4.93e-05	0.155	0.877	-8.9e-05	0.000
x3	-0.0002	9.87e-05	-1.844	0.065	-0.000	1.14e-05
x5	8.846e-05	9.85e-05	0.898	0.369	-0.000	0.000
x6	-9.798e-07	2.12e-05	-0.046	0.963	-4.25e-05	4.05e-05

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.6807	0.191	3.556	0.000	0.305	1.056
x1	3.0860	0.129	23.934	0.000	2.833	3.339
x2	3.0625	0.048	64.086	0.000	2.969	3.156
x3	0.9346	0.032	29.531	0.000	0.873	0.997
x1^2	0.2466	0.026	9.372	0.000	0.195	0.298
x1 x2	0.8758	0.022	39.854	0.000	0.833	0.919
x1^3	-0.0166	0.002	-9.875	0.000	-0.020	-0.013
x1^2 x2	-0.0365	0.002	-17.168	0.000	-0.041	-0.032

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.3990	0.003	154.586	0.000	0.394	0.404
x1	0.6033	0.002	270.430	0.000	0.599	0.608
x1^2	-0.0604	0.001	-116.458	0.000	-0.061	-0.059
x1^3	0.0028	3.41e-05	82.706	0.000	0.003	0.003

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
0	23.778224	13.319974	15.565124	-3.713626	7.296793	-19.371013	-0.894130	-6.110282	-28.959316	2.851336
1	16.602950	23.311281	21.099052	-0.304154	-3.218990	2.357643	12.027277	7.070349	-5.762185	-23.050198
2	12.084683	19.710443	9.837102	-1.081918	-1.201942	9.738019	16.125920	19.119391	-15.582122	-12.292535
3	13.044534	10.749040	5.884407	-11.703525	-4.134358	-22.344666	-1.263349	0.493711	-15.305347	6.799087
4	8.314115	6.748794	5.388535	-0.000290	-4.724787	-16.346812	3.293600	-10.848273	-17.285491	6.034214

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
count	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000	2000.000000
mean	8.677829	11.716801	9.252817	-2.679634	2.774942	0.077631	8.200884	8.712878	-12.860135	-1.339360
std	11.971827	6.655333	9.861443	10.625164	8.800553	15.375478	10.426023	9.881482	8.881026	14.534143
min	-25.824199	-8.497562	-23.666439	-29.429655	-22.033329	-35.264019	-21.428538	-16.811146	-36.065150	-36.468083
25%	0.231327	7.161564	2.648845	-10.652694	-4.098043	-14.003670	-0.812888	1.480706	-19.430697	-13.216190
50%	12.754335	11.896021	11.422441	-2.631318	2.484416	1.500838	8.532535	9.628064	-14.417786	-2.094310
75%	17.364337	16.279210	16.503676	5.340314	9.660898	14.050512	17.138898	16.081044	-6.534142	10.561797
max	32.268570	32.909917	31.230550	26.422798	29.312010	31.727042	32.084297	36.847922	13.553705	32.641789

	component_1	component_2	component_3
0	-0.513079	0.150428	0.069967
1	0.164240	0.392656	-0.071182
2	0.243555	0.171085	-0.136090
3	-0.385704	0.017588	0.252501
4	-0.303847	0.083962	0.351603

	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	cohort_guess
856	16.258705	7.536404	20.738980	-12.694768	2.361967	-13.735497	1.114422	7.757069	-11.394184	21.557510	2
1107	-3.724744	16.132432	-6.568144	9.931925	2.570133	15.679603	8.533902	15.854891	-3.841659	-2.757685	1
992	7.901057	11.937911	13.580849	-4.198245	-1.866585	-22.293985	-0.826338	-6.077889	-15.110786	5.361091	0
1297	-9.897615	25.005272	-9.610588	18.815536	6.059984	22.159695	29.886364	19.724171	-3.086989	-4.117812	1
445	18.250491	17.457891	18.771506	-0.833597	-13.159148	9.135771	21.157722	11.466289	-5.570972	-17.053007	3
183	-10.539711	15.283469	-11.576097	9.137384	2.902189	15.165293	20.014466	19.858354	5.253761	-18.449395	1
1411	-5.274538	21.802477	-10.129634	2.056461	1.218785	12.577448	20.278365	19.309603	2.381341	-12.527796	1
1161	12.363580	7.115654	16.053999	-1.434357	5.038282	-18.648144	4.398782	-7.789061	-13.777822	10.110424	0
873	8.534240	4.586206	10.560645	-19.470090	15.850141	-6.419461	-2.522902	9.986515	-17.717444	20.802697	2
1348	8.210770	-3.624457	4.939408	-5.635591	21.514586	-10.027645	-1.592554	9.510031	-20.588316	15.208638	2

	View_Level	Frequency
0	0	6014518
1	1	1317265
2	2	725275
3	3	466790
4	4	326298
5	5	242879
6	6	188453
7	7	151391
8	8	125166
9	9	105231

	Tag	Tag_Count
0	javascript	250807
1	python	192652
2	java	175109
3	php	142844
4	android	136363
5	c#	132168
6	html	109411
7	jquery	92614
8	css	75326
9	ios	66249

This analysis seeks three objectives:¶

First, this analysis surveyed the distributions with the following selected findings:¶

Regarding predicting the dependent variables, this analysis found the following:¶

Second, this analysis found through unsupervised machine learning:¶

Third, this analysis derived business insights through SQL:¶

Surveying the DS1 Distributions ¶

DS1 Pair Plot ¶

DS1 Correlation Matrix ¶

Using XGBoost for 'ya' Feature Selection ¶

Using XGBoost for 'yb' Feature Selection ¶

Multiple Linear Regression for 'ya' Prediction ¶

Simple Linear Regression for 'yb' Prediction ¶

Selecting Number of Components for DS2 PCA ¶

3D-Plotting DS2's K-Means Cohort Results After PCA¶

SQL Love-Language Queries¶

And it begins...¶

👇🏾 We begin by reading in the data. 👇🏾¶

💭👆🏾This indicates that there are no NULLS.👆🏾💭¶

👇🏾 We now will now inspect and fit the distributions. 👇🏾¶¶

🔬 Function 1: Visualizing & Fitting the Distribution 🔬

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

🔬 Function 2: XGBoost Modeler 🔬¶

👇🏾 First, we check dependent variable 'ya'👇🏾¶¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

👇🏾 Second, we check dependent variable 'yb'👇🏾¶¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👇🏾Let's also check that one feature, x1, which seems to be doing the brunt of the work here.👇🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

👇🏾 Third, we check dependent variable 'yc'👇🏾¶¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

🔬 Function 3: Multiple Linear Regression Modeler 🔬¶

🔬 Function 4: Single Linear Regression Modeler 🔬¶

👇🏾 First, let's build a multiple linear regression model for dependent variable 'ya.'👇🏾¶

👇🏾 Second, let's build a single linear regression model for dependent variable 'yb.'👇🏾¶¶

👇🏾 Finally, let's build a multiple linear regression model for dependent variable 'yc.' We don't expect it to work 👇🏾¶

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾¶

👇🏾 Just as we did in Part 1, Question 1, we begin by reading in the data. 👇🏾¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

💭👆🏾Just a few thoughts...👆🏾💭¶

👇🏾 Applying our K-means work from the last question...👇🏾¶

💭👆🏾Just a few thoughts...👆🏾💭¶

👇🏾 Full answers and clickable/runable results can be found on the Stack Overflows query permalink.👇🏾¶¶

Best,¶

George John Jordan Thomas Aquinas Hayward, Optimist