Executive Summary ¶

This analysis seeks two objectives:

(1) recommend the best button for a webpage by analyzing A/B test results and making causal inferences.
(2) create engagement segmentations for the user base via clustering.

First, this analysis recommends that the light button be selected for the complete website rollout.

For discoverability, this analysis used overall button clicks as the metric.
- Here, the dark button narrowly edged out the light button (23% to 21%).
- Although the dark button received more clicks—and the difference was statistically significant—it is unlikely to be practically significant.

For message usage, this analysis used conversion rate after clicking a button, given that clicking a button was the user's first action post enrollment in the experiment.
- Here, the light button dominated. It has a 45% message use conversion rate, while the dark button had a 22% message use conversion rate.
  - This difference was both statistically and practically significant.
Notably, this A/B test is imperfectly designed because we are testing for both color and text vs. no-text. These features may be confounding.
This data scientist recommends running another test of a dark button with the text "Get help."
Thought:
- This data scientist contends that the dark button got more clicks because
  - (1) aesthetically, it contrasted with the white background and
  - (2) it attracted curiosity due to its lack of any text-based signage or explanation.
- So the dark button got more clicks but also more bounces.
- The light button did not stand out as much, but featured the phrase "Get help."
  - As such, its clicked were more intentional and its conversion rate was thus higher.

Second, This analysis recommends segmenting users into the following three categories:

(1) Inactive, Passive Users (not depicted below)
space
(2) Active Users Who Call Frequently, But Don't Use the Online Platform
space
(3) Active Users Who Are Often Online, But Rarely Call In

These segmentations were supported by K-Means clustering.

Credit is given to Jake Vanderplas for his remarkable data science explanations and code. In particular, I found his take on the K-Means clustering algorithm very helpful.

This analysis makes these segmentations with a business situation and use-case in mind:

Situation:
- Customers who mostly call and who may not be aware of online resources.
  - If they are made aware of some of these, they may have a better experience and the company will save money.
- On the other hand, customers who only use online resources may underestimate the efficacy of talking on the phone with a health insurance representative. If they are made aware of all that can be accomplished on the phone, they may have a better experience.
Use-case:
- Ideally, these segmentations would be the analytical foundation for a directed-information or advertisement campaigns geared towards increasing customer retention and satisfaction.

Credit to God, my Mother, family and friends.

All errors are my own.

Best,
George John Jordan Thomas Aquinas Hayward, Optimist

Table of Contents ¶

Executive Summary
Selected Data Visualizations
Part I - A/B Test
Part II - Unsupervised Machine Learning
Thank You!
#TheLordAlwaysDelivers

Selected Data Visualizations ¶

A/B Testing ¶

The Dark Button Got Slightly More Clicks

The Light Button Got More Absolute Message Sends

The Light Button Dominated The Dark Button In Terms of Message Send Conversion Rate

Clustering ¶

Watch The K-Means Clustering Classification Progress As We Select For More and More Active Users.
It seems that users seem to really have an overall 'either/or' preference for calling or online.

Further Exploration ¶

Could New Yorkers use the service more than anyone else?

Do people on exchanges use the service more than people who are not on exchanges?

Do seniors use the service less frequently than other age groups?

Back to Top

And It Begins¶

#for part 1
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import mysql.connector as mysql
import numpy as np
import pandas as pd
import missingno as msno
import statsmodels
from statsmodels.stats.proportion import proportion_confint
from statsmodels.stats.proportion import proportions_ztest
from scipy import stats
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
#added in for part 2
import seaborn as sns
from sklearn import preprocessing
from sklearn.cluster import KMeans

Part I. - A/B Test ¶

Prompt 1: Messaging Button Experiment ¶

Background: Oscar has run a two-armed experiment to test a new design for the messaging button that appears on the homepage of the member website. This homepage button is one of several possible entry points to the messaging feature. Our messaging feature allows members to message a dedicated care team with any benefits or plan related questions.
The goals of this experiment were to improve discoverability and increase usage of the messaging feature. The control design (dark) was shown to 50% of user traffic for the duration of the experiment period; the other 50% of traffic was shown the experimental design (light).

Back to Top

Step 0 - Data Game Plan & A/B Testing Strategy ¶

We read in the data.
We then get the data into a usable database for us to query, and then pull back.
There are some parts here that will be easier in Pandas, and there are some parts which are easier in SQL. So we'll use both.
First, we want to read in the data and check for nulls.
Given the way I want to evaluate the A/B test, I need to use Pandas to set up a table that I want to query.
Here's the game plan for generating the table we want to be made:
- We start with the experiment_subjects table.
  - From working with this table, I happen to know that it has a single null value for a user_id, so we want to eliminate that.
  - We thus drop duplicates in Pandas on user_id, and we eliminate rows that have a null user_id. That's just a glitch in the data.
- This table, the experiment_subjects table is the base for us to build out a/b testing analysis table.
- Next we need to bring in the experiment_actions table.
  - This table has a user_id and then an action they took, and a timestamp for that action. Actions can include Clicking the Button, Viewing the Inbox, Viewing a Message Thread, and then Sending a Message.
  - Critically, although these could constitute an e-commerce-like funnel, they are not always in order. This is because the Button is only one of several ways to reach the messaging screen.
  - For simplicity in the A/B testing stage, we want to convert our data into Bernoulli Trials. This means that each data point will have a discrete outcome of success, 1, or failure, 0.
  - Data scientists love this because it lowers the variance in our populations because choices are 2 discrete things, instead of the infinity of choices in a continuous variable.
  - This will let us use a two sample proportion z test, and we'll also follow that up with a two sample proportion t-test, although we should have a large enough n sample size for us to go with the z test.
  - This will also require that the users are truly randomized and independent, which, I think, is a reasonable assumption here.
- So this brings us back to building the data table.
- Since we are using a Bernoulli Trial method, we're not concerned that a given user clicked on a button a thousand times; we're concerned about how many users clicked on a button. Thus we are going to be looking at distinct user counts.
- This was not too complicated in the event_subjects table to do since that table was just a user, their treatment/control status, and their experiment enrollment status.
- This is a bit harder to do in event_actions table.
- To solve this problem, we want to group this table by both the user_id and the action. We don't care how many times a user did a particular action, but we do care a lot about the different, discrete actions that a given user took. This grouping will take care of that.
  - But what do we do if a user does a given action more than once?
  - We can simply take the first user_id-action per timestamp.
  - We can sort the table by timestamp, then do the group by user_id-action in Pandas, then set a parameter in Pandas to select the first of the duplicate user_id-actions pairs.
- The result will be a shorter (in # of rows) events_actions table that gives one row for everything a user did a different action. So it will have as many rows as there are user_id-actions, and when a user did an given action more than once, it will keep the earliest action they have.
- The next point is critical:
  - We will only include users who have actions that start AFTER the test enrollment timestamp.
  - This is a decision I have made because if the users were already doing all these actions BEFORE our experiment began, I view them as more tainted because they bring in whatever their old memories are. In fact the novelty alone of the new Button can be enough for people to click on it. What we really want to know is how A/B users reacted who are unbiased by their past experiences on the site.
    - Now you could reasonably argue that you want to emphasize repeat-users, and that would be OK, but, for the potential biases stated above, I contend this is the best way for us to test, given that the goal is to bring on many, many more new users, who have not used our service before.

Back to Top

Step 1 - Building the Final Tables to Support A/B Testing Strategy ¶

The next thing we have to do is set up the A/B testing tables, where we will actually see the proportions.
For checking the discoverability, we care most about button clicks.
- Thus, we care about the proportion of users in the light button cohort that clicked the light button VS. the proportion of users in the dark button cohort that clicked the dark button.
For checking usage of the messaging feature, we care about conversion from the button to sending a message.
- Thus, here, we care about what proportion of the users in the light button cohort who clicked the light button (where clicking the button was their first action) went on to actually send a message VS. what proportion of the users in the dark button cohort who clicked the dark button (where clicking the button was their first action) went on to actually send a message.
- For this table we really have to think about the timing of the actions. To think about conversion, we need the button to be the first thing they clicked on. If they somehow sent a message, then randomly clicked the button, I content, it's unclear if the button had anything to do with it.
  - The most apples-to-apples comparison will be when the user FIRST clicked a button, then went on to send a message.
  - I will, accordingly, for this table, only count messages sent as a 'success' when clicking the button was the first thing they did.
Once all this is done, we round our analysis with a two sample proportion z test, and then two-sample proportion t test.
The plan is to visualize using pie charts.

👇🏾 We begin by preparing the experiments_subjects table. 👇🏾¶

experiment_subjects = pd.read_csv('experiment_subjects.csv')

#let's check it out
experiment_subjects.describe()

👆🏾You can see that there are some duplicates in the user_id. We need to remove that.👆🏾¶

experiment_subjects = experiment_subjects.drop_duplicates(['user_id'])
experiment_subjects.describe()

💭👆🏾Something's still off. There are now 34,935 user_ids, but 34,936 rows. 👆🏾💭¶

This means that we've got a NULL user_id somewhere.
Sidenote:
- This shows the answer to an age-old Python vs. SQL question, namely: counting nulls. Looks like Python does not count nulls. The Python count function counts non-null values in a column.

experiment_subjects = experiment_subjects.dropna(subset=['user_id'])
experiment_subjects.describe()

👆🏾Now we're talking! The rows line up. 👆🏾😊✅⛪¶

👇🏾 Let's make sure those dates are in a datetime format. 👇🏾¶

experiment_subjects.enrolled_at.dtype

dtype('O')

#that means it's an object; we better turn this into a datetime format:
experiment_subjects['enrolled_at'] = pd.to_datetime(experiment_subjects['enrolled_at'],\
                                                format='%Y-%m-%d %H:%M:%S')
experiment_subjects.enrolled_at.dtype

dtype('<M8[ns]')

#let's confirm that's a datetime
np.dtype('datetime64[ns]') == np.dtype('<M8[ns]')

True

👆🏾Great, now we're in datetime. Now we're in great shape, and we can turn to the next table before we join.👆🏾¶

experiment_subjects.head()

👇🏾 We now move to preparing the experiments_actions table. 👇🏾¶

experiment_actions = pd.read_csv('experiment_actions.csv')

#let's check it out
experiment_actions.describe()

💭 As said in the game plan, we're not going to be too worried about the new_thread column.¶

For this analysis, we're going to focus on whether a button click, or whether a message was sent.
The new thread vs old thread dichotomy will be more relevant to mid-funnel analyses, but that is outside the scope of this notebook.

#similiar to the above situation, let's get timestamp into datetime format
experiment_actions['timestamp'] = pd.to_datetime(experiment_actions['timestamp'],\
                                                format='%Y-%m-%d %H:%M:%S')

👇🏾 We now want to group the actions column to give us 1 row per user_id-action. 👇🏾¶

We could have also used Pandas's groupby function.
We critically want to sort this table first by ascending timestamp order.
- This lets us make sure we get the user's earliest action, in the event they take the same action more than once.
- As discussed in the game plan at the outset, this will be very helpful in how we understand conversion later.

experiment_actions = experiment_actions.sort_values('timestamp').drop_duplicates(subset=['user_id', 'action'],\
                                                                                  keep='first')

experiment_actions.describe()

👇🏾 We're now ready to join the tables together and then send them over to SQL. 👇🏾¶

SQL is my love language.
We will use a left join.
We will create our a/b test results dataframes from this table.

ab_test = experiment_subjects.merge(experiment_actions, left_on='user_id', right_on='user_id',\
                                   how='left')

👇🏾 As discussed in the game plan, we eliminate rows where the action happened before the experiment. 👇🏾¶

But we also want to keep in the original experiment subjects who 'bounced', never did any action, and thus have a Null value for the timestamp column.
So this next line of code says, basically, to eliminate the users who had an action BEFORE they were enrolled in the experiement.

ab_test = ab_test[(ab_test.enrolled_at <= ab_test.timestamp) | (pd.isnull(ab_test.timestamp) == True)]

ab_test.describe()

❤️👇🏾 Now let's send it over to SQL 👇🏾❤️¶

engine = create_engine('mysql+mysqlconnector://newuser:data@localhost:3306/sys', echo=False)
ab_test.to_sql(name='ab_test_os', con=engine, if_exists = 'replace', index=False)

#connect to the MySQL database
db = mysql.connect(
    host = "localhost",
    user = "newuser",
    passwd = "data",
    auth_plugin='mysql_native_password',
    database = 'sys')

Back to Top

💭👇🏾Thinking about our discoverability table.👇🏾💭¶

We need a table that, in the first column, lists each individual user in the experiment.
Next, in the second column, it will list which audience (light button or dark button) the user is in.
Then, in the third column, it will have a 1 for if they clicked and a 0 if they did not click.

discoverability = pd.read_sql("""
with experiment_subjects_distinct as(
select distinct
user_id,
audience_name
from sys.ab_test_os
), 

clickers as (
select
user_id,
audience_name,
case when action = 'Clicked Messaging Button' then 1 end as clicked_yn
from sys.ab_test_os
where action = 'Clicked Messaging Button'
)

select
esd.user_id,
esd.audience_name as button,
coalesce(c.clicked_yn, 0) as clicked_yn
from experiment_subjects_distinct esd
left join clickers c on c.user_id = esd.user_id;

""", con=db)

discoverability.reset_index
discoverability.sample(5)

discoverability.describe()

Back to Top

💭👇🏾Thinking about our message_usage table.👇🏾💭¶

We need a table that, in the first column, lists each individual user in the experiment.
Next, in the second column, it will list which audience (light button or dark button) the user is in.
Then, in the third column, it will have a 1 for if they sent a message, after clicking the button, where clicking the button was their first action.
- It will return a 0 otherwise.

message_usage = pd.read_sql("""
with experiment_subjects_distinct as(
select distinct
user_id,
audience_name
from sys.ab_test_os), 

ab_test_with_action_order as (
select 
*,
dense_rank() over(partition by user_id order by timestamp asc) as user_event_order
from sys.ab_test_os
),

first_clickers as (
select distinct
user_id
from ab_test_with_action_order
where action = 'Clicked Messaging Button' and user_event_order = 1
),

sent_message_after_clicking_button_first as (
select
ab.user_id,
case when action = 'Sent Message' then 1 end as sent_message_yn
from ab_test_with_action_order ab
join first_clickers fc on fc.user_id = ab.user_id
where action = 'Sent Message' and user_event_order > 1
)

select
esd.user_id,
esd.audience_name as button,
coalesce(sm.sent_message_yn, 0) as sent_message_yn
from experiment_subjects_distinct esd
left join sent_message_after_clicking_button_first sm on sm.user_id = esd.user_id;

""", con=db)

message_usage.sample(5)

message_usage.describe()

Back to Top

💭👇🏾For our conversion metric, we'll want to have a first_clickers table.👇🏾💭¶

So we now know (from right above) how many people went on to send a message after their frist action post enrollment was clicking the button.
But for how many people in each audience (light button vs. dark button) was this the case?
We'll want to have a tally of first clickers, and that's what this table will do.
For this table:
- The first column will be the test group the user is in.
- The second column will just be the number of users, from that group, who were a first click.
- This should be a lot like the first table, except it can't just be any click of a button, it has to be where the click of the button was the first thing the did post experiment enrollment.

first_clickers = pd.read_sql("""
with experiment_subjects_distinct as (
select distinct
user_id,
audience_name
from sys.ab_test_os
), 

ab_test_with_action_order as (
select 
*,
dense_rank() over(partition by user_id order by timestamp asc) as user_event_order
from sys.ab_test_os
),

first_clickers as (
select 
user_id,
case when action = 'Clicked Messaging Button' then 1 end as first_action_was_click_yn
from ab_test_with_action_order
where action = 'Clicked Messaging Button' and user_event_order = 1
)

select
esd.user_id,
esd.audience_name as button,
coalesce(fc.first_action_was_click_yn, 0) as first_action_was_click_yn
from experiment_subjects_distinct esd
left join first_clickers fc on fc.user_id = esd.user_id;
""", con=db)

first_clickers.sample(5)

first_clickers.describe()

👆🏾One cool thing to notice: although we're at ~ 21% overall for clicks, we're at ~19% overall for first-clicks.👆🏾¶

This is just a kind of temporarl quirk you often see in A/B tests.
There are some users who clicked who, somehow, someway, went to the message thread first, then clicked the button to get there second.
- One instance could be that a person had already sent a message, then clicked back through the button to send another.

Fantastic!¶

We now have three tables that will help us unlock the secrets of this A/B test.
- ✅ discoverability
- ✅ message_usage
- ✅ first_clickers

Back to Top

Step 2 - Enter Statistics: Hypothesis Testing ¶

In a nutshell, we've set up these tables to as Bernoulli trials to work smoothly with two sample proportion z-tests and t-tests.
We will see different proportions of success and failure depending on the light button vs dark button audience groups.
We're just trying to get a sense of whether the difference we see is more likely due to chance or if there really is something going on and there's a difference in the groups due to something other than chance.
In a good experiment, we could say that the only difference in groups—which we assume to be truly randomly assigned users, who have no impact on each other (independent)—would be the buttons themselves. So the only difference would be the button, and we can make a causal inference that the button is the change catalyst.
In statistics terms, we start out with a null hypothesis.
- I'm a lawyer, and I can proudly say that we do the same in the law.
- We begin by assuming a party is innocent, just like we begin by assuming that both the test groups are equal. In other words, we assume the button has no effect.
- But as we look at the data, we'd expect a certain bit of fluctuation. Just like in a criminal case, we'd expect a certain number of suspicion to be based on behavior that just happened to put the suspect near the crime.
- But as this data stack more and more in one direction, the more and more likely we are to reject the null hypothesis, and conclude that something is really going on. And in the law, conclude that the person is not innocent.
The p value, is the chance that we'd see a given result randomly, due to chance.
- The lower that value is, the more likely we are seeing something other than chance at work.
- For a lot of experiments we reject the null if the p value is under 5%, so that's what we call 'statistical significance.'
- Notably, just because a difference is statistically significant does not mean it is practically significant. For instance, we might detect an effect caused by a feature of an experiment that chances the outcome by 1%. We might be able to conclude, with a large enough sample size, that that 1% was not a random change. But does the business care about a 1% difference? Maybe they do, but maybe they don't.
By the way, while I'm at it, a confidence interval is a range that we are x% sure (often 95%) holds the true value of the population's proportion, since we're just looking at a sample.
- That's how you get the so-called 'margin of error' in polls.

Back to Top

🔬 A/B Test Metric 1 - Discoverability: Proportion of Users Who Clicked The Button 🔬¶

Let's check out discoverability.

#helpful to peek at the data again.
discoverability.sample(3)

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾¶

count_clicks = discoverability.clicked_yn.sum()
users_clicks = discoverability.clicked_yn.count()

#this is a z-test w/ normal distribution for the confidence interval, alpha set it to 95%
print('95-Percent Confidence Interval: \n Lower Bound = %.4f, Upper Bound = %.4f' % \
      (statsmodels.stats.proportion.proportion_confint(\
            count_clicks, users_clicks, alpha=0.05, method='normal')))

95-Percent Confidence Interval: 
 Lower Bound = 0.2146, Upper Bound = 0.2236

👆🏾Pretty cool! 👆🏾¶

This tells us that we are 95% sure the true proportion of overall clicks, un-segmented, is between 21.46% and 22.36%.

👇🏾Now that we're warmed up, let's get to the actual A/B test results for discoverabililty.👇🏾¶

#we'll store these as variables for easier use later
clicked_light_button = discoverability[discoverability["button"]=='light']['clicked_yn']
clicked_dark_button = discoverability[discoverability["button"]=='dark']['clicked_yn']

#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
click_button_ab_test = pd.DataFrame({
    "count": [clicked_light_button.sum(), clicked_dark_button.sum()],
     "users": [clicked_light_button.count(), clicked_dark_button.count()]
    }, index=['light_button_clicks', 'dark_button_clicks'])

#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
                                               click_button_ab_test['count'], click_button_ab_test['users']))
#z score, p-value

Z-Score = -5.277, p-Value = 0.0000001313

👆🏾Pretty cool! 👆🏾¶

This very low p-value encourages us to go ahead, reject the null hypothesis, look at the difference in proportions across the treament and control groups and conclude that that difference is really caused by the buttons.

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾¶

print('T-Score = %.3f, p-Value = %.10f' % stats.ttest_ind(clicked_light_button, clicked_dark_button))

T-Score = -5.279, p-Value = 0.0000001306

👆🏾Pretty cool! 👆🏾¶

For these high n sample size numbers, and with these degrees of freedom, I'd expect the t-distribution to be pretty close to the z-distribution, so this isn't too surprising.
We can definitely say, as I like to say in my own parlance, "There's a there, there." lol

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭¶

click_button_ab_test['proportion'] = round(click_button_ab_test['count'] / click_button_ab_test['users'],2)
click_button_ab_test

Back to Top

👆🏾So, we've got our first conclusion!! 👆🏾😊✅⛪¶

The dark buttons are more 'discoverable.'
This 2% difference is most likely not due to chance.
In fact, there were fewer users exposed to the dark buttons, and the dark buttons still got more clicks!

Back to Top

🔬 A/B Test Metric 2 - Message Usage: Conversion Proportion For Either Button 🔬¶

Let's check out message usage.
Now we want to see how many people actually sent a message after they clicked on the button, given that their first action post experiment enrollment was to click on the button.
We will fist look at the raw numbers of actual sends that either type of button has.
Then we'll look at the button conversion percentage.

#helpful to peek at the data again.
message_usage.sample(3)

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾¶

This is for the raw numbers of users who went on to actually send. This isn't the last-stage-of-the-funnel conversoin proportion yet.

count_sends = message_usage.sent_message_yn.sum()
users_sends = message_usage.sent_message_yn.count()

#this is a z-test w/ normal distribution for the confidence interval, alpha set it to 95%
print('95-Percent Confidence Interval: \n Lower Bound = %.4f, Upper Bound = %.4f' % \
      (statsmodels.stats.proportion.proportion_confint(\
            count_sends, users_sends, alpha=0.05, method='normal')))

95-Percent Confidence Interval: 
 Lower Bound = 0.0589, Upper Bound = 0.0640

👆🏾Pretty cool! 👆🏾¶

This tells us that we are 95% sure the true proportion of overall actual-sends, un-segmented, is between 5.89% and 6.40%.
We can definitely see that a message sent is rarer than a button click.
- This makes sense.

👇🏾Now that we're warmed up, let's get to the actual A/B test results for message usage.👇🏾¶

#we'll store these as variables for easier use later
sent_light_button = message_usage[message_usage["button"]=='light']['sent_message_yn']
sent_dark_button = message_usage[message_usage["button"]=='dark']['sent_message_yn']

#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
send_button_ab_test = pd.DataFrame({
    "count": [sent_light_button.sum(), sent_dark_button.sum()],
     "users": [sent_light_button.count(), sent_dark_button.count()]
    }, index=['light_button_sends', 'dark_button_sends'])

#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
                                               send_button_ab_test['count'], send_button_ab_test['users']))
#z score, p-value

Z-Score = 13.794, p-Value = 0.0000000000

👆🏾Pretty cool! 👆🏾¶

This very low p-value encourages us to go ahead, reject the null hypothesis, look at the difference in proportions across the treament and control groups and conclude that that difference is really caused by the buttons.

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾¶

print('T-Score = %.3f, p-Value = %.10f' % stats.ttest_ind(sent_light_button, sent_dark_button))

T-Score = 13.834, p-Value = 0.0000000000

👆🏾Pretty cool! 👆🏾¶

For these high n sample size numbers, and with these degrees of freedom, I'd expect the t-distribution to be pretty close to the z-distribution, so this isn't too surprising.
We can definitely say, as I like to say in my own parlance, "There's a there, there." lol

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭¶

send_button_ab_test['proportion'] = round(send_button_ab_test['count'] / send_button_ab_test['users'],2)
send_button_ab_test

Back to Top

👆🏾So, we've got our second conclusion!! 👆🏾😊✅⛪¶

The light buttons lead to more messages sent.
This 4% difference is most likely not due to chance.
We will still want to look at the conversion numbers

👇🏾Let's look a bit at the first-clickers we've spoken about.👇🏾¶

#take a peek
first_clickers.sample(3)

👇🏾We want to build a date frame that let us see the conversion funnel after a first-click.👇🏾¶

#we'll store these as variables for easier use later
first_clicked_light_button = first_clickers[first_clickers["button"]=='light']['first_action_was_click_yn']
first_clicked_dark_button = first_clickers[first_clickers["button"]=='dark']['first_action_was_click_yn']

#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
first_clicks_button_ab_test = pd.DataFrame({
    "count": [first_clicked_light_button.sum(), first_clicked_dark_button.sum()],
     "users": [first_clicked_light_button.count(), first_clicked_dark_button.count()]
    }, index=['light_button_first_clicks', 'dark_button_first_clicks'])

first_clicks_button_ab_test

send_button_ab_test

👇🏾Now we can combine the above two dataframes to make the funnel dateframe.👇🏾¶

click_to_send_conversions_funnel = pd.DataFrame({
    "users": [send_button_ab_test.users['light_button_sends'], send_button_ab_test.users['dark_button_sends']],
     "first-clicking users": [first_clicks_button_ab_test['count']['light_button_first_clicks'],\
                        first_clicks_button_ab_test['count']['dark_button_first_clicks']],
     "sending users": [send_button_ab_test['count']['light_button_sends'],\
                       send_button_ab_test['count']['dark_button_sends']]
    }, index=['light_button', 'dark_button'])

click_to_send_conversions_funnel['conversion'] \
   = round(click_to_send_conversions_funnel['sending users'] / \
        click_to_send_conversions_funnel['first-clicking users'],2)

click_to_send_conversions_funnel

#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
         click_to_send_conversions_funnel['sending users'], click_to_send_conversions_funnel['first-clicking users']))
#z score, p-value

Z-Score = 18.644, p-Value = 0.0000000000

Back to Top

👆🏾So, we've got our third and final conclusion!! 👆🏾😊✅⛪¶

The light buttons has the superior messages conversion rate.
Not that this is surprising, but:
- This 23% difference is most likely not due to chance, haha!

Back to Top

Step 3 - Let's Make Some Data Visualizations!¶

click_button_ab_test

Back to Top

labels = ['Clicking Users', '']
light_button_clicks = [click_button_ab_test['count'][0], click_button_ab_test['users'][0]-\
                      click_button_ab_test['count'][0]]
dark_button_clicks = [click_button_ab_test['count'][1], click_button_ab_test['users'][1]-\
                     click_button_ab_test['count'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_clicks, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_lb, explode = (0, 0.08)\
          , startangle = 6)
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_clicks, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("    Anytime Clicks, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_A_clicks.png",dpi=300, bbox_inches='tight')
plt.show()

send_button_ab_test

Back to Top

labels = ['Sending Users', '']
light_button_sends = [send_button_ab_test['count'][0], send_button_ab_test['users'][0]-\
                     send_button_ab_test['count'][0]]
dark_button_sends = [send_button_ab_test['count'][1], send_button_ab_test['users'][1]-\
                    send_button_ab_test['count'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_sends, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_lb, explode = (0, 0.08))
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_sends, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("    Sends After Clicks, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_B_send_it.png",dpi=300, bbox_inches='tight')
plt.show()

click_to_send_conversions_funnel

Back to Top

labels = ['Converted Users', '']
light_button_converted_sends = \
  [click_to_send_conversions_funnel['sending users'][0], click_to_send_conversions_funnel['first-clicking users'][0]-\
  click_to_send_conversions_funnel['sending users'][0]]
dark_button_converted_sends = \
  [click_to_send_conversions_funnel['sending users'][1], click_to_send_conversions_funnel['first-clicking users'][1]-\
  click_to_send_conversions_funnel['sending users'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_converted_sends, labels=labels, autopct='%1.2f%%', shadow=False, \
           colors=colors_lb, explode = (0, 0.08), startangle = -43)
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_converted_sends, labels=labels, autopct='%1.2f%%', shadow=False, \
           colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("      Conversion Rate, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_C_converted_sends.png",dpi=300, bbox_inches='tight')
plt.show()

Back to Top

Part II. - Unsupervised Machine Learning ¶

Prompt 2: Classifying Member Engagement Status ¶

Background: At Oscar, we have analytical and business use cases that require a classification of how engaged members are with Oscar (targeting outreach campaigns to more or less engaged members, understanding how member engagement intersects with clinical profile to affect costs, etc). The attached dataset includes three years of simulated engagement data by member by month. We would like you to use this data to build a classification of member engagement.

Step 0 - Read in The Data & K-Means Clustering Strategy Discussion ¶

As I read in the data—and this is sort of a summary of some work I've done offline—I thought about a few things:
- What leverage points did I see in the data
- What kinds of business use cases might I glean from the data
  - Use those business cases for guidance
As I looked at the user information, I saw a familiar theme:
- Most users are quite inactive.
Thus the first segmentation is to filter out those users who are not interacting with the platform.
Next, we look to the categorial and the continuous data.
After I looked through it, there was one thing that really stuck out to me:
- you have users who are very active online
- you have users who are very active offline (such as phone calls)
I isolated logins and inbound calls and plotted them against each other.
I found that the people who logged in the most tended to call the least
And the people who called the most tended to log in the least
Please note that all this is for only the most active users, after other users have been filtered out.
- And I always do these filterings on the basis of standard deviation rank:
  - Meaning all users below +2 standard deviations of logins can be filtered out, because this essentially means that they have 0 logins. The same goes for inbound phone calls.
Then we bring in the K-means clustering algorithm from sci-kit learn and see how it fares on the data.
Along with my hypothesis, I decided to use 2 clusters for K-means, and this worked best.
- I could tell it worked best by looking at hues on scatter plots according to how it divided the data.
The reason I thought of this kind of segmentation is the advertising use-case:
- Tell people who login all the time, but never call that representatives are there to help them.
- Tell people who call all the time and never log in, that there is an exciting app, which they can use.
Then for further research if time allows, I have provided a look at the low-call, high-online VS. high-call, low-online graphs split over other categoricals, such as:
- region
- exchange status
- age bracket
My ultimate conclusion is that segmentation should be done as follows:
- (1) - passive users
- (2) - active users who are low-call, high-online
- (3) - active users who are high-call, low-online.
- From my analysis, it appear very few, if any users, are both high-call and high-online or both low-call and low-online. It seems people really have a preference for how they communicate. If it turned out that there were large categories like this, we could segment in to them.
  - But, as for now, the high, high or low, low categories do not seem to exist here.

Back to Top

Step 1 - Deploy K-Means!¶

I have written the K-means algorithm into a function.
- As always, I want to thank Jake Vanderplas for his great tutorials on K-means.
  - I've taken what he's said about K-means, and made a function that works nicely for our purposes.
This function works with my standard deviation splitting strategy and matplotlib.
Here's what it does:
- First, you tell the function how active you want your users. The function works better, the more active the users are.
  - So if you tell the function that you want sigma of 6, it will try to cluster people who are above 6 standards of deviation from the mean in EITHER a) number of logins or b) number of inbound calls.
- Second, the function applies the k-means unsupervised machine learning algorithm.
- Third, the function the plots a graph of logins on the x axis and inbound calls on the y axis, for specified standard deviation 'cut', and then classifies into two clusters based on K-means.

Back to Top

👇🏾Let's take a quick peek at the data.👇🏾¶

member_engagement = pd.read_csv('member_engagement.csv')

#this is just to take a look at the data
member_engagement.describe()

member_engagement.dtypes

member_id                 object
policy_id                 object
month                     object
policy_relation           object
age_group                 object
region                    object
enrollment_type           object
ib_call_count              int64
ob_call_count              int64
ob_call_answered_count     int64
ib_message_count           int64
ob_message_count           int64
ob_message_read_count      int64
login_count                int64
telemedicine_count         int64
search_count               int64
grievance_count            int64
dtype: object

member_engagement.enrollment_type.unique()

array(['Off Exchange', 'On Exchange'], dtype=object)

member_engagement.policy_relation.unique()

array(['Subscriber', 'Dependent', 'Spouse'], dtype=object)

member_engagement.region.unique()

array(['New York', 'Orlando', 'San Antonio', 'Austin', 'New Jersey'],
      dtype=object)

member_engagement.month.unique()

array(['2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01',
       '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01',
       '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01',
       '2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
       '2019-05-01', '2019-06-01', '2018-01-01', '2018-02-01',
       '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01',
       '2018-07-01', '2018-08-01', '2018-09-01', '2018-10-01',
       '2018-11-01', '2018-12-01'], dtype=object)

member_engagement.sample(4)

Back to Top

🔬 Introducing My Function: K-Means-THINK_Binary 🔬¶

Special thanks to Jake VanderPlas for inspiring me and showing the foundational code.

def k_means_think_binary(sigma):
    super_peeps =  member_engagement[
    (np.abs(member_engagement.login_count-\
                    member_engagement.login_count.mean())\
       >= (sigma*member_engagement.login_count.std())) 
                                 |\
      (np.abs(member_engagement.ib_call_count-\
                    member_engagement.ib_call_count.mean())\
       >= (sigma*member_engagement.ib_call_count.std()))                           
                                ]
    super_peeps = super_peeps[['login_count','ib_call_count']]
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(super_peeps)
    y_kmeans = kmeans.predict(super_peeps)
    #y_kmeans = ['Loves To Call, But Not Online' if x == 0 else 'Always Online, Never Calls' for x in y_kmeans]
    if sigma < 3: #this is a hacky bit to make sure the legend colors are correctly assigned
        y_kmeans = ['Loves To Call, But Not Online' if x == 0 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959
        sns.set_palette("husl", 8)
    #elif sigma in (3,10,15):
     #   y_kmeans = ['Loves To Call, But Not Online' if x == 1 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959
      #  sns.set_palette("Set2")
    else:
        y_kmeans = ['Loves To Call, But Not Online' if x == 1 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959 
        flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
        sns.set_palette(flatui)
    super_peeps['Classification Concept'] = y_kmeans
    super_peeps
    #sns.set_palette("Paired")
    #sns.set_palette("husl", 8)
    sns.lmplot( x="login_count", y="ib_call_count", data=super_peeps, fit_reg=False, hue="Classification Concept", \
               legend=True, legend_out=True)
    #plt.legend(loc='upper right')
    user_count = super_peeps.login_count.count()
    plt.ylabel('Number of Inbound Calls Per Month')
    plt.xlabel('Number of Customer Logins Per Month')
    plt.suptitle('For Users With Inbound Calls or Logins Above {}σ'.format(sigma),\
              fontweight = 'bold', y=1.05)
    plt.title('                              {:,} customers'.format(user_count))
    plt.savefig('analyzer_{}-std-kmeans.png'.format(sigma),dpi=400, bbox_inches='tight')
    plt.ylim(top=30, bottom=0) 
    plt.xlim(right=80, left=0)
    plt.show()

Back to Top

for standard_deviation in [0,0.5,0.75,1,2,3,15,20]:
        k_means_think_binary(standard_deviation)

👆🏾Look out for this! 👆🏾¶

Sometimes K-means alters the 1 or 0 it is using for classification, and throws off the labels in the legend.
The left side is of this graph is always 'Loves To Call, But Not Online'
- the right side of this graph is always 'Always Online, Never Calls'
- But sometimes the K-means ML algorithm swaps the identifiers
So just please be aware that the labels may get swapped once in a while, randomly.

Back to Top

Step 2 - Further Exploration ¶

This takes the same function written above, and one-by-one, breaks it down across different categorical variables in the userbase.
This is a foundation for further analysis.

Back to Top

🔬 Introducing My Other Function: The Analyzer 🔬¶

Special thanks to Jake VanderPlas for inspiring me and showing the foundational code.

def the_analyzer(num, groupby):
    super_peeps =  member_engagement[
    (np.abs(member_engagement.login_count-\
                    member_engagement.login_count.mean())\
       >= (num*member_engagement.login_count.std())) 
                                 |\
      (np.abs(member_engagement.ib_call_count-\
                    member_engagement.ib_call_count.mean())\
       >= (num*member_engagement.ib_call_count.std()))                           
                                ]
    sns.set_palette("Dark2")
    sns.lmplot( x="login_count", y="ib_call_count", data=super_peeps, fit_reg=False, hue=groupby, \
               legend=True)
    user_count = super_peeps.member_id.count()
    plt.ylabel('Number of Inbound Calls Per Month')
    plt.xlabel('Number of Customer Logins Per Month')
    plt.suptitle('            For Users With Inbound Calls or Logins Above {}σ'.format(num),\
              fontweight = 'bold', y=1.05)
    plt.title('     {:,} customers'.format(user_count))
    plt.savefig('analyzer_{}-std_{}-groupby.png'.format(num, groupby),dpi=400, bbox_inches='tight')
    plt.ylim(top=30, bottom=0) 
    plt.xlim(right=80, left=0)
    plt.show()

Back to Top

for i in [0,0.5,0.75,1,2,3,10,15, 20]:
    for q in ['enrollment_type', 'age_group', 'region']:
        the_analyzer(i, q)

	user_id	audience_name	enrolled_at
0	624cf3cb-d23c-4f5d-bbb7-034d9e101576	dark	2017-11-12 16:29:56
1	703fda1a-d406-42d5-98c1-ddc584da80db	dark	2017-11-14 03:49:57
2	02a8cc81-7643-49f1-8c34-58c84155c4ab	dark	2017-11-29 17:02:30
3	8a7b5f34-5283-4f91-ab27-d630fac065b1	dark	2017-11-25 16:17:07
4	a8dc881e-489e-4414-a281-ea3dc1164628	dark	2017-11-24 19:00:19

	user_id	action	new_thread	timestamp
count	78985	78985	31808	78985
unique	11718	4	2	61138
top	bfd8c778-01d7-428d-aa0d-04f510ac2374	Viewed Messaging Inbox	False	2017-12-03 19:27:29
freq	223	30071	21272	7

	user_id	action	new_thread	timestamp
count	30266	30266	11339	30266
unique	11718	4	2	24034
top	4efb6683-a836-4773-95c0-66813ea67a69	Viewed Messaging Inbox	True	2017-11-28 16:41:54
freq	4	10908	7108	4
first	NaN	NaN	NaN	2017-06-28 15:18:17
last	NaN	NaN	NaN	2017-12-06 23:59:38

	user_id	audience_name	enrolled_at	action	new_thread	timestamp
count	47475	47475	47475	24258	8440	24258
unique	33081	2	32634	4	2	18587
top	ade9f184-1c1f-4151-b2e7-6364222f4c23	light	2017-11-12 17:13:22	Viewed Messaging Inbox	True	2017-12-03 15:20:28
freq	4	23953	8	8570	5668	4
first	NaN	NaN	2017-11-12 00:00:14	NaN	NaN	2017-11-12 00:05:38
last	NaN	NaN	2017-12-05 23:59:32	NaN	NaN	2017-12-06 23:59:38

	user_id	button	clicked_yn
26056	228c1835-fba2-4347-9e79-00c0605d5ead	light	1
8252	6d0a5838-7549-4dbb-b7de-8656aab061c4	dark	0
19190	cf699e8f-8c74-4b35-a0bb-2e4ab8c35f40	light	0
17946	2c1acbbb-6d0f-4855-87d0-9a776cb10ba4	light	0
2197	0b822fa3-b0ec-49bd-8d84-b5b12e3411d8	dark	0

	user_id	audience_name	enrolled_at
count	34951	34953	34953
unique	34935	2	34444
top	bff9cf7a-fd4b-4ddd-9d65-4c2bf91878c1	light	2017-12-05 01:26:46
freq	2	17539	3

	user_id	audience_name	enrolled_at
count	34935	34936	34936
unique	34935	2	34427
top	7c8690d4-62f8-41db-a0ff-fb8667624569	light	2017-12-01 19:07:19
freq	1	17522	3

	clicked_yn
count	33081.000000
mean	0.219099
std	0.413642
min	0.000000
25%	0.000000
50%	0.000000
75%	0.000000
max	1.000000

	user_id	button	sent_message_yn
15709	2607040c-abd0-4cc2-8062-c0647b22ede8	dark	0
15578	573f8132-7ec8-48e0-a65d-2a016e1e4384	dark	0
29233	7ad9a77f-1e6a-4f28-b76c-b183f644ccec	light	0
5351	791be0c4-72ba-4c10-8025-2ed35974d692	dark	1
31568	a28946e5-7651-4e28-abdc-c2a027ca5c63	light	0

	user_id	button	first_action_was_click_yn
13685	bd5fec4a-c2aa-4521-9e74-581126ed60f8	dark	1
24207	fcd2e5f8-8713-41cc-bd5a-559c36bfd2ff	light	0
4335	ac15997a-9156-4319-b001-0078cab2ff97	dark	0
13248	19f30902-ca80-4884-a156-6de44abf2dd8	dark	0
21628	9137187f-9f8a-466b-a7f4-2ba3faf4f6f9	light	0

	user_id	button
3232	4e05eefd-71a2-4509-b419-a04d7e57ad09	dark
27885	0f6861fe-25d0-481e-ad75-4b4f8ffa3c5a	light
16861	f902eb57-8cad-478f-ab28-abd4b6743a61	dark

	count	users	proportion
light_button_clicks	3437	16593	0.21
dark_button_clicks	3811	16488	0.23

	user_id	button	sent_message_yn
20182	abfbdad5-3078-411a-bd2b-0a5d67044b75	light	0
15852	49344f91-ac02-4d51-84e4-e5dbfd535800	dark	1
19847	4035cf20-c26e-40a6-b163-03b0f9a7ff2d	light	0

	count	users	proportion
light_button_sends	1321	16593	0.08
dark_button_sends	712	16488	0.04

	user_id	button
13024	43d65887-f1d3-4d03-b35a-a78a6b0f4562	dark
18547	996e38c2-c4c5-4953-844a-0aa04c66534f	light
19603	3c659ecb-5ec0-45f0-8b00-0869d0258436	light

	count	users
light_button_first_clicks	2957	16593
dark_button_first_clicks	3194	16488

	users	first-clicking users	sending users	conversion
light_button	16593	2957	1321	0.45
dark_button	16488	3194	712	0.22

	users	first-clicking users	sending users	conversion
light_button	16593	2957	1321	0.45
dark_button	16488	3194	712	0.22

	member_id	policy_id	month	policy_relation	age_group	region	enrollment_type	ib_call_count	login_count	telemedicine_count	search_count
145277	28193a93-0caa-4e06-b4c1-831c328d8ece	33b792fe-d263-4466-8b5b-f388a2c740c9	2019-01-01	Spouse	36-45	New York	On Exchange	0	3	0	2
307812	a4637f25-1c78-4bfd-829e-0e34e9be82e8	6dbbfc8b-3be5-41ef-9adb-fa819a2ca889	2019-05-01	Subscriber	36-45	Orlando	On Exchange	0	8	1	1
151349	185aeac7-f717-42ca-9152-c08f952229e0	b6e95fbd-6358-4506-9d6c-43e4582f9f8d	2018-04-01	Subscriber	56+	San Antonio	On Exchange	1	0	0	0
228915	d391bdce-70f5-4a0e-9f5a-0acfe6cd3d0a	d7841194-a7e5-43cf-92e6-b93f9117cb59	2019-04-01	Subscriber	27-35	New York	Off Exchange	0	5	1	1

	ib_call_count	ob_call_count	ob_call_answered_count	ib_message_count	ob_message_count	ob_message_read_count	login_count	telemedicine_count	search_count	grievance_count
count	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000	351362.000000
mean	0.228969	0.038738	0.007368	0.144577	0.163319	0.134064	1.114802	0.027308	0.320049	0.007684
std	0.806874	0.353109	0.094743	0.956451	0.782292	0.745294	2.917814	0.191751	1.377393	0.099852
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000
max	90.000000	15.000000	4.000000	62.000000	40.000000	40.000000	108.000000	6.000000	36.000000	2.000000

A/B Testing ¶

Clustering ¶

Further Exploration ¶

And It Begins¶

Prompt 1: Messaging Button Experiment ¶

👇🏾 We begin by preparing the experiments_subjects table. 👇🏾¶

👆🏾You can see that there are some duplicates in the user_id. We need to remove that.👆🏾¶

💭👆🏾Something's still off. There are now 34,935 user_ids, but 34,936 rows. 👆🏾💭¶

👆🏾Now we're talking! The rows line up. 👆🏾😊✅⛪¶

👇🏾 Let's make sure those dates are in a datetime format. 👇🏾¶

👆🏾Great, now we're in datetime. Now we're in great shape, and we can turn to the next table before we join.👆🏾¶

👇🏾 We now move to preparing the experiments_actions table. 👇🏾¶

💭 As said in the game plan, we're not going to be too worried about the new_thread column.¶

👇🏾 We now want to group the actions column to give us 1 row per user_id-action. 👇🏾¶

👇🏾 We're now ready to join the tables together and then send them over to SQL. 👇🏾¶

👇🏾 As discussed in the game plan, we eliminate rows where the action happened before the experiment. 👇🏾¶

❤️👇🏾 Now let's send it over to SQL 👇🏾❤️¶

💭👇🏾Thinking about our discoverability table.👇🏾💭¶

💭👇🏾Thinking about our message_usage table.👇🏾💭¶

💭👇🏾For our conversion metric, we'll want to have a first_clickers table.👇🏾💭¶

👆🏾One cool thing to notice: although we're at ~ 21% overall for clicks, we're at ~19% overall for first-clicks.👆🏾¶

Fantastic!¶

🔬 A/B Test Metric 1 - Discoverability: Proportion of Users Who Clicked The Button 🔬¶

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾¶

👆🏾Pretty cool! 👆🏾¶

👇🏾Now that we're warmed up, let's get to the actual A/B test results for discoverabililty.👇🏾¶

👆🏾Pretty cool! 👆🏾¶

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾¶

👆🏾Pretty cool! 👆🏾¶

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭¶

🔬 A/B Test Metric 2 - Message Usage: Conversion Proportion For Either Button 🔬¶

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾¶

👆🏾Pretty cool! 👆🏾¶

👇🏾Now that we're warmed up, let's get to the actual A/B test results for message usage.👇🏾¶

👆🏾Pretty cool! 👆🏾¶

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾¶

👆🏾Pretty cool! 👆🏾¶

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭¶

👇🏾Let's look a bit at the first-clickers we've spoken about.👇🏾¶

👇🏾We want to build a date frame that let us see the conversion funnel after a first-click.👇🏾¶

👇🏾Now we can combine the above two dataframes to make the funnel dateframe.👇🏾¶

Prompt 2: Classifying Member Engagement Status ¶

👇🏾Let's take a quick peek at the data.👇🏾¶

🔬 Introducing My Function: K-Means-THINK_Binary 🔬¶

👆🏾Look out for this! 👆🏾¶

🔬 Introducing My Other Function: The Analyzer 🔬¶

Best,¶

George John Jordan Thomas Aquinas Hayward, Optimist