Executive Summary

This analysis seeks two objectives:

  • (1) recommend the best button for a webpage by analyzing A/B test results and making causal inferences.
  • (2) create engagement segmentations for the user base via clustering.

First, this analysis recommends that the light button be selected for the complete website rollout.

ex-C-converted-sends

conversion-funnel-final

  • For discoverability, this analysis used overall button clicks as the metric.
    • Here, the dark button narrowly edged out the light button (23% to 21%).
    • Although the dark button received more clicks—and the difference was statistically significant—it is unlikely to be practically significant.

ex-A-clicks

  • For message usage, this analysis used conversion rate after clicking a button, given that clicking a button was the user's first action post enrollment in the experiment.
    • Here, the light button dominated. It has a 45% message use conversion rate, while the dark button had a 22% message use conversion rate.
      • This difference was both statistically and practically significant.
  • Notably, this A/B test is imperfectly designed because we are testing for both color and text vs. no-text. These features may be confounding.
  • This data scientist recommends running another test of a dark button with the text "Get help."
  • Thought:
    • This data scientist contends that the dark button got more clicks because
      • (1) aesthetically, it contrasted with the white background and
      • (2) it attracted curiosity due to its lack of any text-based signage or explanation.
    • So the dark button got more clicks but also more bounces.
    • The light button did not stand out as much, but featured the phrase "Get help."
      • As such, its clicked were more intentional and its conversion rate was thus higher.

Second, This analysis recommends segmenting users into the following three categories:

  • (1) Inactive, Passive Users (not depicted below)
    space
  • (2) Active Users Who Call Frequently, But Don't Use the Online Platform
    space
  • (3) Active Users Who Are Often Online, But Rarely Call In

2sigma-function

3sigma-function

These segmentations were supported by K-Means clustering.

  • Credit is given to Jake Vanderplas for his remarkable data science explanations and code. In particular, I found his take on the K-Means clustering algorithm very helpful.

This analysis makes these segmentations with a business situation and use-case in mind:

  • Situation:
    • Customers who mostly call and who may not be aware of online resources.
      • If they are made aware of some of these, they may have a better experience and the company will save money.
    • On the other hand, customers who only use online resources may underestimate the efficacy of talking on the phone with a health insurance representative. If they are made aware of all that can be accomplished on the phone, they may have a better experience.
  • Use-case:
    • Ideally, these segmentations would be the analytical foundation for a directed-information or advertisement campaigns geared towards increasing customer retention and satisfaction.

Credit to God, my Mother, family and friends.

All errors are my own.

Best,
George John Jordan Thomas Aquinas Hayward, Optimist

Selected Data Visualizations

A/B Testing

The Dark Button Got Slightly More Clicks

ex-A-clicks

The Light Button Got More Absolute Message Sends

ex-B-send-it

The Light Button Dominated The Dark Button In Terms of Message Send Conversion Rate

ex-C-converted-sends

conversion-funnel-final

Clustering

Watch The K-Means Clustering Classification Progress As We Select For More and More Active Users.
It seems that users seem to really have an overall 'either/or' preference for calling or online.

0sigma-function

1sigma-function

2sigma-function

3sigma-function

20sigma-function

Further Exploration

Could New Yorkers use the service more than anyone else?

new-yorkers-use-more-questionmark

Do people on exchanges use the service more than people who are not on exchanges?

on-exchange-equal-more-use-questionmark

Do seniors use the service less frequently than other age groups?

seniors-using-the-product-less-questionmark

And It Begins

In [3]:
#for part 1
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import mysql.connector as mysql
import numpy as np
import pandas as pd
import missingno as msno
import statsmodels
from statsmodels.stats.proportion import proportion_confint
from statsmodels.stats.proportion import proportions_ztest
from scipy import stats
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
#added in for part 2
import seaborn as sns
from sklearn import preprocessing
from sklearn.cluster import KMeans

Prompt 1: Messaging Button Experiment


  • Background: Oscar has run a two-armed experiment to test a new design for the messaging button that appears on the homepage of the member website. This homepage button is one of several possible entry points to the messaging feature. Our messaging feature allows members to message a dedicated care team with any benefits or plan related questions.

  • The goals of this experiment were to improve discoverability and increase usage of the messaging feature. The control design (dark) was shown to 50% of user traffic for the duration of the experiment period; the other 50% of traffic was shown the experimental design (light).

1buttons

its-a-web-page

2ab-experiment-tables

Step 0 - Data Game Plan & A/B Testing Strategy

  • We read in the data.
  • We then get the data into a usable database for us to query, and then pull back.
  • There are some parts here that will be easier in Pandas, and there are some parts which are easier in SQL. So we'll use both.
  • First, we want to read in the data and check for nulls.
  • Given the way I want to evaluate the A/B test, I need to use Pandas to set up a table that I want to query.
  • Here's the game plan for generating the table we want to be made:
    • We start with the experiment_subjects table.
      • From working with this table, I happen to know that it has a single null value for a user_id, so we want to eliminate that.
      • We thus drop duplicates in Pandas on user_id, and we eliminate rows that have a null user_id. That's just a glitch in the data.
    • This table, the experiment_subjects table is the base for us to build out a/b testing analysis table.
    • Next we need to bring in the experiment_actions table.
      • This table has a user_id and then an action they took, and a timestamp for that action. Actions can include Clicking the Button, Viewing the Inbox, Viewing a Message Thread, and then Sending a Message.
      • Critically, although these could constitute an e-commerce-like funnel, they are not always in order. This is because the Button is only one of several ways to reach the messaging screen.
      • For simplicity in the A/B testing stage, we want to convert our data into Bernoulli Trials. This means that each data point will have a discrete outcome of success, 1, or failure, 0.
      • Data scientists love this because it lowers the variance in our populations because choices are 2 discrete things, instead of the infinity of choices in a continuous variable.
      • This will let us use a two sample proportion z test, and we'll also follow that up with a two sample proportion t-test, although we should have a large enough n sample size for us to go with the z test.
      • This will also require that the users are truly randomized and independent, which, I think, is a reasonable assumption here.
    • So this brings us back to building the data table.
    • Since we are using a Bernoulli Trial method, we're not concerned that a given user clicked on a button a thousand times; we're concerned about how many users clicked on a button. Thus we are going to be looking at distinct user counts.
    • This was not too complicated in the event_subjects table to do since that table was just a user, their treatment/control status, and their experiment enrollment status.
    • This is a bit harder to do in event_actions table.
    • To solve this problem, we want to group this table by both the user_id and the action. We don't care how many times a user did a particular action, but we do care a lot about the different, discrete actions that a given user took. This grouping will take care of that.
      • But what do we do if a user does a given action more than once?
      • We can simply take the first user_id-action per timestamp.
      • We can sort the table by timestamp, then do the group by user_id-action in Pandas, then set a parameter in Pandas to select the first of the duplicate user_id-actions pairs.
    • The result will be a shorter (in # of rows) events_actions table that gives one row for everything a user did a different action. So it will have as many rows as there are user_id-actions, and when a user did an given action more than once, it will keep the earliest action they have.
    • The next point is critical:
      • We will only include users who have actions that start AFTER the test enrollment timestamp.
      • This is a decision I have made because if the users were already doing all these actions BEFORE our experiment began, I view them as more tainted because they bring in whatever their old memories are. In fact the novelty alone of the new Button can be enough for people to click on it. What we really want to know is how A/B users reacted who are unbiased by their past experiences on the site.
        • Now you could reasonably argue that you want to emphasize repeat-users, and that would be OK, but, for the potential biases stated above, I contend this is the best way for us to test, given that the goal is to bring on many, many more new users, who have not used our service before.

Step 1 - Building the Final Tables to Support A/B Testing Strategy

  • The next thing we have to do is set up the A/B testing tables, where we will actually see the proportions.
  • For checking the discoverability, we care most about button clicks.
    • Thus, we care about the proportion of users in the light button cohort that clicked the light button VS. the proportion of users in the dark button cohort that clicked the dark button.
  • For checking usage of the messaging feature, we care about conversion from the button to sending a message.
    • Thus, here, we care about what proportion of the users in the light button cohort who clicked the light button (where clicking the button was their first action) went on to actually send a message VS. what proportion of the users in the dark button cohort who clicked the dark button (where clicking the button was their first action) went on to actually send a message.
    • For this table we really have to think about the timing of the actions. To think about conversion, we need the button to be the first thing they clicked on. If they somehow sent a message, then randomly clicked the button, I content, it's unclear if the button had anything to do with it.
      • The most apples-to-apples comparison will be when the user FIRST clicked a button, then went on to send a message.
      • I will, accordingly, for this table, only count messages sent as a 'success' when clicking the button was the first thing they did.
  • Once all this is done, we round our analysis with a two sample proportion z test, and then two-sample proportion t test.
  • The plan is to visualize using pie charts.

👇🏾 We begin by preparing the experiments_subjects table. 👇🏾

In [2]:
experiment_subjects = pd.read_csv('experiment_subjects.csv')
In [3]:
#let's check it out
experiment_subjects.describe()
Out[3]:
user_id audience_name enrolled_at
count 34951 34953 34953
unique 34935 2 34444
top bff9cf7a-fd4b-4ddd-9d65-4c2bf91878c1 light 2017-12-05 01:26:46
freq 2 17539 3

👆🏾You can see that there are some duplicates in the user_id. We need to remove that.👆🏾

In [4]:
experiment_subjects = experiment_subjects.drop_duplicates(['user_id'])
experiment_subjects.describe()
Out[4]:
user_id audience_name enrolled_at
count 34935 34936 34936
unique 34935 2 34427
top 7c8690d4-62f8-41db-a0ff-fb8667624569 light 2017-12-01 19:07:19
freq 1 17522 3

💭👆🏾Something's still off. There are now 34,935 user_ids, but 34,936 rows. 👆🏾💭

  • This means that we've got a NULL user_id somewhere.
  • Sidenote:
    • This shows the answer to an age-old Python vs. SQL question, namely: counting nulls. Looks like Python does not count nulls. The Python count function counts non-null values in a column.
In [5]:
experiment_subjects = experiment_subjects.dropna(subset=['user_id'])
experiment_subjects.describe()
Out[5]:
user_id audience_name enrolled_at
count 34935 34935 34935
unique 34935 2 34426
top 7c8690d4-62f8-41db-a0ff-fb8667624569 light 2017-12-01 19:07:19
freq 1 17522 3

👆🏾Now we're talking! The rows line up. 👆🏾😊✅⛪

👇🏾 Let's make sure those dates are in a datetime format. 👇🏾

In [6]:
experiment_subjects.enrolled_at.dtype
Out[6]:
dtype('O')
In [8]:
#that means it's an object; we better turn this into a datetime format:
experiment_subjects['enrolled_at'] = pd.to_datetime(experiment_subjects['enrolled_at'],\
                                                format='%Y-%m-%d %H:%M:%S')
experiment_subjects.enrolled_at.dtype
Out[8]:
dtype('<M8[ns]')
In [9]:
#let's confirm that's a datetime
np.dtype('datetime64[ns]') == np.dtype('<M8[ns]')
Out[9]:
True

👆🏾Great, now we're in datetime. Now we're in great shape, and we can turn to the next table before we join.👆🏾

In [10]:
experiment_subjects.head()
Out[10]:
user_id audience_name enrolled_at
0 624cf3cb-d23c-4f5d-bbb7-034d9e101576 dark 2017-11-12 16:29:56
1 703fda1a-d406-42d5-98c1-ddc584da80db dark 2017-11-14 03:49:57
2 02a8cc81-7643-49f1-8c34-58c84155c4ab dark 2017-11-29 17:02:30
3 8a7b5f34-5283-4f91-ab27-d630fac065b1 dark 2017-11-25 16:17:07
4 a8dc881e-489e-4414-a281-ea3dc1164628 dark 2017-11-24 19:00:19

👇🏾 We now move to preparing the experiments_actions table. 👇🏾

In [23]:
experiment_actions = pd.read_csv('experiment_actions.csv')
In [24]:
#let's check it out
experiment_actions.describe()
Out[24]:
user_id action new_thread timestamp
count 78985 78985 31808 78985
unique 11718 4 2 61138
top bfd8c778-01d7-428d-aa0d-04f510ac2374 Viewed Messaging Inbox False 2017-12-03 19:27:29
freq 223 30071 21272 7

💭 As said in the game plan, we're not going to be too worried about the new_thread column.

  • For this analysis, we're going to focus on whether a button click, or whether a message was sent.
  • The new thread vs old thread dichotomy will be more relevant to mid-funnel analyses, but that is outside the scope of this notebook.
In [25]:
#similiar to the above situation, let's get timestamp into datetime format
experiment_actions['timestamp'] = pd.to_datetime(experiment_actions['timestamp'],\
                                                format='%Y-%m-%d %H:%M:%S')

👇🏾 We now want to group the actions column to give us 1 row per user_id-action. 👇🏾

  • We could have also used Pandas's groupby function.
  • We critically want to sort this table first by ascending timestamp order.
    • This lets us make sure we get the user's earliest action, in the event they take the same action more than once.
    • As discussed in the game plan at the outset, this will be very helpful in how we understand conversion later.
In [26]:
experiment_actions = experiment_actions.sort_values('timestamp').drop_duplicates(subset=['user_id', 'action'],\
                                                                                  keep='first')
In [27]:
experiment_actions.describe()
Out[27]:
user_id action new_thread timestamp
count 30266 30266 11339 30266
unique 11718 4 2 24034
top 4efb6683-a836-4773-95c0-66813ea67a69 Viewed Messaging Inbox True 2017-11-28 16:41:54
freq 4 10908 7108 4
first NaN NaN NaN 2017-06-28 15:18:17
last NaN NaN NaN 2017-12-06 23:59:38

👇🏾 We're now ready to join the tables together and then send them over to SQL. 👇🏾

  • SQL is my love language.
  • We will use a left join.
  • We will create our a/b test results dataframes from this table.
In [41]:
ab_test = experiment_subjects.merge(experiment_actions, left_on='user_id', right_on='user_id',\
                                   how='left')

👇🏾 As discussed in the game plan, we eliminate rows where the action happened before the experiment. 👇🏾

  • But we also want to keep in the original experiment subjects who 'bounced', never did any action, and thus have a Null value for the timestamp column.
  • So this next line of code says, basically, to eliminate the users who had an action BEFORE they were enrolled in the experiement.
In [53]:
ab_test = ab_test[(ab_test.enrolled_at <= ab_test.timestamp) | (pd.isnull(ab_test.timestamp) == True)] 
In [54]:
ab_test.describe()
Out[54]:
user_id audience_name enrolled_at action new_thread timestamp
count 47475 47475 47475 24258 8440 24258
unique 33081 2 32634 4 2 18587
top ade9f184-1c1f-4151-b2e7-6364222f4c23 light 2017-11-12 17:13:22 Viewed Messaging Inbox True 2017-12-03 15:20:28
freq 4 23953 8 8570 5668 4
first NaN NaN 2017-11-12 00:00:14 NaN NaN 2017-11-12 00:05:38
last NaN NaN 2017-12-05 23:59:32 NaN NaN 2017-12-06 23:59:38
In [55]:
engine = create_engine('mysql+mysqlconnector://newuser:data@localhost:3306/sys', echo=False)
ab_test.to_sql(name='ab_test_os', con=engine, if_exists = 'replace', index=False)
In [56]:
#connect to the MySQL database
db = mysql.connect(
    host = "localhost",
    user = "newuser",
    passwd = "data",
    auth_plugin='mysql_native_password',
    database = 'sys')

💭👇🏾Thinking about our discoverability table.👇🏾💭

  • We need a table that, in the first column, lists each individual user in the experiment.
  • Next, in the second column, it will list which audience (light button or dark button) the user is in.
  • Then, in the third column, it will have a 1 for if they clicked and a 0 if they did not click.
In [60]:
discoverability = pd.read_sql("""
with experiment_subjects_distinct as(
select distinct
user_id,
audience_name
from sys.ab_test_os
), 

clickers as (
select
user_id,
audience_name,
case when action = 'Clicked Messaging Button' then 1 end as clicked_yn
from sys.ab_test_os
where action = 'Clicked Messaging Button'
)

select
esd.user_id,
esd.audience_name as button,
coalesce(c.clicked_yn, 0) as clicked_yn
from experiment_subjects_distinct esd
left join clickers c on c.user_id = esd.user_id;

""", con=db)
In [67]:
discoverability.reset_index
discoverability.sample(5)
Out[67]:
user_id button clicked_yn
26056 228c1835-fba2-4347-9e79-00c0605d5ead light 1
8252 6d0a5838-7549-4dbb-b7de-8656aab061c4 dark 0
19190 cf699e8f-8c74-4b35-a0bb-2e4ab8c35f40 light 0
17946 2c1acbbb-6d0f-4855-87d0-9a776cb10ba4 light 0
2197 0b822fa3-b0ec-49bd-8d84-b5b12e3411d8 dark 0
In [68]:
discoverability.describe()
Out[68]:
clicked_yn
count 33081.000000
mean 0.219099
std 0.413642
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

💭👇🏾Thinking about our message_usage table.👇🏾💭

  • We need a table that, in the first column, lists each individual user in the experiment.
  • Next, in the second column, it will list which audience (light button or dark button) the user is in.
  • Then, in the third column, it will have a 1 for if they sent a message, after clicking the button, where clicking the button was their first action.
    • It will return a 0 otherwise.
In [69]:
message_usage = pd.read_sql("""
with experiment_subjects_distinct as(
select distinct
user_id,
audience_name
from sys.ab_test_os), 

ab_test_with_action_order as (
select 
*,
dense_rank() over(partition by user_id order by timestamp asc) as user_event_order
from sys.ab_test_os
),

first_clickers as (
select distinct
user_id
from ab_test_with_action_order
where action = 'Clicked Messaging Button' and user_event_order = 1
),

sent_message_after_clicking_button_first as (
select
ab.user_id,
case when action = 'Sent Message' then 1 end as sent_message_yn
from ab_test_with_action_order ab
join first_clickers fc on fc.user_id = ab.user_id
where action = 'Sent Message' and user_event_order > 1
)

select
esd.user_id,
esd.audience_name as button,
coalesce(sm.sent_message_yn, 0) as sent_message_yn
from experiment_subjects_distinct esd
left join sent_message_after_clicking_button_first sm on sm.user_id = esd.user_id;

""", con=db)
In [75]:
message_usage.sample(5)
Out[75]:
user_id button sent_message_yn
15709 2607040c-abd0-4cc2-8062-c0647b22ede8 dark 0
15578 573f8132-7ec8-48e0-a65d-2a016e1e4384 dark 0
29233 7ad9a77f-1e6a-4f28-b76c-b183f644ccec light 0
5351 791be0c4-72ba-4c10-8025-2ed35974d692 dark 1
31568 a28946e5-7651-4e28-abdc-c2a027ca5c63 light 0
In [71]:
message_usage.describe()
Out[71]:
sent_message_yn
count 33081.000000
mean 0.061455
std 0.240167
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

💭👇🏾For our conversion metric, we'll want to have a first_clickers table.👇🏾💭

  • So we now know (from right above) how many people went on to send a message after their frist action post enrollment was clicking the button.
  • But for how many people in each audience (light button vs. dark button) was this the case?
  • We'll want to have a tally of first clickers, and that's what this table will do.
  • For this table:
    • The first column will be the test group the user is in.
    • The second column will just be the number of users, from that group, who were a first click.
    • This should be a lot like the first table, except it can't just be any click of a button, it has to be where the click of the button was the first thing the did post experiment enrollment.
In [76]:
first_clickers = pd.read_sql("""
with experiment_subjects_distinct as (
select distinct
user_id,
audience_name
from sys.ab_test_os
), 

ab_test_with_action_order as (
select 
*,
dense_rank() over(partition by user_id order by timestamp asc) as user_event_order
from sys.ab_test_os
),

first_clickers as (
select 
user_id,
case when action = 'Clicked Messaging Button' then 1 end as first_action_was_click_yn
from ab_test_with_action_order
where action = 'Clicked Messaging Button' and user_event_order = 1
)

select
esd.user_id,
esd.audience_name as button,
coalesce(fc.first_action_was_click_yn, 0) as first_action_was_click_yn
from experiment_subjects_distinct esd
left join first_clickers fc on fc.user_id = esd.user_id;
""", con=db)
In [77]:
first_clickers.sample(5)
Out[77]:
user_id button first_action_was_click_yn
13685 bd5fec4a-c2aa-4521-9e74-581126ed60f8 dark 1
24207 fcd2e5f8-8713-41cc-bd5a-559c36bfd2ff light 0
4335 ac15997a-9156-4319-b001-0078cab2ff97 dark 0
13248 19f30902-ca80-4884-a156-6de44abf2dd8 dark 0
21628 9137187f-9f8a-466b-a7f4-2ba3faf4f6f9 light 0
In [78]:
first_clickers.describe()
Out[78]:
first_action_was_click_yn
count 33081.000000
mean 0.185938
std 0.389062
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

👆🏾One cool thing to notice: although we're at ~ 21% overall for clicks, we're at ~19% overall for first-clicks.👆🏾

  • This is just a kind of temporarl quirk you often see in A/B tests.
  • There are some users who clicked who, somehow, someway, went to the message thread first, then clicked the button to get there second.
    • One instance could be that a person had already sent a message, then clicked back through the button to send another.

Fantastic!

  • We now have three tables that will help us unlock the secrets of this A/B test.
    • ✅ discoverability
    • ✅ message_usage
    • ✅ first_clickers

Step 2 - Enter Statistics: Hypothesis Testing

  • In a nutshell, we've set up these tables to as Bernoulli trials to work smoothly with two sample proportion z-tests and t-tests.
  • We will see different proportions of success and failure depending on the light button vs dark button audience groups.
  • We're just trying to get a sense of whether the difference we see is more likely due to chance or if there really is something going on and there's a difference in the groups due to something other than chance.
  • In a good experiment, we could say that the only difference in groups—which we assume to be truly randomly assigned users, who have no impact on each other (independent)—would be the buttons themselves. So the only difference would be the button, and we can make a causal inference that the button is the change catalyst.
  • In statistics terms, we start out with a null hypothesis.
    • I'm a lawyer, and I can proudly say that we do the same in the law.
    • We begin by assuming a party is innocent, just like we begin by assuming that both the test groups are equal. In other words, we assume the button has no effect.
    • But as we look at the data, we'd expect a certain bit of fluctuation. Just like in a criminal case, we'd expect a certain number of suspicion to be based on behavior that just happened to put the suspect near the crime.
    • But as this data stack more and more in one direction, the more and more likely we are to reject the null hypothesis, and conclude that something is really going on. And in the law, conclude that the person is not innocent.
  • The p value, is the chance that we'd see a given result randomly, due to chance.
    • The lower that value is, the more likely we are seeing something other than chance at work.
    • For a lot of experiments we reject the null if the p value is under 5%, so that's what we call 'statistical significance.'
    • Notably, just because a difference is statistically significant does not mean it is practically significant. For instance, we might detect an effect caused by a feature of an experiment that chances the outcome by 1%. We might be able to conclude, with a large enough sample size, that that 1% was not a random change. But does the business care about a 1% difference? Maybe they do, but maybe they don't.
  • By the way, while I'm at it, a confidence interval is a range that we are x% sure (often 95%) holds the true value of the population's proportion, since we're just looking at a sample.
    • That's how you get the so-called 'margin of error' in polls.
In [79]:
#helpful to peek at the data again.
discoverability.sample(3)
Out[79]:
user_id button clicked_yn
3232 4e05eefd-71a2-4509-b419-a04d7e57ad09 dark 0
27885 0f6861fe-25d0-481e-ad75-4b4f8ffa3c5a light 0
16861 f902eb57-8cad-478f-ab28-abd4b6743a61 dark 0

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾

In [81]:
count_clicks = discoverability.clicked_yn.sum()
users_clicks = discoverability.clicked_yn.count()
In [107]:
#this is a z-test w/ normal distribution for the confidence interval, alpha set it to 95%
print('95-Percent Confidence Interval: \n Lower Bound = %.4f, Upper Bound = %.4f' % \
      (statsmodels.stats.proportion.proportion_confint(\
            count_clicks, users_clicks, alpha=0.05, method='normal'))) 
95-Percent Confidence Interval: 
 Lower Bound = 0.2146, Upper Bound = 0.2236

👆🏾Pretty cool! 👆🏾

  • This tells us that we are 95% sure the true proportion of overall clicks, un-segmented, is between 21.46% and 22.36%.

👇🏾Now that we're warmed up, let's get to the actual A/B test results for discoverabililty.👇🏾

In [88]:
#we'll store these as variables for easier use later
clicked_light_button = discoverability[discoverability["button"]=='light']['clicked_yn']
clicked_dark_button = discoverability[discoverability["button"]=='dark']['clicked_yn']
In [111]:
#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
click_button_ab_test = pd.DataFrame({
    "count": [clicked_light_button.sum(), clicked_dark_button.sum()],
     "users": [clicked_light_button.count(), clicked_dark_button.count()]
    }, index=['light_button_clicks', 'dark_button_clicks'])
In [97]:
#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
                                               click_button_ab_test['count'], click_button_ab_test['users']))
#z score, p-value
Z-Score = -5.277, p-Value = 0.0000001313

👆🏾Pretty cool! 👆🏾

  • This very low p-value encourages us to go ahead, reject the null hypothesis, look at the difference in proportions across the treament and control groups and conclude that that difference is really caused by the buttons.

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾

In [114]:
print('T-Score = %.3f, p-Value = %.10f' % stats.ttest_ind(clicked_light_button, clicked_dark_button))
T-Score = -5.279, p-Value = 0.0000001306

👆🏾Pretty cool! 👆🏾

  • For these high n sample size numbers, and with these degrees of freedom, I'd expect the t-distribution to be pretty close to the z-distribution, so this isn't too surprising.
  • We can definitely say, as I like to say in my own parlance, "There's a there, there." lol

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭

In [119]:
click_button_ab_test['proportion'] = round(click_button_ab_test['count'] / click_button_ab_test['users'],2)
click_button_ab_test
Out[119]:
count users proportion
light_button_clicks 3437 16593 0.21
dark_button_clicks 3811 16488 0.23

👆🏾So, we've got our first conclusion!! 👆🏾😊✅⛪

  • The dark buttons are more 'discoverable.'
  • This 2% difference is most likely not due to chance.
  • In fact, there were fewer users exposed to the dark buttons, and the dark buttons still got more clicks!

🔬 A/B Test Metric 2 - Message Usage: Conversion Proportion For Either Button 🔬

  • Let's check out message usage.
  • Now we want to see how many people actually sent a message after they clicked on the button, given that their first action post experiment enrollment was to click on the button.
  • We will fist look at the raw numbers of actual sends that either type of button has.
  • Then we'll look at the button conversion percentage.
In [100]:
#helpful to peek at the data again.
message_usage.sample(3)
Out[100]:
user_id button sent_message_yn
20182 abfbdad5-3078-411a-bd2b-0a5d67044b75 light 0
15852 49344f91-ac02-4d51-84e4-e5dbfd535800 dark 1
19847 4035cf20-c26e-40a6-b163-03b0f9a7ff2d light 0

👇🏾 To start out, just for fun, let's get a 95% confidence interval of what the true population proportion is. 👇🏾

  • This is for the raw numbers of users who went on to actually send. This isn't the last-stage-of-the-funnel conversoin proportion yet.
In [101]:
count_sends = message_usage.sent_message_yn.sum()
users_sends = message_usage.sent_message_yn.count()
In [106]:
#this is a z-test w/ normal distribution for the confidence interval, alpha set it to 95%
print('95-Percent Confidence Interval: \n Lower Bound = %.4f, Upper Bound = %.4f' % \
      (statsmodels.stats.proportion.proportion_confint(\
            count_sends, users_sends, alpha=0.05, method='normal'))) 
95-Percent Confidence Interval: 
 Lower Bound = 0.0589, Upper Bound = 0.0640

👆🏾Pretty cool! 👆🏾

  • This tells us that we are 95% sure the true proportion of overall actual-sends, un-segmented, is between 5.89% and 6.40%.
  • We can definitely see that a message sent is rarer than a button click.
    • This makes sense.

👇🏾Now that we're warmed up, let's get to the actual A/B test results for message usage.👇🏾

In [109]:
#we'll store these as variables for easier use later
sent_light_button = message_usage[message_usage["button"]=='light']['sent_message_yn']
sent_dark_button = message_usage[message_usage["button"]=='dark']['sent_message_yn']
In [123]:
#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
send_button_ab_test = pd.DataFrame({
    "count": [sent_light_button.sum(), sent_dark_button.sum()],
     "users": [sent_light_button.count(), sent_dark_button.count()]
    }, index=['light_button_sends', 'dark_button_sends'])
In [113]:
#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
                                               send_button_ab_test['count'], send_button_ab_test['users']))
#z score, p-value
Z-Score = 13.794, p-Value = 0.0000000000

👆🏾Pretty cool! 👆🏾

  • This very low p-value encourages us to go ahead, reject the null hypothesis, look at the difference in proportions across the treament and control groups and conclude that that difference is really caused by the buttons.

👇🏾But before we begin, let's also quickly run a t-test on it.👇🏾

In [115]:
print('T-Score = %.3f, p-Value = %.10f' % stats.ttest_ind(sent_light_button, sent_dark_button))
T-Score = 13.834, p-Value = 0.0000000000

👆🏾Pretty cool! 👆🏾

  • For these high n sample size numbers, and with these degrees of freedom, I'd expect the t-distribution to be pretty close to the z-distribution, so this isn't too surprising.
  • We can definitely say, as I like to say in my own parlance, "There's a there, there." lol

💭👇🏾So let's take a look at the actual difference finally!!!👇🏾💭

In [242]:
send_button_ab_test['proportion'] = round(send_button_ab_test['count'] / send_button_ab_test['users'],2)
send_button_ab_test
Out[242]:
count users proportion
light_button_sends 1321 16593 0.08
dark_button_sends 712 16488 0.04

👆🏾So, we've got our second conclusion!! 👆🏾😊✅⛪

  • The light buttons lead to more messages sent.
  • This 4% difference is most likely not due to chance.
  • We will still want to look at the conversion numbers

👇🏾Let's look a bit at the first-clickers we've spoken about.👇🏾

In [120]:
#take a peek
first_clickers.sample(3)
Out[120]:
user_id button first_action_was_click_yn
13024 43d65887-f1d3-4d03-b35a-a78a6b0f4562 dark 0
18547 996e38c2-c4c5-4953-844a-0aa04c66534f light 0
19603 3c659ecb-5ec0-45f0-8b00-0869d0258436 light 0

👇🏾We want to build a date frame that let us see the conversion funnel after a first-click.👇🏾

In [141]:
#we'll store these as variables for easier use later
first_clicked_light_button = first_clickers[first_clickers["button"]=='light']['first_action_was_click_yn']
first_clicked_dark_button = first_clickers[first_clickers["button"]=='dark']['first_action_was_click_yn']
In [142]:
#now for use in the proportions test, we need to make a mini dataframe
#counts is the number of successes
#in a Bernoulli trial setup where it's just 1s and 0s, we can just use sum() to get this
#nobs (now users) is the total number of trials, len() can work, though I chose to use counts()
first_clicks_button_ab_test = pd.DataFrame({
    "count": [first_clicked_light_button.sum(), first_clicked_dark_button.sum()],
     "users": [first_clicked_light_button.count(), first_clicked_dark_button.count()]
    }, index=['light_button_first_clicks', 'dark_button_first_clicks'])
In [143]:
first_clicks_button_ab_test
Out[143]:
count users
light_button_first_clicks 2957 16593
dark_button_first_clicks 3194 16488
In [144]:
send_button_ab_test
Out[144]:
count users
light_button_sends 1321 16593
dark_button_sends 712 16488

👇🏾Now we can combine the above two dataframes to make the funnel dateframe.👇🏾

In [145]:
click_to_send_conversions_funnel = pd.DataFrame({
    "users": [send_button_ab_test.users['light_button_sends'], send_button_ab_test.users['dark_button_sends']],
     "first-clicking users": [first_clicks_button_ab_test['count']['light_button_first_clicks'],\
                        first_clicks_button_ab_test['count']['dark_button_first_clicks']],
     "sending users": [send_button_ab_test['count']['light_button_sends'],\
                       send_button_ab_test['count']['dark_button_sends']]
    }, index=['light_button', 'dark_button'])
In [146]:
click_to_send_conversions_funnel['conversion'] \
   = round(click_to_send_conversions_funnel['sending users'] / \
        click_to_send_conversions_funnel['first-clicking users'],2)
In [147]:
click_to_send_conversions_funnel
Out[147]:
users first-clicking users sending users conversion
light_button 16593 2957 1321 0.45
dark_button 16488 3194 712 0.22
In [148]:
#now we use this to feed into the stats test
#for some odd reason if you say gender_polls.count it will blow up...
#so you have to say gender_polls['count'] 
print('Z-Score = %.3f, p-Value = %.10f' % statsmodels.stats.proportion.proportions_ztest(\
         click_to_send_conversions_funnel['sending users'], click_to_send_conversions_funnel['first-clicking users']))
#z score, p-value
Z-Score = 18.644, p-Value = 0.0000000000

👆🏾So, we've got our third and final conclusion!! 👆🏾😊✅⛪

  • The light buttons has the superior messages conversion rate.
  • Not that this is surprising, but:
    • This 23% difference is most likely not due to chance, haha!
In [149]:
click_button_ab_test
Out[149]:
count users proportion
light_button_clicks 3437 16593 0.21
dark_button_clicks 3811 16488 0.23
In [265]:
labels = ['Clicking Users', '']
light_button_clicks = [click_button_ab_test['count'][0], click_button_ab_test['users'][0]-\
                      click_button_ab_test['count'][0]]
dark_button_clicks = [click_button_ab_test['count'][1], click_button_ab_test['users'][1]-\
                     click_button_ab_test['count'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_clicks, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_lb, explode = (0, 0.08)\
          , startangle = 6)
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_clicks, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("    Anytime Clicks, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_A_clicks.png",dpi=300, bbox_inches='tight')
plt.show()
In [266]:
send_button_ab_test
Out[266]:
count users proportion
light_button_sends 1321 16593 0.08
dark_button_sends 712 16488 0.04
In [267]:
labels = ['Sending Users', '']
light_button_sends = [send_button_ab_test['count'][0], send_button_ab_test['users'][0]-\
                     send_button_ab_test['count'][0]]
dark_button_sends = [send_button_ab_test['count'][1], send_button_ab_test['users'][1]-\
                    send_button_ab_test['count'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_sends, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_lb, explode = (0, 0.08))
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_sends, labels=labels, autopct='%1.2f%%', shadow=False, colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("    Sends After Clicks, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_B_send_it.png",dpi=300, bbox_inches='tight')
plt.show()
In [268]:
click_to_send_conversions_funnel
Out[268]:
users first-clicking users sending users conversion
light_button 16593 2957 1321 0.45
dark_button 16488 3194 712 0.22
In [276]:
labels = ['Converted Users', '']
light_button_converted_sends = \
  [click_to_send_conversions_funnel['sending users'][0], click_to_send_conversions_funnel['first-clicking users'][0]-\
  click_to_send_conversions_funnel['sending users'][0]]
dark_button_converted_sends = \
  [click_to_send_conversions_funnel['sending users'][1], click_to_send_conversions_funnel['first-clicking users'][1]-\
  click_to_send_conversions_funnel['sending users'][1]]
adg_font = {'fontname':'Adobe Garamond Pro'}

fig, axs = plt.subplots(1, 2,figsize=(11, 6))
colors_lb = ['darkgray','gainsboro']
colors_db = ['cornflowerblue','lightsteelblue']

axs[0].pie(light_button_converted_sends, labels=labels, autopct='%1.2f%%', shadow=False, \
           colors=colors_lb, explode = (0, 0.08), startangle = -43)
axs[0].set_title('Light Button', fontsize = 19, **adg_font)

axs[1].pie(dark_button_converted_sends, labels=labels, autopct='%1.2f%%', shadow=False, \
           colors=colors_db, explode = (0, 0.08))
axs[1].set_title('Dark Button', fontsize = 19, **adg_font)

plt.subplots_adjust(wspace=0.3, hspace=1)
plt.suptitle("      Conversion Rate, By Button Type", fontsize = 24, fontweight = 'bold', **adg_font)
plt.savefig("ex_C_converted_sends.png",dpi=300, bbox_inches='tight')
plt.show()

Prompt 2: Classifying Member Engagement Status


  • Background: At Oscar, we have analytical and business use cases that require a classification of how engaged members are with Oscar (targeting outreach campaigns to more or less engaged members, understanding how member engagement intersects with clinical profile to affect costs, etc). The attached dataset includes three years of simulated engagement data by member by month. We would like you to use this data to build a classification of member engagement.

4member-engagement

Step 0 - Read in The Data & K-Means Clustering Strategy Discussion

  • As I read in the data—and this is sort of a summary of some work I've done offline—I thought about a few things:
    • What leverage points did I see in the data
    • What kinds of business use cases might I glean from the data
      • Use those business cases for guidance
  • As I looked at the user information, I saw a familiar theme:
    • Most users are quite inactive.
  • Thus the first segmentation is to filter out those users who are not interacting with the platform.
  • Next, we look to the categorial and the continuous data.
  • After I looked through it, there was one thing that really stuck out to me:
    • you have users who are very active online
    • you have users who are very active offline (such as phone calls)
  • I isolated logins and inbound calls and plotted them against each other.
  • I found that the people who logged in the most tended to call the least
  • And the people who called the most tended to log in the least
  • Please note that all this is for only the most active users, after other users have been filtered out.
    • And I always do these filterings on the basis of standard deviation rank:
      • Meaning all users below +2 standard deviations of logins can be filtered out, because this essentially means that they have 0 logins. The same goes for inbound phone calls.
  • Then we bring in the K-means clustering algorithm from sci-kit learn and see how it fares on the data.
  • Along with my hypothesis, I decided to use 2 clusters for K-means, and this worked best.
    • I could tell it worked best by looking at hues on scatter plots according to how it divided the data.
  • The reason I thought of this kind of segmentation is the advertising use-case:
    • Tell people who login all the time, but never call that representatives are there to help them.
    • Tell people who call all the time and never log in, that there is an exciting app, which they can use.
  • Then for further research if time allows, I have provided a look at the low-call, high-online VS. high-call, low-online graphs split over other categoricals, such as:
    • region
    • exchange status
    • age bracket
  • My ultimate conclusion is that segmentation should be done as follows:

    • (1) - passive users
    • (2) - active users who are low-call, high-online
    • (3) - active users who are high-call, low-online.

    • From my analysis, it appear very few, if any users, are both high-call and high-online or both low-call and low-online. It seems people really have a preference for how they communicate. If it turned out that there were large categories like this, we could segment in to them.

      • But, as for now, the high, high or low, low categories do not seem to exist here.

Step 1 - Deploy K-Means!

  • I have written the K-means algorithm into a function.
    • As always, I want to thank Jake Vanderplas for his great tutorials on K-means.
      • I've taken what he's said about K-means, and made a function that works nicely for our purposes.
  • This function works with my standard deviation splitting strategy and matplotlib.
  • Here's what it does:
    • First, you tell the function how active you want your users. The function works better, the more active the users are.
      • So if you tell the function that you want sigma of 6, it will try to cluster people who are above 6 standards of deviation from the mean in EITHER a) number of logins or b) number of inbound calls.
    • Second, the function applies the k-means unsupervised machine learning algorithm.
    • Third, the function the plots a graph of logins on the x axis and inbound calls on the y axis, for specified standard deviation 'cut', and then classifies into two clusters based on K-means.

👇🏾Let's take a quick peek at the data.👇🏾

In [7]:
member_engagement = pd.read_csv('member_engagement.csv')
In [8]:
#this is just to take a look at the data
member_engagement.describe()
Out[8]:
ib_call_count ob_call_count ob_call_answered_count ib_message_count ob_message_count ob_message_read_count login_count telemedicine_count search_count grievance_count
count 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000 351362.000000
mean 0.228969 0.038738 0.007368 0.144577 0.163319 0.134064 1.114802 0.027308 0.320049 0.007684
std 0.806874 0.353109 0.094743 0.956451 0.782292 0.745294 2.917814 0.191751 1.377393 0.099852
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
max 90.000000 15.000000 4.000000 62.000000 40.000000 40.000000 108.000000 6.000000 36.000000 2.000000
In [9]:
member_engagement.dtypes
Out[9]:
member_id                 object
policy_id                 object
month                     object
policy_relation           object
age_group                 object
region                    object
enrollment_type           object
ib_call_count              int64
ob_call_count              int64
ob_call_answered_count     int64
ib_message_count           int64
ob_message_count           int64
ob_message_read_count      int64
login_count                int64
telemedicine_count         int64
search_count               int64
grievance_count            int64
dtype: object
In [10]:
member_engagement.enrollment_type.unique()
Out[10]:
array(['Off Exchange', 'On Exchange'], dtype=object)
In [11]:
member_engagement.policy_relation.unique()
Out[11]:
array(['Subscriber', 'Dependent', 'Spouse'], dtype=object)
In [12]:
member_engagement.region.unique()
Out[12]:
array(['New York', 'Orlando', 'San Antonio', 'Austin', 'New Jersey'],
      dtype=object)
In [13]:
member_engagement.month.unique()
Out[13]:
array(['2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01',
       '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01',
       '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01',
       '2019-01-01', '2019-02-01', '2019-03-01', '2019-04-01',
       '2019-05-01', '2019-06-01', '2018-01-01', '2018-02-01',
       '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01',
       '2018-07-01', '2018-08-01', '2018-09-01', '2018-10-01',
       '2018-11-01', '2018-12-01'], dtype=object)
In [15]:
member_engagement.sample(4)
Out[15]:
member_id policy_id month policy_relation age_group region enrollment_type ib_call_count ob_call_count ob_call_answered_count ib_message_count ob_message_count ob_message_read_count login_count telemedicine_count search_count grievance_count
145277 28193a93-0caa-4e06-b4c1-831c328d8ece 33b792fe-d263-4466-8b5b-f388a2c740c9 2019-01-01 Spouse 36-45 New York On Exchange 0 0 0 0 0 0 3 0 2 0
307812 a4637f25-1c78-4bfd-829e-0e34e9be82e8 6dbbfc8b-3be5-41ef-9adb-fa819a2ca889 2019-05-01 Subscriber 36-45 Orlando On Exchange 0 0 0 0 0 0 8 1 1 0
151349 185aeac7-f717-42ca-9152-c08f952229e0 b6e95fbd-6358-4506-9d6c-43e4582f9f8d 2018-04-01 Subscriber 56+ San Antonio On Exchange 1 0 0 0 0 0 0 0 0 0
228915 d391bdce-70f5-4a0e-9f5a-0acfe6cd3d0a d7841194-a7e5-43cf-92e6-b93f9117cb59 2019-04-01 Subscriber 27-35 New York Off Exchange 0 0 0 0 0 0 5 1 1 0

🔬 Introducing My Function: K-Means-THINK_Binary 🔬

  • Special thanks to Jake VanderPlas for inspiring me and showing the foundational code.
In [16]:
def k_means_think_binary(sigma):
    super_peeps =  member_engagement[
    (np.abs(member_engagement.login_count-\
                    member_engagement.login_count.mean())\
       >= (sigma*member_engagement.login_count.std())) 
                                 |\
      (np.abs(member_engagement.ib_call_count-\
                    member_engagement.ib_call_count.mean())\
       >= (sigma*member_engagement.ib_call_count.std()))                           
                                ]
    super_peeps = super_peeps[['login_count','ib_call_count']]
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(super_peeps)
    y_kmeans = kmeans.predict(super_peeps)
    #y_kmeans = ['Loves To Call, But Not Online' if x == 0 else 'Always Online, Never Calls' for x in y_kmeans]
    if sigma < 3: #this is a hacky bit to make sure the legend colors are correctly assigned
        y_kmeans = ['Loves To Call, But Not Online' if x == 0 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959
        sns.set_palette("husl", 8)
    #elif sigma in (3,10,15):
     #   y_kmeans = ['Loves To Call, But Not Online' if x == 1 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959
      #  sns.set_palette("Set2")
    else:
        y_kmeans = ['Loves To Call, But Not Online' if x == 1 else 'Always Online, Never Calls' for x in y_kmeans] 
                #^^^thanks to @arboc, https://stackoverflow.com/a/4406777/11736959 
        flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
        sns.set_palette(flatui)
    super_peeps['Classification Concept'] = y_kmeans
    super_peeps
    #sns.set_palette("Paired")
    #sns.set_palette("husl", 8)
    sns.lmplot( x="login_count", y="ib_call_count", data=super_peeps, fit_reg=False, hue="Classification Concept", \
               legend=True, legend_out=True)
    #plt.legend(loc='upper right')
    user_count = super_peeps.login_count.count()
    plt.ylabel('Number of Inbound Calls Per Month')
    plt.xlabel('Number of Customer Logins Per Month')
    plt.suptitle('For Users With Inbound Calls or Logins Above {}σ'.format(sigma),\
              fontweight = 'bold', y=1.05)
    plt.title('                              {:,} customers'.format(user_count))
    plt.savefig('analyzer_{}-std-kmeans.png'.format(sigma),dpi=400, bbox_inches='tight')
    plt.ylim(top=30, bottom=0) 
    plt.xlim(right=80, left=0)
    plt.show()
In [17]:
for standard_deviation in [0,0.5,0.75,1,2,3,15,20]:
        k_means_think_binary(standard_deviation)

👆🏾Look out for this! 👆🏾

  • Sometimes K-means alters the 1 or 0 it is using for classification, and throws off the labels in the legend.
  • The left side is of this graph is always 'Loves To Call, But Not Online'
    • the right side of this graph is always 'Always Online, Never Calls'
    • But sometimes the K-means ML algorithm swaps the identifiers
  • So just please be aware that the labels may get swapped once in a while, randomly.

Step 2 - Further Exploration

  • This takes the same function written above, and one-by-one, breaks it down across different categorical variables in the userbase.
  • This is a foundation for further analysis.

🔬 Introducing My Other Function: The Analyzer 🔬

  • Special thanks to Jake VanderPlas for inspiring me and showing the foundational code.
In [20]:
def the_analyzer(num, groupby):
    super_peeps =  member_engagement[
    (np.abs(member_engagement.login_count-\
                    member_engagement.login_count.mean())\
       >= (num*member_engagement.login_count.std())) 
                                 |\
      (np.abs(member_engagement.ib_call_count-\
                    member_engagement.ib_call_count.mean())\
       >= (num*member_engagement.ib_call_count.std()))                           
                                ]
    sns.set_palette("Dark2")
    sns.lmplot( x="login_count", y="ib_call_count", data=super_peeps, fit_reg=False, hue=groupby, \
               legend=True)
    user_count = super_peeps.member_id.count()
    plt.ylabel('Number of Inbound Calls Per Month')
    plt.xlabel('Number of Customer Logins Per Month')
    plt.suptitle('            For Users With Inbound Calls or Logins Above {}σ'.format(num),\
              fontweight = 'bold', y=1.05)
    plt.title('     {:,} customers'.format(user_count))
    plt.savefig('analyzer_{}-std_{}-groupby.png'.format(num, groupby),dpi=400, bbox_inches='tight')
    plt.ylim(top=30, bottom=0) 
    plt.xlim(right=80, left=0)
    plt.show()
In [21]:
for i in [0,0.5,0.75,1,2,3,10,15, 20]:
    for q in ['enrollment_type', 'age_group', 'region']:
        the_analyzer(i, q)

Best,

George John Jordan Thomas Aquinas Hayward, Optimist

george-hayward-data-scientist