June 28, 2019

Table of Contents

Part I. - Beginning Notes

Not shown here: I cleaned up the data the team provided, I deleted out the null values for simplicity's sake (though in the real world I would talk with the team about why we have the nulls and we'd decide if we want to fill them in), I re-formated the datetime columns, replaced the whitespace in the column headers with underscores, and, then, finally, I loaded the fixed csv into MySQL Workbench to run my SQL queries.*

Two data notes: I will definitely want to talk to the team about (1) using a more standard datetime format going forward and (2) better understanding why there were so many nulls in the “Driver at restaurant datetime” column.*

Part I. - Question 1

I analyzed the data per delivery region, unless otherwise noted. The metrics that stood out to me were:

  1. Tip Percentage,
  2. Average Order Size,
  3. Average Number of Minutes to Complete Order,
  4. Number of Orders Per Time of Day,
  5. New Orders Out of Total Orders,
  6. Top Restaurants,
  7. Highest Frequency Customers,
  8. Refund Rate,
  9. Orders Per Customer, &
  10. Orders Per Returning Customer

Quick Look: A Young Business with a Lot of Promise

Even though this data is from one of DoorDash’s earliest months, the business shows a lot of promise, with over $925,000 in topline revenue (Insight 8c).

Further, there are roughly 2.7 orders per customer, and this indicates that we are performing a service that people like enough to order again. (Insight 6a). In fact, in this month’s data, over 63% of our users were returning customers. (Insight 8b).

We operate in three regions: Palo Alto, Mountain View, and San Jose. Most of our orders are in Palo Alto (11k+), then Mountain View (almost 4k), and finally San Jose (just under 3k). (Insight 7). The average order value seems steady in $40-45 range. (Insight 1).

I’ve also noticed a few interesting trends. First, our order frequency is bi-modal with our orders bunching around 11am (lunch) and 6pm (dinner) everyday. We have very few orders before 10am or after 9pm, and this can guide the way we use our resources. (Insight 2).

I’ve also looked at how long it takes us to get the order to the order-er. It’s usually just under 50 minutes, and we can see here a breakdown of what we do in those 50 minutes: Step 1 (Customer -> Restaurant), Step 2 (Restaurant -> Dasher), and Step 3 (Dasher -> Customer). Regarding refunds, our refund rate is very good. It hovers around 2-3%, varying from delivery region to delivery region, and this rate is in line with my understanding of e-commerce businesses. (Insight 7).

Additionally, I’ve analyzed data on our top restaurants, for which I considered a short total time to delivery and a popularity of greater than 50 orders from customers. For instance, Restaurant ID 12 seems like a particularly outstanding. (Insight 3). Along similar lines, I've also analyzed our power users. Customer ID 514 has already ordered over 60 times this month. (Insight 5).

Finally, I noticed something quirky, which can be seen here. Although Palo Alto is one of the wealthiest cities in the United States, they seem to be our stingiest tippers. This may be a place for us to work on. In fact, it's where I will focus my experiment.

Some Recommendations.

Using what I’ve seen so far, here are a few recommendations:

  1. Higher Tips. Tips are great because they put more money in the pockets of our workers, and that will help them with their expenses and bills. Since community impact is so kety to our mission, we may want to do an awareness campaign, starting in our wealthiest markets (i.e., Palo Alto), to increase customer tipping.
    .
  2. Faster Customer-to-Restaurant Times. There’s not much we can do about how long the restaurant takes to cook the meal or how long it takes the driver to navigate traffic to deliver it, but we may be able to work on cutting down on the time from the customer placing an order to the restaurant getting it in the system. Note: I believe DoorDash has already fixed this issue with its API embedding to restaurants’ own website.
    .
  3. Incentives for Prime Hours We definitely have hot hours and cooler hours for food delivery. We should optimize as much as we can around 11am and 6pm. We may consider an incentive-based model for drivers and restaurants around these hours, to try to give customers the fastest food experience when they are hungriest, and when it is logistically most complicated.
    .
  4. Rewards Program to Keep Our Churn Low We'll want to monitor churn. It’s very important for any business. Since the data here is only for one month, churn is hard to compute, but the fact that over 63% of our customers are returning customers is excellent! To grow the business, not only will we want to always monitor churn, but we may institute a rewards program to keep people ordering on our platform.
    .
  5. Extra Focus on Power Users and Top Restaurants Reach out to power users and top restaurants. We have some clear people who order much more than average, and we have some restaurants that people love much more than average. For our power users, we should be high-touch and give them the best service our budget can allow, and for top restaurants, we should talk with them about advertising opportunities and profiling them on our platform.

Part I. - Question 2

Experimental Recommendation.

There are many parts of the business we can focus on for improvement, but one that stands out to me is our low tipping rate.

I think that places like Palo Alto can tip more. If we can raise that tip rate, I think it will **a)** be positive for our dashers, **b)** attract more workers to our business, **c)** lead to good press coverage, and **d)** increase employee morale because we know we are doing the right thing.

Further, I would only want to see if we can make the ask when people are at the final stage of the sales funnel. This is similar to when you are at a checkout register and someone asks if you would like add a small charitable donation to the end of your purchase. As such, we would not expect it to decrease our number of orders because the person is committed to the order at that stage.

Experiment Question:

Can we increase tips in Palo Alto by adding a photo of the dasher, a short bio of the dasher, and a “15% Tip” button on the screen?

Design: A/B test via a two-sample difference of the means t-test to check this.

Key Metric: Mean Tip Percentage

Control: Status quo; no changed.

Treatment: Status quo + the photo/bio/button combo at checkout.

Null Hypothesis: H_0 = μ _control - μ _treatmet = 0

Alternative Hypothesis: H_A = μ _control - μ _treatment =/= 0

Conditions: Randomized, Individuals Independent, Groups Independent, Sample are at least nearly normal (make histogram to check)

Best-Practice Steps:

  1. Decide what our statistical threshold will be. Likely p=0.05
    .
  2. Decide what our practical threshold will be. Likley 3% increase on average, but this can be discussed.
    .
  3. Decide how long we want to run the study. This would be a team discussion, but I think a month would be a good length of time to run it.
    .
  4. Decide whether we want to have two-tailed or one-tailed test. I’d say two, because we also want to test if somehow the treatment group tips at a lower rate than our control.
    .
  5. Determine the sample size. We can use an online calculator like Optimizely's official one.
    .
  6. Note the degrees of freedom.
    .
  7. Determine the pooled standard error between the means.
    .
  8. Use the pooled standard error to find the t-value.
    .
  9. Assess if this t-value, for the degrees of freedom we’ve specified, leaves us at a p-value below 0.05.
    .
  10. If so, we can reject the null hypothesis and move ahead.

Part I. - Question 3

On a larger scale, because this is based on software, I think we can implement it on a wider scale. By wider scale, I still say Palo Alto.

I think it’s the kind of thing which can be a win-win, so I don’t anticipate too many things going wrong. Some operational challenges may be that riders are protective of their information, so we would make sure to get their consent and talk to the lawyers. Note: I am also a lawyer. But we would want to be in great communication with the drivers, we would want to make sure privacy laws are followed, and we would want to check the success.

The engineers who are building the functionality may have more pressing concerns, so we would want to work with them respectfully.

The great thing about digital A/B testing is that when you do have statistical information showing an improvement, the change can be scaled rapidly.

Part I - Code (SQL and Python) and Data Visualizations

I now connect to my local MySQL database below, then I reproduce the queries I ran as strings, and lastly, I load their output tables as pandas dataframes.

In [1]:
import pandas as pd
import mysql.connector as mysql
from matplotlib import pyplot as plt
In [2]:
#connecting to the local sql database I set up
db = mysql.connect(
    host = "localhost",
    user = "newuser",
    passwd = "data",
    auth_plugin='mysql_native_password',
    database = 'sys')

Insight 1. Tip Percentage and Average Time Spent in Each Step of Delivery Process Per Region

In [3]:
tips_and_supplychain_per_region = pd.read_sql("""
WITH supply_flow AS (
select
    Delivery_Region,
    Is_ASAP,
    Restaurant_ID,
    Is_New,
    Order_total,
    round(100*(Amount_of_tip/Order_total),2) as 'Tip_Percent',
    (timestampdiff(MINUTE, Customer_placed_order_datetime, Placed_order_with_restaurant_datetime)) as 'Step_1_Time',
    (timestampdiff(MINUTE, Placed_order_with_restaurant_datetime, Driver_at_restaurant_datetime)) as 'Step_2_Time',
    (timestampdiff(MINUTE, Driver_at_restaurant_datetime, Delivered_to_consumer_datetime)) as 'Step_3_Time',
    (timestampdiff(MINUTE, Customer_placed_order_datetime, Delivered_to_consumer_datetime)) as 'Total_Time'
    from sys.doordash
    where Is_Asap = 'True'
    )  
    select
        Delivery_Region,
        round(avg(Tip_Percent),2) as 'Tip Pct',
        round(avg(Order_total),2) as 'Avg Order',
        round(avg(Step_1_Time),1) as 'Cust->Rest Time',
        round(avg(Step_2_Time),1) as 'Rest->Driv Time',
        round(avg(Step_3_Time),1) as 'Driv->Cust Time',
        round(avg(Total_Time),1) as 'Total Time',
        count(*) as 'Orders'
    from supply_flow
    where Step_1_Time > 0 #to take out say 31 to 1 overnight outliars since the data did not come with months or years
       and Step_2_Time > 0
       and Step_3_Time > 0
       and Total_Time > 0
  group by 1
  having orders > 50
  order by 8 asc
    ;
""", con=db)
In [4]:
tips_and_supplychain_per_region
#time of the supply chain is in minutes
Out[4]:
Delivery_Region Tip Pct Avg Order Cust->Rest Time Rest->Driv Time Driv->Cust Time Total Time Orders
0 San Jose 8.26 41.10 7.4 16.0 24.3 48.6 1559
1 Mountain View 7.24 45.09 8.8 14.6 22.5 46.9 1845
2 Palo Alto 6.81 44.28 8.1 13.9 23.8 46.8 4912

Two Notes Here:

  1. All the times of the 'supply chain' are in minutes.
  1. There are fewer orders analyzed here because of the 'where Step_1_Time > 0', etc. clauses. Essentially, what happened is that the data originally came with a day and an hours-minutes-seconds indicator, but NOT a month or year. So in pandas, when I was reformatting the datetimes, I assigned all the datetimes to the same month (this was necessary). However, there were some orders that happened on the 31st day and finished on the 1st of the next month, but this would rollover to 1st of the same month because of the my 'all-in-the-same-calendar-month' assignment. These end-of-the-month dashes would lead to bizarre results such as negative delviery times. For simplicity's sake, I excluded such negative times from this aggregation. This is permissable because in the real world, if we have a day, we will also have a month and a year, and thus the problem I worked around here would not occur.

In [5]:
labels = ['Tip', '']
pa_tip = [tips_and_supplychain_per_region['Tip Pct'][2], (100-tips_and_supplychain_per_region['Tip Pct'][2])]
mv_tip = [tips_and_supplychain_per_region['Tip Pct'][0], (100-tips_and_supplychain_per_region['Tip Pct'][0])]
sj_tip = [tips_and_supplychain_per_region['Tip Pct'][1], (100-tips_and_supplychain_per_region['Tip Pct'][1])]
national_tip = [15,85]

fig, axs = plt.subplots(2, 2,figsize=(9, 6))
colors_pa = ['limegreen','lightgreen']
colors_mv = ['paleturquoise','lightcyan']
colors_sj = ['sandybrown','bisque']
colors_nt = ['hotpink','pink']

axs[0,0].pie(pa_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_pa, explode = (0, 0.1))
axs[0,0].set_title('Palo Alto',fontweight = 'bold')

axs[0,1].pie(mv_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_mv, explode = (0, 0.1))
axs[0,1].set_title('Mountain View',fontweight = 'bold')

axs[1,0].pie(sj_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_sj, explode = (0, 0.1))
axs[1,0].set_title('San Jose',fontweight = 'bold')

axs[1,1].pie(national_tip, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_nt, explode = (0, 0.1))
axs[1,1].set_title('National',fontweight = 'bold')

plt.subplots_adjust(wspace=0, hspace=0.1)
plt.suptitle("Tip Percentage Comparison", fontsize = 14, fontweight = 'bold')
plt.savefig('ex_1_tips.png')
plt.show()

In [6]:
f, ax = plt.subplots(figsize=(8, 6))
step1_time = tips_and_supplychain_per_region['Cust->Rest Time']
step2_time = tips_and_supplychain_per_region['Rest->Driv Time']
step3_time = tips_and_supplychain_per_region['Driv->Cust Time']
total_time = tips_and_supplychain_per_region['Total Time']

p1 = plt.bar(range(len(step1_time)),
 step1_time, color = 'cornflowerblue')
p2 = plt.bar(range(len(step2_time)),
 step2_time, bottom = step1_time, color = 'sandybrown')
p3 = plt.bar(range(len(step3_time)),
 step3_time, bottom = (step2_time+step1_time), color = 'lightgreen')

plt.locator_params(axis='x', nbins=4)
ax.set_xticklabels(['0','San Jose', 'Moutain View', 'Palo Alto'], style='italic')


plt.xlabel("Delivery Region", fontsize = 13)
plt.ylabel("Time (minutes)", fontsize = 13)
plt.legend((p1[0], p2[0], p3[0]), ('Customer to Restaurant', 'Restaurant to Dasher',\
           'Dasher to Customer'),loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("DoorDash Time To Serve Avg. Breakdown", fontsize = 14, fontweight = 'bold')

plt.savefig('ex_2_time_to_serve.png')
plt.show()
                    
        

Insight 2. What time of day do most orders happen?

In [7]:
time_of_order = pd.read_sql("""
select
    extract(HOUR from date_add(Customer_placed_order_datetime, INTERVAL -7 HOUR)) as 'Hour',
    count(*) as 'Number of Orders',
    round(100* count(*) / 13524,2) as 'Percentage'
    from sys.doordash
    group by 1
    order by 2 desc
    ;
""", con=db)
In [8]:
time_of_order.head(10)
Out[8]:
Hour Number of Orders Percentage
0 18 2549 18.85
1 19 2165 16.01
2 17 2111 15.61
3 20 1387 10.26
4 11 1099 8.13
5 12 887 6.56
6 16 839 6.20
7 13 632 4.67
8 10 626 4.63
9 21 487 3.60

In [9]:
f, ax = plt.subplots(figsize=(8, 6))
plt.bar(time_of_order['Hour'],time_of_order['Number of Orders'])
plt.xlabel("Hour of Day", fontsize = 13)
plt.locator_params(axis='x', nbins=17)
ax.set_xticklabels(['0','12am', '2am', '4am','6am', '8am',\
                   '10am', '12pm','2pm', '4pm',\
                   '6pm', '8pm','10pm', 'Midnight'], style='italic')
plt.xticks(rotation=-25)
plt.ylabel("Number of Orders", fontsize = 13)
plt.title("Order Frequency Throughout the Day (Month Aggregation)", fontsize = 13, fontweight = 'bold')
plt.savefig("ex_3_order_frequency.png")
plt.show()

Insight 3. Who are our best restaurants?

In [10]:
best_restaurants = pd.read_sql("""
WITH supply_flow AS (
select
    Delivery_Region,
    Is_ASAP,
    Restaurant_ID,
    Is_New,
    Order_total,
    round(100*(Amount_of_tip/Order_total),2) as 'Tip_Percent',
    (timestampdiff(MINUTE, Customer_placed_order_datetime, Placed_order_with_restaurant_datetime)) as 'Step_1_Time',
    (timestampdiff(MINUTE, Placed_order_with_restaurant_datetime, Driver_at_restaurant_datetime)) as 'Step_2_Time',
    (timestampdiff(MINUTE, Driver_at_restaurant_datetime, Delivered_to_consumer_datetime)) as 'Step_3_Time',
    (timestampdiff(MINUTE, Customer_placed_order_datetime, Delivered_to_consumer_datetime)) as 'Total_Time'
    from sys.doordash
    where Is_Asap = 'True'
    ) 
    select
        Restaurant_ID,
        Delivery_Region,
        round(avg(Tip_Percent),2) as 'Tip Pct',
        round(avg(Order_total),2) as 'Avg Order',
        round(avg(Step_1_Time),1) as 'Cust->Rest Time',
        round(avg(Step_2_Time),1) as 'Rest->Driv Time',
        round(avg(Step_3_Time),1) as 'Driv->Cust Time',
        round(avg(Total_Time),1) as 'Total Time',
        count(*) as 'Orders'    
    from supply_flow
    where Step_1_Time > 0 #to take out say 31 to 1 overnight outliars since the data did not come with months or years
       and Step_2_Time > 0
       and Step_3_Time > 0
       and Total_Time > 0
  group by 1, 2
  having orders > 50
  order by 8 asc;
""", con=db)
In [11]:
best_restaurants.head()
Out[11]:
Restaurant_ID Delivery_Region Tip Pct Avg Order Cust->Rest Time Rest->Driv Time Driv->Cust Time Total Time Orders
0 12 Palo Alto 6.68 31.35 8.2 9.7 18.4 37.4 257
1 72 Palo Alto 8.01 42.83 8.9 9.5 20.6 39.9 78
2 233 Palo Alto 7.43 32.85 7.1 9.7 22.5 40.2 99
3 43 Mountain View 7.37 29.51 10.1 8.6 20.7 40.4 127
4 18 Palo Alto 5.75 34.50 9.4 10.1 20.3 40.8 51

Pease note: To make sure I analyzed full numbers for the 'Insight 4-8' Sections, I created a new database with the 'null' rows back in it. The nulls were generally created in the datetime fields, so I removed those for my logistical time analysis. However, to get a better picture of user size and topline revenue, I've brought them back. You can see that these SQL queries reference a 'doordash_no_dt' database. The previous data came from tables that, at least partially, relied on the datetime values. But th next data comes from tables that do not rely on the datetime values.

Insight 4. By delivery region, how many of our orders are new?

In [12]:
new_orders_by_region = pd.read_sql("""
select
    Delivery_Region,
    sum(case when Is_New = 'True' then 1 else 0 end) as new_orders,
    count(*) as orders,
    round(100*(sum(case when Is_New = 'True' then 1 else 0 end)) / (count(*)),2) as pct_new
    from sys.doordash_no_dt
    group by 1
    ;
""", con=db)
In [13]:
new_orders_by_region
Out[13]:
Delivery_Region new_orders orders pct_new
0 Palo Alto 2123.0 11433 18.57
1 Mountain View 720.0 3760 19.15
2 San Jose 666.0 2859 23.29
3 None 3.0 26 11.54

In [14]:
labels = ['New Orders', 'Repeat Orders']
pa_data = [new_orders_by_region.new_orders[0], (new_orders_by_region.orders[0] - new_orders_by_region.new_orders[0])]
mv_data = [new_orders_by_region.new_orders[1], (new_orders_by_region.orders[1] - new_orders_by_region.new_orders[1])]
sj_data = [new_orders_by_region.new_orders[2], (new_orders_by_region.orders[2] - new_orders_by_region.new_orders[2])]

fig, axs = plt.subplots(1, 3,figsize=(12, 3))
colors_pa = ['limegreen','lightgreen']
colors_mv = ['paleturquoise','lightcyan']
colors_sj = ['sandybrown','bisque']

axs[0].pie(pa_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_pa, explode = (0, 0.1))
axs[0].set_title('Palo Alto',fontweight = 'bold')

axs[1].pie(mv_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_mv, explode = (0, 0.1))
axs[1].set_title('Mountain View',fontweight = 'bold')

axs[2].pie(sj_data, labels=labels, autopct='%1.2f%%', shadow=True, colors=colors_sj, explode = (0, 0.1))
axs[2].set_title('San Jose',fontweight = 'bold')

plt.savefig("ex_4_new_vs_repeat_orders.png")
plt.show()

Insight 5. Who are our power customers?

In [15]:
#this is a little trickier because there are two sql statements to be run sequentially
#we split them up in the following way

sql1="set @total_orders = (select count(*) from sys.doordash_no_dt);" 
sql2="""
select
    Consumer_ID as "Customer_ID",
    count(*) as 'Num_Orders',
    round(100*(count(*) / @total_orders),3) as "Pct_of_all"
    from sys.doordash_no_dt
    group by 1
    order by 2 desc
    limit 50;
"""
cursor=db.cursor(buffered=True)
cursor.execute(sql1)
cursor.close()

power_customers = pd.read_sql(sql2, con=db)
In [16]:
power_customers.head(10)
Out[16]:
Customer_ID Num_Orders Pct_of_all
0 514 66 0.365
1 929 50 0.277
2 11956 47 0.260
3 2469 43 0.238
4 4087 39 0.216
5 2435 37 0.205
6 22183 36 0.199
7 8000 35 0.194
8 6245 33 0.183
9 14043 27 0.149

Insight 6a. What's our overall orders-to-customer metric?

In [17]:
orders_to_customer = pd.read_sql("""
select
    count(distinct Consumer_ID) as num_customers,
    count(*) as num_orders,
    round((count(*) / count(distinct Consumer_ID)),2) as orders_per_customer
    from sys.doordash_no_dt
    ;
""", con=db)
In [18]:
orders_to_customer
Out[18]:
num_customers num_orders orders_per_customer
0 6701 18078 2.7

Insight 6b. And what would happen if we only counted orders per returning customer?

In [19]:
orders_to_returning_customer = pd.read_sql("""
select
    count(distinct Consumer_ID) as num_customers,
    count(*) as num_orders,
    round((count(*) / count(distinct Consumer_ID)),2) as orders_per_returning_customer
    from sys.doordash_no_dt
    where Is_New = "False"
    ;
""", con=db)
In [20]:
orders_to_returning_customer
#this makes sense because if we start with 2/3 = 0.67
#but then take 1 from the numerator and one from the denominator
#we get (2-1)/(3-1) = 1/2 = 0.50
#this is what we're effectively doing by taking out the 1-time-only customers
Out[20]:
num_customers num_orders orders_per_returning_customer
0 5848 14566 2.49

Insight 7. How are we doing on our refund rate?

In [21]:
refund_rate_by_region = pd.read_sql("""
select
    Delivery_Region as region,
    sum(case when Refunded_amount > 0 then 1 else 0 end) as refund_count, 
    count(*) as order_count,
    round(100*(sum(case when Refunded_amount > 0 then 1 else 0 end)) / (count(*)),2) as refund_rate_pct
    from sys.doordash_no_dt
    group by 1
    order by 4 asc;
""", con=db)
In [22]:
refund_rate_by_region
Out[22]:
region refund_count order_count refund_rate_pct
0 Palo Alto 291.0 11433 2.55
1 San Jose 87.0 2859 3.04
2 Mountain View 115.0 3760 3.06
3 None 3.0 26 11.54

Insight 8a. Some Final Counts - Total Users

Please note: To make sure I analyzed full numbers for the 'Insight 8' Section, I created a new database with the 'null' rows back in it. The nulls were generally created in the datetime fields, so I removed those for my logistical time analysis. However, to get a better picture of user size and topline revenue, I've brought them back. You can see that these SQL queries reference a 'doordash_no_dt' database.

In [23]:
total_users =  pd.read_sql("""
select
    count(distinct Consumer_ID) as total_users_this_month
from sys.doordash_no_dt;
""", con=db)
In [24]:
total_users
Out[24]:
total_users_this_month
0 6701

Insight 8b. Some Final Counts - Total Returning Users

In [25]:
total_returning_users =  pd.read_sql("""
select
    count(distinct Consumer_ID) as total_new_users_this_month
from sys.doordash_no_dt
where Is_New = "True";
""", con=db)
In [26]:
total_returning_users
Out[26]:
total_new_users_this_month
0 2431

2431 users / 6701 users = 36.28% new users this month

100% - 36.28% = 63.72% returning users this month

Insight 8c. Some Final Sums - Total Topline Revenue, Total Discount Applied, Total Amount Tipped

In [27]:
total_financial_sums =  pd.read_sql("""
select
    sum(Order_total) as total_revenue,
    sum(Amount_of_tip) as total_tip,
    sum(Amount_of_discount) as total_discount_applied
from sys.doordash_no_dt;
""", con=db)
In [28]:
total_financial_sums
Out[28]:
total_revenue total_tip total_discount_applied
0 926705.33 63128.77 24526.54

Part II - Question 1

• MySQL has a nifty function called ‘DayofWeek,’ so I used that here. The heavy lifting was just making sure the date was in the right Datetime format. Once you’ve got that, there’s a lot you can do with SQL.
.
• For the average pay per hour, I find the difference in hours between the dashers start and end time. Then I divide their total pay by that difference. Please note that because the TimeStampDiffence returns integers (so it would round to the nearest hour), I found it easier to just take out put in minutes and divide by 60 to get hours as a float that way.
.
• Finally I grouped it by the day of the week.
.
• The output will be in days of the weeks as 1,2,3,4, etc, but you can easily reformat that (perhaps with a CASE WHEN).

Part II - Question 2

• The core of this query is the same as the previous part, earnings per hour, but here you need to get information from another table.
.
• I make an assumption to limit the data to dashes that occur between 11am and 2pm. This would miss dashes that started or ended after that time block. However, this assumption, I think is valid because if we did have a dash that ran through this 11am to 2pm block we would not be able to correctly assess the pay per hour during that time because we only have total pay and we have no way of knowing how much of that total pay came from orders outside the block. Therfore, I have limited the query to dashes begin after 11am and ending before 2pm. I did this with an Hour extract clause in the where clause.
.
• To join the table I joined on the dasher_id via a standard inner join.
.
• I then made sure we had the correct submarket_id, also in the where clause.

Part II - Question 3

• There are several ways to attack this problem, and I would like to start with a common SQL gripe – the median is very important in EDA, and there is basically no easy way to get the median in SQL (contrast that with Microsoft Excel).

• The plan was to first aggregate average hourly pay using the methods above. I noticed there was a slight semantic difference in this problem (“total pay per hour” vs “average pay earnings per hour”) and I interpreted “total pay per hour” as meaning the same thing as “average earnings per hour” did before. So I first aggregated average hourly pay, just as I did above.
.
• Then I grouped the pay dasher_id and the dasher_id email address (Which we can do because they have a 1:1 relationship).
.
........o I had to join the tables again to get this information.
.
• I then did a special datediff function in the where clause to make sure that we are only tracking the last 30 days, as specified by the prompt.
.
• Finally, and this was the key, I decided to use a window function to rank the users by their hourly pay.
.
........o The one I tried to use was NTILE() because it fits well with this problem. NTILE buckets the rows into as many buckets as you specify. Here I would specific 2 buckets. We would then order the buckets, in an ascending, order on the basis of the average hourly pay that is also in the SELECT cause.
.
........o Here’s where NTILE shines:
.
................ The issue with the median is that if you have an odd number of rows, the median is the middle row, but if you have an even number of rows, the median becomes an average of the two middle rows.
.
................ NTILE handles this beautifully for our problem. When the total number of rows is even, it buckets evenly. But when the total number of rows is odd, it makes the first group 1 row bigger than all the other groups.
.
................ For us that is perfect.
.
................ The goal is that we give a promotion to the dashers in the bottom 50% percentile. So here, in a ‘tie goes to the runner’ mindset, we would want to err on the side of giving the mid-point dasher the promotion. Because the goal is to help the dasher!
.
................ So as long as you set up the NTILE to follow an ascending hourly pay order, you will have 1’s next to everyone who needs the promotion.
.
• To simply things, I then nested the query. The WITH clause query gives me the dasher, their email, their pay, and their bucket.
.
• All I have to do is query the table this WITH clause query made, and this time set the bucket number = 1 in a where clause.
.
• This produce the desired list.
.

Thank you so much for your time and consideration.

Best,

George John Jordan Thomas Aquinas Hayward, Optimist