
· De Daniel Builescu
How to Improve Decision-Making with Hypothesis Testing in Python
Learn how hypothesis testing in Python can help you make confident, data-driven choices.
In the whirlwind of corporate or product decisions, we commonly ask:
- “Does our fresh marketing spin actually surpass the old?”
- “Will a different courier unequivocally reduce shipping duration?”
- “Might a marginal price bump hurt sales, or can we proceed unscathed?”
Rather than flailing about with pure instinct, hypothesis testing imposes method to the madness — collect data, parse it through statistical checks, and uncover if an apparent difference stands on real ground or on fleeting randomness.
Core Concepts in Simple Terms
- Null Hypothesis (H0): The staid baseline — “no difference,” “no effect,” or “everything’s the same.” E.g., “Email A equals Email B’s performance.”
- Alternative Hypothesis (H1): The claim that something actually changes or diverges.
- p-value: A numeric gauge (0 to 1) revealing the likelihood of observing results this extreme, assuming H0 holds. Low p-value => your data likely isn’t random happenstance.
- Significance Level (α): The cutoff for risk tolerance. 0.05 = 5% risk, 0.01 = 1% risk, etc. If p-value < α, we label the result “significant.”
Scenario 1: Testing Two Email Designs
New job, naive marketing team. They present me with two variant email templates: A vs. B. Conventional practice? Select whichever resonates personally.
But I proposed an A/B check:
- Send Email A to half the recipients.
- Send Email B to the remaining.
- Measure clicks. Then see if we detect a genuine difference.
1. Gather Data
import numpy as np
# A vs. B performance data
clicks_A = np.array([12, 14, 8, 10, 9, 11, 15, 13])
clicks_B = np.array([16, 18, 14, 17, 19, 20, 15, 18])
Each array enumerates the clicks from separate mini-samples. We want to see if B truly edges out A in mean performance.
2. Apply a T-test
from scipy.stats import ttest_ind
t_stat, p_val = ttest_ind(clicks_A, clicks_B)
print("T-statistic:", t_stat)
print("P-value:", p_val)
- T-statistic: Magnitude of difference relative to inherent variability.
- p_val: If it skulks below 0.05, we suspect B outperforms A legitimately.
3. Interpret
p_val < 0.05 => “Statistically significant.”
p_val > 0.05 => Possibly random.
We discovered B hammered A. We deployed B wholeheartedly.
Scenario 2: Shipping Method Check
Speed matters. Our operations lead discovered a fresh courier, hypothesizing it slashes delivery times. But forging a partnership with them cost money.
We tested:
- Old courier for half the orders.
- New courier for the other half.
- Track each package’s delivery days.
import pandas as pd
from scipy.stats import ttest_ind
data = {
"method": ["old"]*6 + ["new"]*6,
"delivery_days": [5, 6, 7, 5, 6, 7, 3, 4, 4, 4, 5, 4]
}
df = pd.DataFrame(data)
old_days = df[df["method"] == "old"]["delivery_days"]
new_days = df[df["method"] == "new"]["delivery_days"]
t_stat2, p_val2 = ttest_ind(old_days, new_days)
print("T-stat:", t_stat2)
print("P-value:", p_val2)
If p_val2 is tiny (0.01, for instance), the new courier likely truly speeds up deliveries. If p_val2 sits around 0.4, the difference may be illusory. We discovered a p_val2 under 0.05. We switched couriers.
Scenario 3: A Price Increase Experiment
Finance mulled a 5% price bump. Risk? Alienating customers.
We tested:
- Raise prices for half of the items (test).
- Keep old prices for the other half (control).
- Compare average units sold.
import numpy as np
from scipy.stats import ttest_ind
control_sales = np.array([100, 102, 98, 105, 99, 101])
raised_price_sales = np.array([94, 90, 95, 92, 89, 91])
t_stat3, p_val3 = ttest_ind(control_sales, raised_price_sales)
print("T-stat:", t_stat3)
print("P-value:", p_val3)
Minuscule p_val3 => a real drop in sales. Large p_val3 => maybe no significant impact. We got p_val3 = 0.02, so we concluded the hike harmed sales. We pivoted, either reducing the increment or adding perks.
Scenario 4: The Retention Conundrum
I once tackled a scenario where a newly launched “recommended articles” widget was believed to keep users on a platform longer.
We tested:
- Before data: user session lengths pre-widget.
- After data: same user group, now with widget.
- Paired T-test because it’s the same individuals.
import numpy as np
from scipy.stats import ttest_rel
session_before = np.array([4.5, 5.0, 3.8, 4.2, 4.1])
session_after = np.array([5.2, 5.6, 4.5, 5.3, 5.1])
t_stat4, p_val4 = ttest_rel(session_before, session_after)
print("T-stat4:", t_stat4)
print("P-value4:", p_val4)
If p_val4 slumps below 0.05, the widget likely improved retention. If it soared above 0.2, maybe the difference is nebulous. We found p_val4=0.01, which suggested a genuine uptick in session duration.
Scenario 5: The Product Quality Check
In manufacturing, minor process tweaks can reduce defect rates.
We tested the old vs. new method:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
counts = np.array([10, 4]) # Defects in old vs. new
nobs = np.array([200, 200]) # Each run had 200 items
stat5, p_val5 = proportions_ztest(counts, nobs)
print("Z-stat:", stat5)
print("P-value:", p_val5)
A small p_val5 (like 0.01) => the new approach probably yields fewer defects. A big p_val5 (like 0.4) => not enough proof. Ours was 0.005, so we embraced the updated process.
Key Steps for Non-Tech Readers
- Formulate a Question: “Is there a difference?”
- Define Hypotheses: H0 (no difference), H1 (some difference).
- Select a Test & Gather Data: T-tests for continuous metrics, proportion tests for pass/fail.
- Run the Test: Python’s SciPy, Pandas, and NumPy handle the math.
- Interpret p-value: If it’s below α (0.05 or 0.01), we generally call it significant.
- Act: Choose the better or keep tinkering.
Understanding p-values & Alpha Levels
- p-value < 0.05: Means <5% chance your results arose if nothing really changed. Often deemed “significant.”
- p-value < 0.01: Stricter, <1% chance.
- p-value near 0.5: 50% chance it’s random. Not strong evidence.
A minuscule p-value doesn’t guarantee a massive effect — it just implies we’re fairly sure something’s not random.
My Personal Takeaways
- Confidence: Data science classes showed me the underlying equations, but real-world applications hammered the lesson home.
- Clarity: Instead of heated debates, we rely on numbers.
- Actionable: T-statistics, p-values — they inform us whether to adopt or abandon.
- Continuous Learning: Some data sets require different tests. But the pattern remains: form a question, gather numbers, interpret results, proceed.
Final Thoughts
Hypothesis testing turns uncertainty into clarity. It answers the question: “Is this truly better?” Instead of relying on gut instinct, use data. Python’s tools — NumPy, Pandas, SciPy — handle the calculations, so you can focus on the decisions that matter.
When in doubt, test. Gather data, analyze results, check the p-value, then act with confidence. No more guesswork — just smarter choices.