februarie 12, 2025 · De Daniel Builescu

How to Improve Decision-Making with Hypothesis Testing in Python

Learn how hypothesis testing in Python can help you make confident, data-driven choices.

In the whirlwind of corporate or product decisions, we commonly ask:

“Does our fresh marketing spin actually surpass the old?”
“Will a different courier unequivocally reduce shipping duration?”
“Might a marginal price bump hurt sales, or can we proceed unscathed?”

Rather than flailing about with pure instinct, hypothesis testing imposes method to the madness — collect data, parse it through statistical checks, and uncover if an apparent difference stands on real ground or on fleeting randomness.

Core Concepts in Simple Terms

Null Hypothesis (H0): The staid baseline — “no difference,” “no effect,” or “everything’s the same.” E.g., “Email A equals Email B’s performance.”
Alternative Hypothesis (H1): The claim that something actually changes or diverges.
p-value: A numeric gauge (0 to 1) revealing the likelihood of observing results this extreme, assuming H0 holds. Low p-value => your data likely isn’t random happenstance.
Significance Level (α): The cutoff for risk tolerance. 0.05 = 5% risk, 0.01 = 1% risk, etc. If p-value < α, we label the result “significant.”

Scenario 1: Testing Two Email Designs

New job, naive marketing team. They present me with two variant email templates: A vs. B. Conventional practice? Select whichever resonates personally.

But I proposed an A/B check:

Send Email A to half the recipients.
Send Email B to the remaining.
Measure clicks. Then see if we detect a genuine difference.

1. Gather Data

import numpy as np

# A vs. B performance data
clicks_A = np.array([12, 14, 8, 10, 9, 11, 15, 13])
clicks_B = np.array([16, 18, 14, 17, 19, 20, 15, 18])

Each array enumerates the clicks from separate mini-samples. We want to see if B truly edges out A in mean performance.

2. Apply a T-test

from scipy.stats import ttest_ind

t_stat, p_val = ttest_ind(clicks_A, clicks_B)
print("T-statistic:", t_stat)
print("P-value:", p_val)

T-statistic: Magnitude of difference relative to inherent variability.
p_val: If it skulks below 0.05, we suspect B outperforms A legitimately.

3. Interpret

p_val < 0.05 => “Statistically significant.”
p_val > 0.05 => Possibly random.
We discovered B hammered A. We deployed B wholeheartedly.

Scenario 2: Shipping Method Check

Speed matters. Our operations lead discovered a fresh courier, hypothesizing it slashes delivery times. But forging a partnership with them cost money.

We tested:

Old courier for half the orders.
New courier for the other half.
Track each package’s delivery days.

import pandas as pd
from scipy.stats import ttest_ind

data = {
    "method": ["old"]*6 + ["new"]*6,
    "delivery_days": [5, 6, 7, 5, 6, 7, 3, 4, 4, 4, 5, 4]
}
df = pd.DataFrame(data)

old_days = df[df["method"] == "old"]["delivery_days"]
new_days = df[df["method"] == "new"]["delivery_days"]

t_stat2, p_val2 = ttest_ind(old_days, new_days)
print("T-stat:", t_stat2)
print("P-value:", p_val2)

If p_val2 is tiny (0.01, for instance), the new courier likely truly speeds up deliveries. If p_val2 sits around 0.4, the difference may be illusory. We discovered a p_val2 under 0.05. We switched couriers.

Scenario 3: A Price Increase Experiment

Finance mulled a 5% price bump. Risk? Alienating customers.

We tested:

Raise prices for half of the items (test).
Keep old prices for the other half (control).
Compare average units sold.

import numpy as np
from scipy.stats import ttest_ind

control_sales = np.array([100, 102, 98, 105, 99, 101])
raised_price_sales = np.array([94, 90, 95, 92, 89, 91])

t_stat3, p_val3 = ttest_ind(control_sales, raised_price_sales)
print("T-stat:", t_stat3)
print("P-value:", p_val3)

Minuscule p_val3 => a real drop in sales. Large p_val3 => maybe no significant impact. We got p_val3 = 0.02, so we concluded the hike harmed sales. We pivoted, either reducing the increment or adding perks.

Scenario 4: The Retention Conundrum

I once tackled a scenario where a newly launched “recommended articles” widget was believed to keep users on a platform longer.

We tested:

Before data: user session lengths pre-widget.
After data: same user group, now with widget.
Paired T-test because it’s the same individuals.

import numpy as np
from scipy.stats import ttest_rel

session_before = np.array([4.5, 5.0, 3.8, 4.2, 4.1])
session_after  = np.array([5.2, 5.6, 4.5, 5.3, 5.1])

t_stat4, p_val4 = ttest_rel(session_before, session_after)
print("T-stat4:", t_stat4)
print("P-value4:", p_val4)

If p_val4 slumps below 0.05, the widget likely improved retention. If it soared above 0.2, maybe the difference is nebulous. We found p_val4=0.01, which suggested a genuine uptick in session duration.

Scenario 5: The Product Quality Check

In manufacturing, minor process tweaks can reduce defect rates.

We tested the old vs. new method:

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

counts = np.array([10, 4])  # Defects in old vs. new
nobs = np.array([200, 200]) # Each run had 200 items

stat5, p_val5 = proportions_ztest(counts, nobs)
print("Z-stat:", stat5)
print("P-value:", p_val5)

A small p_val5 (like 0.01) => the new approach probably yields fewer defects. A big p_val5 (like 0.4) => not enough proof. Ours was 0.005, so we embraced the updated process.

Key Steps for Non-Tech Readers

Formulate a Question: “Is there a difference?”
Define Hypotheses: H0 (no difference), H1 (some difference).
Select a Test & Gather Data: T-tests for continuous metrics, proportion tests for pass/fail.
Run the Test: Python’s SciPy, Pandas, and NumPy handle the math.
Interpret p-value: If it’s below α (0.05 or 0.01), we generally call it significant.
Act: Choose the better or keep tinkering.

Understanding p-values & Alpha Levels

p-value < 0.05: Means <5% chance your results arose if nothing really changed. Often deemed “significant.”
p-value < 0.01: Stricter, <1% chance.
p-value near 0.5: 50% chance it’s random. Not strong evidence.

A minuscule p-value doesn’t guarantee a massive effect — it just implies we’re fairly sure something’s not random.

My Personal Takeaways

Confidence: Data science classes showed me the underlying equations, but real-world applications hammered the lesson home.
Clarity: Instead of heated debates, we rely on numbers.
Actionable: T-statistics, p-values — they inform us whether to adopt or abandon.
Continuous Learning: Some data sets require different tests. But the pattern remains: form a question, gather numbers, interpret results, proceed.

Final Thoughts

Hypothesis testing turns uncertainty into clarity. It answers the question: “Is this truly better?” Instead of relying on gut instinct, use data. Python’s tools — NumPy, Pandas, SciPy — handle the calculations, so you can focus on the decisions that matter.

When in doubt, test. Gather data, analyze results, check the p-value, then act with confidence. No more guesswork — just smarter choices.

Distribuie:

Resources for learning

Automate Everything with Python & AI – Work Less, Achieve More, Scale Faster
Automate Everything with Python & AI – Work Less, Achieve More, Scale Faster

Preț normal $19.99

Preț normal ~~$79.99~~ Pret de vanzare $19.99
Preț unitar pe
Python Coding Interview Guide: 100+ Must-Know Questions & Answers to Land 6-Figure Tech Jobs
Python Coding Interview Guide: 100+ Must-Know Questions & Answers to Land 6-Figure Tech Jobs

Preț normal $14.99

Preț normal ~~$49.99~~ Pret de vanzare $14.99
Preț unitar pe
Make Money Coding: The Complete Guide to Freelance, SaaS & Startups (+ 3 Free Checklists)
Make Money Coding: The Complete Guide to Freelance, SaaS & Startups (+ 3 Free Checklists)

Preț normal $19.99

Preț normal ~~$99.99~~ Pret de vanzare $19.99
Preț unitar pe
Shopify Growth Hacks: How to Build, Optimize & Scale Your Store
Shopify Growth Hacks: How to Build, Optimize & Scale Your Store

Preț normal $23.99

Preț normal ~~$99.99~~ Pret de vanzare $23.99
Preț unitar pe