Weirdnes of data analytics

Why does this happen?

Featured image

Recently, during one of my Executive MBA classes on data analytics, I stumbled upon some fascinating and rather strange facts about data. While this wasn’t my first introduction to data analytics, I had never really paused to explore its weird and counterintuitive aspects before.

There are so many phenomena around us that we can explain with math, but the “why” behind them remains elusive. Here are a few examples that I found particularly amusing—not necessarily because they’re practical for daily life or work, but because I genuinely can’t wrap my head around why they happen. If you have an explanation, I’d love to hear it!

The Central Limit Theorem: A Strange Truth of Data

The Central Limit Theorem (CLT) is one of the most fascinating ideas in data analytics. It tells us that if you take random samples from any population—no matter how irregular the population distribution—the sample means will form a normal distribution (a bell curve) as the sample size increases.

What makes this truly fascinating is that it works even when the original population data is far from normal.

Example: Tree Heights in a Forest

Imagine a forest with two species of trees:

If you measure the heights of thousands of trees in the forest, the data won’t follow a simple bell curve. Instead, it will look bimodal, with two peaks—one for each species.

Now, if you take random samples of, say, 10, 30, 50, or 100 trees and calculate the average height for each sample, something amazing happens: the distribution of these sample means starts to form a normal distribution. The larger the sample size, the closer it gets to a perfect bell curve—no matter how complex the original data is.

Graph 1: Original Tree Height Distribution

The first graph shows the bimodal distribution of tree heights, reflecting two species with distinct average heights. This represents the raw data.

dist_tree1
Distribution Of Tree Heights (Bimodal)

Graphs 2–5: Sample Means

The subsequent graphs demonstrate the CLT in action:

dist_tree1
Distribution Of Sample Means

Why Is This Useful?

The CLT allows us to make reliable predictions about averages, even when the population data is messy or unpredictable. For example:

Simpson’s Paradox: When Data Plays Tricks on You

Simpson’s Paradox is a fascinating phenomenon where trends observed in individual groups reverse when the groups are combined. It’s a powerful reminder that data can be misleading if we don’t analyze it at the right level. Let’s explore this with a classic example involving university admissions.

The Scenario: University Admissions

A university has two departments (A and B), and their acceptance rates for male and female applicants are as follows:

Department Male Applicants Male Accepted Female Applicants Female Accepted
A 100 80% 400 70%
B 300 30% 100 20%

Within Departments:

At first glance, it looks like males are consistently favored over females in both departments.

Combined Data:

Now, let’s calculate the overall acceptance rates. This is where the number of applicants in each department starts to play a significant role.

Surprisingly, when the data is combined, females now appear to have a higher acceptance rate (60%) than males (42.5%), reversing the trend seen within the departments!

Why Does This Happen?

The paradox occurs because the number of applicants in each department is uneven:

This imbalance skews the combined results, creating an apparent reversal.

Lesson Learned:

Simpson’s Paradox is a critical reminder that aggregated data can be deceptive. To avoid being misled:

This paradox isn’t just a mathematical curiosity—it has real implications in fields like medicine, social sciences, and business. Have you encountered a similar scenario where data seemed to play tricks on you? Let me know your thoughts!

Mixing Low-Risk and High-Risk Assets: A Portfolio Advantage

In portfolio management, it’s counterintuitive but true: combining a small portion of a high-risk asset with a low-risk asset can both increase returns and reduce risk. Let’s break this down with an example using T-bills (low-risk) and small stocks (high-risk), inspired by the attached class notes.

Portfolio Setup

Let’s assume:

We’ll explore three scenarios:

  1. 100% in T-bills (low risk).
  2. 98% in T-bills and 2% in small stocks.
  3. 50% in T-bills and 50% in small stocks.

Portfolio Expected Return and Risk

Using the formulas for portfolio expected return and variance:

We calculate the portfolio metrics for the three scenarios:

Weight in T-Bills ((w_b)) Weight in Small Stocks ((w_s)) Expected Return ((E[R])) Variance Standard Deviation
1.00 0.00 0.03754 0.00111 0.03339
0.98 0.02 0.04029 0.00107 0.03278
0.50 0.50 0.10621 0.03038 0.17430

Observations

  1. Moving 2% into Small Stocks (98% T-Bills, 2% Small Stocks):
    • Expected Return increases from 3.754% to 4.029%.
    • Standard Deviation decreases slightly from 0.03339 to 0.03278.
    • This creates a portfolio with both higher returns and lower risk than 100% in T-bills.
  2. Equal Allocation (50% T-Bills, 50% Small Stocks):
    • The expected return jumps to 10.621%, but the standard deviation increases to 0.17430. This shows a higher risk-return tradeoff for more aggressive allocations.

Why Does This Happen?

This phenomenon occurs because of negative correlation and the diversification effect:

Key Takeaway

A carefully diversified portfolio can deliver higher returns with lower risk than holding a single low-risk asset. Even small changes in allocation, like moving 2% from T-bills to small stocks, can significantly enhance portfolio performance while maintaining minimal risk.

Have you experimented with portfolio diversification? Let me know how it worked for you!