Perhaps, above all, I like the exploratory nature of visualizing data. We must be mindful of clustering illusions and type I errors, but it’s fun to explore unbridled, feeding our intuition and the part of our brain that seeks patterns.
Let’s consider an example. Take a look at this list of numbers.
If you look at it hard enough, chances are your brain will come up with something. Don’t feel bad if it didn’t as this is just a list of 100 random numbers (technically, pseudo-random numbers).
If we chart these points, we get a representation of this randomness.
There is no pattern to be had. Charting these scattered points is not entirely useless, however.
Knowing full well that my brain will try to find a pattern, I let it play and noticed something. Nothing too surprising, but some dots appear to line up. This can easily be seen above y = 40, for example.
Well, big deal, some numbers are randomly repeated. Yes, but… now, I’m curious. How often are numbers repeated?
Let’s chart once more, this time using the frequency of occurrence on the y-axis.
This looks a little less random, doesn’t it? With this set of 100 draws, the range appears to be between 0 (the number wasn’t drawn) and 4 repetitions (in the case of number 63).
(This conjured some unrelated thoughts about the survivorship bias. I humorously imagined the top dots on the chart selling books to the lower dots about how to succeed. But I digress.)
It’s also visually obvious that there are fewer and fewer numbers as the frequency of repetition goes up. This makes sense. The odds of a number being present at least once are fairly high:
While the odds of it being repeated 100 times are virtually zero:
Every other frequency falls in between these two bounds.
Okay, cool, with 100 draws we seem to have quite a few numbers repeating 3 times, with one number even repeating 4 times.
A hypothesis emerges
We can hypothesize that as the number of draws increases, we might see a larger number of maximum repetitions as well. With 100, we comfortably hit 3 repetitions with several numbers, and we even had a 4. What happens with 1000, 10,000, 100,000 draws?
It feels like the odds of a number appearing even more frequently by chance should go up.
In other words, the maximum number of repetitions, as a function of the number of draws, should diverge. Very slowly, mind you, but it should diverge.
Hmm, slowly diverge. Logarithmically, perhaps? Just a hunch, but let’s throw caution to the wind, and hazard an even stronger hypothesis.
With n=100, observation leads us to a fairly safe guess. Namely, it’s very likely that at least one number will repeat at least 3 times. 3, in our case, happens to be .
I bet the odds aren’t too bad for either, but seems extremely likely.
So our guess/hypothesis becomes:
Given n integers randomly selected from 1 to n, there is a high probability that at least one number will be repeated at least log n + 1 times.
This would imply:
Let’s fire up good ol’ Python and see where we land.
When I ran the code (shown at the end) for n = 1000 I got the following chart.
Look at all those points on y = 4. Even is holding up. Methinks, we are onto something.
Let’s try n = 10,000.
Yup. We expected some numbers to be repeated 5 times, and we got quite a few of those.
Okay, time to bring out the big guns. n=100,000.
It looks like it still holds. There is a multitude of numbers with 6 repetitions.
OK, so we started off with random data and through visualization and a bit of intuition, we came up with a pretty neat (if conservative) rule about how often numbers repeat by chance. Along with some room to refine our guess and experiment further.
This is the essence of science. You observe phenomena, formulate a hypothesis, and then try to prove or disprove it through experimentation.
Now, before the rigorous mathematicians among my readers have a conniption, all this doesn’t prove the hypothesis above. Of course.
What does highly likely even mean in this context? Is it 90%, or 99.999% How does it vary as n varies? You’ll want to analytically verify this and calculate what the actual odds for the occurrence of at least and repetitions are.
I have not had the time to take this further step (the solution is likely not as trivial as it first looks). At the risk of enraging you with a flashback from your college days, I’m afraid I’ll have to leave this as an exercise for the reader. 😉
But this fun exploration is what got us to the hypothesis and these follow up questions in the first place. Historically, people have often figured out some mathematical relation or hint of it before they could rigorously prove it or had the exact answer.
The Babylonians come to mind, as they might have figured out the Pythagorean Theorem before they had a rigorous proof:
The famous and controversial Plimpton 322 clay tablet, believed to date from around 1800 BCE, suggests that the Babylonians may well have known the secret of right-angled triangles (that the square of the hypotenuse equals the sum of the square of the other two sides) many centuries before the Greek Pythagoras. The tablet appears to list 15 perfect Pythagorean triangles with whole number sides, although some claim that they were merely academic exercises, and not deliberate manifestations of Pythagorean triples.
Antonio Cangiano is a Software Development Manager at IBM. He authored Ruby on Rails for Microsoft Developers (Wrox, 2009) and Technical Blogging (The Pragmatic Bookshelf, 2012, 2019). He is also the Marketing Lead for Cognitive Class, an educational initiative which he helped grow from zero to over 1 Million students. You can follow him on Twitter.