Isn’t it the dream to win the lottery? Imagine what you could do with $100 million dollars. Buy a big house, get that nice car you’ve always wanted, go on a shopping spree. All you would need are those magic numbers before they were drawn. Is there a way to predict the lottery numbers?

After learning about Symbolic Regression with Genetic Programming, one of the first things I thought about applying it towards was the lottery. I thought, if all the algorithm needed was some data points in a time series, that would fit the lottery problem perfectly. It turns out, the whole random thing put a bit of a kink in my plan.

Obviously, if the lottery could be predicted accurately and consistently, it wouldn’t be around for long as the states would lose money. The national lotteries (in the United States) use a mechanical system, which makes if fairly random, and they use a format that makes the odds extremely hard to win. (As of this writing, the odds of winning the Powerball jackpot are 1 in 292,201,338).

So then, why should I bother trying to predict it? Well, I’m not going after Powerball or Mega Millions. The are really random and the odds are too great to eek out any kind of meaningful advantage (more on that later). Instead, I focus on the smaller state lotteries that are computer generated. Yes, if you didn’t know, many small lotteries don’t use a mechanical setup that spins the balls around. They use a computer program to randomly pick the numbers. The problem with this is, if you know about software and computers, that you can’t achieve true randomness with a computer. The best you can get is pseudo-randomness.

Now, many will argue that the pseudo-randomness is “good enough”, and they might be right, but when the odds are so much better than the Powerball, that, I believe, might be enough to create an advantage to find a pattern or strategy to predict a drawing.

Before I get into my initial research and findings, let’s go over some “lotto picking theory”.

People actually spend a lot of time trying to find patterns in the lottery, or anything really that would allow them to predict a number or two. What has emerged is some really interesting theory on how to analysis lottery numbers.

## Draw Analysis

I live in Washington state, so let’s use the Washington Lotto, which uses a Pick 6-49 format (pick 6 numbers from 1-49).

The drawing from Monday, November 14, 2016 was: **09-19-31-32-43-46**

If I asked you, what kind of properties could you extract from this set of numbers, what would you come up with? Here’s a few:

- Sum of numbers
- Width of numbers
- Odds / Evens
- Has consecutive numbers
- Area of Convex Hull created if numbers were put into a 2D coordinates (for 49 numbers, a 7×7 2D grid works well).

Using just these five, we get the following:

[table id=2 /]

If you applied this to all drawings in the past three months, you’d get a time series of data points, which is the input model for symbolic regression.

Note that the Area of Convex Hull is an idea I came up during my testing and was not able to find anyone, anywhere doing this, so I presume it to be a novel approach.

## Initial Experiments

So, I ran my symbolic regression tool on a bunch of time series data using properties derived from my draw analysis. I used more properties than I showed above, but you get the idea. I tweaked my tool to add a prediction tester, or what is more commonly known as out of sample testing. Along side the symbolic regression outputs, I used a quick random pick generator as a benchmark.

So, did symbolic regression produce functions that predicted the next data point? In short, the results did not perform much better than the random pick generator (only ~1% better).

Determined to find an answer, I looked at the lotto draw history. I wrote out the numbers by hand. I started drawing circles and lines, and shapes everywhere, trying to extrapolate some kind of pattern. It always looked like I was close, that there was something, but nothing definitive. I have definitely veered off from letting Genetic Programming find the solution, to me taking over and doing it myself.

## Digging For Gold

I had eventually moved down to the Match 4 lotto. Match 4 is pick 4-24 (pick 4 numbers from 1-25), with a jackpot odds of 1 in 10,600. This is even better than the Washington Lotto, though the jackpot prize is $10,000 instead of $1,000,000 (base). But, you have the chance to win Match 4 everyday, as opposed to only 3 times a week for the Lotto.

Here I thought with Match 4’s much better odds, maybe the Genetic Programming algorithm had a better chance to find the function with my limited hardware (I’m running an old Athlon X6 1055T, 8.0 GB RAM). The results were the same; symbolic regression predicted no better than random guessing.

So again, I played with the numbers by hand, but then I looked at a different draw property, something I hadn’t focused on before. It’s called Hot Numbers.

The critics out there are now really shaking their heads and throwing up their hands, thinking I’m a lost cause at this point, but bear with me.

When I say hot numbers, in general, it’s the set of numbers that have be drawn in the past *n* drawings, where *n* can be anywhere from 2 to 10. The numbers that are left are considered Cold Numbers.

For Match 4, I experimented with *n* to see what was most optimal for my new strategy. Ideally, the set of hot numbers is much smaller than the set of cold numbers, and for computer generated lotteries, they tend to draw hot numbers more often than cold numbers (e.g. Powerball and Mega Millions). This appeared to be the slight advantage I was looking for so using just the Hot Number property on a drawing, I analyzed over a year of drawings on how many drawings had 0,1,2,3, or 4 hot numbers, and so on. This was the result:

Unique Numbers: Start Date: 11/21/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 6, Window Count: 70 Avg: 16.3428571428571, Max: 20, Min: 12, All: 24, {UQ/ALL}: .681 Start Date: 11/21/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 5, Window Count: 71 Avg: 14.7887323943662, Max: 18, Min: 12, All: 20, {UQ/ALL}: .7395 Start Date: 11/21/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 4, Window Count: 72 Avg: 12.8055555555556, Max: 16, Min: 10, All: 16, {UQ/ALL}: .8000 Start Date: 11/21/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 3, Window Count: 73 Avg: 10.4246575342466, Max: 12, Min: 7, All: 12, {UQ/ALL}: .868 Start Date: 11/21/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 2, Window Count: 74 Avg: 7.51351351351351, Max: 8, Min: 5, All: 8, {UQ/ALL}: .938

The data above is the result of creating a sliding window “Window Size” of size *n* and then determining Max, Min, and Average of Hot numbers. At a Window Size = 6, we see that all 24 numbers of the Match 4 set have shown up as a hot number at least once during this time frame, and the average size of the hot number set was ~16. Meaning, if I were to buy tickets for each combination of hot numbers, nCr => C(n,r) => C(16,4) = 1820. At $2 per ticket, that’s a bit too much for to spend on a daily basis. What about *n* = 2? Even with an average of 8 hot numbers, C(8,4) = 70, so 70 * $2 = $140 a day, which is still pricey.

The next step is to find out the optimal value for *n* with respect to the average set size for Hot Number.

Hot Numbers: Start Date: 1/1/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 5, Window Count: 395 Avg: 2.39240506329114, Max: 4, Min: 0 | (Cnt:Tot:%) => (0:7:0.01772152), (1:61:0.1544304), (2:140:0.3544304), (3:144:0.364557), (4:43:0.1088608) Start Date: 1/1/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 4, Window Count: 396 Avg: 2.02272727272727, Max: 4, Min: 0 | (Cnt:Tot:%) => (0:22:0.05555556), (1:90:0.2272727), (2:166:0.4191919), (3:93:0.2348485), (4:25:0.06313131) Start Date: 1/1/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 3, Window Count: 397 Avg: 1.63476070528967, Max: 4, Min: 0 | (Cnt:Tot:%) => (0:46:0.115869), (1:126:0.3173803), (2:160:0.4030227), (3:57:0.1435768), (4:8:0.02015113) Start Date: 1/1/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 2, Window Count: 398 Avg: 1.14321608040201, Max: 4, Min: 0 | (Cnt:Tot:%) => (0:98:0.2462312), (1:175:0.4396985), (2:96:0.241206), (3:28:0.07035176), (4:1:0.002512563) Start Date: 1/1/2015 12:00:00 AM, End Date: 2/4/2016 12:00:00 AM, Window Size: 1, Window Count: 399 Avg: 0.606516290726817, Max: 3, Min: 0 | (Cnt:Tot:%) => (0:195:0.4887218), (1:168:0.4210526), (2:34:0.08521304), (3:2:0.005012531), (4:0:0)

A description of the report above:

- Window Size: This is
*n.* - Window Count: The number of drawings that occurred from Start Date to End Date.
- Avg: The average number of Hot Numbers for each drawing.
- Max: The maximum number of Hot Numbers to show up for any drawing.
- Min: The minimum number of Hot Numbers to show up for any drawing.
- (Cnt:Tot:%): This is a stat of each type of drawing based on how many hot numbers it had, so for a Window Size of 2, and for drawings with 1 Hot number, there were 175 total drawings which represented 43.96% of all the drawings in the time frame: (Cnt:Tot:%) => (1:175:0.43969…)

Based on this data, I zeroed in on the Window Size (*n*) of 4 because there were 25 occurrences of all 4 numbers being hot numbers. And the average set size for *n* = 4 was about 13, that’s 715 tickets * $2/ticket = $1,430, but if I could predict when the 25 occurrences of 4 Hot Number drawings, then it wouldn’t matter ($10,000 > $1,430).

So I charted the 4 Hot Number occurrences to try and find a pattern. There seemed to be pair pattern where when one 4 Hot Number draw occurred, a 2nd one occurred in the next few days. This wasn’t consistent enough to be a predictor though. Then there was something odd about the Sum of Line property in comparison to the previous draw. Many of the times that a 4 Hot Number draw occurred, it’s Sum of Line was greater than the one before it. This wasn’t 100% of the time, but it happened often enough that I thought I could test a strategy around taking the 4 Hot Number draws and filtering them down based on their Sum of Line to reduce the daily costs.

At this point, my goal changed from trying to predict a jackpot, to going after profit. Could I come up with a strategy for daily play that minimized my costs, but came away with enough jackpots to turn a profit over time?

To cut this short and skip more details, I did come up with a strategy that simulated a profit from 2013 to 2016. Prior to 2013, there was a distinct change in my simulation results that always produced a loss. My only conclusion was that the Lottery’s number generator changed sometime in 2013 to make this strategy start working from 2013 onward.

The strategy went like this:

- Get the set of hot numbers with
*n*= 4. - Generate all ticket combinations with the set of hot numbers derived from (1).
- Find the Sum of Line value for each ticket.
- Find the Area of the Convex Hull for each ticket.
- Filter all tickets from (2) such that: Sum Of Line (Min, Max) => (32, Prior Sum), and Area (Min, Max) => (2, 3.5)

Applying this strategy for Match 4 on time frame 1/15/2015 – 2/5/2016 go the following results:

Total Winnings: $68,326 Total Costs: $51,478 Profit: $16,858

I tweaked the parameters and got even greater profit, but with more costs. The best being $53,128 in profit but the cost of $94,286. I also ran this on different time frames such as all of 2015, 2014, 2013, and 2012.

Since the strategy didn’t work prior to 2013, I decided if it changed before, it could change again, so I didn’t bother trying it. Also, having to spend $50,000 – $100,000 a year to make a profit seemed a bit crazy to me and I wasn’t going to test it with real money to find out if it would work.

Now, if I could figure this out for the lottery, predicting the stock market should be a lot easier, right?