Better thinking through technology

How computationally demanding is baseball?

I wanted to toss something out there for folks to ponder. The MLB model, which was built by allowing a computer system to assemble its own vocabulary by studying MLB data on its own, is a beast.

For reference, these are the data storage requirements of the other bots:

31 Mb of data, spread out over 24 database tables. Contains data from 1,537 games.

The NFL model is only that demanding because it in fact is trying to predict the future out as far as eight weeks in advance. It would be much smaller if not for the accelerated model.

18.5 Mb of data, spread out over 6 database tables. Contains data from 9,239 games.

Yup, that NBA bot is lean.

NCAA football
43 Mb of data, spread out across 15 table. Contains data from 2,944 games.

7.5 Mb of data, spread out across 6 table. Contains data from 4,201 games.

NCAA basketball
270 Mb of data, spread out across 5 tables. Contains data from 23,940 games.

It's worth noting here that the NCAA BB bot gets this big because projecting possible outcomes for nearly 300 different teams produces a 300 by 300 grid. On balance, the NCAA BB bot is fairly lean when accounting for the massive number of potential games it has to anticipate. Not that you really need a project for Vermont playing against Kentucky, but if youu ever did, the bot's willing to make an honest try.


493.5 Mb spread out across 11 tables. Contains data from 10,291 games.

When you realize that baseball has one-tenth the number of teams as NCAA basketball, that's just an insane figure for storage overhead. Why is it so bad?

1. Pitchers
If you realize that about 180 starting pitchers have played significant roles in MLB games since the middle of last season, then you start seeing that the numbers look a lot closer to the NCAA BB model that might be first anticipated.

2. Data Points
Because the bot was self-taught about what data matters, there's a lot less pruning involved.

The current version of the bot employs slightly more than 1,300 data points from each game. Mind you, these data points represent a compressed version of the between 180 and 250 events that take place over a game. Those events are described in the MLB XML files by a ridiculous number of descriptors, including x-y-z data for pitch locations, spin rates, spin directions and more and more and more.

In fact, the bot's first data sheet compressed the game into 6,400 data points. That was proving to be computationally unrealistic, so I lobotomized the poor bastard, trimming that down to the 1,300 data points that I couldn't do without.

Before all is said and done, those will be boiled down to 19 data points, include team-pitcher data, team data and projected clusters for both the home away team. Arrange that by 180 pitchers X 30 teams and then repeating again going the other way, 30 teams x 180 pitchers, to generate home and away predictions for how Pitcher X will do against Team Y and Pitcher Y will do against Team X.

Final tally? More than 370,000 different possible outcomes.

Intriguingly, the league-wide model is super slim. Despite using an 11x11 set of clusters, it only produces 71 combinations. Despite the possibility that an additional 50 outcomes might occur, none of them have happened since the beginning of the 2011 season.

Yes, baseball truly is a game of infinite possibilities. So many, in fact, that many potential games don't even occur during the course of a 4+ year span.

That's a hell of a lot of data just to produce between three and ten bets a day.