Baseball API
Yesterday I was hardening the negroleagues endpoints & adding materialized views. I also made this banner for the project with canva. I'm really liking how this project is turning out. My "vision" for this is to be very similar to Nominatim, in that it pulls from openly available data and provides a really phenomenal way to prototype and learn from. They have some reasonable restrictions on using their API (its really made for their own UI) but do offer a detailed guide on self-hosting.
Based on their example, I implemented rate limiting & caching pretty early in this project's development.
This API draws primarily from these sources
Lahman (Society for American Baseball Research)
MLB Stats API (the integration used the wiki for the eponymous third party Python library)
Fangraphs Guts! for park factors
There's a small ETL "pipeline"/CLI toolbox I made for this. I'm considering making a little data viewer TUI as well.
Math
Here I'd like to share a handful of formulas that are baked into the API's repository layer. There a ton of things I could dive into, so let me know if you want me to elaborate on any or talk in more detail!
Most counting stats are precomputed by these datasets but some commonly used stats like OPS (on-base + slugging) requires a little bit of computation, and OPS+ requires a park factor to correct for offensive environments (this is prevalent mostly in Seattle and Colorado).
wOBA (weighted on-base average), WRC (weighted runs created), WRAA (weighted runs above average), and WAR (wins above replacement) required the most careful database work.
If you couldn't guess, this was not easy to debug.
Here the wOBA scale comes from a season-specific constant such that league wOBA ≈ league OBP.
I also added a leverage index, which is "for a game state is computed based on the potential swing in win expectancy." You really will not believe how much data retrosheet gives you. It's incredible.
Partitioning
Before working on the Negro League and Federal League endpoints, much of the work I had done was isolated to the last few seasons because it had the most complete set of play by play data (and I could confirm I'd seen some of the games I was looking into). This, and a couple of indexes I added after the fact (as well as some more personal issues) are why I delayed the deployment/"moved my right field wall." When the database contains closer to the full range of available data, these database changes get riskier and harder to debug.
The Implicit Date Filter
In order to really take advantage of the partitions and stop the Postgres engine from checking every partition, if a league is explicit and within a range, we add a date filter so the database checks the right partition. This doesn't happen when using the AL or NL league IDs because they cover the whole database (I'm banking heavily on the assumption that games would be queried instead of calling the plays endpoint directly).
Basically, this exists because as it stands now, partitioning is a bit of a premature optimization.
What's the use case for this?
I'm excited to see what use cases people come up with with this API. There are some neat projects made with the MLB Stats API, like a terminal version of game day. I made some small examples with Go's baked in templating and alpine.js + chart.js. They're simple views of returned data.
There's so much for me to talk about with this project. The README talks about some of the "cooler" endpoints and running the application locally. Speaking of locally, I don't think I can procrastinate deployment any further and will be working on solidifying the containerization. I'll probably have to implement local auth too.
Beacon
Nothing to report here aside from the fact that I plan to get back into the swing of this project once the baseball API is released in earnest.
If you have any thoughts or questions, feel free to comment or DM me on bluesky @desertthunder.dev!