FFDB, my local Statcast database, is now on GitHub

baseball
programming
Author

Harper Wicker-Lenseigne

Published

May 21, 2026

tl;dr:

I’m publishing the code behind my local Statcast database, which I’m calling FFDB (because it’s Four-seam Fast), in case anyone else would like to use it as well. You can find it on GitHub. Instructions for usage can be found there! Enjoy! Contact me if you have any questions. I’m happy to help!

Overview

One of the first things that really drew me to baseball was sabermetrics. I spent hours experimenting with the MLB API, creating rudimentary Python scripts to loop through game log JSON files that I had downloaded. Of course, this was wildly inefficient. So frustratingly inefficient, in fact, that the struggle it presented led to a year-long on-and-off journey of creating and iterating on what is now FFDB. I had originally just created this as an internal tool for myself to play with, but after putting some decent time into making it faster, cleaner, and more efficient, I figured I may as well put it out there, in case it helps anyone who was in the same situation I was.

If you have any experience with the MLB game data API, the schema I’ve defined should feel mostly familiar – the major conceptual difference is that I’m using a relational database-style setup instead of nesting JSON objects. Using SQL queries to interact with the data is much more effective, and having all the data stored locally makes it significantly faster as well. Admittedly, it does take up a lot of space – the JSON files from 2008 to 2026 are currently taking up more than 60 GB on my hard drive right now. However, you can delete the JSON files after generating the Parquet files (which are ~30x smaller!), but this means you can’t regenerate the Parquet files if a change to the schema is made without redownloading all of the JSON files again.

Speaking of downloading all the JSON files, I would like to mention that you should be somewhat careful downloading the JSON files from the MLB API. While I haven’t personally experienced this, you could risk an IP ban or rate limiting.

Note that this project is certainly still a work in progress! Additionally, changes are made to the MLB API all the time, so I will likely be updating this continuously. I will do my best to avoid any changes that aren’t backwards-compatible, but no promises…

Example queries

Get the average velocity of Paul Skenes’ four-seam fastballs in the 2024 regular season:

SELECT AVG(start_speed)                    -- average the speed column in our results
FROM events e                              -- grab rows from the pitches column
JOIN games g ON g.game_id = e.game_id      -- add info from the corresponding game row
JOIN ref.players rp ON rp.id = e.pitcher   -- add info from the player row that matches the pitcher's ID
WHERE rp.full_name = 'Paul Skenes'         -- match the name of the pitcher
    AND g.game_type = 'R'                      -- regular season only
    AND g.season = 2024                        -- filter by season for only 2024
    AND e.pitch_type = 'FF'                    -- filter by pitch type for four-seamers

Get the team OPS for each team in 2025:

SELECT
    rt.team_name,                                                                     -- label each row with the proper team
    SUM(bl.hits + bl.base_on_balls + bl.hit_by_pitch)                                 -- OBP calculation
        / SUM(bl.at_bats + bl.base_on_balls + bl.hit_by_pitch + bl.sac_flies) AS obp,
    SUM(bl.total_bases) / SUM(bl.at_bats) AS slg,                                     -- SLG calculation
    obp + slg AS ops                                                                  -- add 'em together
FROM batting_logs bl                                                                  -- easiest way to aggregate offensive stats
JOIN games g ON bl.game_id = g.game_id                                                -- get game-level info like season
JOIN player_logs pl ON pl.game_id = bl.game_id AND bl.player = pl.player              -- get player context info from ID, specifically their team ID
JOIN ref.teams rt ON pl.parent_team_id = rt.id                                        -- get team info from ID, specifically the name
WHERE g.season = 2025
    AND g.game_type = 'R'
GROUP BY rt.team_name

Query tips

The documentation itself can be found in the GitHub repository, but I’m going to put some tips for building queries on this page as well.

Regular season “game-type” filtering

The game file downloader will download every “major league” game each year (more exactly, it’s downloading every game under the “sportId” 1, which corresponds to the MLB). This means it includes spring training (which you almost never want to include) and the postseason (which you often do not want to include). If you just want regular season data, you should do the following:

SELECT * 
FROM games g
WHERE g.game_type = 'R'

To more finely control the game types that are included, you can reference the relevant MLB Stats API endpoint /gameTypes.

Only returning pitches from “events” table queries

The “events” table represents pitch-level events, which usually means each row is a pitch, but not exactly. Batter timeouts, game advisories, and other supplemental pitch-level events are included as rows. This can cause issues with null values that might not be super obvious! The solution is as follows:

SELECT *
FROM events e
WHERE e.is_pitch = TRUE