Back to Portfolio

Pitch Type Classifier

Machine Learning-Powered Pitch Classification System with Interactive Data Cleaning

🎯 Project Overview

An advanced R Shiny application that automatically classifies baseball pitch types from TrackMan data using hierarchical XGBoost models, with intelligent arsenal correction and an interactive interface for manual data cleaning. The system achieves 93.6% accuracy through 5-fold cross-validation while providing analysts with complete control over the final dataset through visual verification and manually pitch tag editing capabilities.

Built to solve the real-world problem of cleaning large amounts of Trackman data for the purpose of scouting reports and machine learning model training.

93.6% 5-Fold Cross-Validation
5 XGBoost Classifiers
4D Outlier Detection
68 Engineered Features

Key Features

1

Hierarchical Classification

Five-model XGBoost architecture with specialized classifiers for different pitch type groups, achieving 93.6% accuracy through 5-fold cross-validation on Big10 conference data.

2

Intelligent Arsenal Correction

Similarity-based centroid comparison merges pitch types with nearly identical movement profiles (within 4 mph, 4" IVB/HB, 400 RPM), further cleaning the data based on arsenal-specific context.

3

4D Outlier Detection

Mahalanobis distance-based flagging using velocity, vertical break, horizontal break, and spin rate. Logic only flags pitches closer to other clusters, reducing false positives.

4

Interactive Data Explorer

Complete filtering interface with multi-select capabilities, range sliders, and "Show Only Flagged" mode. Export to CSV/Excel with one click.

5

Pitcher Movement Plot Explorer

Interactive Plotly movement charts with click-and-drag selection, team-specific and pitcher-specific filtering, and instant navigation from flagged pitches to visual context. Allows for easy manual data cleaning directly on the movement plot.

6

Flexible Editing Workflow

Full manual editing of both flagged and regular pitches in the data explorer and on movement plots. Ability to undo changes (50-change history).

🔧 Technical Implementation

Machine Learning Architecture

The classification system uses a hierarchical approach with five XGBoost models:

Additionally, an optional arsenal correction system may be implemented. Next, a 4D Mahalanobis distance-based flagging system identifies potentially mislabeled pitches after classification. Rare pitch types (Splitter, Cutter, Sweeper) receive boosted class weights during training to address class imbalance.

Feature Engineering

68 engineered features including movement characteristics (IVB, HB, spin), velocity profiles, pitcher-relative metrics (Z-scores, velocity ratios), release point characteristics, and spatial clustering indicators. Movement features weighted more heavily than velocity for better pitch type discrimination.

Arsenal Correction (4D Similarity-Based)

For each pitcher with 30+ pitches, the system:

  1. Calculates 4D centroids for each pitch type (mean velocity, IVB, HB, spin rate)
  2. Compares all pairs of pitch types within pitcher's arsenal using Euclidean distance
  3. Flags pairs as "too similar" if ALL four differences are below thresholds (4 mph, 4", 4", 400 RPM)
  4. Merges the less common type into the more common type
  5. Reduces arsenals with too many pitch types to a realistic arsenal (~1% of pitches affected)

Outlier Detection & Flagging

Conservative 4D Mahalanobis distance approach:

Note on spin rate: While spin is included in the 4D detection, Trackman spin measurements can occasionally be noisy and include misreads. The system's conservative "closer to another cluster" requirement helps filter out spin-driven false positives where movement and velocity profiles are normal.

R / R Shiny XGBoost Tidyverse Plotly DT (DataTables) Centroid Clustering Mahalanobis Distance 5-Fold Cross-Validation

Application Demonstration

Watch the Pitch Type Classifier in action, showcasing the complete workflow from data upload to final export:

(Coming soon!): Demo video showing data upload, classification, flagging review, and manual editing workflow

Features Showcase

Key interface components and functionality:

Data Explorer Interface

Data Explorer

Advanced filtering, multi-select editing, and "Show Only Flagged" mode for efficient data review.

Interactive Movement Plot

Visual Verification

Interactive Plotly charts showing pitch movement with click-and-drag selection for manual corrections.

Flagging System

4D Outlier Flagging

Conservative flagging with detailed reasons and suggestions, Accept/Reject options for batch processing.

Data Summary Stats

Real-Time Statistics

Live stats showing total pitches, arsenal corrections, flagged count, and manual edits.

📊 Complete Workflow

  1. Upload TrackMan CSV: System automatically loads and processes data (max 100 MB)
  2. Classification: Hierarchical XGBoost models classify all pitches (< 10 seconds)
  3. Arsenal Correction: Centroid-based similarity merging reduces inflated arsenals (~1% of pitches)
  4. Outlier Flagging: 4D Mahalanobis detection flags suspicious pitches (~0.7% of pitches)
  5. Interactive Review: Data Explorer shows all flagged pitches with suggestions
  6. Visual Verification: Click "View in Pitcher Explorer" to see pitch in movement plot context
  7. Batch or Individual Editing: Accept/Reject all or selected flagged pitches, or manual edit in the data explorer or directly in the movement plot
  8. Download Clean Data: Export final classified dataset with all corrections applied

Key Innovations

Conservative Flagging Logic

Unlike aggressive outlier detection that flags everything outside normal ranges, this system only flags pitches that are BOTH outliers AND closer to a different pitch type's cluster. This reduces false positives while maintaining high recall for true mislabels.

4D Arsenal Correction

Automatically corrects pitcher arsenals to reduce redunant model predictions (such as a pitcher that was predicted to have both a slider and cutter that are identical).

Integrated Visual Workflow

Rather than forcing users to choose between automated classification and manual editing, the system seamlessly integrates both. Analysts can trust the algorithm for obvious cases, verify uncertain ones visually, and override when domain knowledge suggests otherwise.

📈 Performance & Validation

Training Approach: 5-fold cross-validation on Big 10 conference data (18,872 pitches)

93.6% Cross-Validation Accuracy
~1% Flagging Rate

Learning & Development

Key insights from the iterative development process:

Future Enhancements