Mike Ion
I study how to tell when a language model is actually doing what we asked it to, and where common evaluation metrics quietly fail. My work lives at the intersection of measurement theory and statistical ML, with applications in education and conversational AI.
Current research
- Evaluation of LLM output distributions. Per-prompt similarity metrics (cosine, BERTScore) can systematically misrank instruction-tuned models. Instruction tuning shifts the center of the output distribution while preserving its spread; cosine tracks the spread and misses the shift. The blind spot has direct implications for training-data filtering pipelines.
- Fidelity of synthetic conversational data. Adapting propensity-score MSE from tabular synthetic data to conversations, with L1-regularized classifiers that yield per-feature diagnostics and dimensional decomposition. Practitioners get not just a quality score but a directive on what to fix next.
- Statistical representations of dialogue. Feature pipelines that map variable-length conversations to fixed-dimensional vectors capturing surface patterns and cognitive dynamics, built on MathMentorDB.
Selected Papers
Across 8,290 paired human and model continuations, cosine similarity ranks instruction-tuned models as most human-like in 5 of 6 genres, while kernel MMD² ranks those same models as least human-like. A 67-feature linguistic decomposition explains the disagreement: instruction tuning shifts the distribution's center without changing its spread. Per-prompt similarity is insensitive to the shift; population-level distance responds to both.
A propensity-score MSE framework for assessing whether LLM-generated conversations match the distribution of real ones. L1-regularized classifiers on interpretable feature subsets reveal that surface and cognitive fidelity are coupled: improving one without attending to the other can worsen overall fidelity. The framework turns a scalar quality score into a diagnostic for iterative prompt refinement.
A hierarchical Bayesian Hidden Markov Model fit to 2,437 math tutoring conversations (51k+ messages) from MathMentorDB, using LLM-classified discourse moves. Four latent pedagogical states emerge unsupervised (Problem Introduction, Exploration, Lecturing, Working). Resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference — replicating impasse-driven learning findings at two orders of magnitude greater scale than prior manually-annotated studies.
Interactive tool operationalizing the pMSE framework above. Traces a synthetic conversation from generation prompt through feature extraction to L1 coefficients, with side-by-side real/synthetic contrast and a scorecard exposing the surface–cognitive coupling directly.
News
- Apr 2026Measuring Simulation Fidelity via Statistical Detectability (with Kevyn Collins-Thompson) accepted to ACM Learning@Scale 2026 (Seoul, July 2026) at 22% — a propensity-score MSE diagnostic for whether LLM-generated tutoring conversations match the distribution of real ones, showing surface and cognitive fidelity are coupled.
- Apr 2026From Moves to Pathways (with Michael Light and Kevyn Collins-Thompson) accepted to AIED 2026 (Late-Breaking Results poster track, 29%) — a hierarchical Bayesian HMM over 2,437 math tutoring conversations recovers four latent pedagogical states unsupervised; resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference.
- Mar 2026Talk at Infinicon (San Luis Obispo): a walkthrough of getting LLMs to return structured outputs — typical prompting approaches for specific fields and categories, then fine-tuning an open-weight model on a free Colab instance and comparing the two. [notebook]
- Oct 2025Preprint Chip-Firing and the Sandpile Group of the R10 Matroid posted on arXiv (with Alex McDonough) and since submitted to Galois Journal of Algebra — a description of chip-firing on R10 using complex numbers and representatives for the 162 equivalence classes of its sandpile group.
- Sep 2025Started as Lecturer in the Statistics Department at California Polytechnic State University, San Luis Obispo.
- Aug 2025Talk at JSM 2025 (Nashville): Bayesian Hierarchical Modeling of Large-Scale Math Tutoring Dialogues — a Bayesian framework for analyzing cognitive load in math tutoring, applied to MathMentorDB's 5.4M messages across 200K+ conversations.
- Feb 2025Awarded $12,435 from the Academic Innovation Fund to develop AI-powered technical-interview practice tools for data science students.
- Jan 2025Talk at JMM 2025 (Seattle): Text-as-Data in Mathematics Education: Harnessing LLMs to Analyze Student Conversations at Scale.
- Sep 2024Started as postdoctoral fellow at the University of Michigan School of Information.
- Mar 2024Defended my PhD at the University of Michigan — Beyond the Classroom: Exploring Mathematics Engagement in Online Communities with Natural Language Processing.