Mike Ion

Mike Ion

Postdoctoral Research Fellow · University of Michigan School of Information

I study how to tell when a language model is actually doing what we asked it to, and where common evaluation metrics quietly fail. My work lives at the intersection of measurement theory and statistical ML, with applications in education and conversational AI.

Current research

  1. Evaluation of LLM output distributions. Per-prompt similarity metrics (cosine, BERTScore) can systematically misrank instruction-tuned models. Instruction tuning shifts the center of the output distribution while preserving its spread; cosine tracks the spread and misses the shift. The blind spot has direct implications for training-data filtering pipelines.
  2. Fidelity of synthetic conversational data. Adapting propensity-score MSE from tabular synthetic data to conversations, with L1-regularized classifiers that yield per-feature diagnostics and dimensional decomposition. Practitioners get not just a quality score but a directive on what to fix next.
  3. Statistical representations of dialogue. Feature pipelines that map variable-length conversations to fixed-dimensional vectors capturing surface patterns and cognitive dynamics, built on MathMentorDB.

Selected Papers

NeurIPS 2026 · in preparation
The Evaluation Blind Spot: Per-Prompt Similarity Can Systematically Misrank Instruction-Tuned Text Distributions

Across 8,290 paired human and model continuations, cosine similarity ranks instruction-tuned models as most human-like in 5 of 6 genres, while kernel MMD² ranks those same models as least human-like. A 67-feature linguistic decomposition explains the disagreement: instruction tuning shifts the distribution's center without changing its spread. Per-prompt similarity is insensitive to the shift; population-level distance responds to both.

Ion, M. & Godfrey, J. · 2026
ACM Learning@Scale 2026 · accepted (22%)
Measuring Simulation Fidelity via Statistical Detectability: A Diagnostic Framework for AI-Generated Tutoring Conversations

A propensity-score MSE framework for assessing whether LLM-generated conversations match the distribution of real ones. L1-regularized classifiers on interpretable feature subsets reveal that surface and cognitive fidelity are coupled: improving one without attending to the other can worsen overall fidelity. The framework turns a scalar quality score into a diagnostic for iterative prompt refinement.

Ion, M. & Collins-Thompson, K. · 2026
AIED 2026 · accepted (Late-Breaking Results poster, 29%)
From Moves to Pathways: Characterizing Pedagogical Discourse Dynamics in Online Tutoring with Bayesian Generative Modeling

A hierarchical Bayesian Hidden Markov Model fit to 2,437 math tutoring conversations (51k+ messages) from MathMentorDB, using LLM-classified discourse moves. Four latent pedagogical states emerge unsupervised (Problem Introduction, Exploration, Lecturing, Working). Resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference — replicating impasse-driven learning findings at two orders of magnitude greater scale than prior manually-annotated studies.

Light, M., Ion, M. & Collins-Thompson, K. · 2026
ACM Learning@Scale 2026 · demo
A Fidelity Diagnostic Workbench for Synthetic Tutoring Conversations

Interactive tool operationalizing the pMSE framework above. Traces a synthetic conversation from generation prompt through feature extraction to L1 coefficients, with side-by-side real/synthetic contrast and a scorecard exposing the surface–cognitive coupling directly.

Ion, M. & Collins-Thompson, K. · 2026

Full publication list →

News