Kaggle Competition Silver Medal: MAP - Charting Student Math Misunderstandings

I received my Silver Medal achievement in the MAP - Charting Student Math Misunderstandings Kaggle competition, marking my first participation in LLM language model competition.

The challenge required identifying whether a student’s free-text explanation exhibited a mathematical misconception and, if so, classifying the specific misconception type. Predictions were evaluated using MAP@3, emphasizing high-precision top-rank retrieval.

The task required building NLP models capable of generalizing across diverse math problems, handling long-tail misconception categories, and integrating question context, student choices, and open-ended explanations.

Our solution pipeline consisted of four key stages:

Stage 1 — Data Preprocessing

Unified Joint Label Space

All training targets were mapped into a single multi-class space:
Category:Misconception (with NA for non-misconception cases).
This avoided cascading errors inherent to multi-stage models.

Question-Level Correctness Prior

From training data, the most frequently correct MC answer was computed for each QuestionId.
This enabled:

A strong structural feature: Is Correct Answer: Yes/No
Later family prefix filtering (True_ vs False_)

Input Construction

Each model input concatenated:

Question text
Selected answer
Correctness hint
Student explanation

Tokenized with max_length=256 using dynamic truncation.

Stage 2 — Model Architecture and Training

LLM Backbone (Qwen3 4B / 8B / 14B)

Since ttudent explanations contain free-form reasoning, linguistic variation, and mathematical logic, a single unified classifier captures all semantics and avoids misalignment across multi-step pipelines. Models were finetuned using HuggingFace AutoModelForSequenceClassification, producing a single-head classifier over the full joint label set.

Parameter-Efficient LoRA Training

All experiments used LoRA applied to Q/V/O/gate/up/down projections.

Key training details:

Precision: 4-bit (nf4) or BF16
LR schedule: cosine with warmup
Gradient checkpointing + dynamic padding
2–3 epochs, batch size 8–16
LoRA rank scaled by model size (e.g., r=512 for 4B → r=16 for 14B)

Stage 3 — Inference and Postprocessing

Top-25 Candidate Preservation

Each model exported a Top-25 probability table, allowing richer uncertainty modeling in the ensemble stage.

Family Prefix Filtering (Key Innovation)

Using the correctness prior:

If a row’s prefix should be True_, filter out all False_*
If it should be False_, filter out all True_*

This sharply constrained the error space and reflected domain structure.

Cross-Model Consistency Weighted Ensemble

Outputs from 4B, 8B, and 14B models were fused using:

Weighted total probability (0.34)
Cross-model consistency ratio (0.33)
Max confidence across models (0.33)

This balanced global support, agreement, and peak reliability.

Completion Heuristics

If fewer than 3 predictions remained:

Add Neither:NA
For True_ cases, optionally add Correct:NA

Ensuring fully valid MAP@3 submissions.

Stage 4 — Evaluation and Results

Leaderboard Scores

Model	Precision	Method	Private LB
Qwen3-14B	4-bit + LoRA	single model	0.945
Qwen3-8B	4-bit + LoRA	single model	0.944
Qwen3-4B	FP16 + LoRA	single model	0.944
Ensemble (4B + 8B + 14B)	+ family filtering	fused	0.946

The ensemble improved upon the best individual model by +0.001, validating the benefit of structural priors and cross-model agreement.

Code Reference

Full implementation of training, inference, LoRA configuration, quantization setup, and ensemble logic is detailed in Github