Samples · lm_eval_harness.gsm8k
Run #75 · Adapter v1.0.0+humaneval-removed+gen-kwargs-pairing · 73/1319 Samples angezeigt
· Score 90.4%
KI-Auswertung
Generiert 2026-05-13 21:38 · claude-sonnet-4-6Zusammenfassung
Das Modell Qwen3-Coder-Next erreicht auf GSM8K eine Pass-Rate von 92,5 % (Score 90,4 %), was ein solides, aber nicht herausragendes Ergebnis für mehrstufige Grundschulmathematik darstellt.
Stärken
- Einfache und mittelschwere Rechenaufgaben werden zuverlässig und mit sauberem Rechenweg gelöst.
- Umrechnung von Einheiten sowie lineare Mehrstufenprobleme (Groceries, Pool-Füllungskosten, Prozentsätze) gelingen konsistent.
- Null Fehler (errors=0), das Modell bricht nie ab oder liefert ungültige Ausgaben.
Schwächen
- Aufgaben mit indirekten oder impliziten Bezügen werden falsch interpretiert, z. B. „10 % schneller laufen" wird als Zeitreduktion durch Divisor 1,1 statt als direkte Subtraktion behandelt.
- Off-by-one-Fehler bei inklusiven Zeiträumen (z. B. Gene-Quiltblock-Aufgabe: 12 statt 11 Jahre).
- Mehrdeutige Problemformulierungen verleiten zu Überanalyse, wodurch das Modell teils falsche Relationen (z. B. Lylah's Gehalt) einführt.
- Wahrscheinlichkeitsaufgaben: Das Modell berechnet korrekt, interpretiert die Frage jedoch falsch (relative statt absolute Differenz).
Auffälligkeiten
Wiederkehrendes Muster: Bei Aufgaben, die eine eindeutige, kurze Antwort erfordern, produziert das Modell ausführliche Alternativüberlegungen und verfehlt dabei das gesuchte einfache Ergebnis. Dies deutet auf eine Tendenz zur Überantwortung (verbosity bias) hin.
Empfehlung
Sampling-Temperatur senken (z. B. auf 0.0 oder greedy decoding), um das Modell bei klaren Zahlenaufgaben von spekulativen Alternativpfaden abzuhalten und die Pass-Rate weiter in Richtung 95 %+ zu treiben.
Übersicht
1319 SamplesVerteilung
Score-Histogramm
0.0 ────── 1.0
| Frage-ID | Status | Score | Prompt | Latenz | Tokens/s | TTFT | |
|---|---|---|---|---|---|---|---|
| 58 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In a 50-… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 85 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Katie ha… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 93 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Lena is … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 119 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Michael … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 137 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The bask… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 147 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mr. Smit… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 159 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ben has … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 184 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Gilbert … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 187 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: It takes… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 219 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Albert h… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 227 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Edward h… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 255 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A gift s… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 298 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Three pl… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 304 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: When thr… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 306 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A pet sh… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 340 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Three lo… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 368 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 371 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Abe find… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 380 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: On Monda… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 403 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Traci an… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 409 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 425 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tim has … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 428 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Verna we… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 450 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Nick hid… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 454 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Nicky we… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 494 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tom buys… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 505 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Reynald … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 507 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tatuya, … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 539 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mr. Rock… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 542 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Daniel h… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 590 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mandy is… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 607 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Earl sta… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 641 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Every Ha… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 649 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Two frie… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 652 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: On Satur… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 675 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There we… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 689 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James is… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 710 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Among th… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 749 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: At a bir… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 752 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Olaf is … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 768 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mark is … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 777 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 780 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Wyatt's … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 782 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tim host… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 796 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Freddie … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 814 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Paddy's … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 823 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In four … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 832 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Elsa sta… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 835 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Matthias… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 858 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In a cer… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 894 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A clerk … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 920 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 951 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mandy is… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 952 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Stephani… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 962 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A family… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 984 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John car… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1001 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Thomas h… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1016 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ian is l… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1019 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A farmer… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1035 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kobe and… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1038 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bag of… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1042 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kobe and… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1048 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John cli… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1059 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Smaug th… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1088 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mark buy… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1161 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mike beg… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1176 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James bu… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1183 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: While ch… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1193 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Hershel … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1227 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A questi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1306 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mary is … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1309 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: If Clove… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 1313 | failed | {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: OpenAI r… | — | — | — | ||
|
Lade Detail …
|
|||||||
73 von 1319 Samples · Limit 200
Nächste ›