Samples · lm_eval_harness.gsm8k

Run #75 · Adapter v1.0.0+humaneval-removed+gen-kwargs-pairing · 73/1319 Samples angezeigt · Score 90.4%
‹ Zurück zum Run-Detail

KI-Auswertung

Generiert 2026-05-13 21:38 · claude-sonnet-4-6

Zusammenfassung

Das Modell Qwen3-Coder-Next erreicht auf GSM8K eine Pass-Rate von 92,5 % (Score 90,4 %), was ein solides, aber nicht herausragendes Ergebnis für mehrstufige Grundschulmathematik darstellt.

Stärken

  • Einfache und mittelschwere Rechenaufgaben werden zuverlässig und mit sauberem Rechenweg gelöst.
  • Umrechnung von Einheiten sowie lineare Mehrstufenprobleme (Groceries, Pool-Füllungskosten, Prozentsätze) gelingen konsistent.
  • Null Fehler (errors=0), das Modell bricht nie ab oder liefert ungültige Ausgaben.

Schwächen

  • Aufgaben mit indirekten oder impliziten Bezügen werden falsch interpretiert, z. B. „10 % schneller laufen" wird als Zeitreduktion durch Divisor 1,1 statt als direkte Subtraktion behandelt.
  • Off-by-one-Fehler bei inklusiven Zeiträumen (z. B. Gene-Quiltblock-Aufgabe: 12 statt 11 Jahre).
  • Mehrdeutige Problemformulierungen verleiten zu Überanalyse, wodurch das Modell teils falsche Relationen (z. B. Lylah's Gehalt) einführt.
  • Wahrscheinlichkeitsaufgaben: Das Modell berechnet korrekt, interpretiert die Frage jedoch falsch (relative statt absolute Differenz).

Auffälligkeiten

Wiederkehrendes Muster: Bei Aufgaben, die eine eindeutige, kurze Antwort erfordern, produziert das Modell ausführliche Alternativüberlegungen und verfehlt dabei das gesuchte einfache Ergebnis. Dies deutet auf eine Tendenz zur Überantwortung (verbosity bias) hin.

Empfehlung

Sampling-Temperatur senken (z. B. auf 0.0 oder greedy decoding), um das Modell bei klaren Zahlenaufgaben von spekulativen Alternativpfaden abzuhalten und die Pass-Rate weiter in Richtung 95 %+ zu treiben.

Übersicht

1319 Samples
Verteilung
1246
73
Score-Histogramm
0 – 0.1: 73 0.1 – 0.2: 0 0.2 – 0.3: 0 0.3 – 0.4: 0 0.4 – 0.5: 0 0.5 – 0.6: 0 0.6 – 0.7: 0 0.7 – 0.8: 0 0.8 – 0.9: 0 0.9 – 1: 1246
0.0 ────── 1.0
Status Score-Schwelle Zurücksetzen Score < 0.5
Frage-ID Status Score Prompt Latenz Tokens/s TTFT
58 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In a 50-…
Lade Detail …
85 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Katie ha…
Lade Detail …
93 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Lena is …
Lade Detail …
119 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Michael …
Lade Detail …
137 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The bask…
Lade Detail …
147 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mr. Smit…
Lade Detail …
159 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ben has …
Lade Detail …
184 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Gilbert …
Lade Detail …
187 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: It takes…
Lade Detail …
219 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Albert h…
Lade Detail …
227 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Edward h…
Lade Detail …
255 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A gift s…
Lade Detail …
298 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Three pl…
Lade Detail …
304 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: When thr…
Lade Detail …
306 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A pet sh…
Lade Detail …
340 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Three lo…
Lade Detail …
368 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
371 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Abe find…
Lade Detail …
380 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: On Monda…
Lade Detail …
403 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Traci an…
Lade Detail …
409 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
425 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tim has …
Lade Detail …
428 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Verna we…
Lade Detail …
450 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Nick hid…
Lade Detail …
454 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Nicky we…
Lade Detail …
494 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tom buys…
Lade Detail …
505 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Reynald …
Lade Detail …
507 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tatuya, …
Lade Detail …
539 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mr. Rock…
Lade Detail …
542 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Daniel h…
Lade Detail …
590 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mandy is…
Lade Detail …
607 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Earl sta…
Lade Detail …
641 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Every Ha…
Lade Detail …
649 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Two frie…
Lade Detail …
652 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: On Satur…
Lade Detail …
675 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There we…
Lade Detail …
689 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James is…
Lade Detail …
710 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Among th…
Lade Detail …
749 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: At a bir…
Lade Detail …
752 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Olaf is …
Lade Detail …
768 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mark is …
Lade Detail …
777 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
780 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Wyatt's …
Lade Detail …
782 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tim host…
Lade Detail …
796 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Freddie …
Lade Detail …
814 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Paddy's …
Lade Detail …
823 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In four …
Lade Detail …
832 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Elsa sta…
Lade Detail …
835 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Matthias…
Lade Detail …
858 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In a cer…
Lade Detail …
894 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A clerk …
Lade Detail …
920 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
951 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mandy is…
Lade Detail …
952 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Stephani…
Lade Detail …
962 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A family…
Lade Detail …
984 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John car…
Lade Detail …
1001 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Thomas h…
Lade Detail …
1016 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ian is l…
Lade Detail …
1019 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A farmer…
Lade Detail …
1035 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kobe and…
Lade Detail …
1038 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bag of…
Lade Detail …
1042 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kobe and…
Lade Detail …
1048 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John cli…
Lade Detail …
1059 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Smaug th…
Lade Detail …
1088 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mark buy…
Lade Detail …
1161 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mike beg…
Lade Detail …
1176 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James bu…
Lade Detail …
1183 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: While ch…
Lade Detail …
1193 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Hershel …
Lade Detail …
1227 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A questi…
Lade Detail …
1306 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mary is …
Lade Detail …
1309 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: If Clove…
Lade Detail …
1313 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: OpenAI r…
Lade Detail …
73 von 1319 Samples · Limit 200 Nächste ›