Samples · lm_eval_harness.gsm8k

Run #75 · Adapter v1.0.0+humaneval-removed+gen-kwargs-pairing · 119/1319 Samples angezeigt · Score 90.4%
‹ Zurück zum Run-Detail

KI-Auswertung

Generiert 2026-05-13 21:38 · claude-sonnet-4-6

Zusammenfassung

Das Modell Qwen3-Coder-Next erreicht auf GSM8K eine Pass-Rate von 92,5 % (Score 90,4 %), was ein solides, aber nicht herausragendes Ergebnis für mehrstufige Grundschulmathematik darstellt.

Stärken

  • Einfache und mittelschwere Rechenaufgaben werden zuverlässig und mit sauberem Rechenweg gelöst.
  • Umrechnung von Einheiten sowie lineare Mehrstufenprobleme (Groceries, Pool-Füllungskosten, Prozentsätze) gelingen konsistent.
  • Null Fehler (errors=0), das Modell bricht nie ab oder liefert ungültige Ausgaben.

Schwächen

  • Aufgaben mit indirekten oder impliziten Bezügen werden falsch interpretiert, z. B. „10 % schneller laufen" wird als Zeitreduktion durch Divisor 1,1 statt als direkte Subtraktion behandelt.
  • Off-by-one-Fehler bei inklusiven Zeiträumen (z. B. Gene-Quiltblock-Aufgabe: 12 statt 11 Jahre).
  • Mehrdeutige Problemformulierungen verleiten zu Überanalyse, wodurch das Modell teils falsche Relationen (z. B. Lylah's Gehalt) einführt.
  • Wahrscheinlichkeitsaufgaben: Das Modell berechnet korrekt, interpretiert die Frage jedoch falsch (relative statt absolute Differenz).

Auffälligkeiten

Wiederkehrendes Muster: Bei Aufgaben, die eine eindeutige, kurze Antwort erfordern, produziert das Modell ausführliche Alternativüberlegungen und verfehlt dabei das gesuchte einfache Ergebnis. Dies deutet auf eine Tendenz zur Überantwortung (verbosity bias) hin.

Empfehlung

Sampling-Temperatur senken (z. B. auf 0.0 oder greedy decoding), um das Modell bei klaren Zahlenaufgaben von spekulativen Alternativpfaden abzuhalten und die Pass-Rate weiter in Richtung 95 %+ zu treiben.

Übersicht

1319 Samples
Verteilung
1246
73
Score-Histogramm
0 – 0.1: 73 0.1 – 0.2: 0 0.2 – 0.3: 0 0.3 – 0.4: 0 0.4 – 0.5: 0 0.5 – 0.6: 0 0.6 – 0.7: 0 0.7 – 0.8: 0 0.8 – 0.9: 0 0.9 – 1: 1246
0.0 ────── 1.0
Status Score-Schwelle Score < 0.5
Frage-ID Status Score Prompt Latenz Tokens/s TTFT
1200 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James is…
Lade Detail …
1201 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Josh and…
Lade Detail …
1202 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John tak…
Lade Detail …
1203 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Winston …
Lade Detail …
1204 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Marissa …
Lade Detail …
1205 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Monika w…
Lade Detail …
1206 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mikaela …
Lade Detail …
1207 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: 1 chocol…
Lade Detail …
1208 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Betty ha…
Lade Detail …
1209 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Brian ca…
Lade Detail …
1210 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Frank ha…
Lade Detail …
1211 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Joyce, M…
Lade Detail …
1212 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A pad of…
Lade Detail …
1213 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Carson n…
Lade Detail …
1214 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Percy wa…
Lade Detail …
1215 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A window…
Lade Detail …
1216 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A 10 met…
Lade Detail …
1217 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Miggy's …
Lade Detail …
1218 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bicycl…
Lade Detail …
1219 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: On the i…
Lade Detail …
1220 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jerry is…
Lade Detail …
1221 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Paul, a …
Lade Detail …
1222 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Two bowl…
Lade Detail …
1223 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kurt's o…
Lade Detail …
1224 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mary is …
Lade Detail …
1225 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Juanita …
Lade Detail …
1226 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Gunther …
Lade Detail …
1227 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A questi…
Lade Detail …
1228 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bird i…
Lade Detail …
1229 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Sam is s…
Lade Detail …
1230 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Marj has…
Lade Detail …
1231 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Tina dec…
Lade Detail …
1232 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Justice …
Lade Detail …
1233 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James bu…
Lade Detail …
1234 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Two year…
Lade Detail …
1235 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Brandon …
Lade Detail …
1236 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ali and …
Lade Detail …
1237 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Dan spen…
Lade Detail …
1238 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Luther m…
Lade Detail …
1239 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Laura lo…
Lade Detail …
1240 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jackson …
Lade Detail …
1241 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Felix is…
Lade Detail …
1242 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The Peri…
Lade Detail …
1243 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Theresa …
Lade Detail …
1244 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Cassy pa…
Lade Detail …
1245 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John wri…
Lade Detail …
1246 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Lorie ha…
Lade Detail …
1247 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Five cow…
Lade Detail …
1248 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ivan has…
Lade Detail …
1249 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jean has…
Lade Detail …
1250 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John man…
Lade Detail …
1251 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Bill is …
Lade Detail …
1252 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kimberly…
Lade Detail …
1253 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
1254 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Dianne r…
Lade Detail …
1255 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A compan…
Lade Detail …
1256 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There ar…
Lade Detail …
1257 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A plane …
Lade Detail …
1258 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Hani sai…
Lade Detail …
1259 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ember is…
Lade Detail …
1260 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Emmett d…
Lade Detail …
1261 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Viviana …
Lade Detail …
1262 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: To earn …
Lade Detail …
1263 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The jour…
Lade Detail …
1264 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John's p…
Lade Detail …
1265 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: One batc…
Lade Detail …
1266 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Charlie …
Lade Detail …
1267 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Two days…
Lade Detail …
1268 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kim has …
Lade Detail …
1269 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jose bou…
Lade Detail …
1270 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The batt…
Lade Detail …
1271 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Ahmed is…
Lade Detail …
1272 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Anika ha…
Lade Detail …
1273 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mark con…
Lade Detail …
1274 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Hattie a…
Lade Detail …
1275 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: There we…
Lade Detail …
1276 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bond p…
Lade Detail …
1277 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: 9 years …
Lade Detail …
1278 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A grocer…
Lade Detail …
1279 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kim driv…
Lade Detail …
1280 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Joe goes…
Lade Detail …
1281 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Archie h…
Lade Detail …
1282 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A rectan…
Lade Detail …
1283 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A bag of…
Lade Detail …
1284 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A buildi…
Lade Detail …
1285 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Chris ha…
Lade Detail …
1286 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Kendra m…
Lade Detail …
1287 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: An apple…
Lade Detail …
1288 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A tank w…
Lade Detail …
1289 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: James ha…
Lade Detail …
1290 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Elsa sta…
Lade Detail …
1291 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Flies ar…
Lade Detail …
1292 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Celina e…
Lade Detail …
1293 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Hannah h…
Lade Detail …
1294 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: An artis…
Lade Detail …
1295 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jon make…
Lade Detail …
1296 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jeannie …
Lade Detail …
1297 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The bake…
Lade Detail …
1298 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The teac…
Lade Detail …
1299 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Sally to…
Lade Detail …
1300 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: John's s…
Lade Detail …
1301 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Maria is…
Lade Detail …
1302 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jason is…
Lade Detail …
1303 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Each wee…
Lade Detail …
1304 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Beth is …
Lade Detail …
1305 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: The stad…
Lade Detail …
1306 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Mary is …
Lade Detail …
1307 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A portab…
Lade Detail …
1308 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: In ten y…
Lade Detail …
1309 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: If Clove…
Lade Detail …
1310 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A pen is…
Lade Detail …
1311 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jake buy…
Lade Detail …
1312 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: You draw…
Lade Detail …
1313 failed 0% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: OpenAI r…
Lade Detail …
1314 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: If Jason…
Lade Detail …
1315 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Jessica'…
Lade Detail …
1316 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A packag…
Lade Detail …
1317 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: Magdalen…
Lade Detail …
1318 passed 100% {…} {"gen_args_0":{"arg_0":["[{\"role\": \"user\", \"content\": \"Question: A car us…
Lade Detail …
119 von 1319 Samples · Limit 200 ‹ Vorherige