Samples · lm_eval_harness.humaneval
KI-Auswertung
Generiert 2026-05-10 17:12 · claude-sonnet-4-6Zusammenfassung
Das Modell qwen3.6-35b-a3b-tq3 erreicht eine Pass-Rate von 56,7 % auf HumanEval, was für ein quantisiertes Modell dieser Größe ein mittelmäßiges Ergebnis darstellt — rund 44 % der 164 Aufgaben werden nicht korrekt gelöst.
Stärken
- Einfache algorithmische Aufgaben (Bitoperationen, Palindrom-Zählung, Sortierung nach Binärdarstellung) werden zuverlässig gelöst.
- Keine Laufzeitfehler (0 Errors), das Modell liefert stets syntaktisch validen Python-Code.
- Kurze, prägnante Implementierungen ohne unnötigen Overhead.
Schwächen
- Logikfehler bei Teilaufgaben: `largest_divisor` iteriert von 1 aufwärts statt von n-1 abwärts und gibt damit den kleinsten statt den größten Teiler zurück.
- Fehlende Hilfsfunktionen: `sum_product` ruft `product()` auf, das nicht definiert ist.
- Algorithmusverständnis: `fizz_buzz` zählt Zahlen statt Ziffern; `how_many_times` enthält einen Off-by-one-Fehler beim Substring-Suchen.
- `decode_cyclic` gibt denselben Code wie `encode_cyclic` zurück, ohne die inverse Operation zu implementieren.
Auffälligkeiten
Ein klares Muster in den Failures: Das Modell missversteht die Aufgabenspezifikation auf konzeptioneller Ebene (Zählung von Ziffern vs. Zahlen, kleinster vs. größter Teiler, Inverse einer Funktion). Zusätzlich fehlt bei einfachen Utility-Funktionen die Eigenimplementierung zugunsten nicht-existierender Built-ins. Die `parse_nested_parens`-Fehler deuten auf Schwächen bei zustandsbehafteter String-Verarbeitung hin.
Empfehlung
Den Sub-Bereich "algorithmische Korrektheit bei invertierten oder gespiegelten Operationen" sowie "Spezifikationsverständnis (Ziffer vs. Zahl, Min vs. Max)" gezielt mit Few-Shot-Prompting oder Chain-of-Thought-Anleitung untersuchen; alternativ Sampling-Temperatur leicht erhöhen (z. B. 0.2–0.4) und pass@k > 1 evaluieren.
Übersicht
164 Samples| Frage-ID | Status | Score | Prompt | Latenz | Tokens/s | TTFT | |
|---|---|---|---|---|---|---|---|
| 6 | failed | {…} {"gen_args_0":{"arg_0":"from typing import List\n\n\ndef parse_nested_parens(par… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 8 | failed | {…} {"gen_args_0":{"arg_0":"from typing import List, Tuple\n\n\ndef sum_product(numb… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 10 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef is_palindrome(string: str) -> bool:\n \"\"\" … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 18 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef how_many_times(string: str, substring: str) -> i… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 24 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef largest_divisor(n: int) -> int:\n \"\"\" For … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 26 | failed | {…} {"gen_args_0":{"arg_0":"from typing import List\n\n\ndef remove_duplicates(numbe… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 32 | failed | {…} {"gen_args_0":{"arg_0":"import math\n\n\ndef poly(xs: list, x: float):\n \"\"… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 36 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef fizz_buzz(n: int):\n \"\"\"Return the number … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 37 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef sort_even(l: list):\n \"\"\"This function tak… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 38 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef encode_cyclic(s: str):\n \"\"\"\n returns … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 39 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef prime_fib(n: int):\n \"\"\"\n prime_fib re… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 59 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef largest_prime_factor(n: int):\n \"\"\"Return … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 65 | failed | {…} {"gen_args_0":{"arg_0":"\ndef circular_shift(x, shift):\n \"\"\"Circular shif… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 67 | failed | {…} {"gen_args_0":{"arg_0":"\ndef fruit_distribution(s,n):\n \"\"\"\n In this … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 69 | failed | {…} {"gen_args_0":{"arg_0":"\ndef search(lst):\n '''\n You are given a non-emp… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 73 | failed | {…} {"gen_args_0":{"arg_0":"\ndef smallest_change(arr):\n \"\"\"\n Given an ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 75 | failed | {…} {"gen_args_0":{"arg_0":"\ndef is_multiply_prime(a):\n \"\"\"Write a function … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 76 | failed | {…} {"gen_args_0":{"arg_0":"\ndef is_simple_power(x, n):\n \"\"\"Your task is to … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 77 | failed | {…} {"gen_args_0":{"arg_0":"\ndef iscube(a):\n '''\n Write a function that tak… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 81 | failed | {…} {"gen_args_0":{"arg_0":"\ndef numerical_letter_grade(grades):\n \"\"\"It is t… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 83 | failed | {…} {"gen_args_0":{"arg_0":"\ndef starts_one_ends(n):\n \"\"\"\n Given a posit… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 84 | failed | {…} {"gen_args_0":{"arg_0":"\ndef solve(N):\n \"\"\"Given a positive integer N, r… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 86 | failed | {…} {"gen_args_0":{"arg_0":"\ndef anti_shuffle(s):\n \"\"\"\n Write a function… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 87 | failed | {…} {"gen_args_0":{"arg_0":"\ndef get_row(lst, x):\n \"\"\"\n You are given a … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 89 | failed | {…} {"gen_args_0":{"arg_0":"\ndef encrypt(s):\n \"\"\"Create a function encrypt t… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 91 | failed | {…} {"gen_args_0":{"arg_0":"\ndef is_bored(S):\n \"\"\"\n You'll be given a st… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 92 | failed | {…} {"gen_args_0":{"arg_0":"\ndef any_int(x, y, z):\n '''\n Create a function … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 93 | failed | {…} {"gen_args_0":{"arg_0":"\ndef encode(message):\n \"\"\"\n Write a function… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 94 | failed | {…} {"gen_args_0":{"arg_0":"\n\ndef skjkasdkd(lst):\n \"\"\"You are given a list … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 95 | failed | {…} {"gen_args_0":{"arg_0":"\ndef check_dict_case(dict):\n \"\"\"\n Given a di… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 96 | failed | {…} {"gen_args_0":{"arg_0":"\ndef count_up_to(n):\n \"\"\"Implement a function th… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 98 | failed | {…} {"gen_args_0":{"arg_0":"\ndef count_upper(s):\n \"\"\"\n Given a string s,… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 99 | failed | {…} {"gen_args_0":{"arg_0":"\ndef closest_integer(value):\n '''\n Create a fun… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 101 | failed | {…} {"gen_args_0":{"arg_0":"\ndef words_string(s):\n \"\"\"\n You will be give… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 105 | failed | {…} {"gen_args_0":{"arg_0":"\ndef by_length(arr):\n \"\"\"\n Given an array of… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 106 | failed | {…} {"gen_args_0":{"arg_0":"\ndef f(n):\n \"\"\" Implement the function f that ta… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 108 | failed | {…} {"gen_args_0":{"arg_0":"\ndef count_nums(arr):\n \"\"\"\n Write a function… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 109 | failed | {…} {"gen_args_0":{"arg_0":"\ndef move_one_ball(arr):\n \"\"\"We have an array 'a… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 110 | failed | {…} {"gen_args_0":{"arg_0":"\ndef exchange(lst1, lst2):\n \"\"\"In this problem, … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 111 | failed | {…} {"gen_args_0":{"arg_0":"\ndef histogram(test):\n \"\"\"Given a string represe… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 112 | failed | {…} {"gen_args_0":{"arg_0":"\ndef reverse_delete(s,c):\n \"\"\"Task\n We are g… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 113 | failed | {…} {"gen_args_0":{"arg_0":"\ndef odd_count(lst):\n \"\"\"Given a list of strings… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 114 | failed | {…} {"gen_args_0":{"arg_0":"\ndef minSubArraySum(nums):\n \"\"\"\n Given an ar… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 115 | failed | {…} {"gen_args_0":{"arg_0":"\ndef max_fill(grid, capacity):\n import math\n \"… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 117 | failed | {…} {"gen_args_0":{"arg_0":"\ndef select_words(s, n):\n \"\"\"Given a string s an… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 119 | failed | {…} {"gen_args_0":{"arg_0":"\ndef match_parens(lst):\n '''\n You are given a l… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 120 | failed | {…} {"gen_args_0":{"arg_0":"\ndef maximum(arr, k):\n \"\"\"\n Given an array a… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 123 | failed | {…} {"gen_args_0":{"arg_0":"\ndef get_odd_collatz(n):\n \"\"\"\n Given a posit… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 125 | failed | {…} {"gen_args_0":{"arg_0":"\ndef split_words(txt):\n '''\n Given a string of … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 126 | failed | {…} {"gen_args_0":{"arg_0":"\ndef is_sorted(lst):\n '''\n Given a list of numb… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 127 | failed | {…} {"gen_args_0":{"arg_0":"\ndef intersection(interval1, interval2):\n \"\"\"You… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 128 | failed | {…} {"gen_args_0":{"arg_0":"\ndef prod_signs(arr):\n \"\"\"\n You are given an… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 130 | failed | {…} {"gen_args_0":{"arg_0":"\ndef tri(n):\n \"\"\"Everyone knows Fibonacci sequen… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 131 | failed | {…} {"gen_args_0":{"arg_0":"\ndef digits(n):\n \"\"\"Given a positive integer n, … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 132 | failed | {…} {"gen_args_0":{"arg_0":"\ndef is_nested(string):\n '''\n Create a function… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 134 | failed | {…} {"gen_args_0":{"arg_0":"\ndef check_if_last_char_is_a_letter(txt):\n '''\n … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 136 | failed | {…} {"gen_args_0":{"arg_0":"\ndef largest_smallest_integers(lst):\n '''\n Crea… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 137 | failed | {…} {"gen_args_0":{"arg_0":"\ndef compare_one(a, b):\n \"\"\"\n Create a funct… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 139 | failed | {…} {"gen_args_0":{"arg_0":"\ndef special_factorial(n):\n \"\"\"The Brazilian fac… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 140 | failed | {…} {"gen_args_0":{"arg_0":"\ndef fix_spaces(text):\n \"\"\"\n Given a string … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 141 | failed | {…} {"gen_args_0":{"arg_0":"\ndef file_name_check(file_name):\n \"\"\"Create a fu… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 145 | failed | {…} {"gen_args_0":{"arg_0":"\ndef order_by_points(nums):\n \"\"\"\n Write a fu… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 148 | failed | {…} {"gen_args_0":{"arg_0":"\ndef bf(planet1, planet2):\n '''\n There are eigh… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 149 | failed | {…} {"gen_args_0":{"arg_0":"\ndef sorted_list_sum(lst):\n \"\"\"Write a function … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 150 | failed | {…} {"gen_args_0":{"arg_0":"\ndef x_or_y(n, x, y):\n \"\"\"A simple program which… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 151 | failed | {…} {"gen_args_0":{"arg_0":"\ndef double_the_difference(lst):\n '''\n Given a … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 155 | failed | {…} {"gen_args_0":{"arg_0":"\ndef even_odd_count(num):\n \"\"\"Given an integer. … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 158 | failed | {…} {"gen_args_0":{"arg_0":"\ndef find_max(words):\n \"\"\"Write a function that … | — | — | — | ||
|
Lade Detail …
|
|||||||
| 160 | failed | {…} {"gen_args_0":{"arg_0":"\ndef do_algebra(operator, operand):\n \"\"\"\n Gi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 162 | failed | {…} {"gen_args_0":{"arg_0":"\ndef string_to_md5(text):\n \"\"\"\n Given a stri… | — | — | — | ||
|
Lade Detail …
|
|||||||
| 163 | failed | {…} {"gen_args_0":{"arg_0":"\ndef generate_integers(a, b):\n \"\"\"\n Given tw… | — | — | — | ||
|
Lade Detail …
|
|||||||