Samples · lm_eval_harness.humaneval

Run #26 · Adapter v1.0.0+humaneval-unsafe-flag · 71/164 Samples angezeigt · Score 56.7%
‹ Zurück zum Run-Detail

KI-Auswertung

Generiert 2026-05-10 17:12 · claude-sonnet-4-6

Zusammenfassung

Das Modell qwen3.6-35b-a3b-tq3 erreicht eine Pass-Rate von 56,7 % auf HumanEval, was für ein quantisiertes Modell dieser Größe ein mittelmäßiges Ergebnis darstellt — rund 44 % der 164 Aufgaben werden nicht korrekt gelöst.

Stärken

  • Einfache algorithmische Aufgaben (Bitoperationen, Palindrom-Zählung, Sortierung nach Binärdarstellung) werden zuverlässig gelöst.
  • Keine Laufzeitfehler (0 Errors), das Modell liefert stets syntaktisch validen Python-Code.
  • Kurze, prägnante Implementierungen ohne unnötigen Overhead.

Schwächen

  • Logikfehler bei Teilaufgaben: `largest_divisor` iteriert von 1 aufwärts statt von n-1 abwärts und gibt damit den kleinsten statt den größten Teiler zurück.
  • Fehlende Hilfsfunktionen: `sum_product` ruft `product()` auf, das nicht definiert ist.
  • Algorithmusverständnis: `fizz_buzz` zählt Zahlen statt Ziffern; `how_many_times` enthält einen Off-by-one-Fehler beim Substring-Suchen.
  • `decode_cyclic` gibt denselben Code wie `encode_cyclic` zurück, ohne die inverse Operation zu implementieren.

Auffälligkeiten

Ein klares Muster in den Failures: Das Modell missversteht die Aufgabenspezifikation auf konzeptioneller Ebene (Zählung von Ziffern vs. Zahlen, kleinster vs. größter Teiler, Inverse einer Funktion). Zusätzlich fehlt bei einfachen Utility-Funktionen die Eigenimplementierung zugunsten nicht-existierender Built-ins. Die `parse_nested_parens`-Fehler deuten auf Schwächen bei zustandsbehafteter String-Verarbeitung hin.

Empfehlung

Den Sub-Bereich "algorithmische Korrektheit bei invertierten oder gespiegelten Operationen" sowie "Spezifikationsverständnis (Ziffer vs. Zahl, Min vs. Max)" gezielt mit Few-Shot-Prompting oder Chain-of-Thought-Anleitung untersuchen; alternativ Sampling-Temperatur leicht erhöhen (z. B. 0.2–0.4) und pass@k > 1 evaluieren.

Übersicht

164 Samples
Verteilung
93
71
Score-Histogramm
0 – 0.1: 71 0.1 – 0.2: 0 0.2 – 0.3: 0 0.3 – 0.4: 0 0.4 – 0.5: 0 0.5 – 0.6: 0 0.6 – 0.7: 0 0.7 – 0.8: 0 0.8 – 0.9: 0 0.9 – 1: 93
0.0 ────── 1.0
Status Score-Schwelle Zurücksetzen Score < 0.5
Frage-ID Status Score Prompt Latenz Tokens/s TTFT
6 failed 0% {…} {"gen_args_0":{"arg_0":"from typing import List\n\n\ndef parse_nested_parens(par…
Lade Detail …
8 failed 0% {…} {"gen_args_0":{"arg_0":"from typing import List, Tuple\n\n\ndef sum_product(numb…
Lade Detail …
10 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef is_palindrome(string: str) -> bool:\n \"\"\" …
Lade Detail …
18 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef how_many_times(string: str, substring: str) -> i…
Lade Detail …
24 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef largest_divisor(n: int) -> int:\n \"\"\" For …
Lade Detail …
26 failed 0% {…} {"gen_args_0":{"arg_0":"from typing import List\n\n\ndef remove_duplicates(numbe…
Lade Detail …
32 failed 0% {…} {"gen_args_0":{"arg_0":"import math\n\n\ndef poly(xs: list, x: float):\n \"\"…
Lade Detail …
36 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef fizz_buzz(n: int):\n \"\"\"Return the number …
Lade Detail …
37 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef sort_even(l: list):\n \"\"\"This function tak…
Lade Detail …
38 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef encode_cyclic(s: str):\n \"\"\"\n returns …
Lade Detail …
39 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef prime_fib(n: int):\n \"\"\"\n prime_fib re…
Lade Detail …
59 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef largest_prime_factor(n: int):\n \"\"\"Return …
Lade Detail …
65 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef circular_shift(x, shift):\n \"\"\"Circular shif…
Lade Detail …
67 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef fruit_distribution(s,n):\n \"\"\"\n In this …
Lade Detail …
69 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef search(lst):\n '''\n You are given a non-emp…
Lade Detail …
73 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef smallest_change(arr):\n \"\"\"\n Given an ar…
Lade Detail …
75 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef is_multiply_prime(a):\n \"\"\"Write a function …
Lade Detail …
76 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef is_simple_power(x, n):\n \"\"\"Your task is to …
Lade Detail …
77 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef iscube(a):\n '''\n Write a function that tak…
Lade Detail …
81 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef numerical_letter_grade(grades):\n \"\"\"It is t…
Lade Detail …
83 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef starts_one_ends(n):\n \"\"\"\n Given a posit…
Lade Detail …
84 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef solve(N):\n \"\"\"Given a positive integer N, r…
Lade Detail …
86 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef anti_shuffle(s):\n \"\"\"\n Write a function…
Lade Detail …
87 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef get_row(lst, x):\n \"\"\"\n You are given a …
Lade Detail …
89 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef encrypt(s):\n \"\"\"Create a function encrypt t…
Lade Detail …
91 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef is_bored(S):\n \"\"\"\n You'll be given a st…
Lade Detail …
92 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef any_int(x, y, z):\n '''\n Create a function …
Lade Detail …
93 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef encode(message):\n \"\"\"\n Write a function…
Lade Detail …
94 failed 0% {…} {"gen_args_0":{"arg_0":"\n\ndef skjkasdkd(lst):\n \"\"\"You are given a list …
Lade Detail …
95 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef check_dict_case(dict):\n \"\"\"\n Given a di…
Lade Detail …
96 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef count_up_to(n):\n \"\"\"Implement a function th…
Lade Detail …
98 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef count_upper(s):\n \"\"\"\n Given a string s,…
Lade Detail …
99 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef closest_integer(value):\n '''\n Create a fun…
Lade Detail …
101 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef words_string(s):\n \"\"\"\n You will be give…
Lade Detail …
105 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef by_length(arr):\n \"\"\"\n Given an array of…
Lade Detail …
106 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef f(n):\n \"\"\" Implement the function f that ta…
Lade Detail …
108 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef count_nums(arr):\n \"\"\"\n Write a function…
Lade Detail …
109 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef move_one_ball(arr):\n \"\"\"We have an array 'a…
Lade Detail …
110 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef exchange(lst1, lst2):\n \"\"\"In this problem, …
Lade Detail …
111 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef histogram(test):\n \"\"\"Given a string represe…
Lade Detail …
112 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef reverse_delete(s,c):\n \"\"\"Task\n We are g…
Lade Detail …
113 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef odd_count(lst):\n \"\"\"Given a list of strings…
Lade Detail …
114 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef minSubArraySum(nums):\n \"\"\"\n Given an ar…
Lade Detail …
115 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef max_fill(grid, capacity):\n import math\n \"…
Lade Detail …
117 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef select_words(s, n):\n \"\"\"Given a string s an…
Lade Detail …
119 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef match_parens(lst):\n '''\n You are given a l…
Lade Detail …
120 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef maximum(arr, k):\n \"\"\"\n Given an array a…
Lade Detail …
123 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef get_odd_collatz(n):\n \"\"\"\n Given a posit…
Lade Detail …
125 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef split_words(txt):\n '''\n Given a string of …
Lade Detail …
126 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef is_sorted(lst):\n '''\n Given a list of numb…
Lade Detail …
127 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef intersection(interval1, interval2):\n \"\"\"You…
Lade Detail …
128 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef prod_signs(arr):\n \"\"\"\n You are given an…
Lade Detail …
130 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef tri(n):\n \"\"\"Everyone knows Fibonacci sequen…
Lade Detail …
131 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef digits(n):\n \"\"\"Given a positive integer n, …
Lade Detail …
132 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef is_nested(string):\n '''\n Create a function…
Lade Detail …
134 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef check_if_last_char_is_a_letter(txt):\n '''\n …
Lade Detail …
136 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef largest_smallest_integers(lst):\n '''\n Crea…
Lade Detail …
137 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef compare_one(a, b):\n \"\"\"\n Create a funct…
Lade Detail …
139 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef special_factorial(n):\n \"\"\"The Brazilian fac…
Lade Detail …
140 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef fix_spaces(text):\n \"\"\"\n Given a string …
Lade Detail …
141 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef file_name_check(file_name):\n \"\"\"Create a fu…
Lade Detail …
145 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef order_by_points(nums):\n \"\"\"\n Write a fu…
Lade Detail …
148 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef bf(planet1, planet2):\n '''\n There are eigh…
Lade Detail …
149 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef sorted_list_sum(lst):\n \"\"\"Write a function …
Lade Detail …
150 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef x_or_y(n, x, y):\n \"\"\"A simple program which…
Lade Detail …
151 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef double_the_difference(lst):\n '''\n Given a …
Lade Detail …
155 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef even_odd_count(num):\n \"\"\"Given an integer. …
Lade Detail …
158 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef find_max(words):\n \"\"\"Write a function that …
Lade Detail …
160 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef do_algebra(operator, operand):\n \"\"\"\n Gi…
Lade Detail …
162 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef string_to_md5(text):\n \"\"\"\n Given a stri…
Lade Detail …
163 failed 0% {…} {"gen_args_0":{"arg_0":"\ndef generate_integers(a, b):\n \"\"\"\n Given tw…
Lade Detail …
71 von 164 Samples · Limit 200 Nächste ›