Samples · bfcl.single_turn

Run #75 · Adapter v1.0.0+v4-layout-aliases+reasoning · 107/707 Samples angezeigt · Score 49.1%
‹ Zurück zum Run-Detail

KI-Auswertung

Generiert 2026-05-13 21:38 · claude-sonnet-4-6

Zusammenfassung

Das Modell erreicht eine Pass-Rate von ~49 %, was einer knappen Mehrheit an Fehlern entspricht. Die Performance liegt damit deutlich unter einem praxistauglichen Schwellenwert für Single-Turn Function Calling.

Stärken

  • Keine Errors (0 von 1390 Aufrufen), das Modell produziert stets strukturierte Ausgaben.
  • Argumente und Werte werden inhaltlich meist korrekt übergeben (richtige Zahlen, richtige Parameter-Namen).

Schwächen

  • Dominantes Muster: Das Modell ersetzt Punkte in Funktionsnamen durch Unterstriche (`math.factorial` → `math_factorial`), was systematisch zu `wrong_func_name`-Fehlern führt.
  • Vereinzelt treten Typ-Fehler auf (z. B. `interval` als flaches Array statt Array-of-Arrays), was auf mangelndes Schema-Verständnis bei verschachtelten Typen hindeutet.
  • Optionale Parameter werden häufig weggelassen, obwohl sie im Expected-Output als leerer String akzeptiert würden — dies könnte weitere stille Fehler verursachen.

Auffälligkeiten

Fast alle sichtbaren Failures basieren auf einem einzigen, konsistenten Fehler: Punkt-Notation in Funktionsnamen wird nicht verwendet. Dies deutet darauf hin, dass das Modell entweder im Training kaum Beispiele mit Namespace-Trennzeichen (`.`) gesehen hat oder der System-Prompt / Tool-Schema diese Konvention nicht ausreichend vorgibt. Der Typ-Fehler bei `interval` ist ein separates, aber ebenfalls reproduzierbares Problem bei Array-of-Floats-Schemas.

Empfehlung

Den Tool-Schema-Prompt explizit um den Hinweis ergänzen, dass Funktionsnamen exakt — inklusive Punkt-Notation — aus dem Schema übernommen werden müssen; alternativ Few-Shot-Beispiele mit Dot-Namespaces in den System-Prompt integrieren und anschließend den Benchmark erneut auswerten.

Übersicht

707 Samples
Verteilung
707
Score-Histogramm
0 – 0.1: 707 0.1 – 0.2: 0 0.2 – 0.3: 0 0.3 – 0.4: 0 0.4 – 0.5: 0 0.5 – 0.6: 0 0.6 – 0.7: 0 0.7 – 0.8: 0 0.8 – 0.9: 0 0.9 – 1: 0
0.0 ────── 1.0
Status Score-Schwelle Zurücksetzen Score < 0.5
Frage-ID Status Score Prompt Latenz Tokens/s TTFT
parallel_multiple_108 failed 0% {…} [[{"role":"user","content":"\"Could you please analyze my personality based on t…
Lade Detail …
parallel_multiple_109 failed 0% {…} [[{"role":"user","content":"\"Can you tell me about the monarchs of France durin…
Lade Detail …
parallel_multiple_110 failed 0% {…} [[{"role":"user","content":"\"What was the population of California in 1980 and …
Lade Detail …
parallel_multiple_111 failed 0% {…} [[{"role":"user","content":"\"Could you please provide me with the origin and fo…
Lade Detail …
parallel_multiple_112 failed 0% {…} [[{"role":"user","content":"\"Could you please help me find the price of the art…
Lade Detail …
parallel_multiple_113 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with some information? I …
Lade Detail …
parallel_multiple_114 failed 0% {…} [[{"role":"user","content":"\"Could you please help me order a custom sculpture …
Lade Detail …
parallel_multiple_115 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan my trip to New York? I would …
Lade Detail …
parallel_multiple_117 failed 0% {…} [[{"role":"user","content":"\"Can you help me book a ticket for a concert of Tay…
Lade Detail …
parallel_multiple_118 failed 0% {…} [[{"role":"user","content":"\"Can you help me create a piece of music in D Minor…
Lade Detail …
parallel_multiple_119 failed 0% {…} [[{"role":"user","content":"\"Can you tell me how many all-time goals Cristiano …
Lade Detail …
parallel_multiple_120 failed 0% {…} [[{"role":"user","content":"\"Can you tell me the scores for the Manchester Unit…
Lade Detail …
parallel_multiple_121 failed 0% {…} [[{"role":"user","content":"\"I'm planning a game night and I need some board ga…
Lade Detail …
parallel_multiple_122 failed 0% {…} [[{"role":"user","content":"\"Could you please find the latest updates for the g…
Lade Detail …
parallel_multiple_123 failed 0% {…} [[{"role":"user","content":"\"Can you tell me how many active players were engag…
Lade Detail …
parallel_multiple_124 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan my meals for the day? I want …
Lade Detail …
parallel_multiple_125 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan a day out in Seattle, WA for …
Lade Detail …
parallel_multiple_126 failed 0% {…} [[{"role":"user","content":"\"Can you help me find a recipe that uses chicken as…
Lade Detail …
parallel_multiple_127 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan my trip? I need to book a hot…
Lade Detail …
parallel_multiple_128 failed 0% {…} [[{"role":"user","content":"\"Could you help me plan my vacation? I need to know…
Lade Detail …
parallel_multiple_129 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a couple of conversions? Firs…
Lade Detail …
parallel_multiple_131 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with a couple of calculat…
Lade Detail …
parallel_multiple_132 failed 0% {…} [[{"role":"user","content":"\"Could you first calculate the derivative of the fu…
Lade Detail …
parallel_multiple_133 failed 0% {…} [[{"role":"user","content":"\"Imagine you are a music producer and you are worki…
Lade Detail …
parallel_multiple_134 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with two tasks? First, I'…
Lade Detail …
parallel_multiple_136 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm inte…
Lade Detail …
parallel_multiple_137 failed 0% {…} [[{"role":"user","content":"\"Can you tell me the function of the molecule ATP i…
Lade Detail …
parallel_multiple_138 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few tasks? Firstly, I am …
Lade Detail …
parallel_multiple_139 failed 0% {…} [[{"role":"user","content":"\"Imagine you are a teacher preparing for a science …
Lade Detail …
parallel_multiple_141 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm stud…
Lade Detail …
parallel_multiple_142 failed 0% {…} [[{"role":"user","content":"\"In the game 'Animal Crossing', I am interested in …
Lade Detail …
parallel_multiple_143 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I need to…
Lade Detail …
parallel_multiple_145 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I am play…
Lade Detail …
parallel_multiple_146 failed 0% {…} [[{"role":"user","content":"You are an art curator and a part-time biologist. Yo…
Lade Detail …
parallel_multiple_147 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan a day out? I want to start by…
Lade Detail …
parallel_multiple_148 failed 0% {…} [[{"role":"user","content":"\"Can you tell me the net worth of the famous footba…
Lade Detail …
parallel_multiple_149 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I need to…
Lade Detail …
parallel_multiple_150 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few things? First, I'm in…
Lade Detail …
parallel_multiple_151 failed 0% {…} [[{"role":"user","content":"\"Imagine you are planning a vacation to Paris, Fran…
Lade Detail …
parallel_multiple_152 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm tryi…
Lade Detail …
parallel_multiple_153 failed 0% {…} [[{"role":"user","content":"\"Could you help me plan a trip? I want to go to Par…
Lade Detail …
parallel_multiple_155 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few calculations? First, I …
Lade Detail …
parallel_multiple_156 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?…
Lade Detail …
parallel_multiple_157 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few calculations and sear…
Lade Detail …
parallel_multiple_158 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I'm int…
Lade Detail …
parallel_multiple_159 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few things? First, I'd li…
Lade Detail …
parallel_multiple_160 failed 0% {…} [[{"role":"user","content":"\"Can you help me with two tasks? First, I want to c…
Lade Detail …
parallel_multiple_161 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few calculations? First, I'…
Lade Detail …
parallel_multiple_162 failed 0% {…} [[{"role":"user","content":"\"Imagine you are planning your finances and you wan…
Lade Detail …
parallel_multiple_163 failed 0% {…} [[{"role":"user","content":"\"John is planning to invest in a mutual fund. He ha…
Lade Detail …
parallel_multiple_165 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan my week? I'm interested in at…
Lade Detail …
parallel_multiple_166 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?…
Lade Detail …
parallel_multiple_167 failed 0% {…} [[{"role":"user","content":"\"In the game 'Animal Crossing' during the 'Summer' …
Lade Detail …
parallel_multiple_168 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a two-part request? First, …
Lade Detail …
parallel_multiple_169 failed 0% {…} [[{"role":"user","content":"\"Could you please tell me the latest game score, in…
Lade Detail …
parallel_multiple_170 failed 0% {…} [[{"role":"user","content":"\"Imagine you are playing a role-playing game and yo…
Lade Detail …
parallel_multiple_172 failed 0% {…} [[{"role":"user","content":"\"Can you help me with the following tasks? First, I…
Lade Detail …
parallel_multiple_173 failed 0% {…} [[{"role":"user","content":"\"Can you help me find a Thai restaurant in New York…
Lade Detail …
parallel_multiple_174 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I need …
Lade Detail …
parallel_multiple_177 failed 0% {…} [[{"role":"user","content":"\"Can you first find out the key historical events r…
Lade Detail …
parallel_multiple_178 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm plan…
Lade Detail …
parallel_multiple_179 failed 0% {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I need t…
Lade Detail …
parallel_multiple_180 failed 0% {…} [[{"role":"user","content":"\"Can you tell me who discovered the Higgs Boson and…
Lade Detail …
parallel_multiple_181 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I need …
Lade Detail …
parallel_multiple_182 failed 0% {…} [[{"role":"user","content":"\"Imagine you are a musician who also loves to paint…
Lade Detail …
parallel_multiple_183 failed 0% {…} [[{"role":"user","content":"\"Could you first calculate the probability of drawi…
Lade Detail …
parallel_multiple_185 failed 0% {…} [[{"role":"user","content":"\"Could you first fetch the top 10 popular artworks …
Lade Detail …
parallel_multiple_186 failed 0% {…} [[{"role":"user","content":"\" I'm trying to figure out the RGB values of the co…
Lade Detail …
parallel_multiple_188 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few calculations and sear…
Lade Detail …
parallel_multiple_189 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan a trip? I want to start by fi…
Lade Detail …
parallel_multiple_190 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?…
Lade Detail …
parallel_multiple_191 failed 0% {…} [[{"role":"user","content":"\"Imagine you are planning a cozy evening at home. Y…
Lade Detail …
parallel_multiple_192 failed 0% {…} [[{"role":"user","content":"\"Can you help me plan a dinner? I am looking for a …
Lade Detail …
parallel_multiple_193 failed 0% {…} [[{"role":"user","content":"\"Can you tell me the name of the scientist who is c…
Lade Detail …
parallel_multiple_194 failed 0% {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I would…
Lade Detail …
parallel_multiple_195 failed 0% {…} [[{"role":"user","content":"\"Can you help me find a multiplayer game that is co…
Lade Detail …
parallel_multiple_196 failed 0% {…} [[{"role":"user","content":"\"Could you please help me with some information? Fi…
Lade Detail …
parallel_multiple_198 failed 0% {…} [[{"role":"user","content":"\"Can you first find me a vegan, main course recipe …
Lade Detail …
parallel_multiple_199 failed 0% {…} [[{"role":"user","content":"\"Can you help me with two things? First, I am curre…
Lade Detail …
irrelevance_26 failed 0% {…} [[{"role":"user","content":"How much gas is generated from heating a 2 m\u00b3 c…
Lade Detail …
irrelevance_27 failed 0% {…} [[{"role":"user","content":"What will be the energy needed to increase the tempe…
Lade Detail …
irrelevance_33 failed 0% {…} [[{"role":"user","content":"Identify the genetic code sequence \"ATCG\"."}]]
Lade Detail …
irrelevance_53 failed 0% {…} [[{"role":"user","content":"Who won the world series in 2018?"}]]
Lade Detail …
irrelevance_60 failed 0% {…} [[{"role":"user","content":"What is the final price of a product after a 25% dis…
Lade Detail …
irrelevance_65 failed 0% {…} [[{"role":"user","content":"How many red marbles are there in a bag of 20, given…
Lade Detail …
irrelevance_74 failed 0% {…} [[{"role":"user","content":"What is the rate of return for a business with $1500…
Lade Detail …
irrelevance_86 failed 0% {…} [[{"role":"user","content":"What's the penalty for burglary in California?"}]]
Lade Detail …
irrelevance_91 failed 0% {…} [[{"role":"user","content":"Can I report noise complaint to my local council in …
Lade Detail …
irrelevance_101 failed 0% {…} [[{"role":"user","content":"What is the best month to visit Hawaii?"}]]
Lade Detail …
irrelevance_111 failed 0% {…} [[{"role":"user","content":"Find a GMO yoga mat that I can buy in-store."}]]
Lade Detail …
irrelevance_113 failed 0% {…} [[{"role":"user","content":"Find me restaurants in London"}]]
Lade Detail …
irrelevance_115 failed 0% {…} [[{"role":"user","content":"How long would it take to travel from Boston to New …
Lade Detail …
irrelevance_118 failed 0% {…} [[{"role":"user","content":"Who won the 1996 NBA championships?"}]]
Lade Detail …
irrelevance_121 failed 0% {…} [[{"role":"user","content":"Find the information on motor neuron diseases"}]]
Lade Detail …
irrelevance_124 failed 0% {…} [[{"role":"user","content":"What's the latest trend in technology?"}]]
Lade Detail …
irrelevance_127 failed 0% {…} [[{"role":"user","content":"What's the general mood of twitter regarding the new…
Lade Detail …
irrelevance_165 failed 0% {…} [[{"role":"user","content":"What type of instrument is a cello?"}]]
Lade Detail …
irrelevance_180 failed 0% {…} [[{"role":"user","content":"Who are in the cricket matches scheduled for today?"…
Lade Detail …
irrelevance_182 failed 0% {…} [[{"role":"user","content":"How many championships did Michael Jordan win in his…
Lade Detail …
irrelevance_188 failed 0% {…} [[{"role":"user","content":"Who won the championship of the World Series in 2020…
Lade Detail …
irrelevance_213 failed 0% {…} [[{"role":"user","content":"Where is a good place for pizza in Boston?"}]]
Lade Detail …
irrelevance_222 failed 0% {…} [[{"role":"user","content":"How many calories are in a tomato?"}]]
Lade Detail …
irrelevance_223 failed 0% {…} [[{"role":"user","content":"Find a bakery that sells sourdough bread in Chicago.…
Lade Detail …
irrelevance_226 failed 0% {…} [[{"role":"user","content":"What's timezone is it in London?"}]]
Lade Detail …
irrelevance_228 failed 0% {…} [[{"role":"user","content":"What is the current time in Sydney, Australia?"}]]
Lade Detail …
irrelevance_236 failed 0% {…} [[{"role":"user","content":"What is the quickest way to get to Tokyo from London…
Lade Detail …
irrelevance_239 failed 0% {…} [[{"role":"user","content":"Find the distance in kilometers from San Francisco t…
Lade Detail …
107 von 707 Samples · Limit 200 ‹ Vorherige