Samples · bfcl.single_turn
KI-Auswertung
Generiert 2026-05-13 21:38 · claude-sonnet-4-6Zusammenfassung
Das Modell erreicht eine Pass-Rate von ~49 %, was einer knappen Mehrheit an Fehlern entspricht. Die Performance liegt damit deutlich unter einem praxistauglichen Schwellenwert für Single-Turn Function Calling.
Stärken
- Keine Errors (0 von 1390 Aufrufen), das Modell produziert stets strukturierte Ausgaben.
- Argumente und Werte werden inhaltlich meist korrekt übergeben (richtige Zahlen, richtige Parameter-Namen).
Schwächen
- Dominantes Muster: Das Modell ersetzt Punkte in Funktionsnamen durch Unterstriche (`math.factorial` → `math_factorial`), was systematisch zu `wrong_func_name`-Fehlern führt.
- Vereinzelt treten Typ-Fehler auf (z. B. `interval` als flaches Array statt Array-of-Arrays), was auf mangelndes Schema-Verständnis bei verschachtelten Typen hindeutet.
- Optionale Parameter werden häufig weggelassen, obwohl sie im Expected-Output als leerer String akzeptiert würden — dies könnte weitere stille Fehler verursachen.
Auffälligkeiten
Fast alle sichtbaren Failures basieren auf einem einzigen, konsistenten Fehler: Punkt-Notation in Funktionsnamen wird nicht verwendet. Dies deutet darauf hin, dass das Modell entweder im Training kaum Beispiele mit Namespace-Trennzeichen (`.`) gesehen hat oder der System-Prompt / Tool-Schema diese Konvention nicht ausreichend vorgibt. Der Typ-Fehler bei `interval` ist ein separates, aber ebenfalls reproduzierbares Problem bei Array-of-Floats-Schemas.
Empfehlung
Den Tool-Schema-Prompt explizit um den Hinweis ergänzen, dass Funktionsnamen exakt — inklusive Punkt-Notation — aus dem Schema übernommen werden müssen; alternativ Few-Shot-Beispiele mit Dot-Namespaces in den System-Prompt integrieren und anschließend den Benchmark erneut auswerten.
Übersicht
707 Samples| Frage-ID | Status | Score | Prompt | Latenz | Tokens/s | TTFT | |
|---|---|---|---|---|---|---|---|
| parallel_multiple_108 | failed | {…} [[{"role":"user","content":"\"Could you please analyze my personality based on t… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_109 | failed | {…} [[{"role":"user","content":"\"Can you tell me about the monarchs of France durin… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_110 | failed | {…} [[{"role":"user","content":"\"What was the population of California in 1980 and … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_111 | failed | {…} [[{"role":"user","content":"\"Could you please provide me with the origin and fo… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_112 | failed | {…} [[{"role":"user","content":"\"Could you please help me find the price of the art… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_113 | failed | {…} [[{"role":"user","content":"\"Could you please help me with some information? I … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_114 | failed | {…} [[{"role":"user","content":"\"Could you please help me order a custom sculpture … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_115 | failed | {…} [[{"role":"user","content":"\"Can you help me plan my trip to New York? I would … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_117 | failed | {…} [[{"role":"user","content":"\"Can you help me book a ticket for a concert of Tay… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_118 | failed | {…} [[{"role":"user","content":"\"Can you help me create a piece of music in D Minor… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_119 | failed | {…} [[{"role":"user","content":"\"Can you tell me how many all-time goals Cristiano … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_120 | failed | {…} [[{"role":"user","content":"\"Can you tell me the scores for the Manchester Unit… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_121 | failed | {…} [[{"role":"user","content":"\"I'm planning a game night and I need some board ga… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_122 | failed | {…} [[{"role":"user","content":"\"Could you please find the latest updates for the g… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_123 | failed | {…} [[{"role":"user","content":"\"Can you tell me how many active players were engag… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_124 | failed | {…} [[{"role":"user","content":"\"Can you help me plan my meals for the day? I want … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_125 | failed | {…} [[{"role":"user","content":"\"Can you help me plan a day out in Seattle, WA for … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_126 | failed | {…} [[{"role":"user","content":"\"Can you help me find a recipe that uses chicken as… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_127 | failed | {…} [[{"role":"user","content":"\"Can you help me plan my trip? I need to book a hot… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_128 | failed | {…} [[{"role":"user","content":"\"Could you help me plan my vacation? I need to know… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_129 | failed | {…} [[{"role":"user","content":"\"Can you help me with a couple of conversions? Firs… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_131 | failed | {…} [[{"role":"user","content":"\"Could you please help me with a couple of calculat… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_132 | failed | {…} [[{"role":"user","content":"\"Could you first calculate the derivative of the fu… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_133 | failed | {…} [[{"role":"user","content":"\"Imagine you are a music producer and you are worki… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_134 | failed | {…} [[{"role":"user","content":"\"Could you please help me with two tasks? First, I'… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_136 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm inte… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_137 | failed | {…} [[{"role":"user","content":"\"Can you tell me the function of the molecule ATP i… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_138 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few tasks? Firstly, I am … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_139 | failed | {…} [[{"role":"user","content":"\"Imagine you are a teacher preparing for a science … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_141 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm stud… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_142 | failed | {…} [[{"role":"user","content":"\"In the game 'Animal Crossing', I am interested in … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_143 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I need to… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_145 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I am play… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_146 | failed | {…} [[{"role":"user","content":"You are an art curator and a part-time biologist. Yo… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_147 | failed | {…} [[{"role":"user","content":"\"Can you help me plan a day out? I want to start by… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_148 | failed | {…} [[{"role":"user","content":"\"Can you tell me the net worth of the famous footba… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_149 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few tasks? First, I need to… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_150 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few things? First, I'm in… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_151 | failed | {…} [[{"role":"user","content":"\"Imagine you are planning a vacation to Paris, Fran… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_152 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm tryi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_153 | failed | {…} [[{"role":"user","content":"\"Could you help me plan a trip? I want to go to Par… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_155 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few calculations? First, I … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_156 | failed | {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_157 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few calculations and sear… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_158 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I'm int… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_159 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few things? First, I'd li… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_160 | failed | {…} [[{"role":"user","content":"\"Can you help me with two tasks? First, I want to c… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_161 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few calculations? First, I'… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_162 | failed | {…} [[{"role":"user","content":"\"Imagine you are planning your finances and you wan… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_163 | failed | {…} [[{"role":"user","content":"\"John is planning to invest in a mutual fund. He ha… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_165 | failed | {…} [[{"role":"user","content":"\"Can you help me plan my week? I'm interested in at… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_166 | failed | {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_167 | failed | {…} [[{"role":"user","content":"\"In the game 'Animal Crossing' during the 'Summer' … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_168 | failed | {…} [[{"role":"user","content":"\"Could you help me with a two-part request? First, … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_169 | failed | {…} [[{"role":"user","content":"\"Could you please tell me the latest game score, in… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_170 | failed | {…} [[{"role":"user","content":"\"Imagine you are playing a role-playing game and yo… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_172 | failed | {…} [[{"role":"user","content":"\"Can you help me with the following tasks? First, I… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_173 | failed | {…} [[{"role":"user","content":"\"Can you help me find a Thai restaurant in New York… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_174 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I need … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_177 | failed | {…} [[{"role":"user","content":"\"Can you first find out the key historical events r… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_178 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I'm plan… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_179 | failed | {…} [[{"role":"user","content":"\"Can you help me with a few things? First, I need t… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_180 | failed | {…} [[{"role":"user","content":"\"Can you tell me who discovered the Higgs Boson and… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_181 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I need … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_182 | failed | {…} [[{"role":"user","content":"\"Imagine you are a musician who also loves to paint… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_183 | failed | {…} [[{"role":"user","content":"\"Could you first calculate the probability of drawi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_185 | failed | {…} [[{"role":"user","content":"\"Could you first fetch the top 10 popular artworks … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_186 | failed | {…} [[{"role":"user","content":"\" I'm trying to figure out the RGB values of the co… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_188 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few calculations and sear… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_189 | failed | {…} [[{"role":"user","content":"\"Can you help me plan a trip? I want to start by fi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_190 | failed | {…} [[{"role":"user","content":"\"Could you please help me with the following tasks?… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_191 | failed | {…} [[{"role":"user","content":"\"Imagine you are planning a cozy evening at home. Y… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_192 | failed | {…} [[{"role":"user","content":"\"Can you help me plan a dinner? I am looking for a … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_193 | failed | {…} [[{"role":"user","content":"\"Can you tell me the name of the scientist who is c… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_194 | failed | {…} [[{"role":"user","content":"\"Could you help me with a few tasks? First, I would… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_195 | failed | {…} [[{"role":"user","content":"\"Can you help me find a multiplayer game that is co… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_196 | failed | {…} [[{"role":"user","content":"\"Could you please help me with some information? Fi… | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_198 | failed | {…} [[{"role":"user","content":"\"Can you first find me a vegan, main course recipe … | — | — | — | ||
|
Lade Detail …
|
|||||||
| parallel_multiple_199 | failed | {…} [[{"role":"user","content":"\"Can you help me with two things? First, I am curre… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_26 | failed | {…} [[{"role":"user","content":"How much gas is generated from heating a 2 m\u00b3 c… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_27 | failed | {…} [[{"role":"user","content":"What will be the energy needed to increase the tempe… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_33 | failed | {…} [[{"role":"user","content":"Identify the genetic code sequence \"ATCG\"."}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_53 | failed | {…} [[{"role":"user","content":"Who won the world series in 2018?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_60 | failed | {…} [[{"role":"user","content":"What is the final price of a product after a 25% dis… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_65 | failed | {…} [[{"role":"user","content":"How many red marbles are there in a bag of 20, given… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_74 | failed | {…} [[{"role":"user","content":"What is the rate of return for a business with $1500… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_86 | failed | {…} [[{"role":"user","content":"What's the penalty for burglary in California?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_91 | failed | {…} [[{"role":"user","content":"Can I report noise complaint to my local council in … | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_101 | failed | {…} [[{"role":"user","content":"What is the best month to visit Hawaii?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_111 | failed | {…} [[{"role":"user","content":"Find a GMO yoga mat that I can buy in-store."}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_113 | failed | {…} [[{"role":"user","content":"Find me restaurants in London"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_115 | failed | {…} [[{"role":"user","content":"How long would it take to travel from Boston to New … | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_118 | failed | {…} [[{"role":"user","content":"Who won the 1996 NBA championships?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_121 | failed | {…} [[{"role":"user","content":"Find the information on motor neuron diseases"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_124 | failed | {…} [[{"role":"user","content":"What's the latest trend in technology?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_127 | failed | {…} [[{"role":"user","content":"What's the general mood of twitter regarding the new… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_165 | failed | {…} [[{"role":"user","content":"What type of instrument is a cello?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_180 | failed | {…} [[{"role":"user","content":"Who are in the cricket matches scheduled for today?"… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_182 | failed | {…} [[{"role":"user","content":"How many championships did Michael Jordan win in his… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_188 | failed | {…} [[{"role":"user","content":"Who won the championship of the World Series in 2020… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_213 | failed | {…} [[{"role":"user","content":"Where is a good place for pizza in Boston?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_222 | failed | {…} [[{"role":"user","content":"How many calories are in a tomato?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_223 | failed | {…} [[{"role":"user","content":"Find a bakery that sells sourdough bread in Chicago.… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_226 | failed | {…} [[{"role":"user","content":"What's timezone is it in London?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_228 | failed | {…} [[{"role":"user","content":"What is the current time in Sydney, Australia?"}]] | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_236 | failed | {…} [[{"role":"user","content":"What is the quickest way to get to Tokyo from London… | — | — | — | ||
|
Lade Detail …
|
|||||||
| irrelevance_239 | failed | {…} [[{"role":"user","content":"Find the distance in kilometers from San Francisco t… | — | — | — | ||
|
Lade Detail …
|
|||||||