Add support for skills by prernakakkar-google · Pull Request #253 · GoogleCloudPlatform/evalbench

prernakakkar-google · 2026-03-03T10:35:42Z

Example run:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ PATH="$HOME/.local/bin:$PATH" EVAL_CONFIG=datasets/gemini-cli-tools/example_run_skills_config.yaml ./evalbench/run.sh
I0303 10:45:57.191980 140027355738944 evalbench.py:36] EvalBench v1.0.0
I0303 10:45:57.196259 140027355738944 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_skills_config.yaml
I0303 10:45:57.196668 140027355738944 __init__.py:11] Orchestrator Type: geminicli
I0303 10:45:57.196756 140027355738944 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0303 10:45:57.198062 140027355738944 gemini_cli.py:79] Fetching new access token via gcloud auth command
I0303 10:45:57.783076 140027355738944 gemini_cli.py:99] Updating /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc with new token...
I0303 10:45:57.783606 140027355738944 gemini_cli.py:141] NPM authentication updated successfully at /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc
I0303 10:46:01.594072 140027355738944 agentevaluator.py:67] Running Gemini CLI evaluation
I0303 10:46:01.638494 140023295899328 agentevaluator.py:117] Turn 1/3 - Prompt: list all instances in project astana-evaluation
I0303 10:46:18.476078 140023295899328 agentevaluator.py:180] Turn 1/3 - Gemini CLI exit code: 0
I0303 10:46:18.476193 140023295899328 agentevaluator.py:182] Turn 1/3 - Gemini CLI stdout: {
  "session_id": "7d028761-f891-4d1d-b789-39587c761d5d",
  "response": "The following Cloud SQL instances were found in project `astana-evaluation`:\n\n*   `clone-agd`\n*   `test56`\n*   `test110`\n*   `staging-nl2code-5`\n*   `trte`\n*   `staging-nl2code-2`\n*   `test-instance`\n*   `staging-nl2code-v2`\n*   `nl2code-staging`\n*   `test-postgres-instance`\n*   `tesy`\n*   `my-pg-app`\n*   `magic`\n*   `nl2code`\n*   `staging-nl2code`\n*   `testing-instance`\n*   `nl2code-clone`\n*   `agd`\n*   `test-cloudsql-mysql-instance`\n*   `test300`\n*   `gemini-postgresql-instance`\n*   `test400`\n*   `staging-nl2code-3`\n*   `staging-nl2code-6`\n*   `test19`\n*   `staging-nl2code-clone`\n*   `staging-nl2code-4`\n*   `test-cloudsql-sql-server-instance`",
  "stats": {
    "models": {
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 9997
        },
        "tokens": {
          "input": 47741,
          "prompt": 47741,
          "candidates": 374,
          "total": 48441,
          "cached": 0,
          "thoughts": 0,
          "tool": 0
        },
        "roles": {
          "main": {
            "totalRequests": 1,
            "totalErrors": 0,
            "totalLatencyMs": 9997,
            "tokens": {
              "input": 47741,
              "prompt": 47741,
              "candidates": 374,
              "total": 48441,
              "cached": 0,
              "thoughts": 0,
              "tool": 0
            }
          }
        }
      }
    },
    "tools": {
      "totalCalls": 2,
      "totalSuccess": 2,
      "totalFail": 0,
      "totalDurationMs": 833,
      "decisions": {
        "accept": 2,
        "reject": 0,
        "modify": 0,
        "auto_accept": 2
      },
      "byName": {
        "activate_skill": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 18,
          "parameters": [
            {
              "name": "cloudsql-postgres-admin"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        },
        "run_shell_command": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 815,
          "parameters": [
            {
              "command": "node /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.gemini/skills/cloudsql-postgres-admin/scripts/list_instances.js '{\"project\": \"astana-evaluation\"}'",
              "description": "Listing Cloud SQL instances in project astana-evaluation"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    }
  }
}
I0303 10:46:18.476258 140023295899328 agentevaluator.py:184] Turn 1/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
I0303 10:46:34.143607 140023295899328 agentevaluator.py:117] Turn 2/3 - Prompt: what is the state of the nl2code instance?
I0303 10:46:47.765748 140023295899328 agentevaluator.py:180] Turn 2/3 - Gemini CLI exit code: 0
I0303 10:46:47.765861 140023295899328 agentevaluator.py:182] Turn 2/3 - Gemini CLI stdout: {
  "session_id": "7d028761-f891-4d1d-b789-39587c761d5d",
  "response": "The `nl2code` instance in project `astana-evaluation` is in the `RUNNABLE` state.",
  "stats": {
    "models": {
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 6930
        },
        "tokens": {
          "input": 38341,
          "prompt": 38341,
          "candidates": 114,
          "total": 38619,
          "cached": 0,
          "thoughts": 0,
          "tool": 0
        },
        "roles": {
          "main": {
            "totalRequests": 1,
            "totalErrors": 0,
            "totalLatencyMs": 6930,
            "tokens": {
              "input": 38341,
              "prompt": 38341,
              "candidates": 114,
              "total": 38619,
              "cached": 0,
              "thoughts": 0,
              "tool": 0
            }
          }
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 702,
      "decisions": {
        "accept": 1,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "run_shell_command": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 702,
          "parameters": [
            {
              "description": "Getting details for Cloud SQL instance nl2code in project astana-evaluation",
              "command": "node /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.gemini/skills/cloudsql-postgres-admin/scripts/get_instance.js '{\"projectId\": \"astana-evaluation\", \"instanceId\": \"nl2code\"}'"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    }
  }
}
I0303 10:46:47.765929 140023295899328 agentevaluator.py:184] Turn 2/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
I0303 10:47:02.860255 140023295899328 agentevaluator.py:162] Simulated user terminated conversation.
I0303 10:47:47.222726 140027355738944 report.py:25] Total Prompts: 1.
I0303 10:47:47.224263 140027355738944 report.py:43] Prompt Errors: 0.
I0303 10:47:47.224896 140027355738944 report.py:44] SQLGen Errors: 0.
I0303 10:47:47.225466 140027355738944 report.py:45] SQLExec Gen Errors: 0.
I0303 10:47:47.226002 140027355738944 report.py:46] Golden Errors: 0.
I0303 10:47:47.228504 140027355738944 analyzer.py:105] \n--- goal_completion Analysis ---
I0303 10:47:47.228600 140027355738944 analyzer.py:107] PASS
Reasoning: The agent successfully completed all parts of the conversation plan. First, it listed all instances in the astana-evaluation project. This list included the nl2code instance. Second, when asked about the state of the nl2code instance, the agent correctly retrieved the state and confirmed it was RUNNABLE, thus fulfilling the entire intent of the plan.
I0303 10:47:47.230248 140027355738944 analyzer.py:105] \n--- behavioral_metrics Analysis ---
I0303 10:47:47.230379 140027355738944 analyzer.py:107] Hallucination Count: 0
Clarification Count: 0
Reasoning: The agent performed the task perfectly. It did not hallucinate any tools, parameters, or resources. In the first turn, it correctly identified the user's intent and listed the instances in the specified project. In the second turn, it correctly maintained context by remembering the project ID from the first turn and used it to get the state of the nl2code instance without asking for redundant clarification.
I0303 10:47:47.232178 140027355738944 analyzer.py:58] turn_count:       Average = 2.00 turns
I0303 10:47:47.233671 140027355738944 analyzer.py:58] end_to_end_latency:       Average = 18462.00 ms
I0303 10:47:47.235287 140027355738944 analyzer.py:58] tool_call_latency:        Average = 1535.00 ms
I0303 10:47:47.236948 140027355738944 analyzer.py:58] token_consumption:        Average = 87060.00 tokens
I0303 10:47:47.238546 140027355738944 analyzer.py:72] executable:       1/1 = 100.0%
I0303 10:47:47.246451 140027355738944 csv.py:31] Created csv configs.csv for StoreType.CONFIGS in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.248256 140027355738944 csv.py:31] Created csv evals.csv for StoreType.EVALS in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.249567 140027355738944 csv.py:31] Created csv scores.csv for StoreType.SCORES in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.250493 140027355738944 csv.py:31] Created csv summary.csv for StoreType.SUMMARY in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
Finished Job ID 6f90c078-b88b-427c-9947-ecd1e36c4b5e
(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$

…lbench into gemini-cli-evals

evalbench/generators/models/gemini_cli.py

prernakakkar-google added 8 commits February 26, 2026 06:45

feat: Add support for syncing Gemini CLI skills to fake home

7e2265b

feat: Sync Gemini CLI skills into fake_home

93e6265

Add skills support

4f117af

Merge branch 'main' into gemini-cli-evals

c00d0e8

Merge branch 'gemini-cli-evals' of github.com:GoogleCloudPlatform/eva…

9b3e373

…lbench into gemini-cli-evals

revert

2df0cd3

revert

81fecc3

fix

70ed713

prernakakkar-google requested review from IsmailMehdi and mahyareb as code owners March 3, 2026 10:35

github-code-quality bot found potential problems Mar 3, 2026

View reviewed changes

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

fix lint

75a0bbe

github-code-quality bot found potential problems Mar 3, 2026

View reviewed changes

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

Potential fix for pull request finding 'Empty except'

561ddf1

prernakakkar-google force-pushed the gemini-cli-evals branch from 2f6dabc to 561ddf1 Compare March 3, 2026 11:15

fix

1dd833c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for skills#253

Add support for skills#253
prernakakkar-google wants to merge 11 commits intomainfrom
gemini-cli-evals

prernakakkar-google commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prernakakkar-google commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

prernakakkar-google commented Mar 3, 2026 •

edited

Loading