Skip to content

Add support for skills#253

Open
prernakakkar-google wants to merge 11 commits intomainfrom
gemini-cli-evals
Open

Add support for skills#253
prernakakkar-google wants to merge 11 commits intomainfrom
gemini-cli-evals

Conversation

@prernakakkar-google
Copy link
Collaborator

@prernakakkar-google prernakakkar-google commented Mar 3, 2026

Example run:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ PATH="$HOME/.local/bin:$PATH" EVAL_CONFIG=datasets/gemini-cli-tools/example_run_skills_config.yaml ./evalbench/run.sh
I0303 10:45:57.191980 140027355738944 evalbench.py:36] EvalBench v1.0.0
I0303 10:45:57.196259 140027355738944 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_skills_config.yaml
I0303 10:45:57.196668 140027355738944 __init__.py:11] Orchestrator Type: geminicli
I0303 10:45:57.196756 140027355738944 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0303 10:45:57.198062 140027355738944 gemini_cli.py:79] Fetching new access token via gcloud auth command
I0303 10:45:57.783076 140027355738944 gemini_cli.py:99] Updating /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc with new token...
I0303 10:45:57.783606 140027355738944 gemini_cli.py:141] NPM authentication updated successfully at /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc
I0303 10:46:01.594072 140027355738944 agentevaluator.py:67] Running Gemini CLI evaluation
I0303 10:46:01.638494 140023295899328 agentevaluator.py:117] Turn 1/3 - Prompt: list all instances in project astana-evaluation
I0303 10:46:18.476078 140023295899328 agentevaluator.py:180] Turn 1/3 - Gemini CLI exit code: 0
I0303 10:46:18.476193 140023295899328 agentevaluator.py:182] Turn 1/3 - Gemini CLI stdout: {
  "session_id": "7d028761-f891-4d1d-b789-39587c761d5d",
  "response": "The following Cloud SQL instances were found in project `astana-evaluation`:\n\n*   `clone-agd`\n*   `test56`\n*   `test110`\n*   `staging-nl2code-5`\n*   `trte`\n*   `staging-nl2code-2`\n*   `test-instance`\n*   `staging-nl2code-v2`\n*   `nl2code-staging`\n*   `test-postgres-instance`\n*   `tesy`\n*   `my-pg-app`\n*   `magic`\n*   `nl2code`\n*   `staging-nl2code`\n*   `testing-instance`\n*   `nl2code-clone`\n*   `agd`\n*   `test-cloudsql-mysql-instance`\n*   `test300`\n*   `gemini-postgresql-instance`\n*   `test400`\n*   `staging-nl2code-3`\n*   `staging-nl2code-6`\n*   `test19`\n*   `staging-nl2code-clone`\n*   `staging-nl2code-4`\n*   `test-cloudsql-sql-server-instance`",
  "stats": {
    "models": {
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 9997
        },
        "tokens": {
          "input": 47741,
          "prompt": 47741,
          "candidates": 374,
          "total": 48441,
          "cached": 0,
          "thoughts": 0,
          "tool": 0
        },
        "roles": {
          "main": {
            "totalRequests": 1,
            "totalErrors": 0,
            "totalLatencyMs": 9997,
            "tokens": {
              "input": 47741,
              "prompt": 47741,
              "candidates": 374,
              "total": 48441,
              "cached": 0,
              "thoughts": 0,
              "tool": 0
            }
          }
        }
      }
    },
    "tools": {
      "totalCalls": 2,
      "totalSuccess": 2,
      "totalFail": 0,
      "totalDurationMs": 833,
      "decisions": {
        "accept": 2,
        "reject": 0,
        "modify": 0,
        "auto_accept": 2
      },
      "byName": {
        "activate_skill": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 18,
          "parameters": [
            {
              "name": "cloudsql-postgres-admin"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        },
        "run_shell_command": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 815,
          "parameters": [
            {
              "command": "node /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.gemini/skills/cloudsql-postgres-admin/scripts/list_instances.js '{\"project\": \"astana-evaluation\"}'",
              "description": "Listing Cloud SQL instances in project astana-evaluation"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    }
  }
}
I0303 10:46:18.476258 140023295899328 agentevaluator.py:184] Turn 1/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
I0303 10:46:34.143607 140023295899328 agentevaluator.py:117] Turn 2/3 - Prompt: what is the state of the nl2code instance?
I0303 10:46:47.765748 140023295899328 agentevaluator.py:180] Turn 2/3 - Gemini CLI exit code: 0
I0303 10:46:47.765861 140023295899328 agentevaluator.py:182] Turn 2/3 - Gemini CLI stdout: {
  "session_id": "7d028761-f891-4d1d-b789-39587c761d5d",
  "response": "The `nl2code` instance in project `astana-evaluation` is in the `RUNNABLE` state.",
  "stats": {
    "models": {
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 6930
        },
        "tokens": {
          "input": 38341,
          "prompt": 38341,
          "candidates": 114,
          "total": 38619,
          "cached": 0,
          "thoughts": 0,
          "tool": 0
        },
        "roles": {
          "main": {
            "totalRequests": 1,
            "totalErrors": 0,
            "totalLatencyMs": 6930,
            "tokens": {
              "input": 38341,
              "prompt": 38341,
              "candidates": 114,
              "total": 38619,
              "cached": 0,
              "thoughts": 0,
              "tool": 0
            }
          }
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 702,
      "decisions": {
        "accept": 1,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "run_shell_command": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 702,
          "parameters": [
            {
              "description": "Getting details for Cloud SQL instance nl2code in project astana-evaluation",
              "command": "node /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.gemini/skills/cloudsql-postgres-admin/scripts/get_instance.js '{\"projectId\": \"astana-evaluation\", \"instanceId\": \"nl2code\"}'"
            }
          ],
          "decisions": {
            "accept": 1,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    }
  }
}
I0303 10:46:47.765929 140023295899328 agentevaluator.py:184] Turn 2/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
I0303 10:47:02.860255 140023295899328 agentevaluator.py:162] Simulated user terminated conversation.
I0303 10:47:47.222726 140027355738944 report.py:25] Total Prompts: 1.
I0303 10:47:47.224263 140027355738944 report.py:43] Prompt Errors: 0.
I0303 10:47:47.224896 140027355738944 report.py:44] SQLGen Errors: 0.
I0303 10:47:47.225466 140027355738944 report.py:45] SQLExec Gen Errors: 0.
I0303 10:47:47.226002 140027355738944 report.py:46] Golden Errors: 0.
I0303 10:47:47.228504 140027355738944 analyzer.py:105] \n--- goal_completion Analysis ---
I0303 10:47:47.228600 140027355738944 analyzer.py:107] PASS
Reasoning: The agent successfully completed all parts of the conversation plan. First, it listed all instances in the astana-evaluation project. This list included the nl2code instance. Second, when asked about the state of the nl2code instance, the agent correctly retrieved the state and confirmed it was RUNNABLE, thus fulfilling the entire intent of the plan.
I0303 10:47:47.230248 140027355738944 analyzer.py:105] \n--- behavioral_metrics Analysis ---
I0303 10:47:47.230379 140027355738944 analyzer.py:107] Hallucination Count: 0
Clarification Count: 0
Reasoning: The agent performed the task perfectly. It did not hallucinate any tools, parameters, or resources. In the first turn, it correctly identified the user's intent and listed the instances in the specified project. In the second turn, it correctly maintained context by remembering the project ID from the first turn and used it to get the state of the nl2code instance without asking for redundant clarification.
I0303 10:47:47.232178 140027355738944 analyzer.py:58] turn_count:       Average = 2.00 turns
I0303 10:47:47.233671 140027355738944 analyzer.py:58] end_to_end_latency:       Average = 18462.00 ms
I0303 10:47:47.235287 140027355738944 analyzer.py:58] tool_call_latency:        Average = 1535.00 ms
I0303 10:47:47.236948 140027355738944 analyzer.py:58] token_consumption:        Average = 87060.00 tokens
I0303 10:47:47.238546 140027355738944 analyzer.py:72] executable:       1/1 = 100.0%
I0303 10:47:47.246451 140027355738944 csv.py:31] Created csv configs.csv for StoreType.CONFIGS in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.248256 140027355738944 csv.py:31] Created csv evals.csv for StoreType.EVALS in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.249567 140027355738944 csv.py:31] Created csv scores.csv for StoreType.SCORES in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
I0303 10:47:47.250493 140027355738944 csv.py:31] Created csv summary.csv for StoreType.SUMMARY in directory results/6f90c078-b88b-427c-9947-ecd1e36c4b5e
Finished Job ID 6f90c078-b88b-427c-9947-ecd1e36c4b5e
(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant