Update benchmarking scaffold #256

kaavee315 · 2024-07-05T11:09:56Z

PR Type

Enhancement, Tests

Description

Refactored ComposioToolSet to improve workspace handling and logging.
Introduced run_and_get_scores function for running agents and retrieving scores.
Updated run function in run_evaluation.py to accept an agent_func parameter.
Enhanced logging setup and error handling in run_evaluation.py.
Removed obsolete benchmark.template file.
Added new run_benchmark.template for running benchmark evaluations.

Changes walkthrough ��

Relevant files

Enhancement

toolset.py Refactor workspace handling and logging in `ComposioToolSet` python/composio/tools/toolset.py Removed `workspace_id` and `workspace_env` attributes from the constructor. Added `set_workspace_id` method to set and retrieve workspace. Adjusted logging and workspace retrieval logic.	+8/-4
run_evaluation.py `Refactor benchmark evaluation and logging setup` python/swe/benchmark/run_evaluation.py Added `run_and_get_scores` function to run agent and get scores. Refactored `run` function to accept `agent_func` as a parameter. Updated logging setup to use `get_logger`. Enhanced error handling and logging for patch retrieval.	+63/-13

Miscellaneous

benchmark.template `Remove obsolete benchmark template` python/swe/composio_swe/scaffold/templates/crewai/benchmark.template Removed the `benchmark.template` file.	+0/-5

Tests

run_benchmark.template `Add new benchmark template for running evaluations` python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template Added new `run_benchmark.template` file. Defined `agent_func` for workspace setup and issue kickoff. Implemented `main` function to run benchmark.	+25/-0

💡 PR-Agent usage:
Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

ellipsis-dev · 2024-07-05T11:09:59Z

Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at help@ellipsis.dev

codiumai-pr-agent-pro · 2024-07-05T11:10:23Z

PR Reviewer Guide 🔍

��️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Key issues to review Refactoring Concern: The removal of `workspace_id` and `workspace_env` from the constructor in `toolset.py` and the addition of `set_workspace_id` might introduce issues if the workspace ID is used before being set in scenarios not covered by the PR. Error Handling: The new error handling and logging enhancements in `run_evaluation.py` need to be carefully reviewed to ensure that they correctly capture and log errors without missing any exceptions. Functionality Change: The change from `SHELL_EXECUTE_COMMAND` to `SHELL_EXEC_COMMAND` in `run_evaluation.py` could potentially alter the behavior of command execution, depending on the implementation details of these actions.

codiumai-pr-agent-pro · 2024-07-05T11:10:52Z

PR Code Suggestions ✨

Category	Suggestion	Score
Possible issue	Add a check to ensure `logs_dir` is not `None` before using it Add a check to ensure `logs_dir` is not `None` before using it in the `run_and_get_scores` function to avoid potential `NoneType` errors. python/swe/benchmark/run_evaluation.py [206-212] +if logs_dir is None: + raise ValueError("logs_dir cannot be None") logger.info("Running agent with logs_dir: %s", logs_dir) run( agent_func=agent_func, test_split=test_split, include_hints=include_hints, logs_dir=logs_dir, ) Apply this suggestion Suggestion importance[1-10]: 8 Why: The suggestion to check if `logs_dir` is not `None` before using it in logging and function calls is crucial to prevent runtime errors, making it a significant improvement.	8
Possible issue	Add error handling to catch and log exceptions during the `kickoff` process Add error handling in the `agent_func` to catch and log any exceptions that occur during the `kickoff` process. python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template [10-14] -return crew.kickoff( - { - "issue_id": issue_config.issue_id, - "issue": issue_config.issue_desc, - } -) +try: + return crew.kickoff( + { + "issue_id": issue_config.issue_id, + "issue": issue_config.issue_desc, + } + ) +except Exception as e: + logger.error(f"Error during kickoff: {e}") + raise Apply this suggestion Suggestion importance[1-10]: 7 Why: Adding error handling in the `agent_func` to catch exceptions during the `kickoff` process is a good practice to ensure robustness and easier debugging.	7
Enhancement	Add a type hint for the `agent_func` parameter to improve code clarity and type safety Add a type hint for the `agent_func` parameter in the `run_and_get_scores` function to improve code clarity and type safety. python/swe/benchmark/run_evaluation.py [204] -def run_and_get_scores(agent_func, test_split="1:50", include_hints=True): +def run_and_get_scores(agent_func: t.Callable, test_split="1:50", include_hints=True): Apply this suggestion Suggestion importance[1-10]: 7 Why: Adding a type hint for `agent_func` as `t.Callable` in `run_and_get_scores` function enhances code clarity and helps with type checking, which is a good practice in Python.	7
Best practice	Use the `not` operator to check for `None` values in a more Pythonic way Instead of checking `workspace_id` for `None` directly, use the `not` operator for a more Pythonic approach. python/composio/tools/toolset.py [73] -if workspace_id is None: +if not workspace_id: Apply this suggestion Suggestion importance[1-10]: 6 Why: The suggestion to use `if not workspace_id:` instead of `if workspace_id is None:` is a valid Pythonic improvement for readability, but it's not crucial.	6

codiumai-pr-agent-pro · 2024-07-05T11:12:37Z

CI Failure Feedback 🧐

(Checks updated until commit `19fe761`)

Action: test (ubuntu-latest, 3.10)
Failed stage: Unittests [❌]
Failed test name: test_list_all
Failure summary: The action failed because the test `test_list_all` in the file `tests/test_cli/test_actions.py` failed. The test failed due to an `AssertionError` indicating a server error response: The expected exit code was `0`, but the actual exit code was `1`. The error message included an HTML response indicating a "503 Service Unavailable" error, suggesting that the application failed to respond.
Relevant error logs: 1: ##[group]Operating System 2: Ubuntu ... 496: * [new branch] featembed-tool -> origin/featembed-tool 497: * [new branch] fix/readme -> origin/fix/readme 498: * [new branch] fix/readme-logo -> origin/fix/readme-logo 499: * [new branch] fix/swe-agent -> origin/fix/swe-agent 500: * [new branch] ft-add-better-help-text -> origin/ft-add-better-help-text 501: * [new branch] ft-apps-id -> origin/ft-apps-id 502: * [new branch] ft-bring-back-core-sdk -> origin/ft-bring-back-core-sdk 503: * [new branch] ft-did-you-mean -> origin/ft-did-you-mean 504: * [new branch] ft-error-tracking -> origin/ft-error-tracking ... 892: tests/test_example.py::test_example[example0] SKIPPED (Testing in CI 893: will lead to too much LLM API usage) [ 4%] 894: tests/test_example.py::test_example[example1] SKIPPED (Testing in CI 895: will lead to too much LLM API usage) [ 6%] 896: tests/test_example.py::test_example[example2] SKIPPED (Testing in CI 897: will lead to too much LLM API usage) [ 9%] 898: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments0-exptected_outputs0-unexptected_outputs0] PASSED [ 11%] 899: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments1-exptected_outputs1-unexptected_outputs1] PASSED [ 13%] 900: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] FAILED [ 15%] 901: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments3-exptected_outputs3-unexptected_outputs3] PASSED [ 18%] 902: tests/test_cli/test_actions.py::TestListActions::test_tag_not_found PASSED [ 20%] 903: tests/test_cli/test_actions.py::TestListActions::test_limit SKIPPED [ 22%] 904: tests/test_cli/test_actions.py::TestListActions::test_copy PASSED [ 25%] 905: tests/test_cli/test_add.py::TestComposioAdd::test_no_auth PASSED [ 27%] 906: tests/test_cli/test_apps.py::TestList::test_list PASSED [ 29%] 907: tests/test_cli/test_apps.py::TestUpdate::test_update_not_required PASSED [ 31%] 908: tests/test_cli/test_apps.py::TestUpdate::test_update SKIPPED (Needs 909: investigation, this test fails in CI) [ 34%] ... 931: tests/test_tools/test_toolset.py::test_find_actions_by_tags PASSED [ 84%] 932: tests/test_tools/test_toolset.py::test_uninitialize_app PASSED [ 86%] 933: tests/test_utils/test_decorators.py::test_deprecated PASSED [ 88%] 934: tests/test_utils/test_git.py::test_get_git_user_info PASSED [ 90%] 935: tests/test_utils/test_shared.py::test_get_pydantic_signature_format_from_schema_params PASSED [ 93%] 936: tests/test_utils/test_shared.py::test_json_schema_to_pydantic_field PASSED [ 95%] 937: tests/test_utils/test_shared.py::test_json_schema_to_fields_dict PASSED [ 97%] 938: tests/test_utils/test_url.py::test_get_web_url PASSED [100%] 939: =================================== FAILURES =================================== ... 965: patch: t.Any, 966: arguments: t.Tuple[str, ...], 967: exptected_outputs: t.Tuple[str, ...], 968: unexptected_outputs: t.Tuple[str, ...], 969: ) -> None: 970: """Test list all actions.""" 971: result = self.run("actions", *arguments) 972: > assert result.exit_code == 0, result.stderr 973: E AssertionError: Error: <!DOCTYPE html> 974: E <html> 975: E <head> 976: E <meta name="viewport" content="width=device-width, initial-scale=1" /> 977: E <meta charset="utf-8" /> 978: E <title>Server Error</title> ... 1031: E } 1032: E 1033: E h1 { 1034: E display: none; 1035: E margin-top: 2rem; 1036: E margin-bottom: 2rem; 1037: E } 1038: E 1039: E .error-503 { ... 1055: E fill="#fff" 1056: E /> 1057: E <path 1058: E d="M511.454 0C319.953 0 153.311 105.16 65.31 260.612c68.771-.144 202.704-.226 202.704-.226h.031v-.051c158.309 0 164.193.707 195.118 1.998l19.149.706c66.7 2.224 148.683 9.384 213.19 58.19 35.015 26.471 85.571 84.896 115.708 126.52 27.861 38.499 35.876 82.756 16.933 125.158-17.436 38.97-54.952 62.215-100.383 62.215H16.69s4.233 17.944 10.58 37.751h970.632A510.385 510.385 0 0 0 1024 512.218C1024.01 229.355 794.532 0 511.454 0Z" 1059: E fill="#fff" 1060: E /> 1061: E </svg> 1062: E 1063: E <h1 class="error-404">Nothing here... yet</h1> 1064: E <h1 class="error-503">Application failed to respond</h1> ... 1066: E <a href="https://railway.app" target="_blank"> Go to Railway </a> 1067: E </main> 1068: E </body> 1069: E </html> 1070: E 1071: E 1072: E assert 1 == 0 1073: E + where 1 = <Result SystemExit(1)>.exit_code 1074: tests/test_cli/test_actions.py:44: AssertionError ... 1218: composio/utils/shared.py 117 83 29% 44, 47-51, 54-58, 61-77, 83, 101-104, 153-158, 174-221, 247-292 1219: composio/utils/url.py 10 1 90% 35 1220: examples/crewai_ci_chart.py 15 15 0% 1-38 1221: -------------------------------------------------------------------------------------------------------------- 1222: TOTAL 7809 1139 85% 1223: Coverage HTML written to dir htmlcov 1224: Coverage XML written to file coverage.xml 1225: =========================== short test summary info ============================ 1226: FAILED tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] - AssertionError: Error: <!DOCTYPE html> 1227: <html> 1228: <head> 1229: <meta name="viewport" content="width=device-width, initial-scale=1" /> 1230: <meta charset="utf-8" /> 1231: <title>Server Error</title> ... 1276: main a:hover { 1277: background-color: hsl(270, 60%, 42%); 1278: } 1279: h1 { 1280: display: none; 1281: margin-top: 2rem; 1282: margin-bottom: 2rem; 1283: } 1284: .error-503 { ... 1298: d="M4.756 438.175A520.713 520.713 0 0 0 0 489.735h777.799c-2.716-5.306-6.365-10.09-10.045-14.772-132.97-171.791-204.498-156.896-306.819-161.26-34.114-1.403-57.249-1.967-193.037-1.967-72.677 0-151.688.185-228.628.39-9.96 26.884-19.566 52.942-24.243 74.14h398.571v51.909H4.756ZM783.93 541.696H.399c.82 13.851 2.112 27.517 3.978 40.999h723.39c32.248 0 50.299-18.297 56.162-40.999ZM45.017 724.306S164.941 1018.77 511.46 1024c207.112 0 385.071-123.006 465.907-299.694H45.017Z" 1299: fill="#fff" 1300: /> 1301: <path 1302: d="M511.454 0C319.953 0 153.311 105.16 65.31 260.612c68.771-.144 202.704-.226 202.704-.226h.031v-.051c158.309 0 164.193.707 195.118 1.998l19.149.706c66.7 2.224 148.683 9.384 213.19 58.19 35.015 26.471 85.571 84.896 115.708 126.52 27.861 38.499 35.876 82.756 16.933 125.158-17.436 38.97-54.952 62.215-100.383 62.215H16.69s4.233 17.944 10.58 37.751h970.632A510.385 510.385 0 0 0 1024 512.218C1024.01 229.355 794.532 0 511.454 0Z" 1303: fill="#fff" 1304: /> 1305: </svg> 1306: <h1 class="error-404">Nothing here... yet</h1> 1307: <h1 class="error-503">Application failed to respond</h1> 1308: <a href="https://railway.app" target="_blank"> Go to Railway </a> 1309: </main> 1310: </body> 1311: </html> 1312: assert 1 == 0 1313: + where 1 = <Result SystemExit(1)>.exit_code 1314: ============= 1 failed, 38 passed, 5 skipped, 1 warning in 22.91s ============== 1315: unittests: exit 1 (23.88 seconds) /home/runner/work/composio/composio/python> pytest -vvv -rfE --doctest-modules composio/ tests/ --cov=composio --cov=examples --cov-report=html --cov-report=xml --cov-report=term --cov-report=term-missing --cov-config=.coveragerc pid=5668 1316: .pkg: _exit> python /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__ 1317: unittests: FAIL code 1 (48.55=setup[19.02]+cmd[5.65,23.88] seconds) 1318: evaluation failed :( (48.79 seconds) 1319: ##[error]Process completed with exit code 1.

✨ CI feedback usage guide:

The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
The tool analyzes the failed checks and provides several feedbacks:

Failed stage
Failed test name
Failure summary
Relevant error logs

In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

/checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"

where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

Configuration options

enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

See more information about the checks tool in the docs.

shubhras01

lgtm

Update scaffold

2436483

codiumai-pr-agent-pro bot added enhancement New feature or request Tests Review effort [1-5]: 4 labels Jul 5, 2024

kaavee315 and others added 3 commits July 5, 2024 17:27

Update scaffold

24e6a3a

Update lint

c577942

fix: templates

b0592f9

kaavee315 force-pushed the kaavee/benchmark_template branch from b0592f9 to b60dd4d Compare July 6, 2024 07:53

Merge branch 'master' into kaavee/benchmark_template

44d7f29

kaavee315 force-pushed the kaavee/benchmark_template branch from b60dd4d to 44d7f29 Compare July 6, 2024 07:57

kaavee315 added 5 commits July 6, 2024 13:32

Lint

9257797

Fix prompt template

5615cc8

Fix issue

19fe761

Fix

94c73c7

Update

5f44a53

kaavee315 enabled auto-merge (squash) July 6, 2024 10:08

shubhras01 approved these changes Jul 6, 2024

View reviewed changes

Update

f3d8ded

kaavee315 merged commit 4cea4ce into master Jul 6, 2024
5 checks passed

kaavee315 deleted the kaavee/benchmark_template branch July 6, 2024 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update benchmarking scaffold #256

Update benchmarking scaffold #256

kaavee315 commented Jul 5, 2024 •

edited by codiumai-pr-agent-pro bot

Loading

ellipsis-dev bot commented Jul 5, 2024

codiumai-pr-agent-pro bot commented Jul 5, 2024

codiumai-pr-agent-pro bot commented Jul 5, 2024

codiumai-pr-agent-pro bot commented Jul 5, 2024 •

edited

Loading

Configuration options

shubhras01 left a comment

Update benchmarking scaffold #256

Update benchmarking scaffold #256

Conversation

kaavee315 commented Jul 5, 2024 • edited by codiumai-pr-agent-pro bot Loading

PR Type

Description

Changes walkthrough ��

ellipsis-dev bot commented Jul 5, 2024

codiumai-pr-agent-pro bot commented Jul 5, 2024

PR Reviewer Guide 🔍

codiumai-pr-agent-pro bot commented Jul 5, 2024

PR Code Suggestions ✨

codiumai-pr-agent-pro bot commented Jul 5, 2024 • edited Loading

CI Failure Feedback 🧐

(Checks updated until commit 19fe761)

Configuration options

shubhras01 left a comment

Choose a reason for hiding this comment

kaavee315 commented Jul 5, 2024 •

edited by codiumai-pr-agent-pro bot

Loading

codiumai-pr-agent-pro bot commented Jul 5, 2024 •

edited

Loading

(Checks updated until commit `19fe761`)