Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmarking scaffold #256

Merged
merged 11 commits into from
Jul 6, 2024
Merged

Update benchmarking scaffold #256

merged 11 commits into from
Jul 6, 2024

Conversation

kaavee315
Copy link
Contributor

@kaavee315 kaavee315 commented Jul 5, 2024

PR Type

Enhancement, Tests


Description

  • Refactored ComposioToolSet to improve workspace handling and logging.
  • Introduced run_and_get_scores function for running agents and retrieving scores.
  • Updated run function in run_evaluation.py to accept an agent_func parameter.
  • Enhanced logging setup and error handling in run_evaluation.py.
  • Removed obsolete benchmark.template file.
  • Added new run_benchmark.template for running benchmark evaluations.

Changes walkthrough ��

Relevant files
Enhancement
toolset.py
Refactor workspace handling and logging in `ComposioToolSet`

python/composio/tools/toolset.py

  • Removed workspace_id and workspace_env attributes from the
    constructor.
  • Added set_workspace_id method to set and retrieve workspace.
  • Adjusted logging and workspace retrieval logic.
  • +8/-4     
    run_evaluation.py
    Refactor benchmark evaluation and logging setup                   

    python/swe/benchmark/run_evaluation.py

  • Added run_and_get_scores function to run agent and get scores.
  • Refactored run function to accept agent_func as a parameter.
  • Updated logging setup to use get_logger.
  • Enhanced error handling and logging for patch retrieval.
  • +63/-13 
    Miscellaneous
    benchmark.template
    Remove obsolete benchmark template                                             

    python/swe/composio_swe/scaffold/templates/crewai/benchmark.template

    • Removed the benchmark.template file.
    +0/-5     
    Tests
    run_benchmark.template
    Add new benchmark template for running evaluations             

    python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template

  • Added new run_benchmark.template file.
  • Defined agent_func for workspace setup and issue kickoff.
  • Implemented main function to run benchmark.
  • +25/-0   

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    ellipsis-dev bot commented Jul 5, 2024

    Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at help@ellipsis.dev

    Copy link

    PR Reviewer Guide 🔍

    ��️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Refactoring Concern:
    The removal of workspace_id and workspace_env from the constructor in toolset.py and the addition of set_workspace_id might introduce issues if the workspace ID is used before being set in scenarios not covered by the PR.

    Error Handling:
    The new error handling and logging enhancements in run_evaluation.py need to be carefully reviewed to ensure that they correctly capture and log errors without missing any exceptions.

    Functionality Change:
    The change from SHELL_EXECUTE_COMMAND to SHELL_EXEC_COMMAND in run_evaluation.py could potentially alter the behavior of command execution, depending on the implementation details of these actions.

    Copy link

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add a check to ensure logs_dir is not None before using it

    Add a check to ensure logs_dir is not None before using it in the run_and_get_scores
    function to avoid potential NoneType errors.

    python/swe/benchmark/run_evaluation.py [206-212]

    +if logs_dir is None:
    +    raise ValueError("logs_dir cannot be None")
     logger.info("Running agent with logs_dir: %s", logs_dir)
     run(
         agent_func=agent_func,
         test_split=test_split,
         include_hints=include_hints,
         logs_dir=logs_dir,
     )
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion to check if logs_dir is not None before using it in logging and function calls is crucial to prevent runtime errors, making it a significant improvement.

    8
    Add error handling to catch and log exceptions during the kickoff process

    Add error handling in the agent_func to catch and log any exceptions that occur during the
    kickoff process.

    python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template [10-14]

    -return crew.kickoff(
    -    {
    -        "issue_id": issue_config.issue_id,
    -        "issue": issue_config.issue_desc,
    -    }
    -)
    +try:
    +    return crew.kickoff(
    +        {
    +            "issue_id": issue_config.issue_id,
    +            "issue": issue_config.issue_desc,
    +        }
    +    )
    +except Exception as e:
    +    logger.error(f"Error during kickoff: {e}")
    +    raise
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding error handling in the agent_func to catch exceptions during the kickoff process is a good practice to ensure robustness and easier debugging.

    7
    Enhancement
    Add a type hint for the agent_func parameter to improve code clarity and type safety

    Add a type hint for the agent_func parameter in the run_and_get_scores function to improve
    code clarity and type safety.

    python/swe/benchmark/run_evaluation.py [204]

    -def run_and_get_scores(agent_func, test_split="1:50", include_hints=True):
    +def run_and_get_scores(agent_func: t.Callable, test_split="1:50", include_hints=True):
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding a type hint for agent_func as t.Callable in run_and_get_scores function enhances code clarity and helps with type checking, which is a good practice in Python.

    7
    Best practice
    Use the not operator to check for None values in a more Pythonic way

    Instead of checking workspace_id for None directly, use the not operator for a more
    Pythonic approach.

    python/composio/tools/toolset.py [73]

    -if workspace_id is None:
    +if not workspace_id:
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion to use if not workspace_id: instead of if workspace_id is None: is a valid Pythonic improvement for readability, but it's not crucial.

    6
    Copy link

    codiumai-pr-agent-pro bot commented Jul 5, 2024

    CI Failure Feedback 🧐

    (Checks updated until commit 19fe761)

    Action: test (ubuntu-latest, 3.10)

    Failed stage: Unittests [❌]

    Failed test name: test_list_all

    Failure summary:

    The action failed because the test test_list_all in the file tests/test_cli/test_actions.py failed.
    The test failed due to an AssertionError indicating a server error response:

  • The expected exit code was 0, but the actual exit code was 1.
  • The error message included an HTML response indicating a "503 Service Unavailable" error, suggesting
    that the application failed to respond.

  • Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    496:  * [new branch]        featembed-tool                           -> origin/featembed-tool
    497:  * [new branch]        fix/readme                               -> origin/fix/readme
    498:  * [new branch]        fix/readme-logo                          -> origin/fix/readme-logo
    499:  * [new branch]        fix/swe-agent                            -> origin/fix/swe-agent
    500:  * [new branch]        ft-add-better-help-text                  -> origin/ft-add-better-help-text
    501:  * [new branch]        ft-apps-id                               -> origin/ft-apps-id
    502:  * [new branch]        ft-bring-back-core-sdk                   -> origin/ft-bring-back-core-sdk
    503:  * [new branch]        ft-did-you-mean                          -> origin/ft-did-you-mean
    504:  * [new branch]        ft-error-tracking                        -> origin/ft-error-tracking
    ...
    
    892:  tests/test_example.py::test_example[example0] SKIPPED (Testing in CI
    893:  will lead to too much LLM API usage)                                     [  4%]
    894:  tests/test_example.py::test_example[example1] SKIPPED (Testing in CI
    895:  will lead to too much LLM API usage)                                     [  6%]
    896:  tests/test_example.py::test_example[example2] SKIPPED (Testing in CI
    897:  will lead to too much LLM API usage)                                     [  9%]
    898:  tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments0-exptected_outputs0-unexptected_outputs0] PASSED [ 11%]
    899:  tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments1-exptected_outputs1-unexptected_outputs1] PASSED [ 13%]
    900:  tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] FAILED [ 15%]
    901:  tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments3-exptected_outputs3-unexptected_outputs3] PASSED [ 18%]
    902:  tests/test_cli/test_actions.py::TestListActions::test_tag_not_found PASSED [ 20%]
    903:  tests/test_cli/test_actions.py::TestListActions::test_limit SKIPPED      [ 22%]
    904:  tests/test_cli/test_actions.py::TestListActions::test_copy PASSED        [ 25%]
    905:  tests/test_cli/test_add.py::TestComposioAdd::test_no_auth PASSED         [ 27%]
    906:  tests/test_cli/test_apps.py::TestList::test_list PASSED                  [ 29%]
    907:  tests/test_cli/test_apps.py::TestUpdate::test_update_not_required PASSED [ 31%]
    908:  tests/test_cli/test_apps.py::TestUpdate::test_update SKIPPED (Needs
    909:  investigation, this test fails in CI)                                    [ 34%]
    ...
    
    931:  tests/test_tools/test_toolset.py::test_find_actions_by_tags PASSED       [ 84%]
    932:  tests/test_tools/test_toolset.py::test_uninitialize_app PASSED           [ 86%]
    933:  tests/test_utils/test_decorators.py::test_deprecated PASSED              [ 88%]
    934:  tests/test_utils/test_git.py::test_get_git_user_info PASSED              [ 90%]
    935:  tests/test_utils/test_shared.py::test_get_pydantic_signature_format_from_schema_params PASSED [ 93%]
    936:  tests/test_utils/test_shared.py::test_json_schema_to_pydantic_field PASSED [ 95%]
    937:  tests/test_utils/test_shared.py::test_json_schema_to_fields_dict PASSED  [ 97%]
    938:  tests/test_utils/test_url.py::test_get_web_url PASSED                    [100%]
    939:  =================================== FAILURES ===================================
    ...
    
    965:  patch: t.Any,
    966:  arguments: t.Tuple[str, ...],
    967:  exptected_outputs: t.Tuple[str, ...],
    968:  unexptected_outputs: t.Tuple[str, ...],
    969:  ) -> None:
    970:  """Test list all actions."""
    971:  result = self.run("actions", *arguments)
    972:  >       assert result.exit_code == 0, result.stderr
    973:  E       AssertionError: Error: <!DOCTYPE html>
    974:  E         <html>
    975:  E           <head>
    976:  E             <meta name="viewport" content="width=device-width, initial-scale=1" />
    977:  E             <meta charset="utf-8" />
    978:  E             <title>Server Error</title>
    ...
    
    1031:  E               }
    1032:  E         
    1033:  E               h1 {
    1034:  E                 display: none;
    1035:  E                 margin-top: 2rem;
    1036:  E                 margin-bottom: 2rem;
    1037:  E               }
    1038:  E         
    1039:  E               .error-503 {
    ...
    
    1055:  E                   fill="#fff"
    1056:  E                 />
    1057:  E                 <path
    1058:  E                   d="M511.454 0C319.953 0 153.311 105.16 65.31 260.612c68.771-.144 202.704-.226 202.704-.226h.031v-.051c158.309 0 164.193.707 195.118 1.998l19.149.706c66.7 2.224 148.683 9.384 213.19 58.19 35.015 26.471 85.571 84.896 115.708 126.52 27.861 38.499 35.876 82.756 16.933 125.158-17.436 38.97-54.952 62.215-100.383 62.215H16.69s4.233 17.944 10.58 37.751h970.632A510.385 510.385 0 0 0 1024 512.218C1024.01 229.355 794.532 0 511.454 0Z"
    1059:  E                   fill="#fff"
    1060:  E                 />
    1061:  E               </svg>
    1062:  E         
    1063:  E               <h1 class="error-404">Nothing here... yet</h1>
    1064:  E               <h1 class="error-503">Application failed to respond</h1>
    ...
    
    1066:  E               <a href="https://railway.app" target="_blank"> Go to Railway </a>
    1067:  E             </main>
    1068:  E           </body>
    1069:  E         </html>
    1070:  E         
    1071:  E         
    1072:  E       assert 1 == 0
    1073:  E        +  where 1 = <Result SystemExit(1)>.exit_code
    1074:  tests/test_cli/test_actions.py:44: AssertionError
    ...
    
    1218:  composio/utils/shared.py                                                           117     83    29%   44, 47-51, 54-58, 61-77, 83, 101-104, 153-158, 174-221, 247-292
    1219:  composio/utils/url.py                                                               10      1    90%   35
    1220:  examples/crewai_ci_chart.py                                                         15     15     0%   1-38
    1221:  --------------------------------------------------------------------------------------------------------------
    1222:  TOTAL                                                                             7809   1139    85%
    1223:  Coverage HTML written to dir htmlcov
    1224:  Coverage XML written to file coverage.xml
    1225:  =========================== short test summary info ============================
    1226:  FAILED tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] - AssertionError: Error: <!DOCTYPE html>
    1227:  <html>
    1228:  <head>
    1229:  <meta name="viewport" content="width=device-width, initial-scale=1" />
    1230:  <meta charset="utf-8" />
    1231:  <title>Server Error</title>
    ...
    
    1276:  main a:hover {
    1277:  background-color: hsl(270, 60%, 42%);
    1278:  }
    1279:  h1 {
    1280:  display: none;
    1281:  margin-top: 2rem;
    1282:  margin-bottom: 2rem;
    1283:  }
    1284:  .error-503 {
    ...
    
    1298:  d="M4.756 438.175A520.713 520.713 0 0 0 0 489.735h777.799c-2.716-5.306-6.365-10.09-10.045-14.772-132.97-171.791-204.498-156.896-306.819-161.26-34.114-1.403-57.249-1.967-193.037-1.967-72.677 0-151.688.185-228.628.39-9.96 26.884-19.566 52.942-24.243 74.14h398.571v51.909H4.756ZM783.93 541.696H.399c.82 13.851 2.112 27.517 3.978 40.999h723.39c32.248 0 50.299-18.297 56.162-40.999ZM45.017 724.306S164.941 1018.77 511.46 1024c207.112 0 385.071-123.006 465.907-299.694H45.017Z"
    1299:  fill="#fff"
    1300:  />
    1301:  <path
    1302:  d="M511.454 0C319.953 0 153.311 105.16 65.31 260.612c68.771-.144 202.704-.226 202.704-.226h.031v-.051c158.309 0 164.193.707 195.118 1.998l19.149.706c66.7 2.224 148.683 9.384 213.19 58.19 35.015 26.471 85.571 84.896 115.708 126.52 27.861 38.499 35.876 82.756 16.933 125.158-17.436 38.97-54.952 62.215-100.383 62.215H16.69s4.233 17.944 10.58 37.751h970.632A510.385 510.385 0 0 0 1024 512.218C1024.01 229.355 794.532 0 511.454 0Z"
    1303:  fill="#fff"
    1304:  />
    1305:  </svg>
    1306:  <h1 class="error-404">Nothing here... yet</h1>
    1307:  <h1 class="error-503">Application failed to respond</h1>
    1308:  <a href="https://railway.app" target="_blank"> Go to Railway </a>
    1309:  </main>
    1310:  </body>
    1311:  </html>
    1312:  assert 1 == 0
    1313:  +  where 1 = <Result SystemExit(1)>.exit_code
    1314:  ============= 1 failed, 38 passed, 5 skipped, 1 warning in 22.91s ==============
    1315:  unittests: exit 1 (23.88 seconds) /home/runner/work/composio/composio/python> pytest -vvv -rfE --doctest-modules composio/ tests/ --cov=composio --cov=examples --cov-report=html --cov-report=xml --cov-report=term --cov-report=term-missing --cov-config=.coveragerc pid=5668
    1316:  .pkg: _exit> python /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
    1317:  unittests: FAIL code 1 (48.55=setup[19.02]+cmd[5.65,23.88] seconds)
    1318:  evaluation failed :( (48.79 seconds)
    1319:  ##[error]Process completed with exit code 1.
    

    ✨ CI feedback usage guide:

    The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
    The tool analyzes the failed checks and provides several feedbacks:

    • Failed stage
    • Failed test name
    • Failure summary
    • Relevant error logs

    In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

    /checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"
    

    where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

    Configuration options

    • enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
    • excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
    • enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
    • persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
    • final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

    See more information about the checks tool in the docs.

    @kaavee315 kaavee315 enabled auto-merge (squash) July 6, 2024 10:08
    Copy link
    Contributor

    @shubhras01 shubhras01 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    lgtm

    @kaavee315 kaavee315 merged commit 4cea4ce into master Jul 6, 2024
    5 checks passed
    @kaavee315 kaavee315 deleted the kaavee/benchmark_template branch July 6, 2024 10:13
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    3 participants