Skip to content

Add gcov-based test pruning with coverage cache#1284

Draft
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:gcov-test-pruning-v3
Draft

Add gcov-based test pruning with coverage cache#1284
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:gcov-test-pruning-v3

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 2, 2026

Summary

  • File-level gcov coverage cache maps 555 test UUIDs to exercised .fpp source files (gzip JSON, 11KB, committed to repo)
  • --only-changes flag prunes tests by intersecting PR-changed files against coverage cache
  • --build-coverage-cache flag + 3-phase parallel cache builder (prepare, run, gcov collect)
  • New rebuild-cache CI job runs on Phoenix via SLURM when cases.py or Fortran dependency graph changes
  • Dep-change detection greps PR/push diffs for added use/include statements
  • Conservative fallbacks: missing cache runs all, missing sim coverage includes test, ALWAYS_RUN_ALL files trigger full suite
  • Remove continue-on-error from github CI job (fixes auto-cancellation)
  • 53 unit tests cover core coverage logic
  • TEMP: duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI

Replaces #1283.

Test plan

  • CI lint checks pass
  • rebuild-cache job triggers (dep_changed detection)
  • Github test jobs download cache artifact and prune tests via --only-changes
  • Revert TEMP changes before merge

🤖 Generated with Claude Code

@sbryngelson sbryngelson marked this pull request as ready for review March 2, 2026 03:49
Copilot AI review requested due to automatic review settings March 2, 2026 03:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a gcov-derived, file-level coverage cache to enable pruning the MFC test suite to only those tests that exercise files changed in a PR, with CI support to rebuild and distribute the cache when it becomes stale.

Changes:

  • Introduces toolchain/mfc/test/coverage.py (cache build/load + changed-file detection + coverage-based test filtering) and commits a gzipped JSON cache.
  • Integrates new CLI flags --only-changes and --build-coverage-cache into the test runner and CLI schema.
  • Updates CI workflows/scripts to rebuild the cache on dependency graph changes and to run pruned test suites on PRs.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
toolchain/mfc/test/coverage.py Implements coverage cache build/load + changed-file diffing + pruning logic (ALWAYS_RUN_ALL safeguards).
toolchain/mfc/test/test_coverage_unit.py Adds offline unit tests covering diff parsing, gcov JSON parsing, cache normalization, and pruning behavior.
toolchain/mfc/test/test.py Wires --only-changes pruning and --build-coverage-cache cache generation into the test command.
toolchain/mfc/cli/commands.py Adds CLI arguments/examples for coverage cache building and pruning.
toolchain/mfc/test/case.py Factors post-process output parameter mods into reusable constants/helpers used by the cache builder.
toolchain/mfc/test/test_coverage_cache.json.gz Adds committed baseline coverage cache artifact.
.github/workflows/test.yml Adds rebuild-cache job + dependency-change detection + artifact flow; enables pruning on PRs.
.github/workflows/phoenix/rebuild-cache.sh SLURM-executed rebuild script: clean, gcov build, build cache in parallel.
.github/workflows/phoenix/test.sh Enables pruning on PRs and scales CPU thread count based on allocation.
.github/workflows/phoenix/submit.sh Adjusts CPU SLURM submission options for cache rebuild workload.
.github/workflows/frontier/test.sh Enables pruning on PRs.
.github/workflows/frontier_amd/test.sh Enables pruning on PRs.
.github/file-filter.yml Adds cases_py filter to trigger cache rebuilds when test definitions change.
CMakeLists.txt Improves gcov build reliability (disable -march=native/LTO for gcov; add Fypp line-marker flag).
.gitignore Ignores legacy raw (non-gz) cache file.
src/simulation/m_bubbles.fpp TEMP change: duplicated use for pipeline exercise.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
src/simulation/m_bubbles.fpp (1)

19-19: Remove duplicate module import.

Line 19 repeats use m_helper_basic already declared on Line 18. This is redundant and can cause avoidable compiler warnings/noise across toolchains.

Proposed fix
-    use m_helper_basic
.github/workflows/phoenix/test.sh (1)

54-56: Potential issue with unset variable in arithmetic comparison.

When SLURM_CPUS_ON_NODE is unset, the comparison SLURM_CPUS_ON_NODE > 64 evaluates the unset variable as 0, but the default :-8 is only applied in the else branch. This works but is fragile. Consider applying the default first:

Suggested improvement
-# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores).
-# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches.
-n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))
+# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores).
+# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches.
+_cpus=${SLURM_CPUS_ON_NODE:-8}
+n_test_threads=$(( _cpus > 64 ? 64 : _cpus ))
toolchain/mfc/test/test_coverage_unit.py (1)

74-75: Use underscore-prefixed stub args to keep lint clean.

Optional cleanup to silence ARG005 without changing behavior.

♻️ Suggested tweak
-_case_stub.input_bubbles_lagrange = lambda case: None
-_case_stub.get_post_process_mods = lambda params: {}
+_case_stub.input_bubbles_lagrange = lambda _case: None
+_case_stub.get_post_process_mods = lambda _params: {}

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee892c and dbe7098.

⛔ Files ignored due to path filters (1)
  • toolchain/mfc/test/test_coverage_cache.json.gz is excluded by !**/*.gz
📒 Files selected for processing (15)
  • .github/file-filter.yml
  • .github/workflows/frontier/test.sh
  • .github/workflows/frontier_amd/test.sh
  • .github/workflows/phoenix/rebuild-cache.sh
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • src/simulation/m_bubbles.fpp
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/coverage.py
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_unit.py

@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch 2 times, most recently from fcfa309 to d6e922c Compare March 3, 2026 23:14
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from codecov bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Claude Code Review

Head SHA: b0e6e8a98f9143c57ad442721bd2bb21a4ae17a7

Files changed: 26 — key paths:

  • .github/workflows/test.yml (+139/-20)
  • toolchain/mfc/test/coverage.py (new, +729 lines)
  • toolchain/mfc/test/test_coverage_unit.py (new, +664 lines)
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary)
  • CMakeLists.txt (+33/-11)
  • toolchain/mfc/test/test.py (+79/-2)
  • toolchain/mfc/test/cases.py, toolchain/mfc/test/case.py
  • .github/workflows/phoenix/rebuild-cache.sh (new)
  • .github/workflows/phoenix/test.sh, frontier/test.sh

Summary

  • Introduces file-level gcov coverage caching (555 tests → .fpp file mapping, 11 KB gzip JSON committed to repo) to skip unaffected tests on PRs.
  • Adds --only-changes / --build-coverage-cache CLI flags to ./mfc.sh test and a new coverage.py toolchain module.
  • New rebuild-cache CI job runs on Phoenix via SLURM, committing updated cache to master on push events.
  • CMake changes disable -march=native and LTO for gcov builds, use -O1 and gfortran5 line markers.
  • 53 unit tests cover the core coverage logic.
  • PR body explicitly flags TEMP changes that must be reverted before merge (see critical finding Error on READE.md #1).

Findings

🔴 Critical

1. TEMP changes left in the PR — must be reverted before merge
(toolchain/mfc/test/cases.py, .github/workflows/test.yml)

The PR description states:

TEMP: duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI

cases.py has a live change ("1D_qbmm" added to casesToSkip), and CMakeLists.txt is not in ALWAYS_RUN_ALL. Unit test at test_coverage_unit.py line ~1900 explicitly asserts should_run_all_tests({"CMakeLists.txt"}) is False. Whether CMakeLists should trigger full suite is a real design question — but the PR body says this was temporary, so the intent needs to be resolved before merge.


🟠 Issues

2. Direct push from CI to refs/heads/master
(.github/workflows/test.yml, "Commit Cache to Master" step, line ~321)

git push origin HEAD:refs/heads/master

The rebuild-cache job commits the regenerated .json.gz directly to master on every push event (and workflow_dispatch) where cases.py or Fortran deps changed. This bypasses code review, and if branch protection rules require PRs on master, this push will fail and break CI. It also means a malicious or broken coverage cache could land on master without review. Consider pushing to a dedicated branch and opening an auto-PR, or using an Action designed for this pattern (e.g., stefanzweifel/git-auto-commit-action with proper safety guards).

3. dep_changed regex matches +++/--- diff header lines
(.github/workflows/test.yml, "Check for Fortran dependency changes" step, line ~262)

grep -qP '^[+-]\s*(use[\s,]+\w|#:include\s|include\s+['"'"'"])'

The pattern ^[+-] also matches the +++ / --- diff header lines (e.g., +++ b/src/simulation/m_rhs.fpp). A file named include_foo.fpp or any diff with a +++ line followed by space would not match, but --- a/use_mpi.fpp or similar filenames could. More importantly, +++ b/toolchain/mfc/test/cases.py starts with + — though \s*(use...) wouldn't match +++ b/.... In practice this is low-risk but the fix is simply ^[+-][^+-] to exclude diff header lines.

4. CMAKE_Fortran_FLAGS_RELEASE override is aggressive
(CMakeLists.txt, line ~142)

set(CMAKE_Fortran_FLAGS_RELEASE "-O1 -DNDEBUG" CACHE STRING "" FORCE)

Using FORCE overwrites any flags the user or CI set before calling CMake. Standard practice is to append: string(APPEND CMAKE_Fortran_FLAGS_RELEASE " -O1"). The -DNDEBUG is already in gfortran's Release defaults. This also removes -O2 / -O3 entirely from release gcov builds even if a future caller adds flags before the FORCE write — it's unconditional.

5. New .fpp file with no existing coverage can slip through undetected
(toolchain/mfc/test/coverage.py, filter_tests_by_coverage)

The unit test test_new_fpp_file_no_coverage_skips explicitly asserts this:

# Brand new .fpp file has no coverage in cache -> 0 tests run
assert len(to_run) == 0

If a developer adds a new .fpp file and includes it from an existing module with a new #:include or use statement, dep_changed would fire and rebuild the cache (correct). But if the new file is included via an existing include/use that was already in the code (just a new file being included), dep_changed won't detect it. In that case, --only-changes will skip all tests because no existing test covers the new file — even though the new file could break things. This is a subtle gap; at minimum it warrants a note in ALWAYS_RUN_ALL docs or a warning when filter_tests_by_coverage finds zero coverage for a changed file.


🟡 Minor / Nits

6. rebuild-cache 'skipped' state not excluded from test job guard
(.github/workflows/test.yml, line ~333)

needs.rebuild-cache.result != 'cancelled' &&

When rebuild-cache is skipped (its if: condition is false), the result is 'skipped', not 'cancelled'. So != 'cancelled' passes, tests still run, and the artifact download step correctly no-ops (if: needs.rebuild-cache.result == 'success' is false). The behavior is correct but the intent reads oddly — != 'cancelled' should arguably be not in ('cancelled', 'failure') to also block on cache-build failures. Currently a broken cache build silently falls back to full suite; that may be acceptable, but is worth documenting.

7. FYPP_GCOV_OPTS variable initialized only in the gcov branch
(CMakeLists.txt, HANDLE_SOURCES macro, line ~402)

${FYPP_GCOV_OPTS}

FYPP_GCOV_OPTS is only set(...) inside the gcov if block; for non-gcov builds it's empty/undefined. CMake expands undefined variables to empty string, so this is safe in practice, but a set(FYPP_GCOV_OPTS "" CACHE INTERNAL "") default at the top of the file would make it explicit.

8. get_changed_files uses shallow clone on CI without deepen guard
(toolchain/mfc/test/coverage.py, get_changed_files)

The test job YAML fetches --depth=1 for master then --deepen=200. In get_changed_files, git merge-base is tried against both local and origin/master. If the deepen fetch fails (network issue), merge-base will fail silently and return None → full suite runs. This is the correct conservative fallback, but the continue-on-error: true on the fetch step (test.yml line ~353) means a network failure during fetch is silent. Consider logging when fallback fires.


Overall this is a well-engineered feature with good test coverage (53 unit tests, comprehensive edge cases). The conservative fallbacks (missing cache → full suite, missing sim coverage → include) are appropriate for a safety-critical CFD code. The main blockers before merge are: resolving the explicit TEMP items (#1), and reviewing the direct push to master (#2).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

This pull request introduces file-level coverage-based test filtering for pull requests, complemented by CI/CD workflow optimizations and build system changes. A new coverage module analyzes which source files each test exercises, enabling selective test execution on PR changes. Benchmark case configurations reduce default simulation step counts. GitHub workflow files are updated with parallelism caps, partition adjustments, pruning flags, and coverage cache management. CMakeLists.txt adds gcov instrumentation support with line-marker mapping. New CLI test command options enable coverage cache building and change-based filtering. The test runner integrates coverage loading and filtering logic to skip unaffected tests during pull requests while maintaining full test runs for other events.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add gcov-based test pruning with coverage cache' accurately captures the main feature being introduced—a gcov-based coverage cache system for test pruning.
Description check ✅ Passed The description provides comprehensive details about the changes, includes test plan items and commit summaries, but omits the required 'Type of change' checkbox section and does not fully follow the template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
toolchain/mfc/test/cases.py (1)

1074-1074: Consider adding a comment explaining why 1D_qbmm is skipped.

The rationale for skipping this example isn't immediately clear. If there's a specific issue (e.g., coverage-related failures, excessive runtime, or known instability), a brief inline comment would help future maintainers understand the exclusion.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cc9039fb-c9a7-4cbb-b465-9d02f54a165b

📥 Commits

Reviewing files that changed from the base of the PR and between 31899ad and b0e6e8a.

⛔ Files ignored due to path filters (1)
  • toolchain/mfc/test/test_coverage_cache.json.gz is excluded by !**/*.gz
📒 Files selected for processing (25)
  • .github/file-filter.yml
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh
  • .github/workflows/frontier/submit.sh
  • .github/workflows/frontier/test.sh
  • .github/workflows/lint-toolchain.yml
  • .github/workflows/phoenix/bench.sh
  • .github/workflows/phoenix/rebuild-cache.sh
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • benchmarks/5eq_rk3_weno3_hllc/case.py
  • benchmarks/hypo_hll/case.py
  • benchmarks/ibm/case.py
  • benchmarks/igr/case.py
  • benchmarks/viscous_weno5_sgb_acoustic/case.py
  • toolchain/mfc/build.py
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/cases.py
  • toolchain/mfc/test/coverage.py
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_unit.py
💤 Files with no reviewable changes (1)
  • .github/workflows/lint-toolchain.yml

Comment on lines +109 to +123
for candidate in candidates:
path = shutil.which(candidate)
if path is None:
continue
try:
result = subprocess.run(
[path, "--version"],
capture_output=True, text=True, timeout=10, check=False
)
version_out = result.stdout
if "Apple LLVM" in version_out or "Apple clang" in version_out:
continue # Apple's gcov cannot parse GCC-generated .gcda files
if "GCC" in version_out or "GNU" in version_out:
return path
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the coverage.py file
cd toolchain && find . -name "coverage.py" -type f

Repository: MFlowCode/MFC

Length of output: 110


🏁 Script executed:

# Read the full function to understand context
head -c 50000 toolchain/mfc/test/coverage.py | tail -c 50000

Repository: MFlowCode/MFC

Length of output: 28642


🏁 Script executed:

# Get line count to understand file size
wc -l toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 91


🏁 Script executed:

# Read the function around lines 109-123
sed -n '65,135p' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 2356


🏁 Script executed:

# Search for usage of find_gcov_binary to see if there are downstream assumptions
rg "find_gcov_binary" toolchain/

Repository: MFlowCode/MFC

Length of output: 183


🏁 Script executed:

# Check if there are any comments or git history explaining the "GCC"/"GNU" check intent
git -C . log --all -S '"GCC" in version_out or "GNU" in version_out' --oneline --max-count=10

Repository: MFlowCode/MFC

Length of output: 143


🏁 Script executed:

# Check for any version-specific feature usage of gcov in the code
rg "gcov_bin|gcov --" toolchain/mfc/test/coverage.py | head -30

Repository: MFlowCode/MFC

Length of output: 555


🏁 Script executed:

# Look for gcov feature usage (e.g., --json-format which was added in gcov 13)
rg "\-\-json-format" toolchain/

Repository: MFlowCode/MFC

Length of output: 151


🏁 Script executed:

# Search for gcov version outputs and documentation
web_search "gcov --version output GCC GNU format different distributions"

Repository: MFlowCode/MFC

Length of output: 107


🏁 Script executed:

# Check if --json-format requires a specific gcov version
web_search "gcov --json-format minimum version requirement gcov 13"

Repository: MFlowCode/MFC

Length of output: 107


🏁 Script executed:

# Look for any error handling or fallback for old gcov versions
rg "json-format|gcov.*version" toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 350


🏁 Script executed:

# Check if the code handles gcov version validation anywhere
rg "gcov.*13|json.?format|version" toolchain/mfc/test/coverage.py -A 2 -B 2

Repository: MFlowCode/MFC

Length of output: 2127


🏁 Script executed:

# Check if there are any tests or documentation about gcov version requirements
fd -type f -name "*.md" | xargs grep -l "gcov\|coverage" 2>/dev/null | head -5

Repository: MFlowCode/MFC

Length of output: 227


🏁 Script executed:

# Search for any version checks of gcov elsewhere in the codebase
rg "version|gcov" toolchain/mfc/build/ | grep -i "gcov\|version"

Repository: MFlowCode/MFC

Length of output: 117


🏁 Script executed:

# Check the build system for gcov requirements
fd -type f \( -name "*.cmake" -o -name "*.py" \) | xargs grep -l "gcov\|coverage" 2>/dev/null

Repository: MFlowCode/MFC

Length of output: 229


🏁 Script executed:

# Look at the git diff to understand what changed in this PR
git diff HEAD~1..HEAD -- toolchain/mfc/test/coverage.py 2>/dev/null | head -100

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Check if there's any CI configuration that specifies gcov version requirements
fd -type f \( -name ".github" -o -name ".gitlab-ci" -o -name "*.yml" -o -name "*.yaml" \) | head -20

Repository: MFlowCode/MFC

Length of output: 227


🏁 Script executed:

# Search for any documentation or README about gcov requirements
find . -name "README*" -o -name "CONTRIBUTING*" -o -name "BUILD*" | head -10

Repository: MFlowCode/MFC

Length of output: 413


🏁 Script executed:

# Check CMake files for gcov configuration
find . -name "*.cmake" 2>/dev/null | xargs grep -l "gcov" 2>/dev/null

Repository: MFlowCode/MFC

Length of output: 41


🏁 Script executed:

# Look for any version-related checks in Python code that mention gcov
rg "major|minor|version" toolchain/mfc/test/coverage.py | head -20

Repository: MFlowCode/MFC

Length of output: 825


🏁 Script executed:

# Check if the code actually validates gcov version before using --json-format
sed -n '280,320p' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 1686


🏁 Script executed:

# Check the _parse_gcov_json_output function to understand what it does
sed -n '150,220p' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 2626


🏁 Script executed:

# Check if there are real-world examples of distro gcov versions without GCC/GNU
python3 << 'EOF'
import subprocess
import re

# Check what a typical gcov --version output looks like when parsed
test_outputs = [
    "gcov (GCC 11.2.0) 11.2.0",  # Standard GCC format
    "gcov (GCC) 9.3.0",           # Some distro format
    "gcov version 11.0 (GCC 11)",  # Alternative format
    "gcov 11.2",                   # Minimal format
]

for output in test_outputs:
    has_gcc = "GCC" in output
    has_gnu = "GNU" in output
    accepted = has_gcc or has_gnu
    print(f"Output: {output!r}")
    print(f"  Has GCC/GNU: {accepted}\n")
EOF

Repository: MFlowCode/MFC

Length of output: 260


🏁 Script executed:

# Verify the code does NOT require specific gcov version features before using --json-format
grep -B 5 -A 5 "json-format" toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 504


🏁 Script executed:

# Check if there's any version validation in the codebase
rg "check.*version|version.*check|json.?format.*version" toolchain/mfc/test/

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Check if lines 109-123 are actually changed in this PR
git diff HEAD -- toolchain/mfc/test/coverage.py | grep -A 5 -B 5 "GCC\|GNU" | head -50

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Get the actual diff of the coverage.py file
git diff HEAD -- toolchain/mfc/test/coverage.py | head -200

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Check git status to see what's been modified
git status toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 122


🏁 Script executed:

# Get exact line numbers for the function
sed -n '67,133p' toolchain/mfc/test/coverage.py | cat -n | grep -E "^[ ]*[109]|^[ ]*11[0-9]|^[ ]*12[0-3]"

Repository: MFlowCode/MFC

Length of output: 415


🏁 Script executed:

# Show the specific lines 109-123
sed -n '109,123p' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 658


🏁 Script executed:

# Check what the actual line numbers are in the function
awk 'NR>=67 && NR<=133 { print NR": "$0 }' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 2611


🏁 Script executed:

# Check if there's a PR or if we're looking at a specific commit
git log --oneline -1

Repository: MFlowCode/MFC

Length of output: 124


🏁 Script executed:

# Get the diff to see what lines actually changed
git show --stat

Repository: MFlowCode/MFC

Length of output: 50370


🏁 Script executed:

# Get full diff of coverage.py to see what actually changed
git show -- toolchain/mfc/test/coverage.py | head -300

Repository: MFlowCode/MFC

Length of output: 11061


🏁 Script executed:

# Verify gcov is used without version validation before calling --json-format
rg "json.format" toolchain/mfc/test/coverage.py -B 5 -A 5

Repository: MFlowCode/MFC

Length of output: 504


🏁 Script executed:

# Check if there's any error handling for failed --json-format calls
sed -n '280,310p' toolchain/mfc/test/coverage.py

Repository: MFlowCode/MFC

Length of output: 1361


Remove strict gcov version string matching in find_gcov_binary().

The current logic only accepts gcov binaries whose --version output contains "GCC" or "GNU", but some legitimate distro-provided GNU gcov binaries lack these strings (e.g., gcov 11.2). This causes false negatives and raises a spurious "GNU gcov not found" error, breaking cache builds. Since the code already gracefully handles gcov failures downstream and rejects Apple LLVM/clang separately, the string check is unnecessarily strict.

🔧 Proposed fix
             if "Apple LLVM" in version_out or "Apple clang" in version_out:
                 continue  # Apple's gcov cannot parse GCC-generated .gcda files
-            if "GCC" in version_out or "GNU" in version_out:
-                return path
+            return path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for candidate in candidates:
path = shutil.which(candidate)
if path is None:
continue
try:
result = subprocess.run(
[path, "--version"],
capture_output=True, text=True, timeout=10, check=False
)
version_out = result.stdout
if "Apple LLVM" in version_out or "Apple clang" in version_out:
continue # Apple's gcov cannot parse GCC-generated .gcda files
if "GCC" in version_out or "GNU" in version_out:
return path
except Exception:
for candidate in candidates:
path = shutil.which(candidate)
if path is None:
continue
try:
result = subprocess.run(
[path, "--version"],
capture_output=True, text=True, timeout=10, check=False
)
version_out = result.stdout
if "Apple LLVM" in version_out or "Apple clang" in version_out:
continue # Apple's gcov cannot parse GCC-generated .gcda files
return path
except Exception:
🧰 Tools
🪛 Ruff (0.15.2)

[error] 114-114: subprocess call: check for execution of untrusted input

(S603)


[warning] 123-123: Do not catch blind exception: Exception

(BLE001)

Comment on lines +687 to +689
changed_fpp = {f for f in changed_files if f.endswith(".fpp")}
if not changed_fpp:
return [], list(cases)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-.fpp Fortran changes are currently treated as “no source changes,” so tests get skipped.

This branch skips all tests when no changed .fpp files are present. A PR that changes .f90/.f source can therefore be incorrectly treated as unaffected under --only-changes.

🔧 Proposed fix
-    changed_fpp = {f for f in changed_files if f.endswith(".fpp")}
-    if not changed_fpp:
+    changed_fortran = {
+        f for f in changed_files if f.endswith((".fpp", ".f90", ".f"))
+    }
+    if not changed_fortran:
         return [], list(cases)
+
+    # Cache is file-level .fpp today; non-.fpp Fortran edits cannot be pruned safely.
+    if any(f.endswith((".f90", ".f")) for f in changed_fortran):
+        return list(cases), []

-    changed_sim = any(f.startswith("src/simulation/") for f in changed_fpp)
+    changed_sim = any(f.startswith("src/simulation/") for f in changed_fortran)
@@
-        if test_file_set & changed_fpp:
+        if test_file_set & changed_fortran:
             to_run.append(case)

Comment on lines +118 to +132
# Example-based tests cover no unique files beyond non-example tests,
# so they add no value to coverage-based pruning. Skip them entirely.
example_skipped = [c for c in cases if "Example" in c.trace]
cases = [c for c in cases if "Example" not in c.trace]
skipped_cases += example_skipped
if example_skipped:
cons.print(f"[dim]Skipped {len(example_skipped)} example tests "
f"(redundant coverage)[/dim]")

cache = load_coverage_cache(common.MFC_ROOT_DIR)
if cache is None:
cons.print("[yellow]Coverage cache missing or stale.[/yellow]")
cons.print("[yellow]Run: ./mfc.sh build --gcov -j 8 && ./mfc.sh test --build-coverage-cache[/yellow]")
cons.print("[yellow]Falling back to full test suite.[/yellow]")
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

“Fallback to full suite” path still excludes Example tests.

Examples are removed before cache/diff checks, so the cache-miss and run-all fallbacks do not actually run the full suite.

🔧 Proposed fix
-        # Example-based tests cover no unique files beyond non-example tests,
-        # so they add no value to coverage-based pruning. Skip them entirely.
-        example_skipped = [c for c in cases if "Example" in c.trace]
-        cases = [c for c in cases if "Example" not in c.trace]
-        skipped_cases += example_skipped
-        if example_skipped:
-            cons.print(f"[dim]Skipped {len(example_skipped)} example tests "
-                       f"(redundant coverage)[/dim]")
+        example_cases = [c for c in cases if "Example" in c.trace]
+        non_example_cases = [c for c in cases if "Example" not in c.trace]

         cache = load_coverage_cache(common.MFC_ROOT_DIR)
         if cache is None:
             cons.print("[yellow]Coverage cache missing or stale.[/yellow]")
             cons.print("[yellow]Run: ./mfc.sh build --gcov -j 8 && ./mfc.sh test --build-coverage-cache[/yellow]")
             cons.print("[yellow]Falling back to full test suite.[/yellow]")
@@
             else:
+                # Exclude example tests only when pruning is actually applied.
+                cases = non_example_cases
+                skipped_cases += example_cases
+                if example_cases:
+                    cons.print(f"[dim]Skipped {len(example_cases)} example tests "
+                               f"(redundant coverage)[/dim]")
                 changed_fpp = {f for f in changed_files if f.endswith(".fpp")}
                 if not changed_fpp:

Also applies to: 135-143

@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch from b0e6e8a to 66c792a Compare March 5, 2026 14:05
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Claude Code Review

Head SHA: 66c792a23ab4709c45979432efda463ad24e541a
Files changed: 25

Changed files
  • .github/file-filter.yml
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh, submit.sh, test.sh
  • .github/workflows/test.yml
  • .gitignore, CMakeLists.txt
  • benchmarks/*/case.py (5 files — benchmark step fix)
  • toolchain/mfc/build.py, toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py (new), test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary, new)
  • toolchain/mfc/test/test_coverage_unit.py (new, 664 lines)

Summary

  • Adds gcov-based test pruning: a file-level coverage cache maps each of 555 test UUIDs to the .fpp source files it exercises, and --only-changes intersects PR-changed files against the cache to skip irrelevant tests.
  • Three-phase parallel cache builder (prepare → run → gcov collect), new rebuild-cache CI job on Phoenix/SLURM, and conservative fallbacks (missing cache → run all; no sim coverage → include conservatively).
  • 53 unit tests in test_coverage_unit.py cover the core filtering logic without requiring a build or network access.
  • Benchmarks refactored to use a ternary for t_step_stop/t_step_save when --steps is provided.

Findings

🔴 Security: Fork code runs with contents: write on self-hosted runner

.github/workflows/test.yml, rebuild-cache job (~line 274–306)

The rebuild-cache job checks out the PR head SHA from the fork (ref: github.event.pull_request.head.sha) and then executes:

bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/rebuild-cache.sh cpu none

Because the checkout fetches the fork's .github/workflows/phoenix/rebuild-cache.sh, a malicious fork can inject arbitrary commands into that script. The job runs on a persistent self-hosted Phoenix runner with contents: write permission, which allows pushing to refs/heads/master.

The trigger is pull_request (not pull_request_target), so the workflow file itself runs from the base branch — but the shell scripts invoked by that workflow come from the checked-out worktree, which is the fork's code. This is a supply-chain risk.

Mitigation options:

  1. Only trigger rebuild-cache on push (master) and workflow_dispatch, never on pull_request.
  2. Or: pin the scripts to run from the base-branch copy before checking out the PR head (two-step checkout, or use a separate step that downloads the scripts from the base).
  3. Or: add explicit github.repository == 'MFlowCode/MFC' + github.event.pull_request.head.repo.full_name == 'MFlowCode/MFC' guards so the job is skipped entirely for forks.

🟡 TEMP changes must be reverted before merge (as noted in PR body)

toolchain/mfc/test/cases.py and toolchain/mfc/test/test_coverage_unit.py

The PR explicitly calls out two temporary changes that test the pipeline:

  1. A duplicate use statement in m_bubbles.fpp to trigger dep-change detection.
  2. Removal of CMakeLists.txt from ALWAYS_RUN_ALL, confirmed by the new unit test test_cmakelists_does_not_trigger_all which asserts should_run_all_tests({"CMakeLists.txt"}) is False.

The root-level CMakeLists.txt controls which source files are compiled and with what flags. A change there (e.g., adding a new .fpp to the build) can affect all tests without touching any .fpp file directly. Once the TEMP changes are reverted, CMakeLists.txt (and/or CMakeLists.txt prefix) should be back in ALWAYS_RUN_ALL.


🟡 Variable naming confusion in build_coverage_cache

toolchain/mfc/test/coverage.py, ~line 1189–1212

phase1_jobs = min(n_jobs, 16)
cons.print(f"... ({phase1_jobs} test workers, {n_jobs} gcov workers)...")
# ...
# Phase 1: single-threaded — no parallelism here
# Phase 2:
with ThreadPoolExecutor(max_workers=phase1_jobs) as pool:  # ← uses phase1_jobs
# Phase 3:
with ThreadPoolExecutor(max_workers=n_jobs) as pool:

Phase 1 is explicitly single-threaded; phase1_jobs is actually used as the Phase 2 worker count. The name is misleading — consider n_test_workers or max_concurrent_tests.


🟡 cons.raw.file redirect in _prepare_test is fragile

toolchain/mfc/test/coverage.py, ~lines 1143–1153

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

This suppresses Rich console output by swapping out its internal file handle. It depends on Rich's internal raw.file attribute being stable across Rich versions, which is not guaranteed. A cleaner approach would be to pass a quiet flag to get_inp() or capture at the subprocess level. The comment acknowledges it's not thread-safe, which is consistent with Phase 1 being single-threaded — just note this would be a silent regression if Phase 1 is ever parallelized.


🟢 Binary blob growth over time

toolchain/mfc/test/test_coverage_cache.json.gz

The gzip cache (11 KB now) is committed to the repo and regenerated on each push to master that touches cases.py or Fortran deps. Each regeneration creates a new git object. At ~11 KB per commit, this is negligible currently, but over hundreds of cache rebuilds it could accumulate. Git LFS or downloading the artifact separately (rather than committing to the repo) would avoid the accumulation. Not a blocker but worth tracking.


Minor nits

  • toolchain/mfc/test/coverage.py line 893: double blank line before _parse_gcov_json_output (style).
  • _prepare_test comment says # NOTE: not thread-safe — Phase 1 must remain single-threaded but doesn't explain why (the cons.raw.file swap). A one-liner cross-reference would help future readers.
  • _run_single_test_direct: ppn = test_info["ppn"] always resolves to 1 via getattr(case, 'ppn', 1) since TestCase objects don't have a ppn attribute. If multi-rank coverage matters, this should use the same ppn logic as the normal test runner.

Overall: The architecture is well thought-out and the conservative fallbacks are sound. The primary concern before merging is the fork-code execution issue in the rebuild-cache CI job, and ensuring the TEMP changes are reverted.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Claude Code Review

Head SHA: 1bac876a65c0480ab4201bf795a07b00fd84148b
Files changed: 27

Changed files
.github/file-filter.yml
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/workflows/bench.yml
.github/workflows/frontier/{bench,submit,test}.sh
.github/workflows/phoenix/{bench,rebuild-cache,submit,test}.sh
.github/workflows/test.yml
.gitignore
CMakeLists.txt
benchmarks/*/case.py  (5 files)
toolchain/mfc/build.py
toolchain/mfc/cli/commands.py
toolchain/mfc/test/{case,cases,test}.py
toolchain/mfc/test/coverage.py  (new, 767 lines)
toolchain/mfc/test/test_coverage_cache.json.gz  (new)
toolchain/mfc/test/test_coverage_unit.py  (new, 664 lines)

Summary

  • Adds gcov-based file-level coverage cache (555 test UUIDs -> exercised .fpp files, gzip JSON) committed to the repo
  • Introduces --only-changes flag to prune tests by intersecting changed .fpp files against the coverage cache; conservative fallbacks on missing/stale cache
  • Adds rebuild-cache CI job that builds MFC with --gcov, runs all tests to collect per-test coverage, and auto-commits the updated cache on push-to-master
  • Dep-change detection in the lint-gate step greps the PR/push diff for added/removed use/include statements to trigger a cache rebuild
  • 53 unit tests cover core coverage-cache logic without requiring a real build or git state

Findings

1. TEMP code must be reverted before merge (blocker)

The PR description explicitly flags:

TEMP: duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI

The test plan marks this as a required step ("Revert TEMP changes before merge"). The cases.py hunk also adds "1D_qbmm" to the example-skip list — it is unclear whether that is also TEMP. These must be cleaned up before merging.

2. Push-event dep-change detection silently fails on shallow clones

.github/workflows/test.yml — dep-check step:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")

On the standard actions/checkout shallow clone (--depth 1), $BEFORE is not present in local history, so git diff fails, DIFF is empty, and dep_changed is set to false. The cache will not be rebuilt after a push that adds a use statement, silently leaving it stale.

For PR events the fallback is safe (dep_changed=true). The push-event path should match:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
  echo "git diff failed on push — defaulting to dep_changed=true for safety."
  echo "dep_changed=true" >> "$GITHUB_OUTPUT"
  exit 0
}

3. --gcov passthrough to ./mfc.sh test may cause argument errors

.github/scripts/run-tests-with-retry.sh:

--test-all|--single|--debug|--gcov) PASSTHROUGH="$PASSTHROUGH $arg" ;;

--gcov is passed through to ./mfc.sh test, but test does not define a --gcov argument in cli/commands.py (only build does). The rebuild-cache.sh script also calls ./mfc.sh test --build-coverage-cache --gcov, which would fail if the test command rejects unknown flags. Verify that ARG("gcov") resolves correctly during a test run, or add a no-op --gcov argument to the test command definition.

4. Benchmark step count reductions change benchmark semantics

All 5 benchmark case.py files reduce t_step_stop/t_step_save from 7x(5*size+5) to 2x(5*size+5) (viscous from 6x to 2x). This 3-3.5x reduction in simulation length may reduce warmup time and steady-state sampling quality. If intentional for CI speed, add a comment to avoid confusion when interpreting benchmark results later.

5. Bare SLURM_CPUS_ON_NODE in arithmetic is fragile

.github/workflows/phoenix/test.sh:

n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

The bare SLURM_CPUS_ON_NODE in the ternary guard evaluates to 0 when unset, so the fallback to 8 works by accident. Prefer consistent usage:

n_test_threads=$(( ${SLURM_CPUS_ON_NODE:-8} > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

Positive observations

  • The github.repository == 'MFlowCode/MFC' guard on rebuild-cache correctly prevents untrusted fork code from running on the self-hosted Phoenix runner with write permissions.
  • Conservative fallbacks throughout coverage.py (missing cache -> full suite, UUID not in cache -> include test, missing sim coverage -> include test) are well thought out.
  • The _normalize_cache function for backwards compat with old line-level format is a nice touch.
  • Removing continue-on-error: true from the GitHub test job restores auto-cancellation correctly; Phoenix/Frontier already had retry logic.
  • 664 lines of offline unit tests for coverage logic without requiring a real build or git state is good practice.

Generated by Claude Sonnet 4.6 via Claude Code

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.84%. Comparing base (2c3590c) to head (a7cfc5c).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   44.95%   44.84%   -0.12%     
==========================================
  Files          70       70              
  Lines       20503    20511       +8     
  Branches     1946     1946              
==========================================
- Hits         9217     9198      -19     
- Misses      10164    10191      +27     
  Partials     1122     1122              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: df97a8a
Files changed: 27

Files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh, submit.sh, test.sh
  • .github/workflows/test.yml
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py, test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary)
  • toolchain/mfc/test/test_coverage_unit.py
  • 5 benchmark case.py files

Summary

  • Introduces file-level gcov coverage cache (555 tests → .fpp files) and --only-changes flag to prune CI test runs on PRs
  • New rebuild-cache CI job on Phoenix/SLURM runs a 3-phase parallel cache builder (prepare → run → gcov collect)
  • Dep-change detection greps PR/push diffs for use/include additions to detect stale coverage cache
  • 53 unit tests cover core coverage logic with full offline mocking
  • Removes continue-on-error: true from the github matrix job; downstream jobs now depend on rebuild-cache via always() guard

Findings

[Must fix before merge] TEMP changes need revert (toolchain/mfc/test/cases.py, src/simulation/m_bubbles.fpp)
The PR description explicitly flags two TEMP changes that must be reverted before merge:

"duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI"

Neither the m_bubbles.fpp change nor the CMakeLists.txt removal from ALWAYS_RUN_ALL appears in the diff listing, suggesting they may already be reverted or were only in a prior iteration. Confirm both are cleaned up before merging. cases.py has a +1/-1 change (excluded test list tweak) that also warrants a second look.


[Security / CI] rebuild-cache runs PR head code on self-hosted runner with contents: write (.github/workflows/test.yml, lines ~305–358)

    permissions:
      contents: write   # Required for Commit Cache to Master on push events
    steps:
      - name: Clone
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

The job checks out github.event.pull_request.head.sha (untrusted fork code for external PRs) and executes it on the phoenix self-hosted runner. The contents: write permission is declared at the job level, so it's held even when triggered by pull requests. Because GITHUB_TOKEN permissions for pull_request from forks are automatically downgraded by GitHub, the write permission itself is inert for fork PRs — but untrusted code still runs on the self-hosted runner.

If this repo ever receives PRs from external contributors (not just org members), this is a supply-chain risk. Consider scoping contents: write to only the push/dispatch path, or adding a environment: restricted gate that requires maintainer approval before running the cache rebuild for fork PRs.


[Behavior change] continue-on-error removal from github job (.github/workflows/test.yml)

Previously the github matrix job (Ubuntu + macOS × compiler variants) had continue-on-error: true, which prevented a single flaky matrix entry from marking the overall workflow as failed. This is now removed. fail-fast: false is still set so all matrix entries still run, but the github job will now be marked as failed if any single matrix entry fails. If this job is a required status check for PR merging, intermittent macOS flakiness could start blocking merges. Intentional or worth a comment?


[Minor] phase2_jobs cap hardcoded at 16 (toolchain/mfc/test/coverage.py, line ~1264)

phase2_jobs = min(n_jobs, 16)

The 16-worker limit is reasonable for OOM protection but is not configurable. On nodes with large RAM (Phoenix GNR nodes have significant memory), the cap may be unnecessarily conservative. A comment explaining the memory estimate (e.g. "each MPI+gcov test uses ~2–5 GB") is present and helpful — consider making it an environment variable or CLI parameter for future tuning without a code change.


[Minor] Thread-safety note on console redirect in _prepare_test (toolchain/mfc/test/coverage.py, lines ~1216–1226)

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

The comment says "NOTE: not thread-safe — Phase 1 must remain single-threaded." This is correct and documented. Just flagging it as a fragility: if Phase 1 is ever parallelized (e.g., to speed up the 555-case prepare loop), this will produce garbled/lost console output. A thread-local or lock-based approach would be more robust long-term.


[Nit] Double space in YAML runner group (.github/workflows/test.yml, line ~321)

    runs-on:
      group:  phoenix   # two spaces before "phoenix"

Cosmetic, but inconsistent with the rest of the file.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: a010a9a34d2afecb397ba0e432d6e203d28a6b59

Files changed: 28

Changed files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, build.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh (new), submit.sh, test.sh
  • .github/workflows/test.yml
  • CMakeLists.txt
  • benchmarks/*/case.py (5 files)
  • toolchain/mfc/build.py, cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py (new), test.py, test_coverage_cache.json.gz (new), test_coverage_unit.py (new)

Summary

  • Adds gcov-based file-level coverage cache (11 KB gzip JSON, 555 test UUIDs) to prune CI test runs to only those touching changed .fpp files
  • New --only-changes CLI flag intersects PR-changed files against cached coverage map; conservative fallbacks (missing cache → run all, ALWAYS_RUN_ALL files → full suite)
  • New rebuild-cache CI job on Phoenix runs the full instrumented suite via SLURM and commits the updated cache to master on push events
  • Removes continue-on-error: true from the github matrix job (fixes auto-cancellation suppression); caps parallel jobs at 64 to avoid MPI daemon exhaustion on GNR nodes
  • 53 unit tests cover the core coverage filter/parse logic in test_coverage_unit.py

Findings

1. TEMP changes must be reverted before merge (acknowledged in PR description)

The PR description explicitly lists two temporary items needed only to exercise the cache rebuild pipeline in CI:

  • Duplicate use statement in src/simulation/m_bubbles.fpp
  • Removal of CMakeLists.txt from ALWAYS_RUN_ALL

These must not be merged as-is. Consider using a separate test branch or a dummy file change that does not touch production Fortran source.


2. git diff "$BEFORE".."$AFTER" on push events has no safe fallback (test.yml, ~line 321)

The PR fallback is correctly dep_changed=true when gh pr diff fails on pull_request events. But for push events:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")

\.event.before is the all-zeros SHA on the first push to a new branch or after a force push to master. git diff 0000000..HEAD fails silently and DIFF becomes empty, so dep_changed=false is emitted. If cases.py also did not change, rebuild-cache is skipped on master push — cache could remain stale after a force-push that restructures Fortran imports. Suggested fix: mirror the PR fallback:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
  echo "dep_changed=true" >> "$GITHUB_OUTPUT"; exit 0
}

3. rebuild-cache job checks out PR HEAD on a self-hosted runner (test.yml, rebuild-cache steps)

ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

This runs code from the PR author's fork on the Phoenix self-hosted runner, which has contents: write permission. For pull_request events (not pull_request_target) from forks, GitHub restricts write-scope secrets automatically, so the push-to-master step is safely blocked. However, any collaborator with write access to MFlowCode/MFC who opens a PR could execute arbitrary code on Phoenix hardware. This is consistent with the existing self job pattern but is worth noting. Consider constraining this job to push/workflow_dispatch only (build the cache only post-merge, never from PR HEAD), or requiring a maintainer label before running.


4. github and self jobs now block on rebuild-cache (up to 4 hours) on any PR touching cases.py or Fortran imports (test.yml)

rebuild-cache has timeout-minutes: 240. When triggered, every downstream test job in the matrix must wait for it. For a PR that tweaks a test parameter in cases.py, this means up to 4 hours of additional CI delay before any tests run. Consider whether a shorter timeout or a pre-existing stale-but-valid cache is acceptable (i.e., only rebuild after merge, not on every PR).


5. Missing != 'failure' guard for rebuild-cache in downstream job conditions (test.yml, ~lines 395-455)

The conditions for github and self check:

needs.rebuild-cache.result != 'cancelled'

This correctly handles the skipped case ('skipped' != 'cancelled' → jobs run using the committed cache). However, it also allows the jobs to proceed when rebuild-cache has result == 'failure'. In that case, --only-changes is still passed and the stale committed cache is used. This is the intended conservative fallback, but it means a cache-rebuild failure is silent from the test job perspective — the jobs run but may run fewer tests than expected. A warning log step or an explicit comment to that effect would aid debugging.


Improvement Opportunities

  • GCOV_PREFIX_STRIP computation (coverage.py, _compute_gcov_prefix_strip): The comment says -1 excludes root '/' but the GCC manual counts the leading slash as one of the N stripped components. Validate on both Linux absolute paths and any unusual runner mount points (e.g., symlinked /home → /storage), as os.path.realpath is correctly used to resolve symlinks first.
  • export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" in frontier/submit.sh and phoenix/submit.sh heredocs: the variable is already exported in the environment; the self-referential export is a no-op (though harmless). A comment clarifying that this propagates the variable into the SLURM job's inline script environment would improve readability.
  • Benchmark step-count reduction (5 benchmark case.py files, 7→2 multiplier): this is a 71% reduction in default benchmark steps. Confirm that the shortened runs still produce representative throughput numbers and that CI timing comparisons remain valid.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: baa49bf4b8cca8b9d9102e4d547d5fc2d5be9f3c
Files changed: 30 | +1903 / -91

Files (30)
.github/file-filter.yml
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/scripts/run_monitored_slurm_job.sh  (new)
.github/workflows/bench.yml
.github/workflows/frontier/bench.sh
.github/workflows/frontier/build.sh
.github/workflows/frontier/submit.sh
.github/workflows/frontier/test.sh
.github/workflows/phoenix/bench.sh
.github/workflows/phoenix/rebuild-cache.sh  (new)
.github/workflows/phoenix/submit.sh
.github/workflows/phoenix/test.sh
.github/workflows/test.yml
.gitignore
CMakeLists.txt
benchmarks/*/case.py  (5 files, t_step_stop multiplier 7x -> 2x)
toolchain/mfc/build.py
toolchain/mfc/cli/commands.py
toolchain/mfc/test/case.py
toolchain/mfc/test/cases.py
toolchain/mfc/test/coverage.py  (new, 795 lines)
toolchain/mfc/test/test.py
toolchain/mfc/test/test_coverage_cache.json.gz  (binary, committed)
toolchain/mfc/test/test_coverage_unit.py  (new, 664 lines)

Summary

  • Introduces gcov-based file-level test pruning: build once with --gcov, map each of ~555 tests to the .fpp files it executes, cache that mapping, then intersect against PR diffs to skip unaffected tests.
  • Adds a rebuild-cache CI job (Phoenix SLURM, 240-min timeout) that regenerates the cache whenever cases.py or Fortran dependency graph (use/include diffs) changes.
  • Adds run_monitored_slurm_job.sh for resilient SLURM monitoring (handles runner SIGKILL before job completion).
  • Switches Phoenix CPU-test nodes to cpu-gnr (exclusive, 192-core Granite Rapids) and caps parallelism at 64 to avoid overwhelming OpenMPI daemons.
  • 53 unit tests for core coverage logic (fully offline, no build required).

Findings

Must fix before merge

1. TEMP changes are committed — must be reverted (acknowledged in PR checklist)

The PR body explicitly marks two items as TEMP:

  • toolchain/mfc/test/cases.py"1D_qbmm" added to casesToSkip to reduce cache-rebuild load in CI.
  • coverage.pyCMakeLists.txt removed from ALWAYS_RUN_ALL to trigger the dep-changed path during CI testing.

Both alter production behavior and must be reverted before merge.

2. Committed binary artifact built under TEMP conditions

toolchain/mfc/test/test_coverage_cache.json.gz was generated while 1D_qbmm was excluded and CMakeLists.txt was absent from ALWAYS_RUN_ALL. The cases_hash stored in its _meta will immediately mismatch the post-revert cases.py, causing load_coverage_cache to return None on every PR and silently fall back to the full test suite until a fresh rebuild-cache job runs. Consider committing a clean cache (built after reverting the TEMP changes) or documenting the expected cold-start window.


Notable issues

3. n_test_threads arithmetic inconsistency — phoenix/test.sh

# current (inconsistent references):
n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

If SLURM_CPUS_ON_NODE is unset the condition evaluates 0 > 64 = false and the fallback :-8 fires correctly, so behavior is right. But the two references to the variable are inconsistent. Cleaner:

n_test_threads=$(( ${SLURM_CPUS_ON_NODE:-8} > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

4. rebuild-cache.sh hardcodes -j 8 for the build — phoenix/rebuild-cache.sh line 19

retry_build ./mfc.sh build --gcov -j 8

On an exclusive 192-core GNR node the build phase uses only 8 of 192 cores. Using $NJOBS (already capped at 64) would cut build time significantly with no correctness risk for a CPU-only gfortran build.

5. Global state mutation to suppress console output in _prepare_testcoverage.py

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

This mutates a module-level singleton. Phase 1 is intentionally single-threaded so there is no race today, but redirecting cons.raw.file is undocumented Rich behavior. If cons.raw is ever replaced with a different Rich Console object the attribute may not exist. Consider rich.console.Console(quiet=True) as a context-local sink instead.

6. GITHUB_EVENT_NAME export in SLURM submit heredoc — phoenix/submit.sh line 80, frontier/submit.sh line 82

export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME"

Exporting the variable so SLURM propagates it into the batch job is correct (SLURM strips most CI env vars). If the variable is unset on a manual submit, an empty string is exported and test.sh conservatively runs the full suite — the right fallback. A brief inline comment explaining why the explicit re-export is needed would help future readers.

7. Post-revert CMakeLists.txt / ALWAYS_RUN_ALL alignment

After the TEMP revert, CMakeLists.txt presumably returns to ALWAYS_RUN_ALL. But the unit test test_cmakelists_does_not_trigger_all asserts the opposite. Either (a) revert the test to expect True, or (b) document the deliberate decision to exclude root CMakeLists.txt (covered by toolchain/cmake/ prefix) and keep the test as-is.


Minor observations

  • Phase 2 parallelism cap of 16 (coverage.py, phase2_jobs = min(n_jobs, 16)): with ~555 tests × 3 executables at ~10× gcov slowdown, validate empirically that the 240-minute SLURM timeout is sufficient after the TEMP revert restores the full test set.
  • Unit test import mechanism (importlib.util + exec fallback with string-substituted imports): functional but fragile if import lines in coverage.py are reformatted. A conftest.py that adds the repo root to sys.path would make the imports straightforward.
  • Benchmark t_step_stop reduction (7× → 2×): logical change, but unrelated to the coverage feature. Worth a separate commit/note if benchmark timing is being tracked historically.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: a84cabc
Files changed: 30

Changed files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/scripts/run_monitored_slurm_job.sh (new)
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/{bench,build,submit,test}.sh
  • .github/workflows/phoenix/{bench,rebuild-cache,submit,test}.sh
  • .github/workflows/test.yml
  • CMakeLists.txt, .gitignore
  • benchmarks/*/case.py (5 files)
  • toolchain/mfc/build.py, toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/{case,cases,coverage,test}.py + unit tests + cache gz

Summary

  • Introduces a 3-phase gcov-based coverage cache (prepare → run → collect) over 555 tests to enable PR-level test pruning via --only-changes.
  • Adds a rebuild-cache CI job on Phoenix that runs on cases.py or Fortran dep-graph changes; uploads artifact for PRs, commits to master on push.
  • Adds run_monitored_slurm_job.sh wrapping SLURM monitoring with sacct-based SIGKILL recovery.
  • Replaces build cache with clean rm -rf build on Phoenix/Frontier; adds GNR-partition SBATCH settings.
  • Ships 53-unit-test suite for the coverage module; includes 11 KB pre-built cache artifact.

Findings

🔴 Must fix before merge

1. TEMP changes not reverted — test plan checkbox unchecked

The PR body explicitly flags two TEMP changes that must be reverted before merge, and marks the step unchecked:

[ ] Revert TEMP changes before merge

Both are still present:

  • CMakeLists.txt missing from ALWAYS_RUN_ALL — The list in toolchain/mfc/test/coverage.py (added lines ~947–960) does not include CMakeLists.txt or toolchain/cmake/ entries. The PR body says this was removed temporarily to trigger the cache rebuild pipeline in CI. Without it, a PR that only changes CMakeLists.txt will skip all tests despite the build system changing.

    Suggested fix: add "CMakeLists.txt" to ALWAYS_RUN_ALL (or add "toolchain/cmake/" to ALWAYS_RUN_ALL_PREFIXES):

    ALWAYS_RUN_ALL = frozenset([
        ...
        "CMakeLists.txt",
    ])
  • "1D_qbmm" added to casesToSkip in cases.py:1074 — This silently suppresses the 1D_qbmm example test. If this was the duplicate-use TEMP change, it must be reverted; if it's a legitimate bug workaround it should be documented separately.


🟡 Design concerns

2. toolchain/mfc/test/test.py not in ALWAYS_RUN_ALL
toolchain/mfc/test/coverage.py (lines ~947–960) includes case.py, cases.py, coverage.py in ALWAYS_RUN_ALL, but not test.py. A PR that modifies the main test runner (test.py) while changing no .fpp files would skip all tests. Low-probability but worth closing.

3. Benchmark step counts reduced 3–3.5× without explanation
All 5 benchmark cases (5eq_rk3_weno3_hllc, hypo_hll, ibm, igr, viscous_weno5_sgb_acoustic) change the t_step_stop/t_step_save multiplier from 6–7 → 2. This is not mentioned in the PR summary. Fewer steps means less wall-clock time but weaker benchmark signal and potentially less statistical reliability. Is this intentional and permanent, or another temporary change?

4. rebuild-cache.sh: build parallelism hardcoded to -j 8
phoenix/rebuild-cache.sh line ~272 uses retry_build ./mfc.sh build --gcov -j 8 while NJOBS is computed from SLURM_CPUS_ON_NODE for tests. On GNR nodes with 192 cores, the build step uses only 8 threads. Consider using NJOBS (already capped at 64) or at least a higher fixed value to avoid a slow build step.


🟢 Minor / informational

5. GITHUB_EVENT_NAME propagation to SLURM jobs
Both phoenix/submit.sh and frontier/submit.sh add export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" so that test.sh/rebuild-cache.sh can conditionally set prune_flag. If the variable is unset in the outer shell, it will be exported as empty, and prune_flag will remain "" (safe). The pattern works but relies on submit.sh being the entry point — direct SLURM invocations without it would silently fall back to full-suite (also safe).

6. retry-build.sh shellcheck disable: intentional
The change from eval "$clean_cmd" to $clean_cmd with a shellcheck disable=SC2086 comment is correct — word-splitting on RETRY_CLEAN_CMD is intended. The comment explains this clearly.


Before merging

  • Revert TEMP: add CMakeLists.txt back to ALWAYS_RUN_ALL (or an appropriate prefix)
  • Revert TEMP: remove "1D_qbmm" from casesToSkip (or document separately if intentional)
  • Clarify/confirm benchmark step-count reduction is intentional
  • Consider adding toolchain/mfc/test/test.py to ALWAYS_RUN_ALL

@github-actions
Copy link

github-actions bot commented Mar 7, 2026

Claude Code Review

Head SHA: a7cfc5c8b0cf07a434f9f9c548d5f26564965513

Files changed: 30

.github/file-filter.yml
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/scripts/run_monitored_slurm_job.sh
.github/scripts/submit_and_monitor_bench.sh
.github/workflows/bench.yml
.github/workflows/frontier/{bench,build,submit,test}.sh
.github/workflows/phoenix/{bench,rebuild-cache,submit,test}.sh
.github/workflows/test.yml
.gitignore / CMakeLists.txt
benchmarks/*/case.py (5 files, t_step_stop 7→2)
toolchain/mfc/{build,cli/commands,test/case,test/cases,test/test}.py
toolchain/mfc/test/coverage.py (new, 795 lines)
toolchain/mfc/test/test_coverage_unit.py (new, 664 lines)
toolchain/mfc/test/test_coverage_cache.json.gz (new binary)

Summary

  • Introduces file-level gcov coverage-based test pruning: builds MFC once with --gcov, runs all 555 tests, records which .fpp files each test touches, stores the mapping as a gzip JSON cache (11 KB), and uses it to skip tests on PRs whose changed files don't overlap with any test's covered set.
  • New rebuild-cache CI job runs on Phoenix SLURM when cases.py or Fortran dependency graph (USE/include) changes; commits the updated cache to master on push events.
  • Conservative fallbacks: missing/stale cache → full suite; ALWAYS_RUN_ALL set covers GPU macro files and toolchain scripts; test not in cache → include.
  • New run_monitored_slurm_job.sh wraps SLURM monitor with a sacct recovery path when the GitHub runner sends SIGKILL mid-wait.
  • 53 unit tests cover core pruning logic offline (no build/git/gcov required).

Findings

1. TEMP changes must be reverted before merge (blocker)
The PR description explicitly flags two TEMP changes and the checklist item is unchecked. Specifically:

  • A duplicate use statement injected in m_bubbles.fpp (not visible in this diff but noted in the PR body) — would trigger dependency-change detection to force a cache rebuild in CI.
  • CMakeLists.txt removed from ALWAYS_RUN_ALL (visible implicitly — CMakeLists.txt is not currently in the ALWAYS_RUN_ALL frozenset in coverage.py). CMake changes can alter which files compile and thus invalidate coverage; this belongs in ALWAYS_RUN_ALL.

Neither revert has been done. Do not merge until both are addressed.

2. CMakeLists.txt absent from ALWAYS_RUN_ALL (, lines ~48–60)
Changes to CMakeLists.txt (e.g., adding a new compile definition, changing FYPP_GCOV_OPTS) alter the build and can shift coverage attribution. CMakeLists.txt should be added to ALWAYS_RUN_ALL (or ALWAYS_RUN_ALL_PREFIXES for the whole root). The PR body says this was intentionally removed as a TEMP change to test the pipeline, confirming it needs to be restored.

3. Direct push to master from CI with contents: write (, lines ~462–473)
The rebuild-cache job commits and pushes directly to refs/heads/master using github-actions[bot]. If branch protection requires PRs for master, this push will be blocked silently (the if git diff --cached --quiet path exits 0 cleanly, masking the failure). Recommend testing this path explicitly and ensuring the bot has the bypass permission — or using a dedicated commit bot rather than the generic Actions bot. The [skip ci] tag in the commit message is correct to avoid infinite rebuild loops.

4. cases.py: "1D_qbmm" added to casesToSkip (, line ~1074)
The PR description groups this with the TEMP changes but the code has no # TEMP comment. If this is intentional (the example case is too expensive for coverage runs), it should be permanent and documented. If it's a TEMP workaround, it needs reversion and a comment.

5. Phase 2 test parallelism cap at 16 may be fragile (, build_coverage_cache, line ~450)
phase2_jobs = min(n_jobs, 16) is hardcoded. The comment explains OOM concerns ("~2–5 GB each under gcov"), but the Phoenix GNR nodes may have 192 cores/1.5 TB RAM. Consider making this configurable (e.g., a --phase2-jobs flag or environment variable) rather than hardcoding 16, to avoid under-utilization on memory-rich nodes.

6. No staleness check on the committed cache's gcov version (, load_coverage_cache)
The staleness check compares cases.py SHA256 against the cache's stored hash — but not the gcov version. If the system gcov is upgraded between cache builds, the coverage data format or attribution could change silently. Minor, but worth noting.

7. retry-build.sh: eval → direct variable expansion (, line ~11)
Good fix — direct variable expansion is safer than eval. However, callers must pass $clean_cmd as a single command word with no shell metacharacters. The existing callers (RETRY_CLEAN_CMD="./mfc.sh clean") are safe, but future callers should be aware of this constraint (a comment would help).


Minor / Non-blocking

  • _compute_gcov_prefix_strip: The len(parts) - 1 formula for GCOV_PREFIX_STRIP looks correct for POSIX paths (root / is one part but not a component to strip). Worth a brief inline comment explaining the -1.
  • GITHUB_EVENT_NAME export in SLURM job scripts: Passed to the SLURM batch environment so test.sh/bench.sh can detect PR vs push. Clean design — just ensure the SLURM epilog doesn't strip custom env vars.
  • Benchmark t_step_stop 7→2: The halved step counts reduce benchmark runtime significantly. Ensure this doesn't affect benchmark statistical validity.
  • test_coverage_cache.json.gz committed to repo: Small (11 KB), gzip-compressed, auto-updated by CI. Acceptable trade-off for portability, but the repo history will accumulate binary diffs on every cache rebuild.

@sbryngelson sbryngelson marked this pull request as draft March 10, 2026 04:25
@github-actions
Copy link

Claude Code Review

Head SHA: 1032dbc10a87721e7334e473d2f948a344d07778
Files changed: 17

Changed files
  • .github/file-filter.yml
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/submit.sh
  • .github/workflows/frontier/test.sh
  • .github/workflows/phoenix/rebuild-cache.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/cases.py
  • toolchain/mfc/test/coverage.py (new, 804 lines)
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (new binary)
  • toolchain/mfc/test/test_coverage_unit.py (new, 662 lines)

Summary:

  • Adds a 3-phase gcov coverage cache builder (prepare → run → collect) that maps 555 test UUIDs to exercised .fpp source files
  • New --only-changes flag prunes tests by intersecting PR-changed files against the cache; conservative fallbacks for missing cache, missing simulation coverage, and infrastructure files
  • New rebuild-cache CI job on Phoenix SLURM; downstream github and self jobs are gated on it via always() + skipped-check pattern
  • 53 unit tests cover core coverage logic with offline mocking

Findings

[CRITICAL] Untrusted fork PR code executes on self-hosted runner with elevated permissions

File: .github/workflows/test.yml, rebuild-cache job, lines ~213–217

The rebuild-cache job runs on pull_request events and checks out the PR head SHA:

ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

It then runs bash .github/workflows/phoenix/rebuild-cache.sh (which calls ./mfc.sh) — entirely from the forked repository's code. A malicious PR that changes cases.py or adds a use statement can trigger this job and execute arbitrary code on the self-hosted Phoenix runner, which has permissions: contents: write and access to GITHUB_TOKEN. The github.repository == 'MFlowCode/MFC' guard only prevents runs from forks' own CI, not fork PRs targeting the main repo.

Recommended fix: Switch this job to use pull_request_target and check out the base branch's workflow scripts separately from the PR code, or disable fork-triggering by gating on github.event.pull_request.head.repo.full_name == 'MFlowCode/MFC'.


[HIGH] TEMP changes must be reverted before merge

File: toolchain/mfc/test/cases.py, line ~1543; toolchain/mfc/test/coverage.py, line ~595

The PR body explicitly lists two TEMP changes:

  1. A duplicate use in m_bubbles.fpp (not visible in the diff — may be a separate commit)
  2. CMakeLists.txt removed from ALWAYS_RUN_ALL to test the rebuild pipeline

In coverage.py, ALWAYS_RUN_ALL does not contain CMakeLists.txt, and there is no top-level CMakeLists.txt in ALWAYS_RUN_ALL_PREFIXES (toolchain/cmake/ is covered but CMakeLists.txt at root is not). CMake flag changes (e.g., -O1 vs -O3, new FYPP_GCOV_OPTS) can silently break coverage without triggering a full suite run.

In cases.py line 1543, "1D_qbmm" is added to casesToSkip — also a TEMP item per the PR body.

The PR checklist says "Revert TEMP changes before merge" but they are not yet reverted.


[HIGH] Benchmark timeout halved without justification

File: .github/workflows/bench.yml, line 34

timeout-minutes changed from 480 → 240 with no explanation in PR body or comments. If benchmark jobs regularly approach 4 hours, this change will cause spurious CI failures.


[MEDIUM] _prepare_test permanently mutates case params

File: toolchain/mfc/test/coverage.py, lines 945–969

_prepare_test calls case.params.update(get_post_process_mods(...)), sets case.params['t_step_stop'] = 1, and clamps t_stop for adaptive-dt cases. These mutations are permanent on the case object. If the caller iterates cases after build_coverage_cache returns, or if test infra holds references to cases, they will observe mutated params. The function should operate on a copy:

import copy
params = copy.deepcopy(case.params)
params.update(get_post_process_mods(params))
params['t_step_stop'] = 1
# … use params, not case.params

[MEDIUM] rebuild-cache blocks downstream tests even when skipped

File: .github/workflows/test.yml, lines 248 and 302

github and self jobs now declare needs: [lint-gate, file-changes, rebuild-cache]. When rebuild-cache is skipped (result == 'skipped'), the condition needs.rebuild-cache.result != 'cancelled' passes correctly. However, if rebuild-cache is running (queued on Phoenix for up to 240 min), all downstream jobs must wait — introducing significant latency for PRs that touch Fortran files but don't need the cache rebuilt (e.g., editing src/simulation/ without adding use statements skips rebuild-cache, so downstream jobs start immediately, but editing with an added use statement triggers rebuild and blocks all tests for up to 4 hours). Consider making downstream jobs start in parallel with rebuild-cache and only download the cache artifact if available (continue-on-error: true on the artifact download step already handles this), removing the blocking dependency.


[LOW] Stale export in Frontier submit.sh

File: .github/workflows/frontier/submit.sh, line 46

export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" re-exports a variable that is already in the environment. This is benign but redundant. The intent is presumably to ensure it propagates into the SLURM job; confirm it is actually needed by checking whether Frontier's SLURM propagates GITHUB_* variables by default.


[LOW] Double blank line in coverage.py

File: toolchain/mfc/test/coverage.py, line 705

Two consecutive blank lines appear between find_gcno_files and _parse_gcov_json_output. Precheck/flake8 may flag this (E303).


Positive notes

  • Conservative fallbacks throughout (missing cache → run all, missing sim coverage → include, ALWAYS_RUN_ALL for infra files) make the pruning safe.
  • The 3-phase design (prepare single-threaded, run parallel, collect parallel) correctly avoids Python import overhead in the hot path.
  • _parse_gcov_json_output correctly handles concatenated JSON objects and gzip vs. raw format differences across gcov versions.
  • Unit tests are thorough and fully offline (no build/git/gcov required).
  • The [skip ci] tag on the auto-commit message correctly prevents infinite CI loops.

@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch from 1032dbc to 2210805 Compare March 11, 2026 20:45
@github-actions
Copy link

Claude Code Review

Head SHA: 221080513b69b2544885fa29282c57495e0f9266

Files changed: 17

  • .github/file-filter.yml
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/submit.sh
  • .github/workflows/frontier/test.sh
  • .github/workflows/phoenix/rebuild-cache.sh (new)
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/cases.py
  • toolchain/mfc/test/coverage.py (new, 811 lines)
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (new binary)
  • toolchain/mfc/test/test_coverage_unit.py (new, 662 lines)

Summary

  • Adds a gcov-based test-pruning system: a committed .json.gz cache maps 555 test UUIDs to the .fpp source files each test exercises; --only-changes intersects PR-changed files against the cache to run only affected tests.
  • A new rebuild-cache CI job on Phoenix re-generates the cache when cases.py or the Fortran dependency graph changes, committing the result back to master.
  • The github and self CI jobs now depend on rebuild-cache, download the artifact, and pass --only-changes on PR runs.
  • 53 unit tests are added for the core pruning logic.

Findings

1. TEMP changes must be reverted before merge (critical, self-noted)

The PR description explicitly flags these as temporary. They will make the cache stale or skip tests incorrectly if merged as-is:

toolchain/mfc/test/cases.py"1D_qbmm" added to casesToSkip to avoid a crash during the CI cache-build pipeline test. Must be removed.

toolchain/mfc/test/coverage.py, ALWAYS_RUN_ALLCMakeLists.txt (root-level) is absent from both ALWAYS_RUN_ALL and ALWAYS_RUN_ALL_PREFIXES. The PR notes it was temporarily removed to force the cache rebuild to run in CI. It must be re-added so that root CMakeLists.txt changes always trigger the full suite.

2. rebuild-cache job checks out PR HEAD on a self-hosted runner with contents: write (test.yml, lines 212–242)

- name: Clone
  uses: actions/checkout@v4
  with:
    ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
permissions:
  contents: write   # Required for Commit Cache to Master on push events

On pull_request events the job checks out the fork/PR author's code on a Phoenix self-hosted runner and runs it (./mfc.sh clean, ./mfc.sh build --gcov, ./mfc.sh test --build-coverage-cache). The github.repository == 'MFlowCode/MFC' guard prevents this for external forks, but PRs from branches within the same org still run untrusted code with contents: write permission. Consider:

  • Moving the contents: write permission to a separate, isolated commit-only job that is never triggered by PR events, or
  • Pinning the checkout to github.base_ref (trusted) when event_name == 'pull_request' and only running gcov on the base, accepting a slight staleness.

3. export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" is a no-op (frontier/submit.sh, line 89)

export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME"

$GITHUB_EVENT_NAME is already set and exported by GitHub Actions. Re-exporting to itself is a no-op in the current shell. More importantly, SLURM jobs launched via sbatch do not automatically inherit the calling environment unless --export=ALL (or an explicit --export=GITHUB_EVENT_NAME=...) is passed in the sbatch invocation. If the intent is to propagate this variable into the SLURM job so frontier/test.sh can read it, the variable needs to be passed explicitly via sbatch --export or written to a file. Verify that frontier/test.sh actually receives the correct value at runtime.

4. Non-.fpp changes silently skip all tests (coverage.py, test.py)

When --only-changes is active and the changed files contain no .fpp files (e.g., a Python-only PR, a root-level CMakeLists.txt change, or a YAML-only change), all tests are skipped:

if not changed_fpp:
    skipped_cases += cases
    cases = []

This means a broken change to toolchain/mfc/run/input.py (which generates Fortran namelist input) would run zero tests if the PR touches no .fpp files. toolchain/mfc/run/input.py is already in ALWAYS_RUN_ALL, so it's handled — but CMakeLists.txt (root) and toolchain/mfc/case_validator.py (which is in ALWAYS_RUN_ALL but is a .py file that won't show up in changed_fpp) are correctly covered by the should_run_all_tests check. Double-check that all .py toolchain files that can silently break test correctness are in ALWAYS_RUN_ALL.

5. bench.yml timeout halved to 240 min without explanation (line 34)

-    timeout-minutes: 480
+    timeout-minutes: 240

This change is unrelated to test pruning and has no explanation. If any benchmark case legitimately requires >4 hours, this will cause silent failures. Please document the rationale (e.g., "no benchmark has ever needed more than 3h") or revert to avoid accidental bench-job cancellations.

6. _prepare_test mutates case.params despite the "work on a copy" comment (coverage.py, line 977)

# Work on a copy so we don't permanently mutate the case object.
params = dict(case.params)
# ... modifications to params ...
case.params = params   # <-- permanently replaces the attribute

The copy is taken for params, but the final assignment replaces case.params with the modified copy. The comment is misleading. In the current flow this is harmless (cases are only used for delete_output() afterward), but it will silently break any future code that re-uses the case objects after Phase 1. Consider either reassigning at the call site without touching case.params, or explicitly document that _prepare_test is destructive.

7. git push origin HEAD:refs/heads/master without pull (rebuild-cache step, test.yml lines 237–241)

git commit -m "Regenerate gcov coverage cache [skip ci]"
git push origin HEAD:refs/heads/master

If two push events land close together (or a workflow_dispatch overlaps with a push), both jobs can race to push to master. The second push will fail with a non-fast-forward error. Consider pulling before committing or using a dedicated branch + PR, or adding a retry / pull-rebase before the push.


Minor / Improvement Opportunities

  • test_coverage_unit.py dual import strategy (importlib + exec fallback, lines 1559–1630): The fallback exec path does fragile string-replacement on source code. If the module structure or import spelling ever changes, the fallback silently produces a half-loaded module. Prefer making the module importable via PYTHONPATH manipulation in the test setup rather than patching source text.

  • Phase 2 parallelism cap is hard-coded (coverage.py, line 1038): phase2_jobs = min(n_jobs, 16) is a fixed cap with a comment about OOM. The Phoenix GNR nodes have 192 cores, so the cap is reasonable, but consider making it a named constant (e.g., _MAX_PHASE2_WORKERS = 16) with the memory-per-process justification inline.

  • --only-changes skips example tests unconditionally (test.py, lines 1393–1398): Even if an example-specific file changed (e.g., examples/2D_ibm_stl_MFCCharacter/case.py), example tests are skipped. This is documented as an intentional optimization but may hide regressions in example cases during PRs that touch example files.

- File-level gcov coverage cache maps test UUIDs to exercised .fpp source
  files (gzip JSON, committed to repo)
- --only-changes flag prunes tests by intersecting PR-changed files against
  coverage cache; conservative fallbacks for missing cache/coverage
- --build-coverage-cache flag + 3-phase parallel cache builder
  (prepare, run, gcov collect)
- New rebuild-cache CI job on Phoenix via SLURM when cases.py or Fortran
  dependency graph changes
- Dep-change detection greps PR/push diffs for added use/include statements
- 53 unit tests cover core coverage logic
- Rebased onto PR MFlowCode#1299 unified CI architecture (submit-slurm-job.sh,
  common/test.sh)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch from 2210805 to c7a7725 Compare March 11, 2026 21:06
@github-actions
Copy link

Claude Code Review

Head SHA: c7a7725
Files changed: 14

Changed files
  • .github/file-filter.yml
  • .github/scripts/submit-slurm-job.sh
  • .github/workflows/common/rebuild-cache.sh (new)
  • .github/workflows/common/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/cases.py
  • toolchain/mfc/test/coverage.py (new, 811 lines)
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary, committed)
  • toolchain/mfc/test/test_coverage_unit.py (new, 662 lines)

Summary

  • Adds a gcov-based file-level coverage cache (gzip JSON, committed) mapping 555 test UUIDs to the .fpp source files they exercise
  • New --build-coverage-cache flag runs all 555 tests with gcov instrumentation via 3-phase parallel pipeline (prepare, run, gcov collect)
  • New --only-changes flag prunes tests by intersecting PR-changed .fpp files against the cache; falls back to full suite when the cache is missing/stale or infrastructure files change
  • New rebuild-cache CI job on Phoenix triggers when cases.py or the Fortran dependency graph changes, committing the updated cache to master on pushes
  • Removes continue-on-error from the GitHub test job to restore auto-cancellation on failure; 53 unit tests cover core coverage logic

Findings

1. TEMP changes must be reverted before merge (blocker)
The PR description and test plan both explicitly flag two TEMP changes that are not yet reverted:

  • toolchain/mfc/test/cases.py: "1D_qbmm" added to the casesToSkip list — this permanently excludes a real test from the example-based test suite.
  • A duplicate use statement in m_bubbles.fpp (presumably in the fork code, not visible in the diff here) used to force dep_changed=true in CI.
  • CMakeLists.txt is missing from ALWAYS_RUN_ALL in coverage.py (only toolchain/cmake/ prefix is listed, not the root CMakeLists.txt). If this was intentionally removed as the TEMP change, it must be restored — changes to the root CMakeLists.txt (e.g. new compile flags, LTO settings) can alter which code paths are compiled in and should trigger a full cache rebuild.

2. _prepare_test mutates case.params despite comment claiming otherwise
toolchain/mfc/test/coverage.py, line 915:

# Work on a copy so we don't permanently mutate the case object.
params = dict(case.params)
# ...
case.params = params   # ← BUG: this mutates the case object

The shallow copy on line 884 is made correctly, but line 915 reassigns case.params to the modified copy, permanently replacing the original params on the case object (e.g. t_step_stop is now 1 for every case after Phase 1). Since cleanup at Phase 3 (line 1124) calls case.delete_output() which only uses the UUID, this is benign here — but it's a correctness hazard if cases is ever reused after build_coverage_cache returns. Consider not reassigning case.params at all, and instead passing params directly to get_inp() via a temporary copy of the case.

3. rebuild-cache job always in needs — cost on every non-triggering PR
test.yml, lines 194 and 240: Both github and self jobs declare rebuild-cache in their needs list unconditionally. When rebuild-cache is skipped (the common case for PRs that don't touch cases.py or Fortran use/include statements), GitHub must still evaluate the DAG edge and the needs.rebuild-cache.result != 'cancelled' guard. This works correctly (skipped ≠ cancelled), but the coupling means every test run now waits for the file-changes job to finish and the rebuild-cache skip to propagate before github and self can start. The net latency impact is small, but worth noting.

4. Dep-change detection on push may be unreliable for force-pushes or initial branch pushes
test.yml, lines 118–125: For push events, github.event.before is the all-zeros SHA (0000000000000000000000000000000000000000) on the very first push to a new branch, making git diff "$BEFORE".."$AFTER" fail. This is safely handled by the || { ... echo "dep_changed=true" } fallback. However, force-pushes rewrite history and also produce a non-ancestor BEFORE, so git diff may succeed but produce a diff that spans many unrelated commits. The conservative fallback of defaulting to true on failure is correct; consider adding a comment noting this edge case.

5. Minor: double blank line in coverage.py (style)
toolchain/mfc/test/coverage.py, around line 643: There is a stray blank line between find_gcno_files and _parse_gcov_json_output (two blank lines where the style elsewhere uses one). Not a correctness issue, but will fail ./mfc.sh precheck if pylint or fprettify catches it.

- Phase 3: omit failed tests from cache instead of storing [] (which
  permanently skips them); absent entries are conservatively included
- _prepare_test: save/restore case.params in finally block so callers
  can safely reuse the case list after build_coverage_cache returns
- _parse_gcov_json_output: return None on decode failure so caller
  omits the test from cache rather than storing empty coverage
- Missing build dir: log warning and return None instead of []
- Lagrange bubble input failure: re-raise so test is skipped from cache
- CI: replace always() with !cancelled() to respect workflow cancellation
- CI: grep -P -> grep -E for POSIX portability (PCRE not universal)
- Narrow except Exception to (OSError, SubprocessError) in find_gcov_binary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Claude Code Review

Head SHA: a7afc06

Files changed: 14

  • .github/file-filter.yml
  • .github/scripts/submit-slurm-job.sh
  • .github/workflows/common/rebuild-cache.sh (new)
  • .github/workflows/common/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/cases.py
  • toolchain/mfc/test/coverage.py (new, 834 lines)
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (new binary)
  • toolchain/mfc/test/test_coverage_unit.py (new, 662 lines)

Summary:

  • Adds gcov-based test pruning: a file-level coverage cache maps 555 test UUIDs to exercised .fpp source files
  • --only-changes prunes tests by intersecting PR-changed files against the cache; falls back to full suite if cache is absent or stale
  • 3-phase parallel cache builder (prepare → run → gcov collect) runnable via ./mfc.sh test --build-coverage-cache
  • New rebuild-cache CI job fires on Phoenix/SLURM when cases.py or Fortran dependency graphs change; auto-commits the updated cache to master on push events
  • 53 unit tests cover the core coverage logic (offline, no gcov or build required)

Findings

1. TEMP changes must be reverted before merge (blocker)

The PR body explicitly calls these out, but they're still present:

toolchain/mfc/test/cases.py (line 1074): "1D_qbmm" was added to casesToSkip. This silently skips a regression test permanently if merged as-is.

toolchain/mfc/test/coverage.py (ALWAYS_RUN_ALL, line ~39): CMakeLists.txt is absent from ALWAYS_RUN_ALL. The unit test at test_coverage_unit.py:1702 asserts should_run_all_tests({"CMakeLists.txt"}) is False, confirming the omission is intentional for testing but unsafe for production — changes to the root CMakeLists.txt (e.g., new compile flags, gcov toggle) would not trigger a full test run.

Both must be reverted before merge.

2. rebuild-cache failure does not block test jobs — confirm this is intentional

test.yml, github and self jobs:

needs.rebuild-cache.result != 'cancelled'

This condition is satisfied for 'success', 'skipped', and 'failure'. If the cache build job fails (e.g., SLURM job errors, gcov crashes), test jobs will proceed without an updated cache and fall back to the full suite — which is the desired graceful degradation. However, a persistent rebuild-cache failure would go unnoticed because test jobs still pass. Consider adding an explicit annotation or step summary when tests fall back to full suite due to a stale/missing cache.

3. Direct push to master from CI without pulling first

test.yml, Commit Cache to Master step (around line 145):

git add toolchain/mfc/test/test_coverage_cache.json.gz
git commit -m "Regenerate gcov coverage cache [skip ci]"
git push origin HEAD:refs/heads/master

If another commit has landed on master since the rebuild-cache job started (which takes up to ~4 hours), the push will fail with a non-fast-forward error. The step has no retry or pull-rebase before pushing, so intermittent failures are expected on active branches. Suggest:

git pull --rebase origin master
git push origin HEAD:refs/heads/master

4. New .fpp file with no existing test coverage skips all tests

toolchain/mfc/test/coverage.py, filter_tests_by_coverage: if a PR adds a brand-new .fpp source file that is not yet covered by any test (e.g., a new m_feature.fpp in src/simulation/), the changed set contains the new file but the cache has no entry for it, so all tests are skipped (no test overlaps with the new file). The unit test test_new_fpp_file_no_coverage_skips confirms this.

This is partially mitigated — a new module added via use m_feature in another file would be caught by the dep-change grep and trigger a cache rebuild before any test run. But if the new file is only #:included (not used), the grep misses it and tests are incorrectly pruned. Worth documenting.

5. Minor: redundant export in submit-slurm-job.sh

.github/scripts/submit-slurm-job.sh, new line:

export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME"

GITHUB_EVENT_NAME is already an exported environment variable in GitHub Actions runners. The export is a no-op here. The assignment is needed if the variable isn't already in the job environment that sources this script, but export VAR="$VAR" is redundant. Use export GITHUB_EVENT_NAME (no assignment) or just leave the existing value in the environment. Not a bug, just noise.


Improvement opportunities (no issues)

  • Cache staleness on gcov_version change: The cache metadata stores gcov_version but load_coverage_cache only checks cases_hash for staleness. A gcov upgrade that changes JSON format or line attribution could produce inconsistent cache entries without triggering a stale detection. Consider adding a gcov_version staleness check.
  • phase2_jobs = min(n_jobs, 16) cap comment says "~2-5 GB each under gcov" but this depends heavily on the test. On Phoenix nodes with SLURM_CPUS_ON_NODE=32, the cap of 16 concurrent tests × 3 executables × possible MPI ranks could still be significant. The cap is reasonable but the rationale in code comments could be more precise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants