Skip to content

CLI for measuring execute_cuda encoding perf#6381

Merged
a10y merged 14 commits intodevelopfrom
aduffy/gpu-scan-measure
Feb 17, 2026
Merged

CLI for measuring execute_cuda encoding perf#6381
a10y merged 14 commits intodevelopfrom
aduffy/gpu-scan-measure

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Feb 9, 2026

Overview of changes

ergonomics/API focused changes

  • Introduced a new LaunchStrategy on the execution context. This by default will launch kernels and not track any timing information, but it is pluggable. For example in benchmarks we replace this with a TimedLaunchedStrategy which executes the kernels in blocking mode and logs their execution time.
  • Centralized the entrypoint for launching all kernels. They are now forced to be dispatched off of the execution context using the ctx.launch_kernel() method, which accepts a closure that is used to populate kernel arguments

A lot of test and benchmark code needed to be updated to use the new launch methods.

Fused FOR + BP

This has been shelved for a FLUP since this was too big

* I've updated the BP kernel generator to generate bp as FFOR, i.e. fused bitpacking with FOR. In practice, this is just adding a const T reference param. By default the execution for BitPackedArray passes zero, but there is a specialization in the ForArray execution tree where if it detects one of its descendants is BP, it fuses itself with the bit unpacking

GPU tracing tool

There's a new binary in vortex-test-e2e-cuda-scan which takes as input a Vortex file.

It will recompress the file using only GPU-supported encodings, scan it back, and collect timings for how long each column scan took. The results are printed as either pretty text, or as JSON to stdout, which can be piped into duckdb or similar for analysis

Example usage:

FLAT_LAYOUT_INLINE_ARRAY_NODE=true RUST_LOG=vortex_cuda=trace,info cargo run --release --bin vortex-test-e2e-cuda-scan -- ./vortex-bench/data/tpch/1.0/vortex-file-compressed/lineitem_0.vortex

@a10y a10y force-pushed the aduffy/gpu-scan-measure branch from 0249a54 to a4c923c Compare February 9, 2026 22:51
@a10y a10y marked this pull request as ready for review February 9, 2026 22:51
@a10y a10y added changelog/chore A trivial change changelog/skip Do not list PR in the changelog and removed changelog/chore A trivial change labels Feb 9, 2026
@a10y a10y requested a review from joseph-isaacs February 9, 2026 22:51
@a10y a10y force-pushed the aduffy/gpu-scan-measure branch 5 times, most recently from e52bb67 to 21537a6 Compare February 10, 2026 20:08
@a10y a10y force-pushed the aduffy/gpu-scan-measure branch 3 times, most recently from 052da59 to 249c24c Compare February 12, 2026 14:33
let mut total_time = Duration::ZERO;
let mut cuda_ctx = CudaSession::create_execution_ctx(&VortexSession::empty())
.vortex_expect("failed to create execution context")
.with_launch_strategy(Arc::new(timed));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here: instead of replicating the full launch setup in benchmark code, we can just stub in a launcher that collects timing information across runs

}};
/// Implementations can add tracing, async callbacks, or other behavior
/// around kernel launches.
pub trait LaunchStrategy: Debug + Send + Sync + 'static {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where LaunchStrategy is defined and impled

@a10y a10y force-pushed the aduffy/gpu-scan-measure branch from 780efdb to 7b61bd6 Compare February 12, 2026 20:05
@a10y a10y force-pushed the aduffy/gpu-scan-measure branch 4 times, most recently from f02a1b2 to b283260 Compare February 15, 2026 18:58
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions.

@a10y a10y force-pushed the aduffy/gpu-scan-measure branch from 3a7d146 to 7ac5cda Compare February 16, 2026 18:39
a10y added 2 commits February 16, 2026 13:40
Signed-off-by: Andrew Duffy <andrew@a10y.dev>

fixup

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
a10y added 8 commits February 16, 2026 13:40
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/gpu-scan-measure branch from 7ac5cda to a6a3917 Compare February 16, 2026 18:41
a10y added 2 commits February 16, 2026 14:06
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y enabled auto-merge (squash) February 16, 2026 21:16
a10y and others added 2 commits February 16, 2026 16:23
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
tx.send(())
// A send should never fail. Panic otherwise.
.expect("CUDA callback receiver dropped unexpectedly");
// NOTE: send can fail if the CudaEvent is dropped by the caller, in which case the receiver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this? This should never fail I think, as the send can only error if the channel does not have sufficient capacity which should never be the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting panics during testing bc the last receiver had been dropped. I forgot which operation this was for but making this not panic fixed it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 17, 2026

Merging this PR will degrade performance by 45.76%

❌ 16 regressed benchmarks
✅ 1041 untouched benchmarks
⏩ 1346 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bench_dict_mask[(0.01, 0.9)] 1.2 ms 2.2 ms -44.22%
Simulation bench_dict_mask[(0.01, 0.01)] 1.2 ms 1.7 ms -29.6%
Simulation bench_dict_mask[(0.01, 0.1)] 1.2 ms 1.8 ms -31.49%
Simulation bench_dict_mask[(0.01, 0.5)] 1.2 ms 2 ms -38.64%
Simulation bench_dict_mask[(0.1, 0.01)] 1.2 ms 1.7 ms -29.73%
Simulation bench_dict_mask[(0.1, 0.5)] 1.2 ms 2 ms -38.65%
Simulation bench_dict_mask[(0.5, 0.9)] 1.2 ms 2.2 ms -44.21%
Simulation bench_dict_mask[(0.1, 0.9)] 1.2 ms 2.2 ms -44.13%
Simulation bench_dict_mask[(0.9, 0.01)] 1.2 ms 1.7 ms -29.57%
Simulation bench_dict_mask[(0.5, 0.01)] 1.2 ms 1.7 ms -29.67%
Simulation bench_dict_mask[(0.1, 0.1)] 1.2 ms 1.8 ms -31.55%
Simulation bench_dict_mask[(0.9, 0.1)] 1.2 ms 1.8 ms -31.48%
Simulation bench_dict_mask[(0.5, 0.5)] 1.2 ms 2 ms -38.63%
Simulation bench_dict_mask[(0.5, 0.1)] 1.2 ms 1.8 ms -31.51%
Simulation bench_dict_mask[(0.9, 0.9)] 1.3 ms 2.3 ms -45.76%
Simulation bench_dict_mask[(0.9, 0.5)] 1.2 ms 2.1 ms -41.79%

Comparing aduffy/gpu-scan-measure (3884af8) with develop (1585f08)

Open in CodSpeed

Footnotes

  1. 1346 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y merged commit dadaa93 into develop Feb 17, 2026
87 of 89 checks passed
@a10y a10y deleted the aduffy/gpu-scan-measure branch February 17, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/skip Do not list PR in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants