CLI for measuring execute_cuda encoding perf by a10y · Pull Request #6381 · vortex-data/vortex

a10y · 2026-02-09T22:07:36Z

Overview of changes

ergonomics/API focused changes

Introduced a new LaunchStrategy on the execution context. This by default will launch kernels and not track any timing information, but it is pluggable. For example in benchmarks we replace this with a TimedLaunchedStrategy which executes the kernels in blocking mode and logs their execution time.
Centralized the entrypoint for launching all kernels. They are now forced to be dispatched off of the execution context using the ctx.launch_kernel() method, which accepts a closure that is used to populate kernel arguments

A lot of test and benchmark code needed to be updated to use the new launch methods.

~~Fused FOR + BP~~

This has been shelved for a FLUP since this was too big

* I've updated the BP kernel generator to generate bp as FFOR, i.e. fused bitpacking with FOR. In practice, this is just adding a const T reference param. By default the execution for BitPackedArray passes zero, but there is a specialization in the ForArray execution tree where if it detects one of its descendants is BP, it fuses itself with the bit unpacking

GPU tracing tool

There's a new binary in vortex-test-e2e-cuda-scan which takes as input a Vortex file.

It will recompress the file using only GPU-supported encodings, scan it back, and collect timings for how long each column scan took. The results are printed as either pretty text, or as JSON to stdout, which can be piped into duckdb or similar for analysis

Example usage:

FLAT_LAYOUT_INLINE_ARRAY_NODE=true RUST_LOG=vortex_cuda=trace,info cargo run --release --bin vortex-test-e2e-cuda-scan -- ./vortex-bench/data/tpch/1.0/vortex-file-compressed/lineitem_0.vortex

vortex-cuda/gpu-scan-cli/src/main.rs

vortex-cuda/benches/for_cuda.rs

a10y · 2026-02-12T19:38:19Z

vortex-cuda/benches/dict_cuda.rs

-                    let mut total_time = Duration::ZERO;
+                    let mut cuda_ctx = CudaSession::create_execution_ctx(&VortexSession::empty())
+                        .vortex_expect("failed to create execution context")
+                        .with_launch_strategy(Arc::new(timed));


see here: instead of replicating the full launch setup in benchmark code, we can just stub in a launcher that collects timing information across runs

a10y · 2026-02-12T19:47:59Z

vortex-cuda/src/kernel/mod.rs

-    }};
+/// Implementations can add tracing, async callbacks, or other behavior
+/// around kernel launches.
+pub trait LaunchStrategy: Debug + Send + Sync + 'static {


this is where LaunchStrategy is defined and impled

vortex-cuda/src/kernel/encodings/runend.rs

vortex-cuda/src/executor.rs

vortex-cuda/gpu-scan-cli/src/main.rs

0ax1

Couple of questions.

vortex-cuda/benches/common/mod.rs

vortex-cuda/benches/for_cuda.rs

vortex-cuda/gpu-scan-cli/src/main.rs

vortex-cuda/src/kernel/encodings/alp.rs

vortex-cuda/src/macros.rs

vortex-cuda/src/kernel/slice/mod.rs

vortex-cuda/src/macros.rs

vortex-cuda/src/stream.rs

vortex-cuda/src/kernel/encodings/zstd.rs

Signed-off-by: Andrew Duffy <andrew@a10y.dev> fixup Signed-off-by: Andrew Duffy <andrew@a10y.dev>

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 · 2026-02-17T11:03:05Z

vortex-cuda/src/stream.rs

-        tx.send(())
-            // A send should never fail. Panic otherwise.
-            .expect("CUDA callback receiver dropped unexpectedly");
+        // NOTE: send can fail if the CudaEvent is dropped by the caller, in which case the receiver


Why did we change this? This should never fail I think, as the send can only error if the channel does not have sufficient capacity which should never be the case.

I was getting panics during testing bc the last receiver had been dropped. I forgot which operation this was for but making this not panic fixed it

https://docs.rs/kanal/latest/kanal/enum.SendError.html

codspeed-hq · 2026-02-17T11:05:47Z

Merging this PR will degrade performance by 45.76%

❌ 16 regressed benchmarks
✅ 1041 untouched benchmarks
⏩ 1346 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bench_dict_mask[(0.01, 0.9)]`	1.2 ms	2.2 ms	-44.22%
❌	Simulation	`bench_dict_mask[(0.01, 0.01)]`	1.2 ms	1.7 ms	-29.6%
❌	Simulation	`bench_dict_mask[(0.01, 0.1)]`	1.2 ms	1.8 ms	-31.49%
❌	Simulation	`bench_dict_mask[(0.01, 0.5)]`	1.2 ms	2 ms	-38.64%
❌	Simulation	`bench_dict_mask[(0.1, 0.01)]`	1.2 ms	1.7 ms	-29.73%
❌	Simulation	`bench_dict_mask[(0.1, 0.5)]`	1.2 ms	2 ms	-38.65%
❌	Simulation	`bench_dict_mask[(0.5, 0.9)]`	1.2 ms	2.2 ms	-44.21%
❌	Simulation	`bench_dict_mask[(0.1, 0.9)]`	1.2 ms	2.2 ms	-44.13%
❌	Simulation	`bench_dict_mask[(0.9, 0.01)]`	1.2 ms	1.7 ms	-29.57%
❌	Simulation	`bench_dict_mask[(0.5, 0.01)]`	1.2 ms	1.7 ms	-29.67%
❌	Simulation	`bench_dict_mask[(0.1, 0.1)]`	1.2 ms	1.8 ms	-31.55%
❌	Simulation	`bench_dict_mask[(0.9, 0.1)]`	1.2 ms	1.8 ms	-31.48%
❌	Simulation	`bench_dict_mask[(0.5, 0.5)]`	1.2 ms	2 ms	-38.63%
❌	Simulation	`bench_dict_mask[(0.5, 0.1)]`	1.2 ms	1.8 ms	-31.51%
❌	Simulation	`bench_dict_mask[(0.9, 0.9)]`	1.3 ms	2.3 ms	-45.76%
❌	Simulation	`bench_dict_mask[(0.9, 0.5)]`	1.2 ms	2.1 ms	-41.79%

_{Comparing aduffy/gpu-scan-measure (3884af8) with develop (1585f08)}

1346 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

a10y force-pushed the aduffy/gpu-scan-measure branch from 0249a54 to a4c923c Compare February 9, 2026 22:51

a10y marked this pull request as ready for review February 9, 2026 22:51

a10y added changelog/chore A trivial change changelog/skip Do not list PR in the changelog and removed changelog/chore A trivial change labels Feb 9, 2026

a10y requested a review from joseph-isaacs February 9, 2026 22:51

joseph-isaacs reviewed Feb 10, 2026

View reviewed changes

vortex-cuda/gpu-scan-cli/src/main.rs Show resolved Hide resolved

a10y force-pushed the aduffy/gpu-scan-measure branch 5 times, most recently from e52bb67 to 21537a6 Compare February 10, 2026 20:08

a10y commented Feb 11, 2026

View reviewed changes

vortex-cuda/benches/for_cuda.rs Show resolved Hide resolved

a10y force-pushed the aduffy/gpu-scan-measure branch 3 times, most recently from 052da59 to 249c24c Compare February 12, 2026 14:33

a10y commented Feb 12, 2026

View reviewed changes

a10y force-pushed the aduffy/gpu-scan-measure branch from 780efdb to 7b61bd6 Compare February 12, 2026 20:05

joseph-isaacs reviewed Feb 13, 2026

View reviewed changes

vortex-cuda/src/kernel/encodings/runend.rs Outdated Show resolved Hide resolved

joseph-isaacs reviewed Feb 13, 2026

View reviewed changes

vortex-cuda/src/executor.rs Show resolved Hide resolved

joseph-isaacs reviewed Feb 13, 2026

View reviewed changes

vortex-cuda/gpu-scan-cli/src/main.rs Show resolved Hide resolved

a10y force-pushed the aduffy/gpu-scan-measure branch 4 times, most recently from f02a1b2 to b283260 Compare February 15, 2026 18:58

0ax1 reviewed Feb 16, 2026

View reviewed changes

a10y force-pushed the aduffy/gpu-scan-measure branch from 3a7d146 to 7ac5cda Compare February 16, 2026 18:39

a10y added 2 commits February 16, 2026 13:40

measure scans

3f7a828

Signed-off-by: Andrew Duffy <andrew@a10y.dev> fixup Signed-off-by: Andrew Duffy <andrew@a10y.dev>

no pub *Executor

bdc0f08

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y added 8 commits February 16, 2026 13:40

move crate

878105a

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

lockfiles

dedc916

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

remove unused dep

e18b2e3

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

don't build CLI on windows

444bd7d

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

lint

10a685f

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

skip

e619d0e

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

address comments

24b3a63

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

address harder

a6a3917

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the aduffy/gpu-scan-measure branch from 7ac5cda to a6a3917 Compare February 16, 2026 18:41

a10y added 2 commits February 16, 2026 14:06

save

ce86532

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

hardest

87721c8

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y enabled auto-merge (squash) February 16, 2026 21:16

a10y and others added 2 commits February 16, 2026 16:23

bracket CUB filters

7c98499

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

nits

3884af8

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 reviewed Feb 17, 2026

View reviewed changes

0ax1 approved these changes Feb 17, 2026

View reviewed changes

a10y merged commit dadaa93 into develop Feb 17, 2026
87 of 89 checks passed

a10y deleted the aduffy/gpu-scan-measure branch February 17, 2026 13:44

Conversation

a10y commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview of changes

Uh oh!

Uh oh!

Uh oh!

a10y Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

a10y Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

0ax1 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

a10y Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

a10y Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Feb 17, 2026

Merging this PR will degrade performance by 45.76%

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

a10y commented Feb 9, 2026 •

edited

Loading