CLI for measuring execute_cuda encoding perf#6381
Conversation
0249a54 to
a4c923c
Compare
e52bb67 to
21537a6
Compare
052da59 to
249c24c
Compare
| let mut total_time = Duration::ZERO; | ||
| let mut cuda_ctx = CudaSession::create_execution_ctx(&VortexSession::empty()) | ||
| .vortex_expect("failed to create execution context") | ||
| .with_launch_strategy(Arc::new(timed)); |
There was a problem hiding this comment.
see here: instead of replicating the full launch setup in benchmark code, we can just stub in a launcher that collects timing information across runs
| }}; | ||
| /// Implementations can add tracing, async callbacks, or other behavior | ||
| /// around kernel launches. | ||
| pub trait LaunchStrategy: Debug + Send + Sync + 'static { |
There was a problem hiding this comment.
this is where LaunchStrategy is defined and impled
780efdb to
7b61bd6
Compare
f02a1b2 to
b283260
Compare
3a7d146 to
7ac5cda
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev> fixup Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
7ac5cda to
a6a3917
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
| tx.send(()) | ||
| // A send should never fail. Panic otherwise. | ||
| .expect("CUDA callback receiver dropped unexpectedly"); | ||
| // NOTE: send can fail if the CudaEvent is dropped by the caller, in which case the receiver |
There was a problem hiding this comment.
Why did we change this? This should never fail I think, as the send can only error if the channel does not have sufficient capacity which should never be the case.
There was a problem hiding this comment.
I was getting panics during testing bc the last receiver had been dropped. I forgot which operation this was for but making this not panic fixed it
There was a problem hiding this comment.
Merging this PR will degrade performance by 45.76%
Performance Changes
Comparing Footnotes
|
Overview of changes
ergonomics/API focused changes
LaunchStrategyon the execution context. This by default will launch kernels and not track any timing information, but it is pluggable. For example in benchmarks we replace this with aTimedLaunchedStrategywhich executes the kernels in blocking mode and logs their execution time.ctx.launch_kernel()method, which accepts a closure that is used to populate kernel argumentsA lot of test and benchmark code needed to be updated to use the new launch methods.
Fused FOR + BPThis has been shelved for a FLUP since this was too big
* I've updated the BP kernel generator to generate bp as FFOR, i.e. fused bitpacking with FOR. In practice, this is just adding aconst T referenceparam. By default the execution for BitPackedArray passeszero, but there is a specialization in theForArrayexecution tree where if it detects one of its descendants is BP, it fuses itself with the bit unpackingGPU tracing tool
There's a new binary in
vortex-test-e2e-cuda-scanwhich takes as input a Vortex file.It will recompress the file using only GPU-supported encodings, scan it back, and collect timings for how long each column scan took. The results are printed as either pretty text, or as JSON to stdout, which can be piped into duckdb or similar for analysis
Example usage: