Releases: triton-lang/triton
Releases · triton-lang/triton
Triton 3.5.1 release, bug fix release
Triton 3.5.0 release
Triton Release Notes
Table of Contents
- Dialect & Frontend
- Backend & Compiler
- AMD/HIP Backend
- NVIDIA Backend
- Gluon & Layout Improvements
- Kernels & Benchmarks
- Testing & CI
- Build & Infrastructure
- Documentation
- Breaking Changes
Dialect & Frontend
New Features
- Warp Specialization Enhancements (#8005): Made warp specialization require at least 4 warps with proper error messaging to prevent compiler crashes
- Ragged TMA Support (#7792, #7783): Added support for write-only and general ragged TMAs with automatic bounds checking using higher-dimensional TMA descriptors
- Device Assert Mask Support (#7905): Added
maskparameter totl.device_assertfor easier debugging with masked operations - Padding Option for TMA Loads (#7993): Added support for padding option (including NaN) in TMA descriptor creation and fallback paths
- Implicit Downcast in TMA Descriptor Store (#6236): Fixed missing implicit downcast when storing blocks through TMA descriptors
- Mutations Disallowed (#7762): Disabled all mutations to address semantic issues in the language
- Specialized Recursion (#7468): Enabled functions to recurse on specialized versions of themselves
- Constexpr Function Cache Invalidation (#7802): Reworked
constexpr_functionto support cache invalidation and capability checks
Bug Fixes
- Floating Point Argument Passing (#7439): Fixed floating point argument passing for
tl.float16and other FP types - Non-Associative Reduce Rematerialization (#7272): Avoided rematerialization for non-associative reduce operations to prevent data consistency issues
- PDL Issue Fix (#7379): Fixed PDL-related issues in the frontend
- Constexpr in Tuples (#7442): Improved handling of constexpr in tuples, fixing type mismatches and in-place mutations
- Loop Carry Detection (#7200): Improved detection of loop carries when
@builtinor@core.externfunctions modify their arguments - Liveouts in Conditionals (#7318): Fixed detection of liveouts in conditional blocks
Improvements
- MLIR Verifier After Parsing (#7999): Run MLIR verifier after parsing to catch errors early
- Better Error for num_cta > 1 on sm < 90 (#7812): Improved error messaging for unsupported configurations
- Extern Elementwise Type Handling (#7930): Fixed mismatched type handling for
core.extern_elementwise - Libdevice Exposure in Gluon (#7890): Exposed libdevice functions with improved layout propagation
Backend & Compiler
LLVM Updates
- LLVM Bump (#7881): Updated to llvm/llvm-project@bc773632355b with multiple API changes including:
- Switched
Constant{Int|Float}Optype and value order - Provided triple for
TargetLibraryInfoImpl - Fixed atomic sync scope for NVIDIA
- Updated MLIR lib names and ops
- Switched
Code Generation
- Generic Swizzling for convert_layout (#6982, #7565): Implemented generalized swizzling algorithm for
convert_layoutthat:- Finds optimal shared memory layout maximizing read/write vectorization
- Minimizes bank conflicts
- Supports
ldmatrix/stmatrixand transpose versions - Uses columns and diagonals for better performance
- Warp-Local Layout Conversion (#7558): Improved warp-local layout conversion algorithm using shuffles with:
- Better handling of broadcasting in layouts
- Fewer
selectandshuffleinstructions - Register packing for sub-32-bit data types
- Byte Permutes in Intra-Warp Conversion (#7809): Used byte permute instructions for better performance in layout conversions
- Tmem Alloc Hoisting (#7568): Hoisted tmem alloc outside of if statements to reduce register pressure
- CP.Async Lowering Improvements (#7314): Moved cp.async to better lowering sequence reusing previous optimizations
Optimizations
- Simpler Codegen for Linear Layouts (#7201): Simplified code generation for linear layouts
- Vectorization Fixes (#7845): Fixed vectorization for
PaddedSharedEncodingwith non-default order - XOR Trick Refactoring (#7397): Refactored XOR trick into helper function for better code reuse
- Shared Memory Offset Fixes (#7949): Fixed various issues with smem base offsets
- Min/Max Redux Optimization for Blackwell (#7465): Implemented new redux.sync optimization
Bug Fixes
- Atomic RMW Broadcasting (#7460): Fixed atomic rmw ops to broadcast results when necessary
- TMA Load with Multiple Users (#7398): Fixed lowering of TMA load when users have differing encodings
- Subview Padding (#7404): Fixed subview padding for PaddedSharedEncoding
- Memdesc Subview Fixes (#7480, #7515): Properly handled memdesc_subview with slicing and offsets
- FP16 to FP32 Conversion (#7585): Fixed fp16 to fp32 conversion issues
- Barrier Synchronization (#7993): Added bar.sync before deallocating tmem to prevent race conditions
AMD/HIP Backend
New Features
- GFX950 (MI350) Support: Added comprehensive support for AMD's latest architecture including:
- ChainedDot Schedule (#7601, #7638): Added new scheduling variant for loops with 2 chained dots
- Ping-Pong Transformation (#7638, #7458): Added ping-pong support for:
- Chained dot schedules
- Async load with num_stages=3
- MXFP types
- Buffer Atomic CAS (#7292): Added support for buffer atomic compare-and-swap
- FP64 MFMA Support (#7461): Added support for fp64 dot operations using MFMA intrinsics
Layout & Memory Optimizations
- General Swizzling Support (#7482, #7606): Enabled ConvertLayoutOp general swizzling
- Padded vs Swizzled Allocation (#7328, #7750): Introduced specialized allocation pass with proper layout selection strategy
- Improved LDS Usage (#7750, #7813): Optimized LDS usage by:
- Preferring swizzle layouts when LDS limits allow
- Using single LDS for both transposed and non-transposed access
- Better layout selection in optimize-lds-usage pass
- TilesPerWarp Parameter (#7283): Added tilesPerWarp parameter to MFMA layout for contiguous tile computation
- Extract Slice Rewrite (#7128): Refactored extract_slice to support:
- Arbitrary tensor ranks
- Relaxed layout constraints
- CTA tile boundary alignment
Code Generation Improvements
- PermlaneSwap Pattern (#7825, #7861): Added general permlane_swap pattern for ConvertLayoutOp
- Register Broadcast (#7407): Added support for register broadcast in slice/concat ops
- Shared Memory Ops for FP4 (#7626): Added support for M/N packed FP4 with transposition
- Direct-to-LDS Loads (#7829): Refactored lowering via common
lowerLdStpath - Local Load/Store Lowering (#7355): Enabled common code path for local_load/store operations
FP8 & Numeric Support
- FP8 Variant Support:
- Dot Scaled Support: Enabled on gfx11 (#7954) and gfx12 (#7644) with emulation via decomposition
- True16 Handling: Disabled on gfx11 due to test failures (#7953)
Stream Pipeliner Enhancements
- Refactoring (#7526, #7556): Refactored to use more common pipeliner functionality
- Async Wait Handling (#7577): Restricted merging async_wait when pipelining with num_stages=3
- Mask Operation Support (#7620): Added ttg.mask handling in stream pipeliner
Build & Driver
- LLD Library API (#7548): Replaced shell-out to lld with direct library API calls
- hipGetProcAddress (#7350): Switched to using hipGetProcAddress for querying HIP symbols
- Driver Version Check (#7501): Added runtime driver version check with descriptive errors
- AOT Compilation (#7007): Added HIP AOT compilation support to compile.py tool
Bug Fixes
- Pointer Canonicalizer (#7242): Fixed attribute propagation when ranks don't match
- Global Atomic Optimization (#7496): Optimized global atomic operations following memory model semantics
- FP32/FP16 to OCP FP8 (#7382): Fixed conversion for subnormal numbers
- Async Copy Vectorization (#7250): Fixed async load pipeline for less than 32-bit loads
- OptimizeLDSUtility Crash (#7434): Fixed nullptr crash in createTmpLayout
- Memrealtime on GFX11/12 (#7357): Added proper support using s_sendmsg_rtn_b64
NVIDIA Backend
Hopper/Blackwell Features
- Warp Specialization:
- Enable for persistent matmul and FA (#7642, #7623)
- Assign final try_wait to partition (#7757)
- Tightened user critical section with accumulator (#7509)
- Fixed rematerialization bug in partitioner (#7427)
- Optimized partitioning by hoisting above broadcasts (#7692)
- Enabled 1 buffer for SSA partition dependencies (#7686)
- Control flow support in TMEM allocation (#7698)
- WGMMA Support in Gluon (#7300, #7313): Added Hopper WGMMA with async wait support
- Aref Operations (#7479, #7561, #7645): Updated aref ops and lower_aref pass with:
- Multi-consumer support
- Stage/cluster attribute passing
- TMA load aref insertion
- Control flow handling
- Partition Loops Rewrite (#7415): Reimplemented supporting general control flow using mutual recursion
Blackwell-Specific
- TMEM Support:
- Fixed codegen for Nx1xf32 (#7234)
- Fixed tmem_su...
Triton 3.4.0 Release
Highlights
Gluon Framework Comprehensive Enhancement
The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)
Hardware Support Expansion
- AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
- Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
- Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)
Performance Optimizations
- Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
- MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
- TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)
New Features
Language and Frontend
- Aggregate Type Support - Added
@tl.aggregatedecorator for autogenerating Triton types from Python classes (#6970) - JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
- Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
- Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
- Masked Histograms - Added support for masked histogram operations (#6695)
- Syntactic Sugar Additions - Added
.item()as syntactic sugar for.reshape([])(#6873)
Backend and Compilation
- Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
- Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
- TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
- Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
- BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
- Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)
Hardware-Specific Features
- AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
- AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
- AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
- NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
- AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)
Improvements
Performance
- Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
- Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
- SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
- Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)
Developer Experience
- Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
- Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
- Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
- Build System Improvements - Better CMake support and dependency management ([#6330](https://github.com/tri...