12 Nov 22:45

atalman

0add682

Triton 3.5.1 release, bug fix release Latest

Latest

This release is meant to fix the following issue:

Fix sm103 (GB300) support broken by Triton 3.5.0 release (#8045)

Assets 3

21 Oct 10:00

atalman

v3.5.0

c3c476f

Triton 3.5.0 release

Triton Release Notes

Dialect & Frontend
Backend & Compiler
AMD/HIP Backend
NVIDIA Backend
Gluon & Layout Improvements
Kernels & Benchmarks
Testing & CI
Build & Infrastructure
Documentation
Breaking Changes

Dialect & Frontend

New Features

Warp Specialization Enhancements (#8005): Made warp specialization require at least 4 warps with proper error messaging to prevent compiler crashes
Ragged TMA Support (#7792, #7783): Added support for write-only and general ragged TMAs with automatic bounds checking using higher-dimensional TMA descriptors
Device Assert Mask Support (#7905): Added mask parameter to tl.device_assert for easier debugging with masked operations
Padding Option for TMA Loads (#7993): Added support for padding option (including NaN) in TMA descriptor creation and fallback paths
Implicit Downcast in TMA Descriptor Store (#6236): Fixed missing implicit downcast when storing blocks through TMA descriptors
Mutations Disallowed (#7762): Disabled all mutations to address semantic issues in the language
Specialized Recursion (#7468): Enabled functions to recurse on specialized versions of themselves
Constexpr Function Cache Invalidation (#7802): Reworked constexpr_function to support cache invalidation and capability checks

Bug Fixes

Floating Point Argument Passing (#7439): Fixed floating point argument passing for tl.float16 and other FP types
Non-Associative Reduce Rematerialization (#7272): Avoided rematerialization for non-associative reduce operations to prevent data consistency issues
PDL Issue Fix (#7379): Fixed PDL-related issues in the frontend
Constexpr in Tuples (#7442): Improved handling of constexpr in tuples, fixing type mismatches and in-place mutations
Loop Carry Detection (#7200): Improved detection of loop carries when @builtin or @core.extern functions modify their arguments
Liveouts in Conditionals (#7318): Fixed detection of liveouts in conditional blocks

Improvements

MLIR Verifier After Parsing (#7999): Run MLIR verifier after parsing to catch errors early
Better Error for num_cta > 1 on sm < 90 (#7812): Improved error messaging for unsupported configurations
Extern Elementwise Type Handling (#7930): Fixed mismatched type handling for core.extern_elementwise
Libdevice Exposure in Gluon (#7890): Exposed libdevice functions with improved layout propagation

Backend & Compiler

LLVM Updates

LLVM Bump (#7881): Updated to llvm/llvm-project@bc773632355b with multiple API changes including:
- Switched Constant{Int|Float}Op type and value order
- Provided triple for TargetLibraryInfoImpl
- Fixed atomic sync scope for NVIDIA
- Updated MLIR lib names and ops

Code Generation

Generic Swizzling for convert_layout (#6982, #7565): Implemented generalized swizzling algorithm for convert_layout that:
- Finds optimal shared memory layout maximizing read/write vectorization
- Minimizes bank conflicts
- Supports ldmatrix/stmatrix and transpose versions
- Uses columns and diagonals for better performance
Warp-Local Layout Conversion (#7558): Improved warp-local layout conversion algorithm using shuffles with:
- Better handling of broadcasting in layouts
- Fewer select and shuffle instructions
- Register packing for sub-32-bit data types
Byte Permutes in Intra-Warp Conversion (#7809): Used byte permute instructions for better performance in layout conversions
Tmem Alloc Hoisting (#7568): Hoisted tmem alloc outside of if statements to reduce register pressure
CP.Async Lowering Improvements (#7314): Moved cp.async to better lowering sequence reusing previous optimizations

Optimizations

Simpler Codegen for Linear Layouts (#7201): Simplified code generation for linear layouts
Vectorization Fixes (#7845): Fixed vectorization for PaddedSharedEncoding with non-default order
XOR Trick Refactoring (#7397): Refactored XOR trick into helper function for better code reuse
Shared Memory Offset Fixes (#7949): Fixed various issues with smem base offsets
Min/Max Redux Optimization for Blackwell (#7465): Implemented new redux.sync optimization

Bug Fixes

Atomic RMW Broadcasting (#7460): Fixed atomic rmw ops to broadcast results when necessary
TMA Load with Multiple Users (#7398): Fixed lowering of TMA load when users have differing encodings
Subview Padding (#7404): Fixed subview padding for PaddedSharedEncoding
Memdesc Subview Fixes (#7480, #7515): Properly handled memdesc_subview with slicing and offsets
FP16 to FP32 Conversion (#7585): Fixed fp16 to fp32 conversion issues
Barrier Synchronization (#7993): Added bar.sync before deallocating tmem to prevent race conditions

AMD/HIP Backend

New Features

GFX950 (MI350) Support: Added comprehensive support for AMD's latest architecture including:
- MFMA scale support (#7799)
- Scale preshuffling (#7603, #7836)
- OpSel implementation for scaled MFMA
- Buffer load/store operations (#7738)
- Improved register usage in Float8 conversions (#7527)
ChainedDot Schedule (#7601, #7638): Added new scheduling variant for loops with 2 chained dots
Ping-Pong Transformation (#7638, #7458): Added ping-pong support for:
- Chained dot schedules
- Async load with num_stages=3
- MXFP types
Buffer Atomic CAS (#7292): Added support for buffer atomic compare-and-swap
FP64 MFMA Support (#7461): Added support for fp64 dot operations using MFMA intrinsics

Layout & Memory Optimizations

General Swizzling Support (#7482, #7606): Enabled ConvertLayoutOp general swizzling
Padded vs Swizzled Allocation (#7328, #7750): Introduced specialized allocation pass with proper layout selection strategy
Improved LDS Usage (#7750, #7813): Optimized LDS usage by:
- Preferring swizzle layouts when LDS limits allow
- Using single LDS for both transposed and non-transposed access
- Better layout selection in optimize-lds-usage pass
TilesPerWarp Parameter (#7283): Added tilesPerWarp parameter to MFMA layout for contiguous tile computation
Extract Slice Rewrite (#7128): Refactored extract_slice to support:
- Arbitrary tensor ranks
- Relaxed layout constraints
- CTA tile boundary alignment

Code Generation Improvements

PermlaneSwap Pattern (#7825, #7861): Added general permlane_swap pattern for ConvertLayoutOp
Register Broadcast (#7407): Added support for register broadcast in slice/concat ops
Shared Memory Ops for FP4 (#7626): Added support for M/N packed FP4 with transposition
Direct-to-LDS Loads (#7829): Refactored lowering via common lowerLdSt path
Local Load/Store Lowering (#7355): Enabled common code path for local_load/store operations

FP8 & Numeric Support

FP8 Variant Support:
- Software emulation for non-gfx942 architectures (#7401)
- Improved conversions with proper clamping (#7337, #7361, #7363)
- BF16 to OCP FP8 conversion on CDNA3 (#7469)
- Float8E4M3FN emulation on CDNA3 and below (#7186)
Dot Scaled Support: Enabled on gfx11 (#7954) and gfx12 (#7644) with emulation via decomposition
True16 Handling: Disabled on gfx11 due to test failures (#7953)

Stream Pipeliner Enhancements

Refactoring (#7526, #7556): Refactored to use more common pipeliner functionality
Async Wait Handling (#7577): Restricted merging async_wait when pipelining with num_stages=3
Mask Operation Support (#7620): Added ttg.mask handling in stream pipeliner

Build & Driver

LLD Library API (#7548): Replaced shell-out to lld with direct library API calls
hipGetProcAddress (#7350): Switched to using hipGetProcAddress for querying HIP symbols
Driver Version Check (#7501): Added runtime driver version check with descriptive errors
AOT Compilation (#7007): Added HIP AOT compilation support to compile.py tool

Bug Fixes

Pointer Canonicalizer (#7242): Fixed attribute propagation when ranks don't match
Global Atomic Optimization (#7496): Optimized global atomic operations following memory model semantics
FP32/FP16 to OCP FP8 (#7382): Fixed conversion for subnormal numbers
Async Copy Vectorization (#7250): Fixed async load pipeline for less than 32-bit loads
OptimizeLDSUtility Crash (#7434): Fixed nullptr crash in createTmpLayout
Memrealtime on GFX11/12 (#7357): Added proper support using s_sendmsg_rtn_b64

NVIDIA Backend

Hopper/Blackwell Features

Warp Specialization:
- Enable for persistent matmul and FA (#7642, #7623)
- Assign final try_wait to partition (#7757)
- Tightened user critical section with accumulator (#7509)
- Fixed rematerialization bug in partitioner (#7427)
- Optimized partitioning by hoisting above broadcasts (#7692)
- Enabled 1 buffer for SSA partition dependencies (#7686)
- Control flow support in TMEM allocation (#7698)
WGMMA Support in Gluon (#7300, #7313): Added Hopper WGMMA with async wait support
Aref Operations (#7479, #7561, #7645): Updated aref ops and lower_aref pass with:
- Multi-consumer support
- Stage/cluster attribute passing
- TMA load aref insertion
- Control flow handling
Partition Loops Rewrite (#7415): Reimplemented supporting general control flow using mutual recursion

Blackwell-Specific

TMEM Support:
- Fixed codegen for Nx1xf32 (#7234)
- Fixed tmem_su...

Assets 3

30 Jul 20:47

atalman

v3.4.0

c817b9b

Triton 3.4.0 Release

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
Masked Histograms - Added support for masked histogram operations (#6695)
Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
Build System Improvements - Better CMake support and dependency management ([#6330](https://github.com/tri...

Assets 3

Releases: triton-lang/triton

Triton 3.5.1 release, bug fix release

Uh oh!

Triton 3.5.0 release

Triton Release Notes

Table of Contents

Dialect & Frontend

New Features

Bug Fixes

Improvements

Backend & Compiler

LLVM Updates

Code Generation

Optimizations

Bug Fixes

AMD/HIP Backend

New Features

Layout & Memory Optimizations

Code Generation Improvements

FP8 & Numeric Support

Stream Pipeliner Enhancements

Build & Driver

Bug Fixes

NVIDIA Backend

Hopper/Blackwell Features

Blackwell-Specific

Uh oh!

Triton 3.4.0 Release

Highlights

Gluon Framework Comprehensive Enhancement

Hardware Support Expansion

Performance Optimizations

New Features

Language and Frontend

Backend and Compilation

Hardware-Specific Features

Improvements

Performance

Developer Experience

Uh oh!