tensorflow changelog

4 months ago

Here's the latest scoop on some exciting changes and updates! 🎉 We've been busy improving, fixing, and adding new features to make everything work smoother and faster. Let's dive into the details:

New Feature: Shared Ownership in tstring 🛠️
We've added a cool new feature to the tstring class! Now, with tstring::owner<T>, you can enjoy shared ownership of data. This means tstring can manage shared views of data, ensuring it stays alive as long as needed. It's all about safe memory management and keeping things tidy!
Improvement: Reduction Offloading to YNNPACK 🚀
We've turbocharged the XLA backend by offloading suitable reductions to YNNPACK. This enhancement checks if a reduction operation fits the bill and, if so, hands it over to the high-performance library for a speed boost. It's all about making things run faster and smoother.
New Feature: V3 to V2/V1 Conversion 🔄
Say hello to conversion functions that bridge the gap between different versions in the XLA library. We've also refactored some logic to make it easier to handle reshape dimensions and transpose permutations. This means better interoperability and seamless transitions between versions.
Improvement: Direct Tensor Reading 📈
The XLA CPU backend just got a performance boost! We now read directly from input tensors in vectorized reduce operations, eliminating the need for intermediate vectors. This change paves the way for future enhancements like tree-reduction and better bufferization.
New Feature: Triton Collective Fusion 🔧
We've introduced a new fusion kind, kTritonCollectiveFusionKind, in the XLA GPU backend. This sets the stage for handling collective operations more efficiently. Actual code emission will follow, but for now, we're laying the groundwork for future improvements.
New Feature: InprocessSymbolSpecs Serialization 📦
Serialization just got a whole lot easier with the new KernelSymbolRegistry. It maps unique names to function pointers, making cross-process usage a breeze. This update enhances interoperability and functionality for CUDA and ROCm kernels.
New Feature: CudaGraph Node Annotations 🖋️
We've enhanced the trace viewer to show per-node framework names in CudaGraph Nodes. By rewriting annotations, you get more context and clarity in profiling data. It's all about making debugging and profiling a more insightful experience.
Bugfix: Race Condition in PullTable 🐛
We squashed a pesky race condition in the PullTable::Handle function. Now, entries are safely checked before erasure, ensuring stability and reliability in concurrent scenarios.
New Feature: kDotDependent in DimensionInfo 📊
We've added kDotDependent to DimensionInfo, helping differentiate between gradients and improving scheduling decisions. The HloDimensionAnalysis class now collects this info, enhancing the analysis capabilities of the HLO framework.
Bugfix: Disable Legacy Emitter Path 🚫
The legacy emitter path for XLA GPU is now disabled by default. This streamlines the emission process and removes reliance on outdated functionality, making things more efficient.
Chore: Remove Stride Functions 🧹
We've cleaned up the XLA types module by removing stride-related functions. This simplifies the codebase and hints at a shift in how strides are handled within the library.
Bugfix: StableHLO Patch File 🔧
We've fixed the StableHLO patch file, improving include paths and adding necessary headers. These changes enhance integration and operation within the XLA framework.

That's it for now! Keep an eye out for more updates as we continue to enhance and refine our features. 🚀

Included Commits

2025-11-12T11:54:45 See commit

This commit introduces a new fusion kind, kTritonCollectiveFusionKind, to the XLA (Accelerated Linear Algebra) GPU backend. When this configuration is set, the FusionEmitter() function will return an instance of the Triton Fusion emitter, which is designed to handle collective operations. The actual code emission for these operations will be implemented in subsequent updates, indicating that this commit lays the groundwork for future enhancements in collective operation handling within the XLA framework.

In addition to the new fusion kind, the commit also includes modifications to various files, such as adding dependencies and updating existing code to support the new functionality. Key changes involve the introduction of functions to set GPU backend configurations for collective operations and updates to the testing framework to validate the new features. Overall, this commit enhances the capabilities of the XLA GPU backend by enabling better support for collective operations through the Triton framework.

Files changed

third_party/xla/xla/backends/gpu/codegen/triton/BUILD
third_party/xla/xla/backends/gpu/codegen/triton/collective_emitter.cc
third_party/xla/xla/backends/gpu/codegen/triton/collective_emitter.h
third_party/xla/xla/backends/gpu/codegen/triton/collective_emitter_test.cc
third_party/xla/xla/backends/gpu/codegen/triton/fusion.cc
third_party/xla/xla/backends/gpu/codegen/triton/fusion_test.cc
third_party/xla/xla/service/gpu/hlo_fusion_analysis.cc
third_party/xla/xla/service/gpu/ir_emission_utils.h

2025-11-12T17:34:04 See commit

The commit focuses on optimizing the vectorized reduce operation in the XLA (Accelerated Linear Algebra) CPU backend by enabling direct reading from the input tensor instead of utilizing an intermediate vector. This significant change eliminates the need for the dynamic vector extract pass, streamlining the code and potentially enhancing performance. Additionally, this modification lays the groundwork for implementing tree-reduction techniques and improving bufferization processes in future developments.

The update involves multiple file modifications within the XLA codebase, including changes to the fusion compiler and various transformation files. Notably, it removes the dynamic vector extract implementation and updates related tests, reflecting the shift towards a more efficient approach in handling vectorized reductions. Overall, the commit contributes to the ongoing efforts to refine the XLA CPU backend for better computational efficiency and flexibility.

Files changed

third_party/xla/xla/backends/cpu/codegen/fusion_compiler.cc
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/BUILD
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/passes.h
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/passes.td
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/rewrite_dynamic_vector_extract.cc
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/shlo_to_vector.cc
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/tests/rewrite_dynamic_vector_extract.mlir
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/tests/shlo_to_vector.mlir
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/vectorized_reduce_emitter.cc
third_party/xla/xla/backends/cpu/codegen/tiled/transforms/vectorized_reduce_emitter.h

2025-11-12T19:22:18 See commit

This commit introduces a new enumeration value, kDotDependent, to the DimensionInfo class, which signifies that a dot operation can access its operands through def-use chains. This enhancement aims to differentiate between two types of gradients—ActivationGradient and WeightGradient—based on their relationship to dot operations, facilitating better scheduling decisions during computation. The HloDimensionAnalysis class is also extended to incorporate methods that collect and evaluate this new information, allowing it to determine whether an instruction is dot-dependent or known to be a weight.

The changes include modifications to various methods within the HloDimensionAnalysis class, enabling it to set and retrieve dimension information related to dot dependencies. New helper functions were added to check if an instruction is dot-dependent or has known dimension information. Furthermore, tests were updated to validate the functionality of the new kDotDependent feature, ensuring that the analysis correctly identifies dot dependencies in various scenarios, while maintaining the integrity of the existing functionality. Overall, these enhancements improve the analysis capabilities of the HLO (High-Level Operations) framework, ultimately contributing to more efficient computation scheduling.

Files changed

third_party/xla/xla/hlo/analysis/hlo_dimension_analysis.cc
third_party/xla/xla/hlo/analysis/hlo_dimension_analysis.h
third_party/xla/xla/hlo/analysis/hlo_dimension_analysis_test.cc

2025-11-12T22:34:18 See commit

The commit focuses on fixing the StableHLO patch file, specifically within the third-party XLA library's StableHLO integration. The changes include modifications to various files, with a total of 22 additions and 4 deletions, resulting in 26 changes overall. Key updates involve correcting include paths in the InterpreterOps.td file and adding necessary headers in the StablehloBroadcastLowering.cpp file.

Notably, the commit replaces references to third-party includes with direct paths, enhancing the clarity and maintainability of the code. Additional headers related to utility functions and data structures from LLVM are also included, which may facilitate improved functionality and performance in the StableHLO operations. Overall, these modifications aim to streamline the integration and operation of StableHLO within the XLA framework.

Files changed

third_party/xla/third_party/stablehlo/temporary.patch

2025-11-13T00:41:45 See commit

This commit introduces functions for converting data structures from version 3 (V3) to both version 2 (V2) and version 1 (V1) representations within the XLA (Accelerated Linear Algebra) library. It also includes a refactor of the existing V3 implementation to streamline the extraction of reshape dimensions and transpose permutations necessary for V2 replica groups. The modifications encompass changes to several files, with notable updates to the MeshAxesReplicaGroupList class, which now features methods to convert to IotaReplicaGroupList and CollectiveDeviceList.

In addition to the conversion functions, the commit enhances the testing suite to validate the new functionalities. New test cases ensure that the flattened replica groups generated from V3 structures align with those produced by their V2 counterparts, confirming the integrity and accuracy of the conversion process. Overall, this update improves the interoperability of different version representations within the XLA framework while maintaining robust testing to safeguard against potential regressions.

Files changed

third_party/xla/xla/hlo/ir/replica_group.cc
third_party/xla/xla/hlo/ir/replica_group.h
third_party/xla/xla/hlo/ir/replica_group_test.cc

2025-11-13T01:00:46 See commit

The recent commit involves the removal of stride-related functions from the XLA (Accelerated Linear Algebra) types module. Specifically, functions such as ByteStridesForShape and StridesForShape, which were responsible for calculating the strides for given shapes and layouts, have been deleted from the codebase. This change also includes modifications to the associated header files, ensuring that any references to these functions are eliminated.

In addition to the deletion of the stride functions, the commit updates the BUILD files to reflect the changes in dependencies. The removal of these functions suggests a potential refactoring or simplification of the XLA module, possibly indicating a shift in how strides are handled within the library or a move towards alternative implementations. Overall, this commit streamlines the code by removing unnecessary complexity related to stride calculations.

Files changed

third_party/xla/xla/python/BUILD
third_party/xla/xla/python/types.cc
third_party/xla/xla/python/types.h

2025-11-13T03:43:42 See commit

This commit introduces enhancements to the XLA (Accelerated Linear Algebra) backend by adding support for offloading suitable reduction operations to YNNPACK, a high-performance library for neural network computations. The changes primarily involve the introduction of a new function, IsReduceOpOffloadedToYnn, which checks if a reduction operation can be offloaded based on its input characteristics and specific criteria, such as the number of elements in the input shape and the type of operations involved. The commit modifies several files, including the YNNPACK matcher and support files, to integrate this new functionality.

Additionally, the commit updates the TreeReductionRewriter class to incorporate a filter that prevents the offloading of reductions to YNNPACK if they do not meet the specified criteria, enhancing the efficiency of the computation pipeline. This integration aims to optimize performance by leveraging YNNPACK for applicable reduction operations, thereby improving the overall execution speed of XLA compiled programs.

Files changed

third_party/xla/xla/backends/cpu/transforms/ynn_matcher.h
third_party/xla/xla/backends/cpu/ynn_support.cc
third_party/xla/xla/backends/cpu/ynn_support.h
third_party/xla/xla/hlo/transforms/simplifiers/tree_reduction_rewriter.cc
third_party/xla/xla/hlo/transforms/simplifiers/tree_reduction_rewriter.h
third_party/xla/xla/service/cpu/cpu_compiler.cc

2025-11-13T08:07:54 See commit

This commit introduces support for the serialization of InprocessSymbolSpecs within the KernelLoaderSpec::InprocessSymbol, which previously referenced CUDA or ROCm kernels via function pointers that were only valid in the current process. The limitation of these pointers made it challenging to serialize and utilize them across different processes. To address this, the commit implements a KernelSymbolRegistry, which maps unique string names to their corresponding function pointers. This allows for the serialization of the names, enabling the recovery of the function pointers in different processes.

In addition to the new registry, the commit includes modifications to several files within the stream_executor directory, such as updates to the kernel specification files and the addition of tests for the new KernelSymbolRegistry. These changes collectively enhance the ability to manage kernel symbols across process boundaries, facilitating better interoperability and functionality in the execution of CUDA and ROCm kernels.

Files changed

third_party/xla/xla/stream_executor/BUILD
third_party/xla/xla/stream_executor/kernel_spec.cc
third_party/xla/xla/stream_executor/kernel_spec.h
third_party/xla/xla/stream_executor/kernel_spec.proto
third_party/xla/xla/stream_executor/kernel_spec_test.cc
third_party/xla/xla/stream_executor/kernel_symbol_registry.cc
third_party/xla/xla/stream_executor/kernel_symbol_registry.h
third_party/xla/xla/stream_executor/kernel_symbol_registry_test.cc

2025-11-13T14:51:52 See commit

This commit disables the legacy emitter path for the XLA GPU by setting a default flag that triggers an error when attempting to use the legacy emitter for Triton. Specifically, the flag XLA_FLAGS=--xla_gpu_unsupported_generic_triton_emitter_features=+disable_legacy_gemm has been added, which will prevent the legacy emitter from being used, except for certain legacy-emitter tests that can still modify the flag.

The changes were made in the debug_options_flags.cc file, where the new flag is incorporated into the list of unsupported features for the generic Triton emitter. Additionally, the debug_options_parsers_test.cc file was updated to include tests that verify the presence of this new flag in the enabled features. Overall, this commit aims to streamline the emission process by removing reliance on legacy functionality.

Files changed

third_party/xla/xla/debug_options_flags.cc
third_party/xla/xla/debug_options_parsers_test.cc

2025-11-13T19:13:40 See commit

This commit addresses a potential race condition in the PullTable::Handle function within the streaming.cc file of the XLA library. The modification ensures that before an entry is erased from the entries_ map, a check is performed to confirm that the iterator it is valid and points to an existing entry. This change is crucial to prevent undefined behavior when the Handle function is called concurrently with the new Reset() method, which could lead to the erasure of entries that are still being processed.

Additionally, the commit introduces a new test case in streaming_test.cc to validate the behavior of the PullTable class under conditions that could trigger this race condition. A subclass of PullTable::Entry, named SelfResettingPullTableEntry, is created to invoke the Reset() method on the PullTable instance during the handling of a request. This test aims to ensure that the race condition is effectively managed, reinforcing the stability and reliability of the PullTable implementation in concurrent scenarios.

Files changed

third_party/xla/xla/python/transfer/streaming.cc
third_party/xla/xla/python/transfer/streaming_test.cc

2025-11-13T22:45:53 See commit

This commit introduces a feature to the CudaGraph Node in the trace viewer, allowing it to display per-node framework names by rewriting annotations to reflect their corresponding values at creation time. This enhancement aims to improve the clarity and utility of profiling data by providing more context about the framework associated with each node.

The changes affect multiple files within the XLA (Accelerated Linear Algebra) backend for GPU profiling, including modifications to the CUDA test files, CUPTI buffer events, and collector components. The updates ensure that the trace viewer can accurately represent the framework names for each node, enhancing the overall profiling and debugging experience for developers working with GPU computations.

Files changed

third_party/xla/xla/backends/profiler/gpu/BUILD
third_party/xla/xla/backends/profiler/gpu/cuda_test.cu.cc
third_party/xla/xla/backends/profiler/gpu/cupti_buffer_events.cc
third_party/xla/xla/backends/profiler/gpu/cupti_buffer_events.h
third_party/xla/xla/backends/profiler/gpu/cupti_collector.cc
third_party/xla/xla/backends/profiler/gpu/cupti_collector.h
third_party/xla/xla/backends/profiler/gpu/cupti_tracer.cc
third_party/xla/xla/backends/profiler/gpu/cupti_tracer_options_utils.cc

2025-11-14T03:02:52 See commit

This commit introduces a significant enhancement to the tstring class by adding shared ownership functionality through a new tstring::owner<T> class. This reference-counted wrapper allows tstring instances to manage shared views of underlying data owned by an instance of tstring::owner. The new method assign_as_shared_view enables tstring to take a shared view of a data buffer while ensuring that the data remains valid as long as any tstring references it. The implementation involves modifying the TF_TString_View structure to include a pointer to the TStringOwnerCApi, which facilitates the reference counting by incrementing and decrementing the owner's reference count appropriately during assignment and deallocation.

Additionally, the commit updates various functions within the TF_TString API to handle the new ownership semantics, ensuring safe memory management. The changes include new inline functions to manage the assignment and deallocation of views with ownership, as well as modifications to existing functions to accommodate the reference counting mechanism. A series of tests have also been added to validate the correct behavior of the shared ownership model, ensuring that the underlying data is not prematurely deleted while still being referenced by tstring instances.

Files changed

tensorflow/compiler/tf2xla/BUILD
third_party/xla/third_party/tsl/tsl/platform/BUILD
third_party/xla/third_party/tsl/tsl/platform/ctstring.h
third_party/xla/third_party/tsl/tsl/platform/ctstring_internal.h
third_party/xla/third_party/tsl/tsl/platform/tstring.h
third_party/xla/third_party/tsl/tsl/platform/tstring_test.cc