tensorflow changelog

11 months ago

Hey there, code enthusiasts! We've got a fresh batch of updates that are sure to make your TensorFlow experience even more exciting. Dive into the latest changes and enhancements that have been made to improve performance, add new features, and fix those pesky bugs. Let's take a closer look at what's new and improved:

New feature 🚀: We've introduced a new XlaOp for a custom combiner backward pass, enhancing TensorFlow's capabilities in handling sparse-dense matrix multiplication operations. This update is a big win for those optimizing deep learning models on TPUs with sparse data structures.
New feature 🌟: Direct translations for unary elementwise operations from StableHLO to HLO are now available, streamlining the process and improving performance for numerical computations in XLA. Say hello to seamless handling of operations like cosine, sine, and tangent!
Improvement 🎉: A progress bar has been added to stdout for long-running matcher processes, giving you visual feedback and making those waiting times a bit more bearable. Keep an eye on the progress and know exactly where you stand!
New feature 🆕: We've expanded direct translation support for BroadcastOp, BroadcastInDimOp, and DynamicBroadcastInDimOp from StableHLO to HLO. This enhancement ensures better handling of broadcast dimensions and shapes, making your operations run smoother.
Bugfix 🔧: We've fixed an integer overflow issue in the TFL::FullyConnectedOp::verify() function by switching to int64_t for storing num_elements. This fix prevents erroneous outputs and ensures accurate calculations even with large tensor sizes.
Improvement 🚀: Hot array iterations just got a performance boost with a new templated Array::Each API variation. This change eliminates type-erasure and virtual calls, optimizing those critical code paths for better efficiency.
New feature 🌟: Scoped alternate memory allocations can now expand to the biggest free chunk at the end of MSA, improving memory utilization and reducing fragmentation for optimized execution performance.
New feature 🆕: Binary elementwise operations can now be directly translated from StableHLO to HLO, broadening the scope of operations and enhancing the efficiency of machine learning models relying on these operations.
Improvement 🎉: GPU command buffers are now smarter with automatic inference of command dependencies using an execution graph. Enable xla_gpu_graph_enable_concurrent_region and enjoy a more efficient command execution!
Chore 🧹: We've removed the pipelining pass from XLA GPU emitters, simplifying the codebase and shifting towards alternative optimization strategies for loop execution.
Bugfix 🔧: We've addressed undefined behaviors in PJRT by fixing pointer casting issues between unrelated types. This update enhances code safety and correctness, ensuring smooth operations across CPU and GPU implementations.
Bugfix 🔄: A regression fix in the XLA collective pipeliner ensures proper handling of scalar constants for padding values. This prevents unnecessary broadcasting and improves the efficiency of dynamic tensor operations.

That's all for now, folks! Enjoy the new features and improvements, and keep coding like a rockstar! 🌟

Included Commits

2025-04-11T00:17:31 See commit

This commit addresses undefined behavior in the PJRT (Portable JAX Runtime) codebase by correcting the way pointers are cast between unrelated types. Specifically, it resolves the issue of reinterpret_cast being used to convert a pointer from a custom PJRT extension struct Foo to a PJRT_Extension_Base*, which is not valid since Foo does not inherit from PJRT_Extension_Base. To comply with C language standards, where inheritance is not supported, the solution involves modifying the structure of Foo to include a PJRT_Extension_Base variable as its first field.

The changes made in this commit include updates to various PJRT extension structures and functions, ensuring that they utilize the new base field for proper type casting. This adjustment enhances the safety and correctness of the PJRT code by preventing potential issues arising from invalid pointer conversions. The modifications span several files, impacting both the CPU and GPU implementations of PJRT, and include the addition of necessary fields and adjustments to function calls to reference the new base structure correctly.

Files changed

third_party/xla/xla/pjrt/c/pjrt_c_api_cpu_internal.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_custom_partitioner_extension.h
third_party/xla/xla/pjrt/c/pjrt_c_api_ffi_extension.h
third_party/xla/xla/pjrt/c/pjrt_c_api_ffi_internal.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_gpu_extension.h
third_party/xla/xla/pjrt/c/pjrt_c_api_gpu_internal.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_gpu_test.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_stream_extension.h
third_party/xla/xla/pjrt/c/pjrt_c_api_triton_extension.h
third_party/xla/xla/pjrt/c/pjrt_c_api_triton_internal.h
third_party/xla/xla/pjrt/pjrt_c_api_client.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_c_pjrt.cc

2025-04-11T19:24:50 See commit

This commit introduces a progress bar to the standard output for long-running matcher processes within the HLO (High-Level Operations) gumgraph matcher. The addition aims to enhance user experience by providing visual feedback on the progress of the matching process, which can be particularly useful during extensive computations. The progress bar is implemented as a simple console output that updates in real-time, displaying the percentage of completion along with a visual representation of the progress.

In terms of code changes, the commit modifies the hlo_gumgraph_matcher.cc file by adding 25 lines of code while removing 2 lines. Key additions include the definition of constants for the progress bar's width and characters, as well as a function to print the progress. The matching loop is updated to calculate and display the current progress based on the number of nodes processed, ensuring that users can monitor the operation's status effectively.

Files changed

third_party/xla/xla/hlo/tools/hlo_diff/matchers/hlo_gumgraph_matcher.cc

2025-04-12T00:32:43 See commit

This commit introduces direct translation for several StableHLO operations, specifically BroadcastOp, BroadcastInDimOp, and DynamicBroadcastInDimOp, into the High-Level Operations (HLO) format. Modifications were made to various files, including updates to the HLO conversion allowed operations and the implementation of specific export functions for these operations. The changes ensure that the new operations are recognized and properly handled during the translation process, with specific attention given to the handling of broadcast dimensions and shapes, particularly for the BroadcastInDimOp.

Additionally, the commit includes the addition of a new test file, stablehlo_invalid.mlir, which is designed to check for errors when unsupported operations are encountered during translation. This ensures that the system can gracefully handle cases where certain StableHLO operations do not have a corresponding representation in HLO. Overall, the changes enhance the capability of the translation framework to support a broader range of StableHLO operations while maintaining compliance with HLO's requirements.

Files changed

third_party/xla/xla/hlo/translate/mhlo_to_hlo/BUILD
third_party/xla/xla/hlo/translate/mhlo_to_hlo/gen_hlo_op_writer.td
third_party/xla/xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc
third_party/xla/xla/hlo/translate/tests/BUILD
third_party/xla/xla/hlo/translate/tests/stablehlo.mlir
third_party/xla/xla/hlo/translate/tests/stablehlo_invalid.mlir
third_party/xla/xla/mlir_hlo/mhlo/transforms/stablehlo_legalize_to_hlo/stablehlo_legalize_to_hlo_pass.cc

2025-04-12T03:03:39 See commit

This commit introduces a direct translation of unary elementwise operations from StableHLO to HLO (High-Level Optimizer) in the XLA (Accelerated Linear Algebra) framework. It modifies the translation definitions to include additional StableHLO operations that can be directly converted without needing an intermediate MHLO (Multi-Headed Linear Optimization) step. Specifically, new operations such as StableHLO_AbsOp, StableHLO_CbrtOp, StableHLO_CeilOp, and several others are added to the list of allowed conversions, enhancing the framework's capability to handle a wider range of unary operations efficiently.

Additionally, the commit includes updates to the implementation files to support these operations, including the conversion of result accuracy attributes and the addition of specific unary operations like cosine, sine, and tangent. The changes are accompanied by new test cases to validate the direct translation functionality, ensuring that the operations yield the expected results in the HLO format. Overall, this enhancement aims to streamline the translation process and improve the performance of numerical computations in XLA by leveraging the capabilities of StableHLO.

Files changed

third_party/xla/xla/hlo/translate/mhlo_to_hlo/gen_hlo_op_writer.td
third_party/xla/xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc
third_party/xla/xla/hlo/translate/tests/BUILD
third_party/xla/xla/hlo/translate/tests/stablehlo_unary_elementwise.mlir
third_party/xla/xla/mlir_hlo/mhlo/transforms/stablehlo_legalize_to_hlo/stablehlo_legalize_to_hlo_pass.cc

2025-04-14T07:33:37 See commit

This commit removes the pipelining pass from the XLA GPU emitters, specifically from the file responsible for optimizing loops in the GPU backend. The changes involve deleting a significant amount of code, including functions and structures related to pipelining operations, such as checking dependencies on induction variables and replacing induction variables within loop constructs. The removal of this functionality indicates a shift in the approach to optimizing loop execution, potentially simplifying the codebase and focusing on alternative optimization strategies.

Additionally, the commit modifies test files related to loop optimization, reflecting the removal of pipelining checks and structures. This suggests that the testing framework has been updated to align with the new state of the codebase, ensuring that it accurately evaluates the performance and correctness of loop optimizations without the previously implemented pipelining approach. Overall, the commit signifies a significant change in the optimization strategy for GPU backends in XLA, moving away from pipelining.

Files changed

third_party/xla/xla/backends/gpu/codegen/emitters/transforms/optimize_loops.cc
third_party/xla/xla/backends/gpu/codegen/emitters/transforms/tests/optimize_loops.mlir

2025-04-15T01:00:50 See commit

This commit introduces direct translation for binary elementwise operations from StableHLO to HLO (High-Level Operations) in the XLA (Accelerated Linear Algebra) library. The changes primarily involve modifying the gen_hlo_op_writer.td file to include several binary operations such as Atan2Op, ComplexOp, DivOp, MaxOp, MinOp, MulOp, PowOp, RemOp, ShiftLeftOp, ShiftRightArithmeticOp, ShiftRightLogicalOp, and SubtractOp into the list of allowed operations for conversion. Additionally, the mlir_hlo_to_hlo.cc file has been updated to implement the logic for the SubtractOp, ensuring that it correctly translates to its HLO equivalent.

The commit also includes updates to the test files to validate the new operations and their translations, ensuring that the binary elementwise operations are correctly represented in the HLO format. The changes enhance the functionality of the XLA library by broadening the scope of operations that can be directly translated, ultimately improving the efficiency and capabilities of machine learning models that rely on these operations.

Files changed

third_party/xla/xla/hlo/translate/mhlo_to_hlo/gen_hlo_op_writer.td
third_party/xla/xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc
third_party/xla/xla/hlo/translate/tests/stablehlo.mlir
third_party/xla/xla/mlir_hlo/mhlo/transforms/stablehlo_legalize_to_hlo/stablehlo_legalize_to_hlo_pass.cc

2025-04-15T01:12:53 See commit

This commit enhances the GPU command buffer functionality within the XLA (Accelerated Linear Algebra) framework by implementing automatic inference of command dependencies using an execution graph. When the configuration option xla_gpu_graph_enable_concurrent_region is enabled, the system will utilize the ExecutionGraph to create a Directed Acyclic Graph (DAG) for the command buffer, improving the efficiency and organization of command execution.

The changes involve modifications across several files, including the command buffer implementation and its associated tests, as well as updates to the execution graph header. This update aims to streamline the handling of command dependencies, potentially leading to better performance in GPU computations by allowing for more concurrent execution of commands.

Files changed

third_party/xla/xla/backends/gpu/runtime/BUILD
third_party/xla/xla/backends/gpu/runtime/command_buffer_cmd.cc
third_party/xla/xla/backends/gpu/runtime/command_buffer_cmd.h
third_party/xla/xla/backends/gpu/runtime/command_buffer_cmd_test.cc
third_party/xla/xla/backends/gpu/runtime/command_buffer_thunk_test.cc
third_party/xla/xla/runtime/execution_graph.h
third_party/xla/xla/service/gpu/tests/command_buffer_test.cc
third_party/xla/xla/stream_executor/gpu/gpu_command_buffer.cc

2025-04-15T18:50:50 See commit

This commit introduces a new feature to the memory space assignment (MSA) algorithm that allows for the expansion of scoped alternate memory allocations to utilize the largest contiguous free chunk available at the end of the memory space. The implementation iterates through all scoped allocations, identifies live nodes that overlap in time, and calculates the largest available free chunk. It then determines whether to extend the current allocation boundaries or move the allocation to a more optimal position based on the available memory. The feature aims to enhance memory utilization and reduce fragmentation, thereby optimizing performance during execution.

Additionally, changes to the codebase include modifications to several files, such as algorithm.cc and algorithm.h, where new functions and methods are added to facilitate this expansion feature. The commit also includes updates to the protocol definitions and test cases to ensure that the new functionality is properly integrated and validated. Tests are provided to confirm that the MSA correctly allocates and expands memory according to the new logic, ensuring that the expected behavior aligns with the intended optimizations.

Files changed

third_party/xla/xla/service/memory_space_assignment/algorithm.cc
third_party/xla/xla/service/memory_space_assignment/algorithm.h
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment.proto
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment_test.cc
third_party/xla/xla/service/memory_space_assignment/options.h

2025-04-15T19:36:09 See commit

This commit reverts a previous change (c2a3c368e79be0292faeac380086c42169763908) to address a regression related to the handling of scalar constants in the XLA collective pipeliner. The fix specifically ensures that when dealing with padding values, the code explicitly checks if the constant is meant for padding, thus preventing unnecessary broadcasting of scalar constants.

The modifications involve changes to the logic in various parts of the collective pipeliner code, particularly in functions that handle dynamic slices and shape computations. The code updates aim to improve the handling of zero-dimensional shapes and ensure that padding values are treated correctly without broadcasting, which could lead to inefficiencies or errors in the computation pipeline. The changes include adjustments to how dimensions are accessed and how constants are processed, reflecting a more robust approach to managing scalar constants in the context of dynamic tensor operations.

Files changed

third_party/xla/xla/service/collective_pipeliner.cc

2025-04-15T19:37:54 See commit

This commit addresses an integer overflow issue in the verify() function of the TFL::FullyConnectedOp class in TensorFlow Lite. The problem arises when calculating the number of elements in tensors, particularly with large dimensions, which can exceed the maximum value representable by a 32-bit signed integer (2,147,483,647). In a specific case involving a tfl.fully_connected operation, the calculation resulted in 2,148,532,224, leading to incorrect error messages during the TFL Converter process. To resolve this, the code has been modified to use int64_t for storing the number of input and output elements, thus preventing overflow and ensuring accurate calculations.

Additionally, the commit includes updates to the test cases to accommodate the changes made in the data type. A new function, fully_connected_with_int64_num_elements, has been added to demonstrate the handling of large tensor dimensions without encountering overflow issues. Overall, this fix enhances the robustness of the FullyConnectedOp operation by ensuring it can handle larger tensor sizes without generating erroneous outputs.

Files changed

tensorflow/compiler/mlir/lite/ir/tfl_ops.cc
tensorflow/compiler/mlir/lite/tests/ops.mlir

2025-04-16T00:25:07 See commit

The recent commit introduces a new templated variant of the Array::Each API designed to enhance performance by eliminating type-erasure and the overhead of virtual function calls for each element during array iterations. This change is particularly beneficial for "hot" code paths where efficiency is critical. The new TemplatedEach function allows users to pass a callback that operates on the indices and values of the array elements, both for mutable and immutable contexts.

In addition to the core implementation in the array.h file, the commit updates existing code in hlo_sharding.cc and tile_assignment.h to utilize the new TemplatedEach method instead of the traditional Each method. This transition aims to improve performance in various components of the XLA (Accelerated Linear Algebra) library by reducing unnecessary overhead during array manipulations. Overall, these changes reflect a focused effort to optimize performance in the XLA codebase.

Files changed

third_party/xla/xla/array.h
third_party/xla/xla/hlo/ir/hlo_sharding.cc
third_party/xla/xla/hlo/ir/tile_assignment.h

2025-04-16T23:18:25 See commit

This commit introduces a new XlaOp for a custom combiner backward (BWD) pass in TensorFlow, enhancing the framework's capabilities for handling sparse-dense matrix multiplication operations. The changes primarily involve modifications to several files related to MLIR (Multi-Level Intermediate Representation) and XLA (Accelerated Linear Algebra), including updates to operation definitions, legalization configurations, and tests to ensure functionality. Additionally, a new protocol buffer definition for the custom combiner operation is added, which facilitates its integration within the TensorFlow ecosystem.

Overall, the commit reflects ongoing improvements to TensorFlow's support for advanced operations in machine learning, particularly in the context of TPU (Tensor Processing Unit) optimizations and sparse tensor handling. The updates span multiple components of the codebase, underscoring the importance of this feature for efficient computation in deep learning models that leverage sparse data structures.

Files changed

tensorflow/compiler/mlir/tensorflow/ir/tf_ops.td
tensorflow/compiler/mlir/tf2xla/transforms/legalization_op_config.cc
tensorflow/compiler/mlir/tf2xla/transforms/legalization_op_config_test.cc
tensorflow/core/api_def/base_api/api_def_XlaSparseDenseMatmulCustomCombinerOnTcGradWithCsrInput.pbtxt
tensorflow/core/tpu/kernels/BUILD
tensorflow/core/tpu/kernels/sparse_core_xla_ops.cc
tensorflow/core/tpu/ops/sparse_core_ops.cc
tensorflow/python/tpu/ops/BUILD
tensorflow/tools/api/golden/v1/tensorflow.raw_ops.pbtxt