tensorflow changelog

9 months ago

Here's a summary of the recent updates and changes we've made. We've been busy enhancing performance, adding new features, and squashing bugs to make everything run smoother and more efficiently. Let's dive into the details! 🚀

Improvement: Enhanced Resource Calculation for Scheduling Groups
We've fine-tuned the resource calculation for scheduling groups with the "keep_original_sequence_order_in_group" attribute. Now, the scheduler maintains the original sequence of instructions while accurately tracking resource usage, thanks to the new GetNumResourcesNeededForAnnotationWithKeepOriginalOrderAttrs function. Comprehensive tests ensure precision in resource calculations, even under different resource limits. 🎯
Improvement: Optimized Hadamard Rotation Algorithm
The Hadamard rotation algorithm in TensorFlow Lite got a turbo boost! The introduction of FWHTGeneral and FWHTFast functions has enhanced performance, especially for larger sizes. These changes mean faster Hadamard rotations, making your TensorFlow Lite applications zippier than ever. ⚡️
New Feature: Dynamic Registration Helper
Say hello to REGISTER_DYNAMIC, a new helper that complements REGISTER_PJRT_PLUGIN. It allows developers to dynamically load shared object files based on environment variables, simplifying plugin integration and management. 🎉
New Feature: Precision Test for XLA:GPU
We've added a test that checks how precision drops with increasing K dimension sizes in dot algorithms. This test helps us understand precision degradation and ensures computations remain accurate as the size scales. 📊
New Feature: HloProgram Serialization Methods
Introducing HloProgram::ToBytes() and HloProgram::FromBytes(), ensuring an exact serialization/deserialization roundtrip. These methods are perfect for specific use cases that require identical program results, although they aren't version-compatible. 🔄
Improvement: XLA:CPU Exponential Function Optimization
The xla.exp operation now runs like a dream in the XLA CPU backend. We've optimized the exponential function calls for F64, resulting in massive performance improvements—up to 85% faster in some cases! 🚀
New Feature: NCCL Net Plugin XPlane IDs Reserved
We've reserved the last 100 custom XPlane IDs for the NCCL Net Plugin, improving profiling capabilities and ID space management within the XLA framework. 🛠️
New Feature: ComputePeakMemory Method
The buffer assignment API now includes a ComputePeakMemory method, accurately calculating peak memory usage. This addition enhances memory management and robustness with extensive unit tests covering various scenarios. 🧠
Bugfix: Revert Host Platform Configuration
We reverted a previous change to address Android build issues, refining platform-specific settings to ensure a stable and predictable build outcome across environments. 🔧
Bugfix: 256-Byte Alignment for cuBLAS Compatibility
To avoid breakages with cuBLAS 12.9.1.4, we've implemented a 256-byte alignment across tests and components, ensuring compatibility and performance in GPU operations. 🖥️
Bugfix: Delegate Closure Order
We've reordered operations in TensorFlow Lite's close() method to prevent use-after-free errors by ensuring delegates are closed before the model handle is deleted. 🔒
Chore: Removed Empty test_macros.h
We've cleaned up the codebase by deleting all uses of the now-empty test_macros.h file, keeping things neat and tidy. 🧹

These updates reflect our ongoing commitment to delivering a robust, high-performance experience. Keep an eye out for more exciting changes coming your way! 🌟

Included Commits

2025-06-13T17:50:56 See commit

This commit removes all instances of the test_macros.h file from the codebase, as the file has been determined to be empty. The deletion affects numerous test files across various directories, including those related to XLA (Accelerated Linear Algebra) in both GPU and CPU contexts, as well as several other components and utilities within the project.

As a result of this change, multiple test files have been modified to eliminate dependencies on the now-redundant header file. This streamlining effort helps to clean up the codebase, ensuring that it remains efficient and maintainable by removing unnecessary components. The commit includes updates to a wide range of test files, indicating a thorough review of the affected areas to ensure functionality is preserved without the obsolete header.

Files changed

third_party/xla/xla/backends/gpu/runtime/gpublas_lt_matmul_thunk_test.cc
third_party/xla/xla/hlo/builder/lib/BUILD
third_party/xla/xla/hlo/builder/lib/constants_test.cc
third_party/xla/xla/hlo/builder/lib/logdet_test.cc
third_party/xla/xla/hlo/builder/lib/math_test.cc
third_party/xla/xla/hlo/builder/lib/qr_test.cc
third_party/xla/xla/hlo/builder/lib/quantize_test.cc
third_party/xla/xla/hlo/builder/lib/self_adjoint_eig_test.cc
third_party/xla/xla/hlo/builder/lib/slicing_test.cc
third_party/xla/xla/hlo/builder/lib/sorting_test.cc
third_party/xla/xla/hlo/builder/lib/svd_test.cc
third_party/xla/xla/hlo/builder/lib/tridiagonal_test.cc
third_party/xla/xla/hlo/builder/lib/tuple_test.cc
third_party/xla/xla/service/BUILD
third_party/xla/xla/service/compiler_test.cc
third_party/xla/xla/service/cpu/tests/BUILD
third_party/xla/xla/service/cpu/tests/onednn_convolution_test.cc
third_party/xla/xla/service/cpu/tests/onednn_matmul_test.cc
third_party/xla/xla/service/cpu/tests/onednn_softmax_test.cc
third_party/xla/xla/service/dynamic_padder_test.cc
third_party/xla/xla/service/dynamic_update_slice_test.cc
third_party/xla/xla/service/elemental_ir_emitter_test.cc
third_party/xla/xla/service/gather_expander_test.cc
third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/conv_layout_normalization_test.cc
third_party/xla/xla/service/gpu/tests/BUILD
third_party/xla/xla/service/gpu/tests/gpu_fused_mha_test.cc
third_party/xla/xla/service/gpu/transforms/collectives/BUILD
third_party/xla/xla/service/gpu/transforms/collectives/async_collective_annotator_test.cc
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/all_reduce_test.cc
third_party/xla/xla/tests/array_elementwise_ops_test.cc
third_party/xla/xla/tests/batch_norm_grad_test.cc
third_party/xla/xla/tests/batch_norm_training_test.cc
third_party/xla/xla/tests/batch_normalization_test.cc
third_party/xla/xla/tests/bfloat16_test.cc
third_party/xla/xla/tests/broadcast_simple_test.cc
third_party/xla/xla/tests/broadcast_test.cc
third_party/xla/xla/tests/cholesky_test.cc
third_party/xla/xla/tests/client_test.cc
third_party/xla/xla/tests/collective_ops_test.cc
third_party/xla/xla/tests/collective_pipeline_parallelism_test.cc
third_party/xla/xla/tests/compute_constant_test.cc
third_party/xla/xla/tests/concat_test.cc
third_party/xla/xla/tests/conditional_test.cc
third_party/xla/xla/tests/constants_test.cc
third_party/xla/xla/tests/conv_depthwise_backprop_filter_test.cc
third_party/xla/xla/tests/conv_depthwise_test.cc
third_party/xla/xla/tests/convert_test.cc
third_party/xla/xla/tests/convolution_cudnn_test.cc
third_party/xla/xla/tests/convolution_test.cc
third_party/xla/xla/tests/convolution_test_1d.cc
third_party/xla/xla/tests/copy_test.cc
third_party/xla/xla/tests/cpu_gpu_fusion_test.cc
third_party/xla/xla/tests/custom_call_test.cc
third_party/xla/xla/tests/deallocation_test.cc
third_party/xla/xla/tests/deconstruct_tuple_test.cc
third_party/xla/xla/tests/dot_operation_test.cc
third_party/xla/xla/tests/dynamic_reshape_test.cc
third_party/xla/xla/tests/exhaustive/BUILD
third_party/xla/xla/tests/exhaustive/exhaustive_binary_test_definitions.h
third_party/xla/xla/tests/exhaustive/exhaustive_unary_complex_test.cc
third_party/xla/xla/tests/exhaustive/exhaustive_unary_test_definitions.h
third_party/xla/xla/tests/fft_test.cc
third_party/xla/xla/tests/float8_test.cc
third_party/xla/xla/tests/gather_operation_test.cc
third_party/xla/xla/tests/get_dimension_size_test.cc
third_party/xla/xla/tests/grouped_convolution_test.cc
third_party/xla/xla/tests/half_test.cc
third_party/xla/xla/tests/int4_test.cc
third_party/xla/xla/tests/iota_test.cc
third_party/xla/xla/tests/local_client_allocation_test.cc
third_party/xla/xla/tests/local_client_execute_test.cc
third_party/xla/xla/tests/map_test.cc
third_party/xla/xla/tests/matmul_test.cc
third_party/xla/xla/tests/matrix_ops_simple_test.cc
third_party/xla/xla/tests/multioutput_fusion_test.cc
third_party/xla/xla/tests/multithreaded_compilation_test.cc
third_party/xla/xla/tests/nccl_group_execution_test.cc
third_party/xla/xla/tests/numerics_test.cc
third_party/xla/xla/tests/outfeed_in_nested_computation_test.cc
third_party/xla/xla/tests/pad_test.cc
third_party/xla/xla/tests/params_test.cc
third_party/xla/xla/tests/plugin.bzl
third_party/xla/xla/tests/prng_test.cc
third_party/xla/xla/tests/reduce_hlo_test.cc
third_party/xla/xla/tests/reduce_precision_test.cc
third_party/xla/xla/tests/reduce_test.cc
third_party/xla/xla/tests/reduce_window_rewriter_execution_test.cc
third_party/xla/xla/tests/reduce_window_test.cc
third_party/xla/xla/tests/replay_test.cc
third_party/xla/xla/tests/reshape_motion_test.cc
third_party/xla/xla/tests/rng_test.cc
third_party/xla/xla/tests/round_trip_packed_literal_test.cc
third_party/xla/xla/tests/round_trip_transfer_test.cc
third_party/xla/xla/tests/runtime_topk_test.cc
third_party/xla/xla/tests/scalar_computations_test.cc
third_party/xla/xla/tests/scatter_test.cc
third_party/xla/xla/tests/select_and_scatter_test.cc
third_party/xla/xla/tests/select_test.cc
third_party/xla/xla/tests/set_dimension_size_test.cc
third_party/xla/xla/tests/slice_test.cc
third_party/xla/xla/tests/stochastic_convert_test.cc
third_party/xla/xla/tests/test_utils_test.cc
third_party/xla/xla/tests/token_hlo_test.cc
third_party/xla/xla/tests/transfer_manager_test.cc
third_party/xla/xla/tests/transpose_test.cc
third_party/xla/xla/tests/vector_ops_reduce_test.cc
third_party/xla/xla/tests/while_test.cc

2025-06-13T18:01:36 See commit

This commit reverts a previous change (commit 0acc54bd1bb16186eb04b4681395b58935da52d0) and modifies the .bazelrc and tensorflow/BUILD files to adjust the configuration settings for building TensorFlow for different platforms, specifically Android, Emscripten, iOS, and ChromiumOS. The primary focus of this revert is to remove the lines related to --host_platform that were previously included due to issues with the Android build process, which were causing incorrect dependency resolutions.

In addition to reverting these changes, the commit introduces conditional configuration settings using the if_google and if_oss constructs to better manage platform-specific settings based on whether the build is occurring in a Google or open-source environment. This adjustment aims to streamline the build process and mitigate potential issues arising from platform-specific dependencies, ensuring a more stable and predictable build outcome across different environments.

Files changed

.bazelrc
tensorflow/BUILD

2025-06-13T18:59:10 See commit

This commit enhances the XLA (Accelerated Linear Algebra) CPU backend by implementing the xla.exp operation within the legacy pipeline. The changes include modifications to the elemental_ir_emitter to support the emission of exponential function calls directly, significantly optimizing the performance of the exp operation for double precision floating-point numbers (F64). The new implementation demonstrates substantial performance improvements across various benchmark tests, with reductions in processing time ranging from approximately 62% to 85%, depending on the input size.

Additionally, the commit introduces corresponding changes to the build configuration and test cases to ensure proper integration and validation of the new functionality. The benchmarks indicate that the new implementation not only optimizes the execution speed but also maintains compatibility with existing functionalities, as evidenced by the thorough testing of intrinsic calls. Overall, this commit represents a significant step forward in enhancing the efficiency of mathematical operations within the XLA framework for CPU execution.

Files changed

third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/elemental_ir_emitter.cc
third_party/xla/xla/service/cpu/elemental_ir_emitter.h
third_party/xla/xla/service/cpu/tests/cpu_intrinsic_test.cc

2025-06-13T22:25:47 See commit

This commit introduces a refined method for calculating resource requirements in scheduling groups that have the "keep_original_sequence_order_in_group" attribute. It modifies the existing resource calculation logic to ensure that when this attribute is present, the scheduler maintains the original sequence of instructions while accurately tracking resource usage. The new function, GetNumResourcesNeededForAnnotationWithKeepOriginalOrderAttrs, sorts the instructions based on their original positions and updates resource counts accordingly. This ensures that the resource allocation reflects the constraints imposed by maintaining the original sequence order.

Additionally, the commit includes comprehensive tests to validate this new functionality. The tests ensure that resource calculations are precise for both overlapping and non-overlapping collective permutations. These tests check that the scheduler behaves correctly under different resource limits, confirming that the original sequence order is preserved in the scheduling output. Overall, this enhancement aims to improve the efficiency and correctness of scheduling in scenarios where instruction order is critical.

Files changed

third_party/xla/xla/service/latency_hiding_scheduler.cc
third_party/xla/xla/service/latency_hiding_scheduler_test.cc

2025-06-16T15:37:14 See commit

This commit reserves the last 100 custom XPlane IDs specifically for the NCCL (NVIDIA Collective Communications Library) Net Plugin. It introduces new constants in the trace_utils.h file that define the range of these IDs, ensuring they do not conflict with existing identifiers. The new constants include kMaxNcclPlanes, kFirstNcclPlaneId, and kLastNcclPlaneId, which are calculated based on the maximum number of custom plane devices allowed per host.

Additionally, the commit modifies the BUILD file to include a new visibility entry for the CoMMA package, which is likely related to the NCCL functionality. Overall, these changes enhance the profiling capabilities of the NCCL plugin by clearly delineating the ID space allocated for its use, thereby improving the organization and management of custom plane IDs within the XLA (Accelerated Linear Algebra) framework.

Files changed

third_party/xla/xla/tsl/profiler/utils/BUILD
third_party/xla/xla/tsl/profiler/utils/trace_utils.h

2025-06-16T22:40:46 See commit

The commit addresses a potential use-after-free (UAF) issue in the TensorFlow Lite Java implementation by modifying the order of operations in the close() method of the NativeInterpreterWrapper class. Specifically, it ensures that delegates are closed before the model handle is deleted. This change is crucial because delegates may reference the model, and closing them after the model has been deleted could lead to undefined behavior.

In the updated code, the delegates are cleared and closed first, followed by the deletion of the model handle and other related resources. This adjustment improves the safety and stability of the code by preventing any lingering references to the model after its deletion, thereby mitigating the risk of UAF errors. The commit includes a total of 11 changes, with 6 additions and 5 deletions, reflecting this reordering of operations.

Files changed

tensorflow/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java

2025-06-17T03:33:52 See commit

The recent commit introduces a new method, ComputePeakMemory, to the buffer assignment API, designed to calculate the peak memory usage based on a provided BufferAssignmentProto. This method implements a BufferMap structure that maps logical buffer IDs to their respective sizes and reference counts. It processes events from the heap simulator traces, tracking memory allocation, deallocation, and sharing of buffers. The peak memory usage is updated dynamically as events are processed, ensuring accurate reporting of memory consumption during the simulation.

Additionally, the commit includes extensive unit tests for the ComputePeakMemory function, covering various scenarios such as simple allocations, multiple buffers, shared buffers, and edge cases like empty prototypes or those without events. These tests validate the correctness of the memory calculations, ensuring that the new method behaves as expected under different conditions, thereby enhancing the robustness of the buffer assignment API.

Files changed

third_party/xla/xla/service/buffer_assignment.cc
third_party/xla/xla/service/buffer_assignment.h
third_party/xla/xla/service/buffer_assignment_test.cc

2025-06-17T17:27:49 See commit

This commit addresses compatibility issues with cuBLAS version 12.9.1.4 by implementing a 256-byte alignment across various tests and components within the XLA project. The changes were necessary to prevent failures in several tests, including gpublas_lt_matmul_thunk_test_nvgpu_any and gpu_compiler_test_nvgpu_any, which were identified as problematic without this adjustment.

Dimitris Vardoulakis made multiple modifications to ensure that the alignment is consistently set to 256 bytes in relevant test files, including calling_convention.hlo and gpu_alignment_test.cc. The adjustments were made to facilitate proper functionality and performance in GPU-related operations, ultimately leading to the successful resolution of the issues outlined in pull request #27784.

Files changed

third_party/xla/xla/service/gpu/gpu_constants.h
third_party/xla/xla/service/gpu/tests/calling_convention.hlo
third_party/xla/xla/service/gpu/tests/gpu_alignment_test.cc
third_party/xla/xla/service/gpu/tests/gpu_noalias_test.cc
third_party/xla/xla/service/gpu/tests/kernel_reuse.hlo
third_party/xla/xla/service/gpu/transforms/dynamic_slice_fusion_rewriter_test.cc

2025-06-17T19:24:06 See commit

The recent commit focuses on optimizing the Hadamard rotation algorithm within TensorFlow Lite's kernel implementation. Key changes include the introduction of a new function, FWHTGeneral, which allows for an optional normalization of the Fast Walsh Hadamard Transform (FWHT). This function checks for input size validity and applies normalization if specified. Additionally, a new FWHTFast function has been implemented, which utilizes loop unrolling for sizes of 16 or greater, enhancing performance by processing data in chunks and optimizing computational efficiency.

The modifications resulted in 54 lines of code added and 5 lines removed, reflecting a significant update to the algorithm's structure. The commit also includes adjustments in the Prepare and Eval functions, where the new FWHT methods are conditionally called based on the size of the Hadamard transform, ensuring that smaller sizes utilize the general method while larger sizes benefit from the optimized fast method. Overall, these changes aim to improve the speed and efficiency of Hadamard rotations in TensorFlow Lite applications.

Files changed

tensorflow/lite/kernels/hadamard_rotation.cc

2025-06-17T22:44:24 See commit

This commit introduces a new dynamic registration helper, designated as REGISTER_DYNAMIC, to complement the existing REGISTER_PJRT_PLUGIN. This enhancement allows developers to create a build target that dynamically loads their shared object (.so) files based on an environment variable that specifies the path to the PJRT plugin. The addition aims to simplify the integration of plugins, making it easier for developers to manage their custom implementations.

The commit includes modifications to several files within the third_party/xla/xla/pjrt/plugin directory, such as the creation of new source and header files for dynamic registration and static registration, as well as updates to existing plugin examples. The changes reflect a significant improvement in the plugin system, facilitating more flexible and dynamic plugin loading capabilities for developers working with PJRT.

Files changed

third_party/xla/xla/pjrt/plugin/BUILD
third_party/xla/xla/pjrt/plugin/dynamic_registration.cc
third_party/xla/xla/pjrt/plugin/dynamic_registration.h
third_party/xla/xla/pjrt/plugin/example_plugin/BUILD
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_c_pjrt.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_c_pjrt_internal.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_c_pjrt_internal.h
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_c_pjrt_test.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_cpp_pjrt.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_cpp_pjrt_test.cc
third_party/xla/xla/pjrt/plugin/example_plugin/myplugin_dynamic_registration.cc
third_party/xla/xla/pjrt/plugin/static_registration.cc
third_party/xla/xla/pjrt/plugin/static_registration.h
third_party/xla/xla/pjrt/plugin/xla_cpu/BUILD

2025-06-19T03:25:05 See commit

This commit introduces two new methods, HloProgram::ToBytes() and HloProgram::FromBytes(), to the HloProgram class, aimed at providing precise serialization and deserialization of HLO programs. These methods ensure that the roundtrip process results in an identical program, which is crucial for specific use cases. However, they are not intended to replace the existing HloProgramSerDes, as the serialized outputs from these new methods lack version compatibility. Therefore, users are encouraged to continue using HloProgramSerDes for general serialization needs.

The implementation includes modifications to several files, adding functionality to serialize HloProgram objects into a byte format and reconstruct them from this format. The commit also includes a test case to verify that the serialization and deserialization process maintains the integrity of the original program by comparing their fingerprints. Overall, this addition enhances the flexibility of HloProgram handling while maintaining a clear distinction between different serialization methods.

Files changed

third_party/xla/xla/python/ifrt/hlo/BUILD
third_party/xla/xla/python/ifrt/hlo/hlo_program.cc
third_party/xla/xla/python/ifrt/hlo/hlo_program.h
third_party/xla/xla/python/ifrt/hlo/hlo_program_test.cc

2025-06-19T08:03:52 See commit

This commit introduces a new test for the XLA GPU backend that assesses how the precision of dot product operations is affected by increasing sizes of the K dimension. The test, titled "CheckPrecisionDegradationAlongKDimension," aims to quantify the maximum absolute relative errors as the K dimension scales from a minimum of 64 to a maximum of 1,048,576, while keeping the M and N dimensions fixed at 32. The results of this test will help understand the degradation of precision in computations as the size of the contracting dimension increases.

To facilitate this, the code modifications include enhancements to error calculation and reporting, as well as ensuring that test arguments remain non-negative to avoid issues with relative error calculations. The test outputs its findings in a CSV format, detailing the maximum and standard deviation of the relative errors for different K sizes. Additionally, it includes conditions to skip the test under certain configurations, such as when using the ROCM backend or when the backend is not Triton. This structured approach aims to provide valuable insights into the performance and precision characteristics of different dot algorithms under varying conditions.

Files changed

third_party/xla/xla/backends/gpu/codegen/triton/dot_algorithms_test.cc