tensorflow changelog

11 months ago

Here's a fresh batch of updates and enhancements to keep your codebase running smoother than ever! 🎉

New Features

BufferFromHostLiteral in CommonPjRtClient: A shiny new method is here to create buffers from host literals, complete with error handling and device memory management. Perfect for those who love seamless data transitions in machine learning tasks! 🚀
Scheduling Annotation in XLA Collective Pipeliner: Now you can schedule operations across loop iterations with the new _scheduling_group_id=<group_id>:<iteration_id> attribute. This makes optimizing performance a breeze! 🌀
CommonAsyncHostToDeviceTransferManager: Say goodbye to redundant implementations! This new manager handles asynchronous transfers using raw buffers, simplifying backend processes. 🎈
DiffResult Serialization/Deserialization: HLO diff tool gets a boost with new serialization capabilities, making data interchange and storage a piece of cake. 🍰
XLA Microbenchmarking Utilities: A set of C++ utilities have been added to set up a microbenchmarking pipeline, ensuring your performance evaluations are top-notch. 🏋️

Improvements

Sharding Devices in XlaCompileOptions: Enhancements to support MPMD parallelism in McJAX, ensuring complex parallelism scenarios are handled with finesse. 🎯
Variable Ops in XNNPACK Delegate: The implementation now uses TFlite storage, fixing visibility issues and simplifying architecture. 🛠️
Profiling Context in Runtime Library: Enhanced profiling capabilities with a new context to manage profiling info across device execution threads. 📈

Bugfixes

cuDNN Command Buffer: Fixed the incorrect updates in the XLA GPU backend, ensuring GPU operations run smoothly. 🛡️
Data Race in ObjectPool: Resolved data race issues with a new marking mechanism for safer push/pop operations. 🎢
Buffer Donation Events: Now waiting on usage and definition events before donation, preventing premature buffer donations. ⏳

Chores

Internal Proto Change: Streamlined protocol buffer definitions in TensorFlow Lite's profiling module for improved consistency. 🧹

These updates are sure to enhance your coding experience, making everything from data management to performance optimization more efficient and fun! Keep coding, keep thriving! 🌟

Included Commits

2025-05-01T02:04:39 See commit

This commit addresses a data race issue in the ObjectPool implementation within the XLA (Accelerated Linear Algebra) framework. The changes primarily involve modifying the way entries are managed in the object pool by implementing a marking mechanism to prevent concurrent access issues during push and pop operations. The new implementation uses pointer tagging to logically delete entries, which helps to avoid data races and the ABA problem, thereby improving the overall performance of the object pool under low contention scenarios. Benchmark results indicate that the updated PopEntry and PushEntry functions perform significantly better than the previous implementation, particularly for non-contended situations.

Additionally, the commit introduces new testing benchmarks to evaluate the performance of the GetOrCreate method, both under normal and contended conditions. The tests ensure that the object pool behaves correctly when multiple threads attempt to access it simultaneously, confirming that the changes maintain the integrity of the object pool while enhancing its efficiency. Overall, this commit lays the groundwork for further optimizations in future iterations, with a focus on minimizing spin-waiting during operations.

Files changed

third_party/xla/xla/runtime/BUILD
third_party/xla/xla/runtime/object_pool.h
third_party/xla/xla/runtime/object_pool_test.cc

2025-05-01T02:53:14 See commit

This commit introduces the BufferFromHostLiteral method to the CommonPjRtClient class within the XLA (Accelerated Linear Algebra) library. The new function is designed to create a buffer from a host literal, accommodating specific memory space and layout requirements. It includes error handling for unsupported tuple shapes and utilizes profiling tools to trace the operation, ensuring efficient management of device memory allocation and transfer events.

The implementation of BufferFromHostLiteral involves several key steps: validating the shape of the input literal, determining the appropriate device layout, allocating raw buffer memory, and scheduling the transfer of data from host to device. This functionality enhances the capabilities of the CommonPjRtClient by facilitating the seamless transition of data between host and device environments, which is essential for optimized performance in machine learning and computational tasks.

Files changed

third_party/xla/xla/pjrt/common_pjrt_client.cc
third_party/xla/xla/pjrt/common_pjrt_client.h

2025-05-01T22:48:56 See commit

The commit associated with PR #25707 addresses an issue within the XLA GPU backend concerning the incorrect updating of the cuDNN command buffer. The changes ensure that the cuDNN CUDA graph is created or updated correctly to reflect the necessary command buffer actions. Key modifications include enhancements to the CuDnnCmd::Record method, which now properly handles both the creation and updating of DNN graph commands based on the execution parameters and dependencies. This fix is crucial for maintaining the efficiency and correctness of GPU operations in machine learning workloads.

Additionally, the commit introduces a new test file, cuda_command_buffer_thunk_test.cc, which includes a comprehensive test suite for the CuDnnCmd. This test validates the functionality of the command buffer by ensuring that it can execute operations correctly and update the underlying command buffer as needed. The overall changes involve 40 lines added to the build configuration and modifications to existing source files, ultimately improving the reliability of cuDNN operations within the XLA framework.

Files changed

third_party/xla/xla/backends/gpu/runtime/BUILD
third_party/xla/xla/backends/gpu/runtime/command_buffer_cmd.cc
third_party/xla/xla/backends/gpu/runtime/cuda_command_buffer_thunk_test.cc

2025-04-28T19:13:36 See commit

This commit introduces sharding devices to the XlaCompileOptions in order to facilitate MPMD (Multiple Program Multiple Data) parallelism within the McJAX framework. The changes are necessary because the output shardings of the PjRt-IFRT executable can no longer be constructed using the addressable devices from the PJRT executable when there are no addressable devices available. This enhancement ensures that the compilation and execution processes can effectively handle complex parallelism scenarios.

The commit involves modifications across multiple files within the TensorFlow and XLA codebase, including updates to various components such as the IFRT compilation cache, backend implementations, and related tests. Overall, these updates are aimed at improving the flexibility and capability of the JAX framework in handling advanced parallel execution patterns.

Files changed

tensorflow/core/tfrt/ifrt/ifrt_persistent_compilation_cache.cc
third_party/xla/xla/backends/cpu/nanort/ifrt_client_test.cc
third_party/xla/xla/python/ifrt/ir/BUILD
third_party/xla/xla/python/ifrt/ir/ifrt_ir_program.cc
third_party/xla/xla/python/ifrt/ir/tests/executable_impl_test_lib.cc
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend.cc
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend_test.cc
third_party/xla/xla/python/pjrt_ifrt/BUILD
third_party/xla/xla/python/pjrt_ifrt/pjrt_compiler.cc
third_party/xla/xla/python/pjrt_ifrt/pjrt_executable.cc
third_party/xla/xla/python/pjrt_ifrt/pjrt_executable.h
third_party/xla/xla/python/pjrt_ifrt/xla_compiler.cc
third_party/xla/xla/python/pjrt_ifrt/xla_compiler.h
third_party/xla/xla/python/pjrt_ifrt/xla_executable_impl_test_lib.cc
third_party/xla/xla/python/version.h

2025-04-28T20:09:23 See commit

This commit introduces serialization and deserialization functionalities for the DiffResult structure in the XLA HLO diff tool. The changes include modifications to several files, notably adding new methods to convert DiffResult to and from a protocol buffer representation (DiffResultProto). The ToProto method constructs a protobuf message that encapsulates matched and unmatched instructions between two HLO modules, while the FromProto method reconstructs a DiffResult object from a given protobuf message. This enhancement allows for better data interchange and storage of diff results, facilitating easier integration and manipulation of the diff data.

Additionally, the commit includes updates to the build configuration to support the new protocol buffer definitions, as well as the implementation of corresponding unit tests to verify the correctness of the serialization and deserialization processes. The tests ensure that the conversion functions work as expected by checking the consistency of the data before and after the serialization process, thus improving the robustness of the HLO diff tool's functionality.

Files changed

third_party/xla/xla/hlo/tools/hlo_diff/BUILD
third_party/xla/xla/hlo/tools/hlo_diff/hlo_diff_result.cc
third_party/xla/xla/hlo/tools/hlo_diff/hlo_diff_result.h
third_party/xla/xla/hlo/tools/hlo_diff/hlo_diff_result_test.cc
third_party/xla/xla/hlo/tools/hlo_diff/proto/BUILD
third_party/xla/xla/hlo/tools/hlo_diff/proto/diff_result.proto

2025-04-28T23:44:00 See commit

This commit introduces support for scheduling annotations in the XLA (Accelerated Linear Algebra) Collective Pipeliner, allowing operations to be scheduled across loop iterations using a new frontend attribute format: _scheduling_group_id=<group_id>:<iteration_id>. The changes span multiple files within the XLA service, including modifications to the Collective Pipeliner implementation and its associated tests, as well as updates to the GPU compiler and related utilities.

Additionally, new utility headers and tests were added to enhance the functionality and ensure the correctness of the scheduling annotations. The commit reflects a significant enhancement in handling scheduling across iterations, which is crucial for optimizing performance in XLA's execution of collective operations. Overall, this update aims to improve the efficiency and flexibility of scheduling within the XLA framework.

Files changed

third_party/xla/xla/service/BUILD
third_party/xla/xla/service/collective_pipeliner.cc
third_party/xla/xla/service/collective_pipeliner.h
third_party/xla/xla/service/collective_pipeliner_test.cc
third_party/xla/xla/service/collective_pipeliner_utils.h
third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/gpu_p2p_pipeliner.cc
third_party/xla/xla/service/gpu/transforms/collectives/BUILD
third_party/xla/xla/service/gpu/transforms/collectives/gpu_collective_combiner_utils_test.cc
third_party/xla/xla/service/legalize_scheduling_annotations.cc
third_party/xla/xla/service/legalize_scheduling_annotations.h
third_party/xla/xla/service/legalize_scheduling_annotations_test.cc
third_party/xla/xla/service/scheduling_annotations_util.cc
third_party/xla/xla/service/scheduling_annotations_util.h
third_party/xla/xla/service/scheduling_annotations_util_test.cc
third_party/xla/xla/service/while_loop_unroller.cc
third_party/xla/xla/service/while_loop_unroller_test.cc
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/collective_pipeliner_execution_test.cc

2025-04-29T00:41:38 See commit

The commit introduces a new component called CommonAsyncHostToDeviceTransferManager, which is designed to handle asynchronous transfers from host to device using raw buffers and linearization for device-to-host (d2h) transfers. This implementation streamlines the process for backends that utilize the CommonPjRtClient, as they are no longer required to create their own versions of the AsyncHostToDeviceTransferManager.

In addition to the new transfer manager, several files were modified to accommodate these changes, including updates to the BUILD file and the header file for common_pjrt_client. New source and header files for the host_to_device_transfer_manager were also added, indicating a significant enhancement to the XLA project's handling of host-device data transfers.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/common_pjrt_client.h
third_party/xla/xla/pjrt/host_to_device_transfer_manager.cc
third_party/xla/xla/pjrt/host_to_device_transfer_manager.h

2025-04-29T03:38:53 See commit

This commit introduces a significant update to the implementation of variable operations within the XNNPACK delegate for TensorFlow Lite. Previously, variable operations were managed as "persistent tensors" stored in the XNNPACK workspace, leading to several issues including a lack of visibility of resources between the TFlite client and the XNNPACK delegate, incorrect handling of VAR_HANDLE operations, and fragility due to the requirement for consistent persistent tensors across subgraphs. The new implementation simplifies the architecture by utilizing storage provided directly by TFlite, ensuring that variables are accessible via the standard resource API and eliminating the need for special handling within XNNPACK.

The updated approach resolves the aforementioned problems, making the integration more robust and straightforward. However, it does come with a limitation: the XNNPACK delegate cannot manage variables that change values within a single subgraph, which should not be delegated in any case. Overall, this change enhances the functionality and reliability of variable operations in XNNPACK while maintaining compatibility with existing frameworks.

Files changed

tensorflow/lite/delegates/xnnpack/BUILD
tensorflow/lite/delegates/xnnpack/README.md
tensorflow/lite/delegates/xnnpack/xnnpack_delegate.cc

2025-04-29T20:59:13 See commit

This commit introduces a set of C++ utility functions aimed at establishing a microbenchmarking pipeline within the XLA (Accelerated Linear Algebra) framework. It includes the addition of several files that implement functionalities for parsing benchmark registries, generating benchmark matrices, and resolving registry paths. The core functionalities are encapsulated in the generate_benchmark_matrices.cc and generate_benchmark_matrices.h files, which provide methods to parse a TextProto registry file into a BenchmarkSuite and generate a corresponding JSON matrix. Additionally, the commit adds a test suite to validate the correctness of these functions, ensuring they handle various scenarios such as file existence checks and parsing errors.

The new utility code is structured to support effective benchmarking by allowing users to specify registry paths both in absolute and relative terms, with appropriate error handling for cases where files are not found. The tests cover a range of expected behaviors, including successful parsing of valid registry content, handling of non-existent files, and validation of the parsing format. Overall, this commit lays the groundwork for a robust benchmarking framework within XLA, enhancing its capabilities for performance evaluation.

Files changed

third_party/xla/xla/tools/benchmarks/utils/BUILD
third_party/xla/xla/tools/benchmarks/utils/generate_benchmark_matrices.cc
third_party/xla/xla/tools/benchmarks/utils/generate_benchmark_matrices.h
third_party/xla/xla/tools/benchmarks/utils/generate_benchmark_matrices_test.cc

2025-04-30T01:54:27 See commit

This commit involves internal changes to the protocol buffer definitions in TensorFlow Lite's profiling module, specifically in the files model_runtime_info.proto and profiling_info.proto. The modifications primarily include the removal of the java_api_version option and a change in the java_package name from "com.google.tflite.profiling" to "tflite.profiling." Additionally, there are minor adjustments with a total of three changes made in each file, which include one addition and two deletions.

These updates streamline the package naming convention for the profiling module, potentially improving consistency and reducing complexity in the codebase. The changes reflect an effort to simplify the API structure while maintaining the functionality necessary for profiling TensorFlow Lite models.

Files changed

tensorflow/lite/profiling/proto/model_runtime_info.proto
tensorflow/lite/profiling/proto/profiling_info.proto

2025-04-30T03:54:47 See commit

This commit introduces a profiling context to the runtime profiling library within the XLA (Accelerated Linear Algebra) framework. The changes include modifications to several files, notably adding a new profiling_context library and related header files that define the ProfilingContext and WithProfilingContext classes. These classes provide a mechanism for managing profiling information across device execution threads, allowing for more granular performance measurements during the execution of XLA programs.

Additionally, the commit updates existing test cases to incorporate checks for non-zero GPU device time measurements, ensuring that the profiling context is effectively utilized during execution. The new context is designed to enhance the profiling capabilities of the runtime, enabling developers to better understand and optimize the performance of their computational workloads on GPU devices. Overall, this enhancement aims to improve the profiling infrastructure within the XLA ecosystem, facilitating more efficient debugging and performance tuning.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client_test.cc
third_party/xla/xla/pjrt/pjrt_stream_executor_client.cc
third_party/xla/xla/pjrt/profiling/BUILD
third_party/xla/xla/pjrt/profiling/profiling_context.h
third_party/xla/xla/pjrt/profiling/profiling_context_no_op.cc
third_party/xla/xla/pjrt/profiling/profiling_context_no_op.h

2025-04-30T04:06:28 See commit

The recent commit addresses a critical issue in the buffer donation process by implementing a wait mechanism for usage and definition events before proceeding with the donation. This change ensures that the buffer is not donated prematurely, which could lead to complications if the usage and definition events are not fully completed. By incorporating the LockUseAndTransferUsageEvents() method and blocking until both the usage and definition events are ready, the code now guarantees that the buffer's state is stable before it is donated.

The modifications were made in the tfrt_gpu_client.cc file, with a total of eight lines added to enhance the buffer management process. Although this approach may not be the most optimal in terms of performance, it effectively resolves the risk of donating a buffer that is still in use or not fully defined, thereby improving the overall reliability of the buffer donation mechanism.

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc