tensorflow changelog

1 year ago

Here's a fresh batch of updates to keep your TensorFlow projects running smoothly and efficiently! 🚀

Improvement: The XLA Latency Hiding Scheduler now dumps its stats to a proto, giving you a clearer picture of performance metrics like wasted cycles and memory pressure. This makes debugging and optimizing your scheduling process a breeze! 📊
Improvement: Say hello to non-blocking NCCL communicators! This update boosts the performance of collective operations in GPU backends by allowing tasks to run concurrently. Faster, smoother, and more efficient GPU operations are now at your fingertips! ⚡️
Improvement: Multiple compilation configs are now supported in the TensorFlow Lite experimental LiteRT compiler plugin. Plus, you can now track partition stats in your compiled models, making performance tuning a lot easier. 🛠️
New Feature: The ifrt::Client interface gets a makeover with two new methods! CreateContext helps you capture the runtime context, and a variant of MakeArrayFromHostBuffer uses this context to streamline performance analysis and debugging. 🕵️‍♂️
New Feature: Introducing the TfrtGpuAsyncHostToDeviceTransferManager and TfrtGpuClient::CreateBuffersForAsyncHostToDevice() for managing async transfers from host to device. More unit tests mean more reliability and correctness! 🧪
New Feature: We've integrated StableHLO from OpenXLA, bringing significant updates and enhancements to the StableHLO framework within the project. 🛡️
New Feature: Check out TfrtGpuBuffer::CopyRawToHostFuture and TfrtGpuClient::BufferFromHostLiteral for efficient and asynchronous data transfers in the TensorFlow XLA GPU backend. 🚀
New Feature: Quantization functionalities have been copied over to TensorFlow Lite, optimizing models for resource-constrained devices. More tests mean more reliability! 📈
Bugfix: A critical bug fix for the FloorDiv operation in TF/TFL lowering to TOSA ensures correct rounding behavior. Accuracy restored! 🔧
Bugfix: Addressed a use-after-move issue in CpuCompiler::CompileAheadOfTime within the XLA CPU module. This fix enhances stability and reliability. 🛠️
Bugfix: Reverted a previous change affecting TensorFlow profiler's error handling, ensuring that any issues are flagged and not ignored. Error management just got stricter! 🚨
Chore: Renamed serialization_base.fbs to tflite_serialization_base.fbs for better clarity and organization within the TensorFlow Lite framework. 🗂️

Enjoy the new features and improvements, and happy coding! 🎉

Included Commits

2025-03-21T18:36:10 See commit

This commit integrates the StableHLO project from the OpenXLA repository, specifically from the commit identified by the hash 66f90d5c. The integration involves significant modifications, including the addition of 95 lines and the deletion of 5,120 lines across the codebase, indicating a substantial update or overhaul of the existing StableHLO implementation.

Additionally, the workspace.bzl file was updated to reflect the new commit and SHA256 hash for the StableHLO library, ensuring that the build system references the latest version correctly. This change is part of an ongoing effort to maintain and enhance the functionality of the StableHLO framework within the project.

Files changed

third_party/stablehlo/temporary.patch
third_party/stablehlo/workspace.bzl

2025-03-21T19:30:25 See commit

This commit introduces the TfrtGpuAsyncHostToDeviceTransferManager class, which is designed to manage asynchronous transfers of data from the host to the device in GPU contexts. Additionally, it implements the TfrtGpuClient::CreateBuffersForAsyncHostToDevice() method, which facilitates the creation of buffers specifically for these asynchronous transfers. The commit also includes the implementation of the TfrtGpuBuffer::ToLiteral() function, enhancing the functionality of buffer management.

Furthermore, the commit enhances the codebase by adding more unit tests to ensure the reliability and correctness of the new features. The modifications affect several files, including BUILD, tfrt_gpu_client.cc, tfrt_gpu_client.h, and tfrt_gpu_client_test.cc, reflecting a comprehensive update to the GPU client implementation within the TensorFlow XLA framework.

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/BUILD
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.h
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client_test.cc

2025-03-21T19:54:23 See commit

The commit addresses a use-after-move issue in the CpuCompiler::CompileAheadOfTime function within the XLA (Accelerated Linear Algebra) CPU compilation module. It introduces a modification to ensure that the target options are correctly set up before being passed to the target machine builder, which prevents potential access to moved objects that could lead to undefined behavior.

Specifically, the change involves creating a llvm::TargetOptions object from the module's configuration and utilizing it in the creation of the target machine. This adjustment not only resolves the use-after-move problem but also enhances the clarity of the code by separating the configuration of compiler options from the target machine instantiation. Overall, this commit improves the stability and reliability of the compilation process in the XLA CPU backend.

Files changed

third_party/xla/xla/service/cpu/cpu_compiler.cc

2025-03-21T22:35:19 See commit

This commit introduces two significant enhancements to the ifrt::Client interface. The first new method, CreateContext, enables users to generate a UserContext object that captures the current runtime context, including elements like the call stack. This context can be passed to various IFRT operations, allowing them to be associated with the client context that initiated them. This feature aims to greatly streamline performance analysis and debugging processes. The second addition is a variant of the existing MakeArrayFromHostBuffer method, which now accepts a UserContext as its final parameter, leveraging the newly created context functionality.

The changes affect multiple files across the codebase, including modifications to header and implementation files for both the ifrt client and related Python interfaces. This broad update ensures that the new context management capabilities are integrated throughout the relevant components, enhancing the overall functionality and usability of the IFRT framework.

Files changed

third_party/xla/xla/backends/cpu/nanort/BUILD
third_party/xla/xla/backends/cpu/nanort/ifrt_client.cc
third_party/xla/xla/backends/cpu/nanort/ifrt_client.h
third_party/xla/xla/backends/cpu/nanort/ifrt_client_test.cc
third_party/xla/xla/python/compile_only_ifrt/BUILD
third_party/xla/xla/python/compile_only_ifrt/client.h
third_party/xla/xla/python/ifrt/BUILD
third_party/xla/xla/python/ifrt/client.h
third_party/xla/xla/python/ifrt/mock.cc
third_party/xla/xla/python/ifrt/mock.h
third_party/xla/xla/python/ifrt/user_context.h
third_party/xla/xla/python/ifrt_proxy/client/BUILD
third_party/xla/xla/python/ifrt_proxy/client/client.cc
third_party/xla/xla/python/ifrt_proxy/client/client.h
third_party/xla/xla/python/ifrt_proxy/integration_tests/BUILD
third_party/xla/xla/python/ifrt_proxy/integration_tests/mock_array_test.cc
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend_test.cc
third_party/xla/xla/python/pjrt_ifrt/BUILD
third_party/xla/xla/python/pjrt_ifrt/pjrt_client.cc
third_party/xla/xla/python/pjrt_ifrt/pjrt_client.h

2025-03-21T23:14:40 See commit

This commit introduces the ability to set multiple compilation configurations in the TensorFlow Lite experimental LiteRT compiler plugin. It enhances the partitioning process by incorporating detailed logging that provides insights into the number of selected operations and total operations within a subgraph. The logging now captures the number of partitions created, allowing developers to better understand the partitioning results during model compilation.

Additionally, the changes include modifications to the PartitionModel function, where it now calculates and logs both the number of selected operations and the total operations in a subgraph. This improvement aims to facilitate debugging and performance tuning by providing more granular statistics about the partitioning process, ultimately enhancing the efficiency of compiled models.

Files changed

tensorflow/lite/experimental/litert/compiler/plugin/compiler_plugin.cc

2025-03-22T08:54:12 See commit

This commit reverts a previous change identified by the hash 62098d7dfc801b45e3c08c31c5cc3bd159c76998, primarily affecting the TensorFlow profiler's options and behavior. The modifications include the removal of the ignore_start_error parameter from the ProfilerOptions class, which previously allowed the profiler to ignore errors when starting. Consequently, the code now directly handles the status of the profiler's start operation without ignoring errors, ensuring that any issues are appropriately flagged.

Additionally, the commit involves several updates across different files, including changes to the CuptiTracer class and related components, where error handling has been adjusted to remove the .IgnoreError() method calls. This ensures that any errors encountered during the profiling process are not silently ignored, which may enhance debugging and error reporting in the profiler's functionality. Overall, the commit emphasizes stricter error management and refines the options available for profiling within TensorFlow.

Files changed

tensorflow/python/profiler/profiler_v2.py
tensorflow/tools/api/golden/v2/tensorflow.profiler.experimental.-profiler-options.pbtxt
third_party/xla/third_party/tsl/tsl/profiler/lib/profiler_session.cc
third_party/xla/xla/backends/gpu/codegen/triton/kernel_name_tracer_cuda.cc
third_party/xla/xla/backends/profiler/gpu/cupti_error_manager_test.cc
third_party/xla/xla/backends/profiler/gpu/cupti_tracer.cc
third_party/xla/xla/backends/profiler/gpu/cupti_tracer.h
third_party/xla/xla/backends/profiler/gpu/device_tracer_cuda.cc
third_party/xla/xla/service/gpu/model/hlo_op_profiler.cc

2025-03-22T21:41:08 See commit

This commit introduces two significant features to the TensorFlow XLA (Accelerated Linear Algebra) GPU backend: TfrtGpuBuffer::CopyRawToHostFuture and TfrtGpuClient::BufferFromHostLiteral. The CopyRawToHostFuture function allows for asynchronous copying of raw data from GPU memory to host memory, implementing checks for valid offsets and transfer sizes, and ensuring that the buffer remains alive during the transfer process. Meanwhile, BufferFromHostLiteral provides a mechanism to create a GPU buffer directly from a host literal, enabling efficient data transfers between host and device.

The changes include modifications across several files, with substantial additions in the tfrt_gpu_client.cc file where the core logic for the new functionalities is implemented. The commit also updates the build configurations and includes new unit tests to validate the correct behavior of the implemented features. These tests cover various scenarios, including full buffer transfers, sub-buffer transfers, and error handling for out-of-range conditions, ultimately enhancing the robustness and usability of the GPU client in TensorFlow's XLA framework.

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/BUILD
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.h
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client_test.cc

2025-03-26T18:43:13 See commit

This commit involves the copying of the quantization_driver and quantization_utils components from their original location to the TensorFlow Lite directory. The changes include the addition of several new files, such as quantization_driver.cc, quantization_driver.h, and their corresponding test and utility files, which are essential for the quantization process in machine learning models. Additionally, modifications were made to the BUILD file and quantization.td to accommodate these new components.

Overall, this update enhances the TensorFlow Lite framework by integrating key quantization functionalities, which are crucial for optimizing models for deployment on resource-constrained devices. The inclusion of tests ensures that the new features are validated, promoting reliability and performance in the quantization process.

Files changed

tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/BUILD
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization.td
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_driver.cc
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_driver.h
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_driver_test.cc
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_traits.h
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_utils.cc
tensorflow/compiler/mlir/lite/quantization/common/quantization_lib/quantization_utils.h

2025-03-26T21:18:10 See commit

The commit addresses a critical bug in the TensorFlow (TF) and TensorFlow Lite (TFL) lowering process to TOSA (Tensor Operator Set Architecture) regarding the FloorDiv operation for integer types. Previously, the FloorDiv was incorrectly lowered to IntDiv, which rounds towards zero, while FloorDiv should round towards negative infinity. This inconsistency in rounding behavior could lead to incorrect results in computations. The patch introduces a new method to correctly implement the FloorDiv operation by utilizing a series of TOSA operations that account for the proper rounding behavior.

The changes include modifications to several test files and the implementation of a helper function, floorIntDiv, which accurately performs floor division for integer inputs. This function ensures that the output aligns with the expected behavior of FloorDiv by adjusting the results based on the signs of the inputs. Additionally, the patch updates the lowering logic in the convertFloorDivOp function to incorporate this new helper function, thereby enhancing the accuracy and reliability of the operation within the TOSA framework.

Files changed

tensorflow/compiler/mlir/tosa/tests/tf-to-tosa-pipeline.mlir
tensorflow/compiler/mlir/tosa/tests/tf-unequal-ranks.mlir
tensorflow/compiler/mlir/tosa/tests/tfl-to-tosa-pipeline.mlir
tensorflow/compiler/mlir/tosa/tests/tfl-unequal-ranks.mlir
tensorflow/compiler/mlir/tosa/transforms/legalize_common.cc
tensorflow/compiler/mlir/tosa/transforms/legalize_common.h

2025-03-27T00:17:54 See commit

The recent commit involves renaming the file serialization_base.fbs to tflite_serialization_base.fbs, along with corresponding updates in various related files to reflect this change. This renaming affects multiple components within the TensorFlow Lite GPU delegate, including the inclusion of the renamed file in headers and source files, ensuring that all references to the old file name are updated accordingly.

The changes also include modifications to the build configuration files to ensure that the new file name is correctly referenced in the build process. Overall, this commit aims to enhance clarity and organization within the codebase by providing a more descriptive name for the serialization base file, which may help in better identifying its purpose within the TensorFlow Lite framework.

Files changed

tensorflow/lite/delegates/gpu/cl/serialization.fbs
tensorflow/lite/delegates/gpu/cl/serialization_generated.h
tensorflow/lite/delegates/gpu/common/gpu_model.fbs
tensorflow/lite/delegates/gpu/common/gpu_model_generated.h
tensorflow/lite/delegates/gpu/common/task/BUILD
tensorflow/lite/delegates/gpu/common/task/arguments.h
tensorflow/lite/delegates/gpu/common/task/gpu_object_desc.h
tensorflow/lite/delegates/gpu/common/task/gpu_operation.h
tensorflow/lite/delegates/gpu/common/task/serialization_base.cc
tensorflow/lite/delegates/gpu/common/task/serialization_base.h
tensorflow/lite/delegates/gpu/common/task/tflite_serialization_base.fbs
tensorflow/lite/delegates/gpu/common/task/tflite_serialization_base_generated.h
tensorflow/lite/delegates/gpu/metal/inference_context.fbs
tensorflow/opensource_only.files

2025-03-28T00:11:04 See commit

This commit implements non-blocking NCCL (NVIDIA Collective Communications Library) communicators, which enhances the performance of collective operations in GPU backends by allowing concurrent execution of tasks. The changes involve modifications to several files, including the NCCL collectives implementation and associated headers, as well as the addition of new error handling functionalities.

Additionally, this commit reverts a previous change identified by the hash 5c93e12b85fda11f14ef1a511ec61efc99e34694, indicating a shift back to a prior implementation approach. The updates aim to improve the efficiency and responsiveness of collective communications in GPU environments, which is crucial for optimizing parallel processing tasks.

Files changed

third_party/xla/xla/backends/gpu/collectives/BUILD
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.cc
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.h
third_party/xla/xla/backends/gpu/collectives/nccl_communicator.cc
third_party/xla/xla/backends/gpu/collectives/nccl_communicator.h
third_party/xla/xla/backends/gpu/collectives/nccl_communicator_test.cc
third_party/xla/xla/backends/gpu/collectives/nccl_errors.cc
third_party/xla/xla/backends/gpu/collectives/nccl_errors.h

2025-03-28T02:29:40 See commit

This commit introduces enhancements to the Latency Hiding Scheduler in XLA (Accelerated Linear Algebra) by implementing the capability to dump scheduler statistics into a protocol buffer (proto). The changes include modifications to the existing code that allow for the collection and serialization of various statistics related to the scheduler's performance, such as wasted cycles for different operations (e.g., all-gather, all-reduce, collective broadcasts) and total cycles used. These statistics are now encapsulated in a new SchedulerStatisticsProto message, which is included in the scheduler's output when the corresponding debug option is enabled.

Additionally, the commit refines the existing methods for logging and displaying these statistics, ensuring that they are more structured and easily accessible. The update simplifies the way statistics are calculated and presented, enhancing the overall debugging and performance analysis capabilities of the scheduler. By providing a detailed breakdown of wasted cycles and memory pressure, this commit aims to facilitate better optimization and understanding of the scheduling process within the XLA framework.

Files changed

third_party/xla/xla/service/latency_hiding_scheduler.cc
third_party/xla/xla/service/latency_hiding_scheduler.h
third_party/xla/xla/xla.proto