tensorflow changelog

10 months ago

Here's the latest scoop on our codebase updates! We've been busy bees, buzzing around to bring you some fantastic new features, improvements, and bug fixes. Let's dive right in! 🐝

New feature: We've jazzed up the XLA framework by using the CUDA runtime API to accurately determine if two ranks are on the same host. This ensures more reliable local communication during collective operations, especially in multi-GPU setups. 🚀

New feature: A new transformation pass is here! We've added a pass to outline an IFRT IR atom program into a module, enhancing the XLA framework's capabilities in handling IR atom programs. 🎉

Improvement: TensorFlow Lite compiler now checks for infinity when folding max and min ops. This ensures that operations handle extreme floating-point values correctly, boosting robustness. 💪

New feature: You can now save output data from TFLite models as TensorFlow Example protocol buffers and output them to a file. This makes model evaluation and debugging a breeze! 📊

Improvement: We’ve added profiling to the ifrt-proxy client, enabling request-response trace tracking. This makes monitoring and analyzing RPC calls a piece of cake. 🍰

New feature: Direct legalization for min and max operations is now available in TensorFlow Lite, streamlining the conversion process and enhancing performance. ⚡️

New feature: We introduced a pattern to reorder gather and cast ops in TensorFlow Lite for more efficient execution. Less work, more play! 🎮

New feature: A new optimization pattern simplifies broadcasting and reshaping operations in TensorFlow MLIR, enhancing efficiency. Who doesn't love a good optimization? 🛠️

Bugfix: We fixed a critical issue in JAX where input arrays weren't reshaped correctly, preventing crashes on TPU and ensuring correct outputs on GPU. Phew! 😅

Bugfix: Memory leaks in cuda_executor.cc error paths are now a thing of the past. We've improved memory management to keep things running smoothly. 🧹

Bugfix: Compatibility issues with Numpy 2.x in TensorFlow's numpy-like operations have been resolved. We're all set for the future! 🔮

Chore: We tidied up by deleting status_test_util.h after migrating all its users. A cleaner codebase is a happier codebase! 🧼

That's all for now, folks! Stay tuned for more exciting updates and improvements. Keep coding and keep smiling! 😄

Included Commits

2024-08-16T19:41:23 See commit

This commit introduces functionality to save output data from TensorFlow Lite (TFLite) models as TensorFlow Example protocol buffers and to output this data to a specified file. Key changes include the addition of new CMake configurations for generating protocol buffer headers and source files, as well as modifications to the benchmark tool to support the new output formats. Specifically, two parameters are introduced: output_filepath, which saves the output tensor data as binary, and output_proto_filepath, which serializes the output tensors into TensorFlow Example format and writes them to a file.

The changes also involve the implementation of utility functions to convert TFLite tensors into appropriate formats for storage, alongside tests to ensure the correctness of these conversions. The modifications enhance the benchmarking capabilities of TFLite by allowing users to easily export and analyze model outputs in a structured format, facilitating better model evaluation and debugging.

Files changed

tensorflow/core/example/CMakeLists.txt
tensorflow/lite/CMakeLists.txt
tensorflow/lite/tools/BUILD
tensorflow/lite/tools/benchmark/BUILD
tensorflow/lite/tools/benchmark/CMakeLists.txt
tensorflow/lite/tools/benchmark/README.md
tensorflow/lite/tools/benchmark/benchmark_tflite_model.cc
tensorflow/lite/tools/utils.cc
tensorflow/lite/tools/utils.h
tensorflow/lite/tools/utils_test.cc

2024-08-16T20:48:35 See commit

This commit introduces profiling capabilities to the ifrt-proxy client, enabling the tracking of request-response traces. The modifications include updates to the BUILD file to incorporate additional dependencies related to profiling, such as traceme and xplane_schema. The core of the change is in the rpc_helper.cc file, where the DoRpc function has been modified to accept profiling names for both sending and receiving requests. This allows for the creation of trace entries that capture the flow of requests and responses, enhancing the ability to monitor and analyze the performance of RPC calls within the client.

The implementation utilizes a random flow ID to uniquely identify each request-response cycle and records the profiling data using TraceMe objects. These objects are configured to log the flow direction and associated profiling names, providing a structured way to analyze the interactions with the IFRT proxy server. Overall, the enhancements aim to improve observability and debugging capabilities for developers working with the ifrt-proxy client.

Files changed

third_party/xla/xla/python/ifrt_proxy/client/BUILD
third_party/xla/xla/python/ifrt_proxy/client/rpc_helper.cc

2024-08-17T01:16:23 See commit

This commit introduces checks for infinity values in the folding operations of maximum and minimum functions within the TensorFlow Lite compiler's MLIR (Multi-Level Intermediate Representation) code. Specifically, the MaximumOp::fold and MinimumOp::fold methods have been modified to account for cases where the input values are either the largest representable floating-point numbers or infinity. If either condition is met for the left-hand side (lhs) or right-hand side (rhs) inputs, the function will return the other operand, thus optimizing the folding operation and ensuring correct behavior when dealing with extreme floating-point values.

Additionally, the commit updates the test cases to include scenarios where negative infinity and positive infinity are involved in maximum and minimum operations. New functions, @max_with_neg_inf and @min_with_inf, have been added to validate that the folding logic correctly handles these edge cases, ensuring that the expected outputs are returned when infinity is part of the input tensors. This enhancement improves the robustness of the TensorFlow Lite compiler's handling of floating-point operations.

Files changed

tensorflow/compiler/mlir/lite/ir/tfl_ops.cc
tensorflow/compiler/mlir/lite/tests/const-fold.mlir

2024-08-17T02:41:01 See commit

This commit introduces a new transformation pass to outline an IFRT (Intermediate Representation for Tensor) IR atom program into a module within the XLA (Accelerated Linear Algebra) framework. The changes include the addition of a new source file for the outlining pass and a test file, as well as modifications to several existing files to integrate this functionality. Specifically, updates were made to headers, build files, and utility files to support the new pass.

Key additions and modifications include the creation of ifrt_outline_atom_program_to_module_pass.cc, which implements the outlining logic, and the addition of a test file to verify its functionality. Additionally, several related files were modified to ensure compatibility with the new transformation, highlighting a significant enhancement to the IFRT's capabilities in handling IR atom programs.

Files changed

third_party/xla/xla/python/ifrt/ir/constants.h
third_party/xla/xla/python/ifrt/ir/tests/BUILD
third_party/xla/xla/python/ifrt/ir/tests/ifrt_outline_atom_program_to_module.mlir
third_party/xla/xla/python/ifrt/ir/transforms/BUILD
third_party/xla/xla/python/ifrt/ir/transforms/ifrt_outline_atom_program_to_module_pass.cc
third_party/xla/xla/python/ifrt/ir/transforms/passes.h
third_party/xla/xla/python/ifrt/ir/transforms/passes.td
third_party/xla/xla/python/ifrt/ir/transforms/spmd_expansion_pass.cc
third_party/xla/xla/python/ifrt/ir/transforms/utils.cc
third_party/xla/xla/python/ifrt/ir/transforms/utils.h

2024-08-17T20:41:28 See commit

This commit introduces direct legalization for the min and max operations within the TensorFlow Lite (TFL) framework, enhancing the conversion process from MHLO (Multi-Headed Linear Operations) to TFL operations. Specifically, the commit modifies the MLIR (Multi-Level Intermediate Representation) files to include new functions for maximum and minimum, ensuring that these operations are correctly translated to their TFL equivalents during the legalize process. The changes include the addition of 28 lines to the tfl_legalize_hlo.mlir file, which defines the new operations, and updates to other files to incorporate these changes into the overall legalization strategy.

Additionally, the commit updates the transformation patterns for the legalize pass, adding the MaxOp and MinOp to the list of operations that can be directly legalized. This ensures that when the TFL compiler encounters these operations, they will be transformed into their corresponding TFL implementations (TFL_MaximumOp and TFL_MinimumOp), streamlining the compilation process and improving performance. Overall, this commit enhances the MLIR framework's capabilities for handling binary element-wise operations, contributing to more efficient model execution in TensorFlow Lite.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/tests/tfl_legalize_hlo.mlir
tensorflow/compiler/mlir/lite/stablehlo/transforms/tflite_legalize_hlo.cc
tensorflow/compiler/mlir/lite/stablehlo/transforms/tflite_legalize_hlo_patterns.td

2024-08-19T19:14:40 See commit

This commit introduces modifications to the TensorFlow MLIR (Multi-Level Intermediate Representation) codebase, specifically focusing on optimizing the handling of broadcasting and reshaping tensor operations. The primary change involves the removal of a dependency on the MLIR Transforms library within the BUILD file, indicating a simplification of the code structure. The commit also adds a significant number of lines (125) to the optimize.cc file, where a new optimization pattern is implemented. This pattern, named SimplifyBroadcastInDimsReshape, allows for the minimization of unit dimensions in operations involving reshaping and broadcasting tensors.

The new optimization pattern enhances the efficiency of tensor operations by enabling the removal of unnecessary unit dimensions from both the input and output shapes of broadcasts, as long as the relative broadcast dimensions are preserved. This is particularly useful in scenarios where the reshaped output retains the same non-unit dimensions as the broadcast input. The changes ensure that the semantics of the computations remain intact while potentially reducing computational complexity and improving performance in tensor manipulations. Additionally, the commit includes a series of new test functions that validate the correctness of these optimizations across various tensor configurations.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/BUILD
tensorflow/compiler/mlir/lite/stablehlo/tests/optimize.mlir
tensorflow/compiler/mlir/lite/stablehlo/transforms/optimize.cc

2024-08-19T21:30:40 See commit

The commit removes the file status_test_util.h from the TensorFlow library as all its users have been successfully migrated to alternative implementations. This deletion is reflected in updates to the BUILD files, where references to status_test_util.h have been eliminated, resulting in a total of 13 deletions across various configurations.

Additionally, the actual content of status_test_util.h has been completely removed, which included copyright information and header guards. This cleanup simplifies the codebase by eliminating unused files, thereby streamlining maintenance and reducing potential confusion among developers regarding the status of the now-obsolete utility.

Files changed

third_party/xla/third_party/tsl/tsl/lib/core/BUILD
third_party/xla/third_party/tsl/tsl/lib/core/status_test_util.h

2024-08-19T22:10:00 See commit

This commit introduces a critical fix in the JAX framework regarding the handling of input layouts specified via in_shardings when using the jit function. Specifically, it ensures that when an input array is uncommitted, it is reshaped to match the layout provided by the user. This change addresses issues previously encountered on GPU, where incorrect outputs were generated, and on TPU, where the system would crash. The update is essential for maintaining the integrity and reliability of computations across different hardware platforms.

The modifications involve changes across several files, including updates to the build configuration, C++ implementation files, and the Python client interface. Key additions include new logic for checking and applying input layouts during the preparation of inputs for JAX operations, ensuring that the input arrays are correctly reshaped before execution. This commit also updates the version number in the Python client to reflect these changes, signifying an incremental improvement in the library's functionality. The issue addressed by this commit is tracked in the repository under the specified issue link.

Files changed

third_party/xla/xla/python/BUILD
third_party/xla/xla/python/pjit.cc
third_party/xla/xla/python/xla_client.py

2024-08-20T22:40:47 See commit

The commit associated with PR #15630 introduces a significant enhancement to the XLA (Accelerated Linear Algebra) framework by utilizing the CUDA runtime API to accurately determine whether two ranks are located on the same host. Previously, the logic used to identify local participants in NCCL (NVIDIA Collective Communications Library) clique groups was based on the number of local participants relative to the total devices in the clique, which did not always reflect the actual local communication capabilities. This discrepancy was highlighted in scenarios where groups included ranks that were not on the same host, leading to potential misconfigurations in local communication.

With this update, the commit modifies the existing logic to leverage the CUDA runtime API to ascertain the number of devices on a host and assess the current rank ID. This allows for a more reliable determination of local communication capabilities during collective operations, specifically in the context of the collective permute thunk. The changes include updates to various parts of the codebase, such as the ExecutableRunOptions class and methods related to the NCCL collective operations, ensuring that local device counts are accurately passed and utilized. This enhancement not only improves the correctness of local communications in multi-GPU setups but also addresses potential issues related to device synchronization and communication deadlocks.

Files changed

third_party/xla/xla/executable_run_options.cc
third_party/xla/xla/executable_run_options.h
third_party/xla/xla/pjrt/pjrt_stream_executor_client.cc
third_party/xla/xla/service/gpu/gpu_executable.cc
third_party/xla/xla/service/gpu/runtime/nccl_collective_permute_thunk.cc
third_party/xla/xla/service/gpu/runtime/nccl_collective_permute_thunk.h
third_party/xla/xla/service/gpu/runtime/thunk.h
third_party/xla/xla/service/hlo_runner.cc
third_party/xla/xla/service/hlo_runner.h
third_party/xla/xla/service/service.cc
third_party/xla/xla/service/service_executable_run_options.h

2024-08-21T00:26:46 See commit

This commit addresses compatibility issues between TensorFlow's numpy-like operations and the upcoming Numpy 2.x version. Specifically, it modifies the tf.numpy.sign, tf.numpy.linspace, and tf.numpy.logspace functions to ensure they behave correctly with the new definitions introduced in Numpy 2.x. The changes include adjusting the way the dtype is determined in the linspace and logspace functions, aligning it with the behavior of Numpy 2.x by basing it on the start and stop parameters instead of a fixed dtype.

Additionally, the sign function has been updated to directly return the result of math_ops.sign(x) when the Numpy version is 2.0 or higher, reflecting the unified definition of the sign function in the newer version. These updates enhance the compatibility of TensorFlow's numpy operations with future Numpy releases, ensuring smoother integration and functionality.

Files changed

tensorflow/python/ops/numpy_ops/np_array_ops.py
tensorflow/python/ops/numpy_ops/np_math_ops.py

2024-08-21T20:05:27 See commit

This commit introduces a new optimization pattern for TensorFlow Lite, specifically targeting the reordering of gather and cast operations within the MLIR (Multi-Level Intermediate Representation) framework. The changes include the addition of a new function, @reorder_gather_cast, which performs a cast operation on an input tensor followed by a gather operation. The optimization replaces the sequence of "Gather(Cast(input), indices)" with "Cast(Gather(input, indices))", which improves efficiency by reducing the number of tensor elements that need to be converted during these operations.

Additionally, the commit modifies existing test cases and patterns in the optimize.mlir and optimize_patterns.td files to incorporate this new optimization. The pattern effectively streamlines the data processing pipeline by ensuring that the gather operation is performed before casting, thereby minimizing computational overhead and enhancing performance in tensor manipulations. Overall, these changes contribute to more efficient execution of TensorFlow Lite models.

Files changed

tensorflow/compiler/mlir/lite/tests/optimize.mlir
tensorflow/compiler/mlir/lite/transforms/optimize_patterns.td

2024-08-22T00:07:05 See commit

This commit addresses a memory leak issue in the error handling paths of the cuda_executor.cc file within the XLA (Accelerated Linear Algebra) project. The changes involve modifying the way DeviceMemoryBase objects are allocated and deallocated when creating or sharing constants on the GPU. Specifically, the code has been updated to use std::make_unique for memory allocation, which ensures that the memory is properly managed and released, thus preventing leaks when errors occur during operations like memory copying.

Additionally, the commit includes adjustments to how the custom deleter for shared pointers is defined, ensuring that the allocated memory is correctly deallocated without leaving dangling pointers. These changes enhance the robustness of the memory management in the CUDA executor, contributing to better resource handling and stability in GPU operations.

Files changed

third_party/xla/xla/stream_executor/cuda/cuda_executor.cc