tensorflow changelog

6 months ago

Here's the scoop on our latest updates, where we've been busy adding new features, squashing bugs, and refining our systems to make everything run smoother than ever. Check out the highlights below and see how we're making things better for you! 🚀

New Features:

xla::Collectives API: We've rolled out the new xla::Collectives API, setting the stage for NVIDIA Collective Communications Library (NCCL) integration. This makes XLA more robust for parallel processing on GPUs, with support for both host and device-initiated collective operations. 🌟
Greater OP Legalization: TensorFlow Lite's LiteRT framework now supports the "greater" operation, complete with new test data and build configurations. This addition enhances tensor comparison capabilities. 📈
Dynamic Shapes in Convolutions: StableHLO now supports dynamic shapes in 1D convolutions, offering more flexibility and aligning with modern machine learning needs. 🌀
Ragged All-to-All in XLA: We've added asynchronous start and done phases for the "ragged all-to-all" operation, boosting XLA's efficiency in handling complex collective operations. 🚀
Custom Options in IFRT: Users can now specify custom_options for runtime-specific execution, allowing more tailored execution parameters. 🛠️
Multi XSpace to InferenceStats Conversion: A new function transforms multiple XSpace instances into InferenceStats, enhancing TensorFlow's profiling framework for better inference performance insights. 🔍
HLO Stats Tool: Introducing the HLO Stats Tool in TensorFlow's profiler for deeper performance analysis of high-level operations. 📊

Improvements:

C++ Tree with Path API: We've transitioned the tree_util.tree_flatten_with_path and tree_map_with_path APIs to C++, speeding up the pytree flattening process. ⚡

Bug Fixes:

Triton Dot Product Bug: Fixed a bug in Triton's dot product algorithm for dot(inf, 1.0), ensuring correct results by addressing non-finite value summation. 🔧
Wheel Creation Logic: Resolved issues in TensorFlow's wheel creation logic when using pywrap rules, improving the packaging process. 📦
Graph Output Tensor Recognition: Corrected logic in TensorFlow Lite to ensure graph output tensors are recognized even when used by other Ops. 🛠️

Chores:

Obsolete TODO Removal: Cleaned up outdated TODO comments in the TensorFlow XLA compiler codebase, streamlining and clarifying the code. 🧹

These updates are all about making your experience smoother, faster, and more efficient. Stay tuned for more exciting improvements and keep those feedbacks coming! 😊

Included Commits

2024-12-01T05:26:07 See commit

This commit enhances the performance of the pytree flattening process by transitioning the tree_util.tree_flatten_with_path and tree_map_with_path APIs to a C++ implementation. While the public-facing APIs remain unchanged, the underlying key classes have been moved to the C++ level, which may introduce minor issues such as the loss of Python dataclass functionality and potential compatibility concerns with pytype due to pattern matching.

Additionally, the commit includes the registration of defaultdict and OrderedDict through the keypath API, further improving the functionality of the pytree system. The modifications are reflected in several files, including pytree.cc, pytree.h, and related Python files, indicating a significant refactor aimed at optimizing the performance and capabilities of the tree structure within the XLA framework.

Files changed

third_party/xla/xla/python/pytree.cc
third_party/xla/xla/python/pytree.h
third_party/xla/xla/python/xla_client.py
third_party/xla/xla/python/xla_extension/pytree.pyi

2024-12-02T23:49:35 See commit

This commit introduces the xla::Collectives API, which sets the groundwork for implementing the NVIDIA Collective Communications Library (NCCL) within the XLA (Accelerated Linear Algebra) framework. The changes include the addition of a new header file, nccl_collectives.h, that defines a class for host-initiated collectives based on NCCL, extending the base Collectives class. The commit also modifies existing BUILD files to incorporate the new NCCL-related components, ensuring they are properly linked with the necessary dependencies for CUDA and ROCm configurations.

Furthermore, the commit updates the documentation to include forward-looking statements regarding the use of NVSHMEM, highlighting the dual support for both host-initiated and device-initiated collective operations within XLA. The overall aim is to enhance the collective operation capabilities of XLA on GPU platforms, paving the way for more efficient parallel processing in machine learning and other computational tasks.

Files changed

third_party/xla/xla/backends/gpu/collectives/BUILD
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.h
third_party/xla/xla/core/collectives/BUILD
third_party/xla/xla/core/collectives/collectives.h

2024-12-04T03:01:02 See commit

The recent commit introduces a new feature in the ifrt::ExecuteOptions structure, allowing users to specify custom_options when executing a loaded executable via ifrt::LoadedExecutable::Execute(). This enhancement is designed to accommodate runtime-specific metadata that can be utilized internally by the implementation, without interfering with the existing IFRT API semantics. A practical example of this feature is the ability to propagate profiling keys that are not currently managed by the IFRT APIs. The implementation includes modifications to several files, including the addition of a new test case to validate the functionality of the custom options.

In terms of implementation, the commit modifies the ExecuteOptions structure to include an optional AttributeMap for custom_options, and updates the associated protocol buffer definition to serialize and deserialize this new field. The changes also include the addition of a new test file to ensure that the serialization and deserialization processes for ExecuteOptions work correctly, confirming that the custom_options are correctly handled during the round-trip conversion. Overall, this enhancement provides greater flexibility for users to pass additional execution parameters that can be leveraged by the runtime.

Files changed

third_party/xla/xla/python/ifrt/BUILD
third_party/xla/xla/python/ifrt/executable.cc
third_party/xla/xla/python/ifrt/executable.h
third_party/xla/xla/python/ifrt/executable_test.cc
third_party/xla/xla/python/ifrt/execute_options.proto

2024-12-04T22:16:28 See commit

This commit introduces support for dynamic shapes in convolution operations within the StableHLO framework, specifically enhancing the handling of 1D convolutions. It modifies several files to replace reshaping operations with "expand_dims" and "squeeze" operations, which are more suitable for accommodating dynamic tensor shapes. The changes involve updating the function signatures, adjusting the way tensors are reshaped, and ensuring that the convolution operations can now handle inputs of varying dimensions more effectively.

Additionally, the commit refines the legality checks for convolution operations by removing constraints that required static dimensions, thus allowing for a broader range of tensor shapes to be processed. This shift not only enhances flexibility in model design but also aligns with the evolving needs of machine learning applications that often work with dynamic data. Overall, the modifications aim to improve the robustness and usability of convolution operations within the MLIR (Multi-Level Intermediate Representation) framework.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/tests/prepare_hlo.mlir
tensorflow/compiler/mlir/lite/stablehlo/transforms/legalize_hlo_conversions/conv.cc
tensorflow/compiler/mlir/lite/stablehlo/transforms/legalize_hlo_conversions/conv_util.cc

2024-12-04T22:27:00 See commit

This commit introduces the legalization of the "greater" operation within TensorFlow Lite's experimental LiteRT framework. It includes the addition of a new MLIR test data file, simple_greater_op.mlir, which defines a function that utilizes the tfl.greater operation to compare two tensors. The commit also modifies multiple source files, including the dump.cc file to support output for the new operation, and updates build configurations to include the new greater operation legalization.

Additionally, the commit adds the implementation files greater_op_legalization.cc and greater_op_legalization.h, which define the logic for legalizing the greater operation within the QNN (Qualcomm Neural Network) compiler plugin. The changes ensure that the greater operation can be recognized and processed appropriately, enhancing the functionality of the TensorFlow Lite framework for handling tensor comparisons. Overall, this commit significantly expands the capabilities of the LiteRT compiler by integrating the greater operation, along with necessary testing and build configurations.

Files changed

tensorflow/lite/experimental/litert/test/testdata/simple_greater_op.mlir
tensorflow/lite/experimental/litert/tools/dump.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/greater_op_legalization.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/greater_op_legalization.h
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin_test.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compose_graph.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/dispatch/litert_dispatch_invocation_context.cc

2024-12-04T23:06:47 See commit

This commit addresses a bug in the TensorFlow Lite framework related to how output tensors are handled when utilized by other operations (Ops). The previous implementation had a loose logic in the isOutput() method, which caused certain tensors to not be recognized as output tensors. To fix this, the commit introduces a check to ensure that tensors used by other Ops are correctly set as graph outputs. This is accomplished by modifying the GraphMapper class to include a method that registers output tensors and updates the tensor type accordingly.

Additionally, the commit includes changes to the build configuration and header files to accommodate the new logic. Specifically, it adds a new dependency on absl/container/flat_hash_set and implements the registration of graph outputs in the MapGraph function. These modifications collectively enhance the handling of output tensors within the graph, ensuring they are properly recognized and utilized in the framework's operations.

Files changed

tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/graph_mapper.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/graph_mapper.h
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compose_graph.cc

2024-12-05T02:04:21 See commit

The recent commit introduces a new tool in TensorFlow's profiler called the HLO Stats Tool, aimed at enhancing performance analysis of high-level operations (HLOs). This addition includes the creation of new source files (op_stats_to_hlo_stats.cc and op_stats_to_hlo_stats.h) that define how operation statistics are converted into HLO statistics. The tool captures various metrics such as execution times, operational intensity, memory bandwidth, and identifies whether operations are autotuned or involve rematerialization. Additionally, a new protobuf definition for HLO stats is included, allowing structured representation of the collected data.

The commit also updates several existing files to integrate the new HLO Stats Tool into the overall profiling framework. This includes modifications to the build configurations and the addition of HLO stats to the list of available tools in the profiler. The updates ensure that the new tool can be utilized effectively within TensorFlow's profiling ecosystem, thus providing developers with deeper insights into the performance characteristics of their models during execution.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/op_stats_to_hlo_stats.cc
tensorflow/core/profiler/convert/op_stats_to_hlo_stats.h
tensorflow/core/profiler/convert/xplane_to_tool_names.cc
tensorflow/core/profiler/convert/xplane_to_tool_names_test.cc
tensorflow/core/profiler/convert/xplane_to_tools_data.cc
tensorflow/core/profiler/protobuf/BUILD
tensorflow/core/profiler/protobuf/hlo_stats.proto

2024-12-05T05:01:25 See commit

This commit addresses issues in the wheel creation logic for TensorFlow when utilizing pywrap rules. The modifications span several files, including the BUILD files and various Python scripts within the TensorFlow repository, indicating a comprehensive update to improve the packaging process.

Key changes were made to the build scripts and utility functions related to pip package creation, ensuring that the integration of pywrap rules functions correctly. This fix is likely aimed at enhancing the reliability and efficiency of building TensorFlow wheels, which are essential for distribution and installation via pip.

Files changed

tensorflow/python/BUILD
tensorflow/tensorflow.default.bzl
tensorflow/tools/pip_package/BUILD
tensorflow/tools/pip_package/build_pip_package.py
tensorflow/tools/pip_package/utils/data_deps.bzl
tensorflow/tools/pip_package/utils/tf_wheel.bzl
tensorflow/tools/pip_package/utils/utils.py
third_party/xla/third_party/tsl/third_party/py/rules_pywrap/pywrap.bzl
third_party/xla/third_party/tsl/third_party/py/rules_pywrap/pywrap.default.bzl
third_party/xla/third_party/tsl/third_party/py/rules_pywrap/pywrap.impl.bzl

2024-12-05T08:12:16 See commit

The recent commit introduces a new function, ConvertMultiXSpaceToInferenceStats, which is designed to transform multiple XSpace instances into InferenceStats within TensorFlow's profiling framework. This function iterates through the session snapshot's XSpaces, processes each one to extract relevant inference statistics, and combines these statistics into a unified InferenceStats object. The implementation leverages other existing functions for preprocessing and grouping metadata, ensuring that the data is organized effectively for subsequent analysis.

Additionally, the commit includes modifications to various files, such as updating build configurations to include the new library and adjusting related components to support the new functionality. The changes enhance the profiling capabilities of TensorFlow by allowing for more comprehensive insights into inference performance across multiple devices, particularly in contexts like TPU utilization. Overall, this enhancement aims to improve the efficiency and effectiveness of performance profiling in machine learning workflows.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/inference_stats.cc
tensorflow/core/profiler/convert/inference_stats_combiner.cc
tensorflow/core/profiler/convert/multi_xspace_to_inference_stats.cc
tensorflow/core/profiler/convert/multi_xspace_to_inference_stats.h
tensorflow/core/profiler/convert/preprocess_single_host_xplane.cc
tensorflow/core/profiler/convert/preprocess_single_host_xplane.h
tensorflow/core/profiler/convert/xplane_to_tools_data.cc

2024-12-05T23:50:08 See commit

This commit primarily focuses on the removal of obsolete TODO comments and those associated with closed bugs within the TensorFlow XLA (Accelerated Linear Algebra) compiler codebase. The changes are concentrated in various files under the third_party/tensorflow/compiler/ directory, where a total of 54 lines were deleted across multiple files. The updates aim to streamline the code by eliminating outdated notes that are no longer relevant, thus improving code clarity and maintainability.

By cleaning up these comments, the commit enhances the overall quality of the codebase, ensuring that developers are not misled by references to tasks that have already been addressed or are no longer applicable. This kind of maintenance is essential in large projects like TensorFlow, where legacy comments can clutter the code and hinder development efforts.

Files changed

third_party/xla/xla/hlo/builder/lib/math_test.cc
third_party/xla/xla/hlo/translate/hlo_to_mhlo/hlo_function_importer.cc
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client.h
third_party/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_support.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_support_legacy_test.cc
third_party/xla/xla/service/gpu/gpu_fusible.cc
third_party/xla/xla/service/gpu/split_k_gemm_rewriter_test.cc
third_party/xla/xla/service/gpu/transforms/gemm_rewriter.cc
third_party/xla/xla/service/p2p_schedule_preparation.cc
third_party/xla/xla/stream_executor/host/jit_host_kernel_function.cc
third_party/xla/xla/tsl/framework/convolution/BUILD
third_party/xla/xla/tsl/protobuf/error_codes.proto

2024-12-06T06:55:36 See commit

This commit enhances the XLA (Accelerated Linear Algebra) library by adding support for splitting the "ragged all-to-all" operation into asynchronous start and done phases. Specifically, it introduces a new predicate for converting ragged all-to-all operations and modifies the AsyncCollectiveCreator class to handle this new operation type alongside existing collective operations like all-to-all and reduce scatter. The changes include updates to the matching logic for collectives and the implementation of the asynchronous start-done pattern for the ragged all-to-all operation.

Additionally, the commit includes extensive updates to the testing framework, renaming existing tests to reflect the new collective creator structure and adding new tests specifically for the ragged all-to-all functionality. The new tests verify that the operation is correctly transformed into an asynchronous format, ensuring that the start and done phases are accurately represented in the computation graph. Overall, this commit strengthens the XLA's capabilities in handling more complex collective operations in a more efficient and asynchronous manner.

Files changed

third_party/xla/xla/hlo/transforms/collectives/async_collective_creator.cc
third_party/xla/xla/hlo/transforms/collectives/async_collective_creator.h
third_party/xla/xla/hlo/transforms/collectives/async_collective_creator_test.cc

2024-11-29T15:23:15 See commit

This commit addresses a critical bug in the Triton implementation of the dot product algorithm, specifically for the case of dot(inf, 1.0) when using the TF32_TF32_F32_X3 algorithm. The issue arose from the summation of partial products in the dot product computation, where the presence of NaN in any partial product could lead to an incorrect result. The correct output for dot(inf, 1.0) should be infinity, but the previous implementation could erroneously yield NaN due to the summation of non-finite values.

To resolve this, the fix involves overriding any accumulated partial product if it is non-finite before summing it with the final result. This ensures that the computation reflects the correct mathematical behavior. Additional modifications were made to the testing framework to enable validation of this fix, indicating that the algorithm is now supported and functions correctly in scenarios involving infinity. The changes include updates to relevant source files and test cases to ensure comprehensive coverage and correctness of the implementation.

Files changed

third_party/triton/temporary/dot_TF32x3_fix.patch
third_party/triton/temporary/series.bzl
third_party/xla/xla/service/gpu/fusions/triton/dot_algorithms_test.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_support_legacy.cc
third_party/xla/xla/service/gpu/transforms/gemm_fusion.cc