tensorflow changelog

1 year ago

Here's a delightful summary of the recent updates and improvements. Get ready to dive into the world of new features, bug fixes, and more! 🚀

New Features

Flatten-Tuple Pass Migration: We've migrated from MHLO to StableHLO with a new transformation pass that flattens tuples in HLO operations. This makes tuple handling more efficient and includes robust test cases to ensure everything is ship-shape. 🛠️
kCpu Property Tag: Say hello to the kCpu property tag in the HloRunner class, which helps distinguish between CPU and GPU environments, paving the way for targeted optimizations. 🖥️
LiteRt C Runtime Shared Library: A new rule to generate a shared library for the LiteRt C runtime is here, making the TensorFlow Lite framework more versatile and organized. 📚
SourceTargetPairs Class: Introducing the SourceTargetPairs class to the XLA service, enhancing the structure and functionality of collective operations. 🎉
Pack Op Legalization: The LiteRT framework now supports the Pack operation, crucial for tensor manipulations in deep learning models. 📦

Improvements

HostOffloader Enhancements: We've improved the handling of DynamicUpdateSlice operations, marking them as host compute when working with host memory, enhancing memory management efficiency. 🧠
Reshard Optimization: In the IFRT framework, multiple reshards are now merged into a single operation when possible, reducing redundancy and boosting performance. 🔄
Persistent Workers for Parallel Loops: Persistent workers are now used for pthreadpool parallel loops, significantly improving execution times and efficiency in the XLA CPU backend. 🚀

Bug Fixes

CUDA Driver Compatibility: Fixed issues with XLA builds on CUDA Driver versions lower than 12.3, ensuring robust functionality across different versions. 🛠️
SparseCore Device ID Fix: Resolved issues with SparseCore device IDs in the TensorFlow profiler's trace viewer, enhancing performance profiling reliability. 📊
Timeline v1 Timestamp Compatibility: Improved timestamp accuracy in the TensorFlow profiler's timeline version 1, ensuring correct timing for GPU events. ⏱️

Chores

Cleanup of Deprecated References: We've cleaned up references to the deprecated global_data.h in XLA, streamlining the codebase for clarity and future improvements. 🧹

These updates bring a mix of new capabilities, optimizations, and fixes, making the TensorFlow ecosystem more robust and ready for the future! 🌟

Included Commits

2025-01-17T15:03:29 See commit

The recent commit focused on cleaning up references to global_data.h, which has been marked as deprecated within the XLA (Accelerated Linear Algebra) framework of TensorFlow. The changes involved removing all instances where global_data.h was included or referenced across various files, including build configurations and test cases. This cleanup is part of an effort to streamline the codebase and eliminate reliance on outdated components.

Overall, the commit reflects a proactive approach to maintaining the code quality and ensuring that the TensorFlow XLA module is aligned with current best practices. By removing deprecated references, the developers aim to enhance the clarity and maintainability of the code, paving the way for future improvements and optimizations.

Files changed

tensorflow/compiler/jit/BUILD
tensorflow/compiler/jit/xla_device_context.h
third_party/xla/xla/client/lib/BUILD
third_party/xla/xla/client/lib/testing.h
third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/sample_harness.cc
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/check_execution_arity_test.cc
third_party/xla/xla/tests/client_library_test_base.h
third_party/xla/xla/tests/client_test.cc
third_party/xla/xla/tests/complex_unary_op_test.cc
third_party/xla/xla/tests/compute_constant_test.cc
third_party/xla/xla/tests/convolution_test_1d.cc
third_party/xla/xla/tests/deallocation_test.cc
third_party/xla/xla/tests/deconstruct_tuple_test.cc
third_party/xla/xla/tests/map_test.cc
third_party/xla/xla/tests/params_test.cc
third_party/xla/xla/tests/reduce_precision_test.cc
third_party/xla/xla/tests/reduce_test.cc
third_party/xla/xla/tests/replay_test.cc
third_party/xla/xla/tests/reshape_motion_test.cc
third_party/xla/xla/tests/round_trip_packed_literal_test.cc
third_party/xla/xla/tests/round_trip_transfer_test.cc
third_party/xla/xla/tests/scalar_computations_test.cc
third_party/xla/xla/tests/select_test.cc
third_party/xla/xla/tests/unary_op_test.cc
third_party/xla/xla/tests/value_inference_test.cc
third_party/xla/xla/tests/vector_ops_simple_test.cc

2025-01-17T21:24:03 See commit

The commit titled "Pass flatten-tuple: Migrate from MHLO to StableHLO" introduces a new transformation pass to the StableHLO framework, specifically aimed at flattening tuples in the operands and results of certain HLO operations. This involves the addition of a new file, stablehlo_flatten_tuple.cpp, which contains the implementation logic for the flattening process, alongside modifications to various header files to register the new pass and define its behavior. The new pass allows for improved handling of tuple types in HLO operations, ensuring that they can be processed more efficiently by flattening nested tuples and variadic types.

In addition to the implementation, the commit also includes test cases that validate the functionality of the new flattening pass. These tests ensure that the transformation works correctly for both custom calls and tupled operands, confirming that the output matches expected results after applying the flattening logic. Overall, this commit enhances the StableHLO framework's capabilities by providing a mechanism to streamline tuple handling, thereby improving the efficiency of HLO operations.

Files changed

third_party/xla/xla/mlir_hlo/BUILD
third_party/xla/xla/mlir_hlo/stablehlo_ext/transforms/passes.h
third_party/xla/xla/mlir_hlo/stablehlo_ext/transforms/passes.td
third_party/xla/xla/mlir_hlo/stablehlo_ext/transforms/stablehlo_flatten_tuple.cpp
third_party/xla/xla/mlir_hlo/tests/stablehlo_ext/stablehlo_flatten_tuple.mlir

2025-01-18T00:49:31 See commit

This commit modifies the HostOffloader component to ensure that all DynamicUpdateSlice operations that work with host memory are marked as host compute, with the exception of those utilized for offloading Direct Memory Access (DMA) operations. The changes involve the addition of logic to track DynamicUpdateSlice instructions and determine their memory space after all host memory propagation has been completed. The commit introduces new vectors to store seen DynamicUpdateSlice instructions and their annotations, which allows the code to handle these operations more effectively during the offloading process.

In addition to the code modifications, the commit also includes updates to the associated header files and test cases to validate the new behavior. A specific test case is added to ensure that DynamicUpdateSlice operations are correctly identified and marked as host compute when applicable. Overall, this change enhances the handling of DynamicUpdateSlice operations within the host offloading framework, improving the efficiency and correctness of memory management in the XLA (Accelerated Linear Algebra) compiler.

Files changed

third_party/xla/xla/hlo/transforms/host_offloader.cc
third_party/xla/xla/hlo/transforms/host_offloader.h
third_party/xla/xla/hlo/transforms/host_offloader_test.cc

2025-01-18T00:53:03 See commit

This commit addresses an issue related to the handling of SparseCore device IDs within the TensorFlow profiler's trace viewer. The specific changes were made in the file xplane_to_step_events.cc, where the logic for converting device trace data into step events was modified.

The primary adjustment involves the calculation of stream_step_events for SparseCore devices, where the device ID is now correctly offset by kSparseCoreIndexStart. This change ensures that the SparseCore device IDs are accurately represented in the trace viewer, enhancing the reliability of performance profiling for TensorFlow applications. Overall, the commit improves the functionality of the profiler by refining the conversion process for device-specific trace data.

Files changed

tensorflow/core/profiler/convert/xplane_to_step_events.cc

2025-01-18T01:11:15 See commit

This commit introduces a new class called SourceTargetPairs within the XLA (Accelerated Linear Algebra) service. The commit includes the creation of a header file (source_target_pairs.h) and a corresponding implementation file (source_target_pairs.cc), along with a test file (source_target_pairs_test.cc) to ensure the functionality of the new class.

Additionally, several existing files related to the XLA service have been modified, including build files and components involved in collective operations. Notably, a source file was renamed as part of this update, indicating a reorganization or enhancement of the codebase to accommodate the new class. Overall, these changes aim to improve the structure and functionality of the XLA service.

Files changed

third_party/xla/xla/service/BUILD
third_party/xla/xla/service/collective_permute_decomposer.cc
third_party/xla/xla/service/gpu/transforms/BUILD
third_party/xla/xla/service/gpu/transforms/collective_select_folder.cc
third_party/xla/xla/service/gpu/transforms/collective_select_folder_test.cc
third_party/xla/xla/service/source_target_pairs.cc
third_party/xla/xla/service/source_target_pairs.h
third_party/xla/xla/service/source_target_pairs_test.cc

2025-01-20T07:51:43 See commit

This commit addresses compatibility issues with timestamps in the TensorFlow profiler's timeline version 1. The primary change involves modifying the SetNodeTimes function to accept an additional parameter for the start time, allowing for the conversion of relative timestamps from XPlane to absolute timestamps. This update ensures that the event timestamps, which are recorded in nanoseconds, are accurately translated into microseconds by adding the start time to the event's timestamp.

Additionally, the commit includes adjustments in the ConvertGpuXSpaceToStepStats function, where the start time is retrieved from the environment plane and passed to SetNodeTimes. This ensures that all node execution statistics reflect the correct timing, leading to improved accuracy in profiling GPU events. The changes enhance the overall functionality of the TensorFlow profiler by ensuring that the timeline data is correctly interpreted and displayed.

Files changed

tensorflow/core/profiler/convert/xplane_to_step_stats.cc
tensorflow/core/profiler/utils/xplane_schema.h

2025-01-20T16:52:12 See commit

This commit addresses compatibility issues with the XLA build when using CUDA Driver versions lower than 12.3. Specifically, it introduces checks in the CudaCommandBuffer class to ensure that certain functionalities, such as StreamBeginCaptureToGraph, are not attempted when the CUDA version is below 12.3. If users attempt to use tracing features on unsupported versions, an absl::UnimplementedError is returned to indicate that these features are not available.

Additionally, the commit modifies the header file to conditionally define a type for CUgraphConditionalHandle based on the CUDA version, ensuring that the code remains functional across different driver versions. Overall, these changes improve the robustness of the XLA framework by preventing unsupported operations and clarifying version dependencies.

Files changed

third_party/xla/xla/stream_executor/cuda/cuda_command_buffer.cc
third_party/xla/xla/stream_executor/cuda/cuda_command_buffer.h

2025-01-20T17:55:58 See commit

This commit introduces the use of persistent workers to enhance the execution of pthreadpool parallel loops within the XLA CPU backend, specifically optimizing performance metrics across various benchmarks. The changes lead to significant reductions in execution time and instruction counts for multiple benchmarks, including BM_SingleTask1DLoop, BM_Parallelize2DTile1D, and BM_Parallelize3DTile2D. For instance, BM_SingleTask1DLoop saw a decrease in execution time from 6.55ns to 6.20ns, while BM_Parallelize2DTile1D improved from 18.6µs to 13.4µs, marking a performance gain of 27.84%.

The modifications span several files, including updates to the parallel loop runner and related test files, indicating a comprehensive approach to implementing and validating these optimizations. The overall results demonstrate a marked improvement in efficiency, with reductions in both the number of CPU operations and instructions executed, suggesting that the integration of persistent workers can lead to more efficient parallel processing in XLA's CPU backend.

Files changed

third_party/xla/xla/backends/cpu/runtime/xnnpack/BUILD
third_party/xla/xla/backends/cpu/runtime/xnnpack/parallel_loop_runner.cc
third_party/xla/xla/backends/cpu/runtime/xnnpack/parallel_loop_runner.h
third_party/xla/xla/backends/cpu/runtime/xnnpack/parallel_loop_runner_test.cc
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_threadpool.cc
third_party/xla/xla/tsl/concurrency/async_value_ref.h

2025-01-20T21:18:26 See commit

The commit introduces a new optimization pass in the IFRT (Intermediate Representation for Tensor) framework that merges multiple reshards into a single reshaping operation when they share the same source and destination. This enhancement aims to streamline the reshaping process by reducing redundancy, potentially improving performance and efficiency in tensor operations.

In the code, the modification is made in the passes.cc file, where the new merging pass, CreateIfrtMergeReshardsPass, is added to the pass manager under certain conditions. This change indicates a focus on optimizing the handling of reshaping operations within the XLA (Accelerated Linear Algebra) framework, contributing to better resource utilization and execution speed during tensor manipulations.

Files changed

third_party/xla/xla/python/ifrt/ir/transforms/passes.cc

2025-01-22T23:15:42 See commit

This commit introduces a new rule to generate a shared library for the LiteRt C runtime in the TensorFlow Lite experimental framework. Specifically, it modifies the BUILD file within the tensorflow/lite/experimental/litert/c directory to include the litert_dynamic_lib function, which defines the shared library named libLiteRtRuntimeCApi.so. The rule specifies dependencies on various components of the LiteRt C API, consolidating them into a list called LITERT_C_API_COMMON_DEPS for better maintainability.

Additionally, the commit updates the existing C API common test to utilize this new dependency list, streamlining the code by reducing redundancy. The shared library build process is configured to account for different environments, such as Android, by including appropriate linking options. Overall, these changes enhance the structure and functionality of the LiteRt C runtime, facilitating its use in various applications.

Files changed

tensorflow/lite/experimental/litert/c/BUILD

2025-01-23T22:08:35 See commit

This commit introduces a new property tag, kCpu, to the HloRunner class within the XLA (Accelerated Linear Algebra) service. The addition modifies several files, including hlo_runner.cc, hlo_runner_interface.h, and hlo_runner_pjrt.cc, to accommodate the new tag. The kCpu tag is designed to identify when the runner is operating on a CPU, specifically checking if the backend platform name is "Host" for the HloRunner and "CpuName" for the HloRunnerPjRt.

The changes enhance the functionality of the HloRunner class by allowing it to explicitly recognize and differentiate between CPU and GPU execution environments. This enhancement could potentially improve the performance and adaptability of the XLA service by enabling more targeted optimizations based on the execution platform. The commit includes three additions in each of the modified files without any deletions, indicating a straightforward implementation of the new property tag.

Files changed

third_party/xla/xla/service/hlo_runner.cc
third_party/xla/xla/service/hlo_runner_interface.h
third_party/xla/xla/service/hlo_runner_pjrt.cc

2025-01-24T00:25:55 See commit

This commit introduces the legalization of the Pack operation within TensorFlow Lite's experimental LiteRT framework. It includes modifications to various files, specifically adding functions to retrieve the axis option for the Pack operation and ensuring that the operation can be properly legalized and registered. A new test case is also added to validate the correct functionality of the Pack operation, ensuring that it can retrieve the axis parameter as expected. The commit enhances the overall functionality of the LiteRT framework by supporting this additional operation, which is crucial for tensor manipulation in deep learning models.

Additionally, the commit involves the creation of a dedicated legalizer for the Pack operation, which includes detailed configurations and the handling of input and output tensors. It ensures that the operation is appropriately translated into the QNN (Qualcomm Neural Network) format, taking into account specific requirements like axis parameters and tensor dimensions. The inclusion of this functionality allows for improved performance and compatibility when deploying models that utilize the Pack operation on supported hardware.

Files changed

tensorflow/lite/experimental/litert/c/litert_options.cc
tensorflow/lite/experimental/litert/c/litert_options.h
tensorflow/lite/experimental/litert/c/litert_options_test.cc
tensorflow/lite/experimental/litert/test/testdata/simple_pack_op.mlir
tensorflow/lite/experimental/litert/tools/dump.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/graph_mapper.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/pack_op_legalization.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/pack_op_legalization.h
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/legalizations/util.h
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin_test.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compose_graph.cc