tensorflow changelog

4 months ago

In this update, we've got a bunch of exciting new features and improvements that will make your developer life a whole lot easier. From enhanced benchmarking workflows to new operation builders for Qualcomm's AI Engine, we've got it all. Plus, we've squashed some pesky bugs to keep things running smoothly. Let's dive into the details! 🚀

New Feature: Benchmark Presubmit Workflow
We've rolled out a shiny new presubmit workflow for benchmarking performance to catch potential regressions before they sneak into the main codebase. This new setup runs tests across various configurations and helps keep the performance top-notch. Plus, we've renamed existing benchmark workflows to make it crystal clear which ones are for nightly runs and which are for presubmit checks. 🕵️‍♂️
Improvement: StableHLO Integration
Integrated a specific version of StableHLO to streamline tensor operations and enhance compatibility within the MLIR framework. This update brings a more efficient syntax for operations and introduces new tests to ensure everything's running smoothly.
New Feature: TraceMe for Thunk Execution
Added a new tracing mechanism to the Thunk execution process in the XLA CPU backend. This feature provides detailed execution traces, making it easier to monitor and debug performance. 🎯
Improvement: PjRtClient::Compile for TFRT GPU
Implemented the PjRtClient::Compile function for enhanced GPU support in TensorFlow Runtime, optimizing resource utilization and boosting performance for TensorFlow applications.
New Feature: Qualcomm AI Engine Direct Op Builders
Introduced new operation builders for Qualcomm's AI Engine Direct, including Conv2d, DepthwiseConv2d, and more. These additions come with unit tests to ensure robust functionality and improved machine learning model performance. 🤖
New Feature: LiteRT GPU Accelerator Integration
Added the ml_drift_cl_litert feature for better TensorBuffer integration in GPU-accelerated models, enhancing the TensorFlow Lite experimental framework.
New Feature: Elementwise Ops in Collective Pipeliner
Enabled support for elementwise operations in the collective pipeliner, improving the efficiency of GPU computations, especially in scaled FP8 GEMMs.
Bugfix: Cross-Module Instruction References
Fixed an issue where instructions were referencing computations across different modules, which was causing some test failures. This update strengthens module encapsulation and code robustness.
Improvement: LiteRT Google Implementation
Updated the LiteRT Google implementation to try loading the newer libedgetpu_litert.so library first, ensuring compatibility with recent Android builds while maintaining backward compatibility.
Chore: Logging Cleanup
Removed excessive logging in parallel_batch_dataset_op.cc to prevent log spamming and enhance user experience.
Bugfix: VhloToVersion Reversion
Reverted a previous change in the VhloToVersion transformation to simplify version compatibility checks within the StableHLO framework.
Bugfix: Trace Events Reversion
Reverted a change in the trace_events.proto file to clarify the handling of flow events, ensuring the trace event framework functions smoothly.

That's all for now, folks! Keep coding, and stay awesome! 😎

Included Commits

2025-03-05T22:17:41 See commit

The recent commit in PR #20399 enhances the collective pipeliner by enabling support for elementwise operations, particularly in the context of dynamic-update-slice operations, such as those used in scaled FP8 General Matrix Multiplications (GEMMs). This change modifies the RunCollectiveOptimizationPasses function to utilize a tree-based pipeline approach, which is expected to improve the efficiency of collective operations in GPU computations. The update involves adjustments to the codebase, including modifications to the GPU compiler and the addition of new tests to ensure the correct implementation of elementwise operations within the collective pipeliner.

In addition to the code changes, the commit also introduces new test cases that validate the functionality of the updated collective pipeliner with elementwise operations. These tests are designed to compare the performance of pipelined and non-pipelined executions, ensuring that the new features work as intended. The successful integration of these changes is intended to enhance the overall performance of collective operations in the XLA (Accelerated Linear Algebra) framework, thereby benefiting applications that rely on efficient matrix computations.

Files changed

third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/tests/collective_ops_e2e_test.cc

2025-03-05T22:29:20 See commit

The commit updates the LiteRT Google implementation by modifying the Southbound API library loading process. It changes the logic to attempt loading the libedgetpu_litert.so library first, which is the newer implementation for Edge TPU in recent Android builds. If this attempt fails, the code falls back to the older libedgetpu_util.so library, ensuring compatibility with older Android versions.

In terms of code changes, the commit includes the addition of new constants for the library paths and updates the LoadSymbols function to reflect the new loading strategy. This modification enhances the flexibility of the library loading process, allowing the application to leverage newer features while maintaining backward compatibility. Overall, the update improves the robustness of the Southbound API implementation within the LiteRT framework.

Files changed

tensorflow/lite/experimental/litert/vendors/google_tensor/dispatch/southbound.cc

2025-03-05T23:20:33 See commit

This commit introduces a new tracing mechanism to the Thunk execution process within the XLA CPU backend, enhancing profiling capabilities. It modifies the existing ThunkExecutor class by adding a TracedExecute method, which wraps the execution of thunks with profiling annotations to capture the start and end events. The implementation uses TraceMeProducer and TraceMeConsumer to log these events, ensuring that overhead is minimized when profiling is not active.

Additionally, the commit updates the relevant build configurations and header files to include necessary dependencies for the new tracing functionality. The changes aim to improve performance monitoring and debugging by providing detailed execution traces for thunks, ultimately aiding developers in understanding and optimizing the execution flow within the XLA framework.

Files changed

third_party/xla/xla/backends/cpu/runtime/BUILD
third_party/xla/xla/backends/cpu/runtime/thunk_executor.cc
third_party/xla/xla/backends/cpu/runtime/thunk_executor.h

2025-03-06T00:57:58 See commit

This commit reverts a previous change identified by the hash f6439c5f481f2769d9be994995bf2c176a99ea8a, specifically in the trace_events.proto file located in the TensorFlow core profiler directory. The modifications include the removal of comments regarding flow events and their classification within the EventType enum, effectively restoring the previous state of the code.

The changes made in this commit involve deleting four lines and modifying five, with a focus on clarifying the handling of flow events in the context of trace events. Notably, the comment indicating that flow events are part of the EVENT_TYPE_COMPLETE category has been removed, and the EVENT_TYPE_FLOW has been marked as deprecated, suggesting a shift in how flow events are managed within the trace event framework.

Files changed

tensorflow/core/profiler/protobuf/trace_events.proto

2025-03-06T01:22:45 See commit

This commit addresses user complaints regarding excessive logging by removing the log statement that reports the minimum parallelism setting in the parallel_batch_dataset_op.cc file of the TensorFlow data kernel. The specific log entry that was removed was intended to inform users about the minimum parallelism value being set but has been deemed unnecessary and spammy by some users.

The changes made involve the deletion of two lines of code that logged the minimum parallelism, which is determined by comparing the autotune minimum and the maximum parallelism available in the context. By eliminating this logging, the commit aims to enhance the user experience by reducing log clutter without affecting the functionality of the parallel batch dataset operation.

Files changed

tensorflow/core/kernels/data/parallel_batch_dataset_op.cc

2025-03-06T02:43:56 See commit

This commit integrates a specific version of StableHLO from the repository at openxla/stablehlo, identified by the commit hash 7b7d6ad4. The update involves significant modifications, including a large number of deletions (353 lines) and some additions to various files within the StableHLO project. Key changes include updates to the handling of tensor operations, such as the transpose operation, which now uses a more streamlined syntax for specifying permutations. Additionally, new tests and patterns have been introduced to enhance compatibility and functionality within the MLIR framework, specifically addressing the transformation of operations with dynamic shapes and memory effects.

The commit also updates the workspace configuration to reflect the new commit and SHA256 hash for StableHLO, ensuring that the integration aligns with the latest changes in the upstream repository. Overall, this integration aims to improve the performance and compatibility of StableHLO operations within the MLIR environment, while also refining the codebase to facilitate future development and maintenance.

Files changed

third_party/stablehlo/temporary.patch
third_party/stablehlo/workspace.bzl

2025-03-06T20:24:34 See commit

This commit reverses a previous change made to the VhloToVersion transformation as part of the StableHLO integration, specifically from the changes introduced in cl/733942948. The modifications involve a patch to the VhloToVersion.cpp file, where several lines of code related to the legality of locations in the context of version compatibility have been removed. This includes the removal of a function that assessed whether a Location object was legal for a target version, particularly focusing on FileLineColRange locations, which were deemed a forward incompatibility.

By reverting these changes, the commit simplifies the version compatibility checks for operations and attributes within the StableHLO framework. The removed code aimed to ensure that certain location types were compatible with specific versions, but the decision to revert suggests a shift in approach, potentially indicating that the previous implementation was either unnecessary or problematic in the context of the overall system functionality.

Files changed

third_party/stablehlo/temporary.patch

2025-03-06T21:32:09 See commit

The commit implements the PjRtClient::Compile function within the TensorFlow Runtime (TFRT) for GPU support. This enhancement is part of the ongoing development to improve the integration of the XLA (Accelerated Linear Algebra) compiler with the TFRT framework, specifically targeting GPU operations.

The changes include modifications to the BUILD file, as well as updates to the tfrt_gpu_client.cc and tfrt_gpu_client.h files, which likely involve the implementation details and interface for the new compile functionality. This commit aims to optimize GPU resource utilization and enhance performance for TensorFlow applications leveraging the XLA compiler.

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/BUILD
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.h

2025-03-06T23:29:49 See commit

This commit addresses an issue in the XLA (Accelerated Linear Algebra) library where instructions inadvertently referenced computations across different modules. Specifically, it removes a call to HloComputation::set_parent() in the instruction cloning process, which was creating temporary cross-module dependencies. This change is motivated by a desire to maintain the integrity of module boundaries, as the cloned instruction has not been fully integrated into a computation and thus should not have a parent. The removal of this call also resolves several downstream test failures that arose from these improper references.

In addition to the main change, the commit includes a refactoring of how computations are handled within instructions. A new helper function, HloInstruction::set_called_computation(), is introduced to ensure that computations are set more consistently. This refactoring aims to enhance code clarity and maintainability while enforcing the invariant that called computations should belong to the same module as their parent instruction. Overall, the commit strengthens the module encapsulation within the XLA framework and improves the robustness of the codebase.

Files changed

third_party/xla/xla/hlo/evaluator/hlo_evaluator.cc
third_party/xla/xla/hlo/experimental/auto_sharding/auto_sharding.cc
third_party/xla/xla/hlo/ir/hlo_computation.cc
third_party/xla/xla/hlo/ir/hlo_instruction.cc
third_party/xla/xla/hlo/ir/hlo_instruction.h
third_party/xla/xla/service/collective_pipeliner.cc
third_party/xla/xla/service/gpu/gpu_p2p_pipeliner_test.cc
third_party/xla/xla/service/gpu/transforms/gemm_rewriter.cc

2025-03-06T23:56:19 See commit

This commit introduces a new presubmit workflow for benchmarking performance to track potential regressions in the XLA project. The workflow is defined in a new YAML file (benchmark_presubmit.yml) and is designed to run tests on various configurations, including different CPU architectures and vCPU counts. It utilizes a matrix strategy to ensure that tests are executed across multiple environments, and it includes steps for checking out the OpenXLA repository and running specific benchmark tests. This new workflow aims to enhance the continuous integration process by providing early feedback on performance issues before code changes are merged.

Additionally, existing benchmark workflows have been renamed to clearly differentiate between nightly benchmarks and the newly created presubmit benchmarks. The changes include updates to the cpu_benchmarks_nightly.yml and gpu_benchmarks_nightly.yml files, which now reflect the new naming convention. Overall, this commit enhances the project's ability to monitor performance effectively while maintaining clarity in the workflow structure.

Files changed

third_party/xla/.github/workflows/benchmark_presubmit.yml
third_party/xla/.github/workflows/cpu_benchmarks_nightly.yml
third_party/xla/.github/workflows/gpu_benchmarks_nightly.yml
third_party/xla/build_tools/ci/build.py
third_party/xla/build_tools/ci/golden_commands.txt

2025-03-07T00:03:45 See commit

The commit associated with PR #88221 introduces new operation builders for Qualcomm's AI Engine Direct, enhancing support for first-party models. The newly added operations include Conv2d, DepthwiseConv2d, AveragePool, MaxPool, DepthToSpace, SpaceToDepth, HardSwish, LeakyRelu, and ResizeBilinear. Alongside these additions, the commit also provides unit tests for the new operation builders and updates the existing qnn_compiler_plugin_test to incorporate these changes, ensuring robust testing of the new functionalities.

The testing phase for the changes was successful, with all 115 tests from the qnn_compiler_plugin_test and 22 tests from the litert_options_test passing without any issues. The commit includes modifications and additions to various source files, specifically targeting the Qualcomm vendor directory within the TensorFlow Lite experimental framework. Overall, this update significantly enhances the capabilities of TensorFlow Lite's integration with Qualcomm's AI Engine, facilitating improved performance for machine learning models.

Files changed

tensorflow/lite/experimental/litert/c/litert_options.cc
tensorflow/lite/experimental/litert/c/litert_options.h
tensorflow/lite/experimental/litert/c/litert_options_test.cc
tensorflow/lite/experimental/litert/tools/dump.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin_test.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compose_graph.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/conv2d_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/conv2d_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/depthwise_conv2d_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/depthwise_conv2d_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/hard_swish_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/hard_swish_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/leaky_relu_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/leaky_relu_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/pool2d_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/pool2d_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/resize_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/resize_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/spatial_transform_op_builder.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/builders/spatial_transform_op_builder.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/wrappers/quantize_params_wrapper.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/core/wrappers/quantize_params_wrapper.h
tensorflow/lite/experimental/litert/vendors/qualcomm/core/wrappers/tensor_wrapper.h

2025-03-07T06:10:10 See commit

The commit introduces the ml_drift_cl_litert feature for the LiteRT GPU Accelerator, enhancing TensorBuffer integration through the addition of the DelegateKernelLiteRt. Key changes include the publication of TensorBufferRequirements under kLiteRtTensorBufferTypeOpenCl, the implementation of the BindTensorBuffers() function for binding TensorBuffers, and a streamlined version of the Invoke() method.

Additionally, modifications were made to the TensorFlow Lite experimental LiteRT codebase, including updates to the BUILD file and test cases in litert_compiled_model_gpu_test.cc. These updates ensure that the code correctly identifies and utilizes OpenCL as the buffer type for input and output tensors, thereby improving the overall functionality and testing of the GPU-accelerated models.

Files changed

tensorflow/lite/experimental/litert/cc/BUILD
tensorflow/lite/experimental/litert/cc/litert_compiled_model_gpu_test.cc