tensorflow changelog


Welcome to the latest round of updates! We've been busy bees 🐝, adding some slick new features, squashing pesky bugs, and tidying up the codebase. Here's a rundown of what’s new and improved:

  • New feature: πŸŽ‰ We've added support for overriding cross-program prefetch behavior and filtering buffer intervals based on their usage in the XLA:TPU:MSA. These enhancements make memory management more flexible and efficient. Plus, we've included tests to make sure everything runs smoothly.

  • New feature: πŸš€ The HLO evaluator now supports explicit batch dimensions for gather and scatter operations. This change reserves necessary dimensions for tensors, making these operations more flexible and robust.

  • Improvement: πŸ› οΈ Introducing the AssertEq wrapper! This nifty tool helps ensure function outputs match expected results, enhancing our assertion framework. We've also improved error checking in the TensorFlow Lite runtime by validating tensor types more reliably.

  • New feature: 🧩 Say hello to HloModuleInterface and HloInstructionInterface! These new interfaces provide a more organized way to manage HLO data, improving efficiency and performance metrics retrieval.

  • New feature: βš™οΈ We’ve added a RuntimeConfig when loading SavedModels, allowing you to disable the tf2xla MLIR bridge. This update optimizes graph execution for better performance.

  • Bugfix: 🐞 Fixed a critical issue in CalculatePostOrderScheduleHelper(), ensuring kAsyncStart instructions are correctly initialized. This fix prevents instructions from being processed out of order.

  • New feature: πŸ” The HloUnaryInstruction class is here to boost result accuracy for specific unary functions, enhancing precision in computations.

  • Improvement: πŸ”§ Enhanced GPU GEMM fusions by allowing effective parameters and their broadcasts to be fused in the epilogues, optimizing performance.

  • New feature: πŸŽ›οΈ A new ToolParam for the XNNPACK TFLite delegate lets you easily toggle the Slinky optimizer via command-line flags, giving you more control over performance tuning.

  • Bugfix: πŸ›‘οΈ Addressed a crucial issue in the GPU dot algorithm rewriter to handle infinity and NaN values correctly, ensuring accurate results in BF16 operations.

  • Bugfix: πŸ”§ Fixed the AlgebraicSimplifier to ensure it doesn't eliminate host offloading copies, maintaining the integrity of host memory operations.

  • Chore: 🧹 We've cleaned up by removing an unnecessary gpu_types.h inclusion in topk_kernel_test.cc, streamlining the code and reducing compilation time.

We hope these updates make your experience even better! Keep exploring and enjoy the improvements. 🌟

Included Commits

2024-11-15T03:40:23 See commit

This commit introduces two new interfaces, HloModuleInterface and HloInstructionInterface, to enhance the structure and interaction with HLO (High-Level Optimizer) modules and instructions in TensorFlow. The HloInstructionWrapper and HloModuleWrapper classes are implemented to conform to these interfaces, providing a more organized and extensible way to manage HLO data. The interfaces define essential methods for retrieving various properties of HLO instructions and modules, such as their names, op codes, and performance metrics like FLOPs and bytes accessed.

In addition to the interface implementations, the commit includes significant modifications to the HloModuleWrapper class, enhancing its functionality to gather fusion instructions and manage nested computations more effectively. The changes also improve the organization of the code by introducing helper functions and ensuring that the wrappers cache results for efficient access. Overall, these updates aim to streamline the handling of HLO data structures within TensorFlow's profiling utilities, ultimately contributing to more efficient computation and analysis.

Files changed

  • tensorflow/core/profiler/utils/BUILD
  • tensorflow/core/profiler/utils/hlo_module_map.cc
  • tensorflow/core/profiler/utils/hlo_module_map.h
2024-11-15T05:09:13 See commit

This commit addresses an issue in the AlgebraicSimplifier component of the XLA (Accelerated Linear Algebra) library, ensuring that it does not eliminate copies related to host offloading. The changes include modifications to the algebraic_simplifier.cc file, where new checks were added to prevent the simplifier from processing copies that involve synchronous transfers to or from the host. This is crucial for maintaining the integrity of computations that rely on host memory operations.

In addition to the main code changes, the commit also updates the host_offload_utils.cc file to refine the logic for identifying synchronous copies involving host memory. A new test case was added to verify that the simplifier correctly retains these host offloading copies during its operations. Overall, this commit enhances the functionality of the AlgebraicSimplifier by ensuring that it respects the requirements of host memory operations in the XLA framework.

Files changed

  • third_party/xla/xla/hlo/transforms/BUILD
  • third_party/xla/xla/hlo/transforms/simplifiers/algebraic_simplifier.cc
  • third_party/xla/xla/hlo/transforms/simplifiers/algebraic_simplifier_test.cc
  • third_party/xla/xla/service/host_offload_utils.cc
2024-11-15T20:15:02 See commit

This commit introduces enhancements to the GPU GEMM fusions in the XLA (Accelerated Linear Algebra) library by allowing the fusing of effective parameters and their corresponding broadcasts in the epilogues. The key addition is the implementation of a new function, IsEffectiveParameter, which identifies parameters as well as parameters that are preceded by no-op operations like bitcasts or tuple element accesses. This change aims to optimize the performance of GEMM operations by improving the handling of parameter broadcasts during fusion.

Additionally, the commit includes modifications to various files, such as adding tests to verify that broadcasts of effective parameters are correctly fused as inputs in the epilogue. The updates also refine the criteria for accepting inputs in output fusion, now explicitly allowing effective parameters and their broadcasts, alongside scalar broadcasts. Overall, this enhancement is expected to improve the efficiency of GEMM computations on GPU by leveraging better fusion strategies.

Files changed

  • third_party/xla/xla/hlo/utils/hlo_query.cc
  • third_party/xla/xla/hlo/utils/hlo_query.h
  • third_party/xla/xla/service/gpu/transforms/gemm_fusion_test.cc
  • third_party/xla/xla/service/gpu/triton_tiling_propagation.cc
2024-11-15T20:34:02 See commit

This commit introduces a new ToolParam to the XNNPACK TensorFlow Lite delegate, enabling users to easily activate the Slinky optimizer through the existing command-line interface. With this addition, users can now specify the --xnnpack_slinky=true|false flag in conjunction with the existing --use_xnnpack=true|false flag, allowing for more granular control over how XNNPACK operates. Importantly, the Slinky flag will be disregarded if XNNPACK is compiled without Slinky support.

The changes involve modifications across several files, including the addition of the new parameter in the BenchmarkPerformanceOptions and updates in the XnnpackDelegateProvider to handle the new flag appropriately. The commit also includes logging enhancements to provide feedback on the Slinky and FP16 settings when the XNNPACK delegate is created, ensuring users are informed of the configurations being applied. Overall, this enhancement streamlines the process of utilizing the Slinky optimizer within the XNNPACK framework, improving flexibility for performance tuning.

Files changed

  • tensorflow/lite/tools/benchmark/benchmark_performance_options.cc
  • tensorflow/lite/tools/delegates/BUILD
  • tensorflow/lite/tools/delegates/xnnpack_delegate_provider.cc
  • tensorflow/lite/tools/evaluation/BUILD
  • tensorflow/lite/tools/evaluation/utils.cc
2024-11-15T20:34:11 See commit

The commit addresses a critical issue in the CalculatePostOrderScheduleHelper() function within the XLA (Accelerated Linear Algebra) library, specifically regarding the handling of kAsyncStart instructions. Previously, this instruction was not included in the scheduling process, leading to incorrect initialization of ordinal and priority values for certain instructions. Consequently, this oversight could disrupt the functionality of the priority-queue-based worklist, resulting in instructions being processed out of order.

To rectify this, the commit modifies the code to include kAsyncStart in the relevant section of the scheduling helper function. Additionally, it introduces a new test case to ensure that the data flow analysis correctly handles asynchronous calls alongside conditionals. The changes involve adding several lines of code to enhance the scheduling logic and improve the accuracy of the worklist processing, thereby maintaining the integrity of the computation order in the XLA framework.

Files changed

  • third_party/xla/xla/hlo/analysis/hlo_dataflow_analysis.cc
  • third_party/xla/xla/hlo/analysis/hlo_dataflow_analysis_test.cc
2024-11-16T01:21:30 See commit

This commit introduces a new RuntimeConfig feature designed to enhance the loading of SavedModels in TensorFlow by allowing the disabling of the tf2xla MLIR bridge. The configuration for the MLIR bridge is defined in a newly added mlir_bridge_config_v1.proto file, which is part of the broader effort to optimize graph execution and improve performance within the TensorFlow ecosystem.

Several files across the TensorFlow codebase have been modified to implement this feature, including updates to various headers and source files related to graph execution, optimization, and the handling of SavedModels. The changes aim to streamline the integration of the new configuration into existing components, ensuring that the system can effectively manage the execution of models without relying on the MLIR bridge when specified.

Files changed

  • tensorflow/compiler/mlir/BUILD
  • tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc
  • tensorflow/compiler/mlir/tf2xla/api/v1/BUILD
  • tensorflow/compiler/mlir/tf2xla/api/v1/mlir_bridge_config_v1.proto
  • tensorflow/core/common_runtime/graph_execution_state.cc
  • tensorflow/core/common_runtime/graph_execution_state.h
  • tensorflow/core/common_runtime/optimization_registry.h
  • tensorflow/core/tfrt/fallback/fallback_state.cc
  • tensorflow/core/tfrt/fallback/fallback_state.h
  • tensorflow/core/tfrt/fallback/fallback_state_test.cc
  • tensorflow/core/tfrt/graph_executor/BUILD
  • tensorflow/core/tfrt/graph_executor/graph_executor.cc
  • tensorflow/core/tfrt/graph_executor/graph_executor.h
  • tensorflow/core/tfrt/graph_executor/graph_executor_test.cc
  • tensorflow/core/tfrt/saved_model/BUILD
  • tensorflow/core/tfrt/saved_model/saved_model.cc
  • tensorflow/core/tfrt/saved_model/saved_model_import_input.cc
  • tensorflow/core/tfrt/saved_model/saved_model_import_input.h
  • tensorflow/core/tfrt/saved_model/saved_model_util.cc
  • tensorflow/core/tfrt/saved_model/saved_model_util.h
  • tensorflow/core/tfrt/utils/BUILD
  • tensorflow/core/tfrt/utils/tfrt_graph_execution_state.cc
  • tensorflow/core/tfrt/utils/tfrt_graph_execution_state.h
  • tensorflow/core/tfrt/utils/tfrt_graph_execution_state_test.cc
2024-11-16T06:24:59 See commit

This commit introduces the HloUnaryInstruction class to enhance support for result accuracy in specific unary functions within the XLA (Accelerated Linear Algebra) library. The modifications span several files, including updates to the hlo_instruction.cc and hlo_instruction.h files, which likely define the new instruction and its properties. Additionally, changes have been made to the parser and related test files to ensure that the new functionality is properly integrated and validated.

The updates also include adjustments in the service files and protocol buffer definitions, indicating a comprehensive enhancement of the unary instruction handling in XLA. Overall, this commit aims to improve the precision of certain unary operations, which could be beneficial for applications requiring high accuracy in their computations.

Files changed

  • third_party/xla/xla/hlo/ir/hlo_instruction.cc
  • third_party/xla/xla/hlo/ir/hlo_instruction.h
  • third_party/xla/xla/hlo/ir/hlo_instructions.cc
  • third_party/xla/xla/hlo/ir/hlo_instructions.h
  • third_party/xla/xla/hlo/parser/BUILD
  • third_party/xla/xla/hlo/parser/hlo_parser.cc
  • third_party/xla/xla/hlo/parser/hlo_parser_test.cc
  • third_party/xla/xla/service/BUILD
  • third_party/xla/xla/service/hlo.proto
  • third_party/xla/xla/service/hlo_instruction_test.cc
2024-11-18T15:57:04 See commit

This commit addresses a critical issue in the dot algorithm rewriter within the XLA GPU framework, specifically regarding the handling of infinity and NaN (Not a Number) values. The previous implementation did not correctly manage these cases, leading to incorrect results in dot operations involving BF16 data types when rewritten for cuBLAS. To resolve this, the commit introduces a masking mechanism to filter out NaN values, ensuring that computations involving infinity yield the expected results.

Additionally, the commit enhances the testing framework by comparing the outputs of the modified algorithms against the standard dot operation, which does not specify a particular algorithm. An example scenario illustrates the previous failure, where operations involving infinity and NaN resulted in NaN instead of the expected infinity. The changes span multiple files, including updates to build configurations and test cases, ensuring that the new logic is effectively integrated and validated within the overall system.

Files changed

  • third_party/xla/xla/service/gpu/fusions/triton/BUILD
  • third_party/xla/xla/service/gpu/fusions/triton/dot_algorithms_test.cc
  • third_party/xla/xla/service/gpu/transforms/BUILD
  • third_party/xla/xla/service/gpu/transforms/dot_algorithm_rewriter.cc
  • third_party/xla/xla/tests/new_hlo_test_base.cc
  • third_party/xla/xla/tests/new_hlo_test_base.h
2024-11-18T19:39:10 See commit

This commit introduces support for explicit batch dimensions in gather and scatter operations within the HLO (High-Level Optimizer) evaluator. By adding explicit batch dimensions, the code reserves necessary dimensions in all tensors involved in gather/scatter operations, enhancing the functionality and flexibility of these operations. The changes include modifications to several files, notably the HLO evaluator implementation and its associated tests, to accommodate the new batching feature.

In detail, the commit modifies the HLO evaluator's logic to handle explicit batch dimensions by updating how input and output indices are mapped during gather and scatter operations. It introduces new methods to propagate these dimensions correctly, ensuring that the evaluation process respects the specified batch dimensions. Additionally, the commit adds corresponding tests to verify the correctness of the implementation, ensuring that the new functionality behaves as expected in various scenarios.

Files changed

  • third_party/xla/xla/hlo/evaluator/BUILD
  • third_party/xla/xla/hlo/evaluator/hlo_evaluator.cc
  • third_party/xla/xla/hlo/evaluator/hlo_evaluator_test.cc
2024-11-19T23:58:37 See commit

The recent commit focuses on cleaning up the code in the XLA (Accelerated Linear Algebra) library by removing an unnecessary inclusion of the gpu_types.h header file in the topk_kernel_test.cc file. This change simplifies the dependencies of the test file, making the codebase cleaner and potentially reducing compilation time.

In addition to the removal of the header file, the commit also updates the corresponding BUILD file to reflect this change. Overall, this modification enhances the maintainability of the code by eliminating redundant components without affecting the functionality of the tests.

Files changed

  • third_party/xla/xla/service/gpu/kernels/BUILD
  • third_party/xla/xla/service/gpu/kernels/topk_kernel_test.cc
2024-11-20T22:48:04 See commit

This commit introduces a new utility function called AssertEq, which serves to assert that the return value of a given function matches an expected value. This change enhances the existing assertion framework by allowing for more generalized function pointers, enabling developers to check function outputs against various expected results. The AssertOk function is also updated to utilize AssertEq, maintaining its original purpose of verifying that a function returns a successful status.

In addition, the commit includes modifications to the Tensor class, where AssertEq is employed to validate the type of tensors by comparing the results of the TypeId() function against expected tensor types. This update not only improves the robustness of error checking within the TensorFlow Lite runtime but also ensures that union types are correctly handled in the tensor API. Overall, the changes enhance the clarity and reliability of assertions in the codebase.

Files changed

  • tensorflow/lite/experimental/litert/cc/litert_detail.h
  • tensorflow/lite/experimental/litert/cc/litert_model.h
2024-11-21T01:15:17 See commit

This commit introduces enhancements to the XLA (Accelerated Linear Algebra) memory space assignment system specifically for TPU (Tensor Processing Unit) support. It allows for the overriding of cross-program prefetch behavior, enabling more flexible memory management across different programs. Additionally, the update includes the capability to filter buffer intervals based on their usage, which can improve efficiency in memory allocation.

To ensure the robustness of these new features, the commit also adds corresponding tests that validate both the overriding of cross-program prefetch behavior and the expanded filtering criteria. Several files related to memory space assignment algorithms and utilities have been modified to implement these changes, ensuring that the system can effectively manage memory resources in a more controlled manner.

Files changed

  • third_party/xla/xla/service/memory_space_assignment/algorithm.cc
  • third_party/xla/xla/service/memory_space_assignment/buffer_interval_comparator.cc
  • third_party/xla/xla/service/memory_space_assignment/buffer_interval_comparator.h
  • third_party/xla/xla/service/memory_space_assignment/memory_space_assignment.proto
  • third_party/xla/xla/service/memory_space_assignment/memory_space_assignment_test.cc
  • third_party/xla/xla/service/memory_space_assignment/options.h
  • third_party/xla/xla/service/memory_space_assignment/utils.cc
  • third_party/xla/xla/service/memory_space_assignment/utils.h