tensorflow changelog

1 year ago

Here's the latest scoop on the updates and improvements made to the TensorFlow and XLA frameworks. We've got some exciting new features, important bug fixes, and a sprinkle of organizational tidying. Let's dive in! 🚀

New Feature: Dynamic GELU Composite Lowerings
Say hello to dynamic composite lowerings for the GELU operation in TensorFlow's MLIR framework. This update brings two new patterns to the table, LegalizeCompositeGELUDynamicShaped and LegalizeCompositeGELUDynamicShaped2, which handle dynamic input shapes with grace and style. Now, TensorFlow can flexibly manage varying input dimensions, making your machine learning models even more robust! 🎉
Improvement: Custom Op for odml.detector
We've waved our magic wand and transformed the odml.detector composite operation into a custom operation within TensorFlow Lite. This makeover streamlines integration and boosts performance by allowing complex operations to be executed as custom operations. A win for flexibility and speed! 🧙‍♂️
New Feature: Explicit Collectives Grouping in JAX
Introducing an explicit collectives grouping pass for jitted JAX methods! This feature ensures computations run within a single NCCL group, optimizing NVLink systems for multi-directional communications. With this addition, expect improved performance and fewer NCCL kernels during execution. Go team efficiency! 🏎️
Bugfix: Shape Representation Safety
We've tightened the bolts on shape representation by using std::variant<> to ensure a shape holds only one exclusive state at a time. This fix prevents misuse and potential crashes, making your code safer and more reliable. Safety first! 🛡️
New Feature: Direct StableHLO to HLO Conversion
Get ready for a smoother ride with direct conversion from StableHLO to HLO for AddOp and ConstantOp. This prototype skips the MHLO step, paving the way for more efficient conversion processes in the future. Streamlining for the win! 🏆
New Feature: GetDefaultLayout API in IFRT Proxy
Meet the new GetDefaultLayout API method in the IFRT Proxy, your go-to for retrieving default layouts for specified data types. This enhancement optimizes data placement and access patterns, making your computational tasks more efficient. Layouts made easy! 📐
Improvement: Scheduler Statistics in XLA
We've added the ability to dump scheduler statistics into a proto, giving you a detailed breakdown of wasted cycles and memory pressure. This enhancement boosts debugging and performance analysis, helping you optimize your scheduling process. Knowledge is power! 📊
Improvement: CommandBuffer API Update
The CommandBuffer class in the XLA GPU backend now features an explicit command update API for the If command. This update allows for more complex command management and resource optimization. Command and conquer! 💪
New Feature: HLO Test for Command Buffers
Introducing a new end-to-end HLO test for command buffers in the XLA GPU service. This test simplifies the process of verifying complex command buffers, strengthening the testing framework and laying the groundwork for future developments. Testing made easy! 🧪
Bugfix: Post-Order Traversal Non-Determinism
We've tackled the non-determinism bug in post-order traversal, ensuring correct instruction ordering by allowing pre-computed post-orders. This fix enhances robustness and prevents potential errors in instruction execution. Order restored! 🔄
Bugfix: Determinism in SHARD_AS/SHARD_LIKE
We've addressed non-determinism in SHARD_AS and SHARD_LIKE operations by switching to std::vector for consistent ordering. This fix enhances the reliability of sharding operations, ensuring predictable outputs in parallel computations. Consistency is key! 🔧
Chore: Kernel Generation Passes Reorganization
We've tidied up the TensorFlow MLIR codebase by moving kernel generation-specific passes to a dedicated directory. This reorganization improves code clarity and maintainability, paving the way for future enhancements. Organization FTW! 📂

That's a wrap on the latest updates! Keep coding, keep innovating, and as always, stay awesome! 🌟

Included Commits

2025-04-01T18:31:21 See commit

This commit focuses on reorganizing the structure of the TensorFlow MLIR (Multi-Level Intermediate Representation) codebase by moving kernel generation-specific passes from the MHLO (MLIR HLO) directory to a dedicated kernel_gen directory. The changes include modifications to several files, such as kernel_creator.cc and various transformation passes, which have been renamed and relocated to improve code organization and maintainability.

Additionally, updates were made to the build files and configuration scripts to reflect this new structure. The commit aims to streamline the development process for kernel generation by isolating related components, thereby enhancing clarity and facilitating future enhancements within the TensorFlow MLIR framework.

Files changed

tensorflow/compiler/mlir/tools/kernel_gen/kernel_creator.cc
tensorflow/compiler/mlir/tools/kernel_gen/tests/broadcast_propagation.mlir
tensorflow/compiler/mlir/tools/kernel_gen/tests/merge_assuming_ops.mlir
tensorflow/compiler/mlir/tools/kernel_gen/tests/shape_simplification.mlir
tensorflow/compiler/mlir/tools/kernel_gen/transforms/BUILD
tensorflow/compiler/mlir/tools/kernel_gen/transforms/broadcast_propagation_pass.cc
tensorflow/compiler/mlir/tools/kernel_gen/transforms/merge_assuming_ops_pass.cc
tensorflow/compiler/mlir/tools/kernel_gen/transforms/passes.h
tensorflow/compiler/mlir/tools/kernel_gen/transforms/passes.td
tensorflow/compiler/mlir/tools/kernel_gen/transforms/shape_simplification_pass.cc
third_party/xla/xla/mlir_hlo/BUILD
third_party/xla/xla/mlir_hlo/mhlo/transforms/CMakeLists.txt
third_party/xla/xla/mlir_hlo/mhlo/transforms/mhlo_passes.td
third_party/xla/xla/mlir_hlo/mhlo/transforms/passes.h
third_party/xla/xla/mlir_hlo/mhlo/transforms/rewriters.h

2025-04-02T00:21:32 See commit

This commit introduces dynamic composite lowerings for the Gaussian Error Linear Unit (GELU) operation in TensorFlow's MLIR (Multi-Level Intermediate Representation) framework. Specifically, it modifies the composite_lowering_patterns.td file by adding two new patterns, LegalizeCompositeGELUDynamicShaped and LegalizeCompositeGELUDynamicShaped2. These patterns are designed to handle the dynamic shapes of inputs when lowering the composite GELU operation, allowing for more flexibility in the handling of varying input dimensions.

The new patterns utilize the MHLO_CompositeOp to define how the GELU operation should be transformed based on the attributes provided. They ensure that the transformation correctly interprets the approximation attribute associated with the GELU operation, facilitating its integration into TensorFlow's execution model for dynamic input shapes. Overall, these changes enhance the capability of TensorFlow's MLIR to support dynamic tensor operations, improving performance and usability in machine learning applications.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/transforms/composite_lowering_patterns.td

2025-04-02T19:41:49 See commit

This commit introduces a prototype for direct conversion from StableHLO to HLO, specifically implementing the AddOp and ConstantOp. The changes include modifications to the conversion pipeline, allowing these StableHLO operations to be converted directly to HLO without the intermediate MHLO step. The commit provides a demonstration of this functionality with a code example, where the StableHLO operations are preserved during the conversion process. However, it is important to note that this direct conversion path is currently disabled in production until all StableHLO operations are integrated into the code generation.

The modifications encompass updates to various files, including the addition of new dependencies, changes to the code generation logic, and the introduction of a new set of legal operations for direct conversion. The commit also includes new test cases to validate the conversion functionality. Overall, this work lays the groundwork for more efficient conversion processes in the future by eliminating unnecessary intermediate steps.

Files changed

third_party/xla/xla/hlo/translate/mhlo_to_hlo/BUILD
third_party/xla/xla/hlo/translate/mhlo_to_hlo/gen_hlo_op_writer.cc
third_party/xla/xla/hlo/translate/mhlo_to_hlo/gen_hlo_op_writer.td
third_party/xla/xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc
third_party/xla/xla/hlo/translate/tests/BUILD
third_party/xla/xla/hlo/translate/tests/stablehlo.mlir
third_party/xla/xla/mlir_hlo/mhlo/transforms/stablehlo_legalize_to_hlo/stablehlo_legalize_to_hlo_pass.cc

2025-04-02T23:15:55 See commit

This commit addresses a bug related to the non-deterministic nature of post-order traversal in the XLA (Accelerated Linear Algebra) framework, which can lead to incorrect instruction ordering when post-order is recalculated multiple times. To mitigate this issue, the patch introduces a mechanism that allows callers to provide a pre-computed post-order of instructions, reducing unnecessary recomputation and maintaining the correct instruction sequence. The core change involves modifying the HloReachabilityMap::Build function to accept an optional parameter for the post-order instructions.

Additionally, the commit includes a new test case designed to ensure that the definition-use order of instructions remains intact when using the new functionality. This test verifies that the order of operations is preserved in various scenarios involving collective operations, thereby enhancing the robustness of the code and preventing potential errors in instruction execution. The changes affect multiple files in the XLA codebase, with modifications to the reachability map, collective schedule linearizer, and the addition of the new test case.

Files changed

third_party/xla/xla/hlo/analysis/hlo_reachability.cc
third_party/xla/xla/hlo/analysis/hlo_reachability.h
third_party/xla/xla/hlo/transforms/collectives/collectives_schedule_linearizer.cc
third_party/xla/xla/hlo/transforms/collectives/collectives_schedule_linearizer_test.cc

2025-04-02T23:20:46 See commit

This commit addresses non-determinism issues in the SHARD_AS and SHARD_LIKE operations within the XLA (Accelerated Linear Algebra) framework. The primary change involves altering the data structures used to represent shard groups from absl::flat_hash_set<HloInstruction*> to std::vector<HloInstruction*>. This modification aims to ensure consistent ordering and behavior when processing sharding instructions, which is crucial for maintaining deterministic outputs in parallel computations.

Additionally, the commit includes updates to various functions and method signatures to accommodate the new vector type for shard groups. The changes enhance the sharding propagation mechanism, allowing for more reliable inference of unspecified dimensions and sharding from groups of instructions. Overall, this fix contributes to the stability and predictability of sharding operations in XLA, which is essential for optimizing performance in machine learning and other computational tasks.

Files changed

third_party/xla/xla/service/sharding_propagation.cc
third_party/xla/xla/service/sharding_propagation.h

2025-04-04T02:59:28 See commit

The recent commit introduces a new API method, GetDefaultLayout, to the IFRT (Intermediate Representation Framework) Proxy within the XLA (Accelerated Linear Algebra) library. This addition involves modifications to several files, including the client and server components of the IFRT Proxy. The GetDefaultLayout method allows users to retrieve the default layout for a specified data type, dimensions, device, and memory kind. The implementation includes creating a request with the necessary parameters, invoking the corresponding RPC method, and handling the response to deserialize the layout.

In addition to the API implementation, the commit also includes updates to the protocol buffer definitions to accommodate the new request and response types for GetDefaultLayout. The backend has been modified to process these requests, ensuring that the necessary logic is in place to handle the retrieval of layout information effectively. Furthermore, tests have been added to validate the functionality of the new API, confirming that it correctly returns the expected layout based on the input parameters. Overall, this commit enhances the capabilities of the IFRT Proxy by providing a mechanism to obtain layout information, which is crucial for optimizing data placement and access patterns in computational tasks.

Files changed

third_party/xla/xla/python/ifrt_proxy/client/BUILD
third_party/xla/xla/python/ifrt_proxy/client/client.cc
third_party/xla/xla/python/ifrt_proxy/client/client.h
third_party/xla/xla/python/ifrt_proxy/client/rpc_helper.cc
third_party/xla/xla/python/ifrt_proxy/client/rpc_helper.h
third_party/xla/xla/python/ifrt_proxy/common/ifrt_service.proto
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend.cc
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend.h
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend_test.cc

2025-03-28T02:29:40 See commit

This commit introduces enhancements to the Latency Hiding Scheduler in XLA (Accelerated Linear Algebra) by implementing the capability to dump scheduler statistics into a protocol buffer (proto). The changes include modifications to the existing code that allow for the collection and serialization of various statistics related to the scheduler's performance, such as wasted cycles for different operations (e.g., all-gather, all-reduce, collective broadcasts) and total cycles used. These statistics are now encapsulated in a new SchedulerStatisticsProto message, which is included in the scheduler's output when the corresponding debug option is enabled.

Additionally, the commit refines the existing methods for logging and displaying these statistics, ensuring that they are more structured and easily accessible. The update simplifies the way statistics are calculated and presented, enhancing the overall debugging and performance analysis capabilities of the scheduler. By providing a detailed breakdown of wasted cycles and memory pressure, this commit aims to facilitate better optimization and understanding of the scheduling process within the XLA framework.

Files changed

third_party/xla/xla/service/latency_hiding_scheduler.cc
third_party/xla/xla/service/latency_hiding_scheduler.h
third_party/xla/xla/xla.proto

2025-03-28T17:59:51 See commit

This commit introduces a significant update by converting the existing odml.detector composite operation into a custom operation within the TensorFlow Lite framework. The changes primarily involve modifications to the MLIR (Multi-Level Intermediate Representation) files, where the odml.detector is now represented as a composite operation that utilizes a custom implementation function. The update includes the definition of the @test_odml_detector function, which demonstrates how the new custom operation can be used, along with its corresponding attributes and parameters.

Additionally, the commit enhances the logic for legalizing composite operations by adding support for the odml.detector in the legalize_stablehlo_composite_to_tfl_custom transformation. This not only streamlines the integration of the custom operation within the existing framework but also ensures that the operation can be effectively serialized and utilized in various contexts. Overall, this change aims to improve the flexibility and performance of the TensorFlow Lite framework by allowing more complex operations to be represented and executed as custom operations.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/tests/legalize-stablehlo-tfl-composite.mlir
tensorflow/compiler/mlir/lite/stablehlo/transforms/legalize_stablehlo_composite_to_tfl_custom.cc

2025-03-28T18:00:44 See commit

The commit introduces an explicit collectives grouping pass to the OpenXLA project, specifically for jitted JAX methods, which enforces that computations run within a single NCCL (NVIDIA Collective Communications Library) group. This enhancement aims to facilitate multi-directional communications, thereby improving the utilization of NVLink systems. The implementation leverages existing NCCLGroupThunk logic during the Intermediate Representation (IR) emitter stage, requiring the introduction of an asynchronous wrapper and an inliner to manage the new functionality.

The new feature is exemplified with a JAX code snippet that demonstrates how to use the explicit collectives grouping by annotating a jitted function to ensure it executes within a NCCL group. The commit also includes various modifications across multiple files, such as adding new source files for the explicit collectives group async wrapper, updating build configurations, and implementing tests to verify the correct functionality of the new feature. This change is expected to optimize performance by reducing the number of NCCL kernels launched during execution, as evidenced by the provided performance traces comparing annotated and non-annotated executions.

Files changed

third_party/xla/xla/service/BUILD
third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/transforms/BUILD
third_party/xla/xla/service/gpu/transforms/explicit_collectives_group_async_wrapper.cc
third_party/xla/xla/service/gpu/transforms/explicit_collectives_group_async_wrapper.h
third_party/xla/xla/service/gpu/transforms/explicit_collectives_group_async_wrapper_test.cc
third_party/xla/xla/service/hlo_verifier.cc
third_party/xla/xla/side_effect_util.cc
third_party/xla/xla/side_effect_util.h
third_party/xla/xla/tests/nccl_group_execution_test.cc

2025-03-28T18:29:17 See commit

The recent commit in the XLA GPU backend introduces significant changes to the CommandBuffer class, specifically transitioning the If command to utilize an explicit command update API. This modification enhances the functionality of the If command by allowing it to accept additional parameters, including a span of dependencies, which facilitates more complex command management. The changes involve updating the method signatures in the command_buffer.h and gpu_command_buffer.h files, along with adjustments in the implementation of the If command in the gpu_command_buffer.cc file.

Additionally, the commit includes modifications to the associated test cases to reflect these changes, ensuring that the new API is correctly integrated and tested. The overall impact of this commit is to improve the flexibility and robustness of command execution on the GPU, allowing for more sophisticated conditional operations and better resource management during command buffer updates.

Files changed

third_party/xla/xla/backends/gpu/runtime/command_buffer_cmd.cc
third_party/xla/xla/stream_executor/command_buffer.h
third_party/xla/xla/stream_executor/gpu/gpu_command_buffer.cc
third_party/xla/xla/stream_executor/gpu/gpu_command_buffer.h
third_party/xla/xla/stream_executor/gpu/gpu_command_buffer_test.cc

2025-03-28T20:24:57 See commit

The commit introduces a new end-to-end High-Level Operation (HLO) test specifically designed for command buffers within the XLA (Accelerated Linear Algebra) GPU service. This test aims to simplify the process of verifying non-trivial command buffers by explicitly incorporating them into the HLO. The addition includes a new test file, command_buffer_test.cc, which defines a test class and a specific test case that validates the execution of HLO operations involving fusions and command buffers.

In the command_buffer_test.cc, the test constructs an HLO module that includes several operations, such as addition and multiplication, encapsulated within a command buffer. The test verifies that the output of the module matches the expected results when executed with a given input. This enhancement not only strengthens the testing framework for command buffers but also lays the groundwork for further development and debugging within the XLA GPU service.

Files changed

third_party/xla/xla/service/gpu/tests/BUILD
third_party/xla/xla/service/gpu/tests/command_buffer_test.cc

2025-03-30T22:45:14 See commit

This commit addresses a critical issue in the representation of shapes within the codebase, which are defined as mutually exclusive cases: invalid shape, token, opaque, array, or tuple. Previously, the system did not enforce this exclusivity, leading to potential bugs where a shape could be incorrectly accessed as a different case, such as trying to access dimension fields of a tuple shape. To enhance safety, the Shape class has been modified to utilize std::variant<>, ensuring that it can only hold the state of one case at a time, thus preventing misuse that could lead to crashes.

The changes introduced are conservative due to the identification of numerous existing bugs; in situations where a caller violates the precondition of a Shape method, the code now opts to return a default value instead of crashing outright. For example, if dimensions() is called on a non-array shape, it will return an empty span. The commit indicates a commitment to tightening enforcement of these preconditions in future updates, aiming to improve the robustness of shape handling in the system.

Files changed

third_party/xla/xla/client/executable_build_options_test.cc
third_party/xla/xla/hlo/builder/xla_builder.cc
third_party/xla/xla/layout_util.cc
third_party/xla/xla/service/hlo_cse.cc
third_party/xla/xla/shape.cc
third_party/xla/xla/shape.h
third_party/xla/xla/shape_util.cc
third_party/xla/xla/shape_util.h