tensorflow changelog

11 months ago

Hey there, code wranglers! We've got some exciting updates for you. Check out the latest and greatest changes that are making our codebase even more awesome. 🚀

Improvements

Streamlined Kernel Management: Combined StreamExecutor::GetKernel and StreamExecutor::CreateKernel into a single method StreamExecutor::LoadKernel. This simplifies the interface and enhances memory management. 🌟
Efficient Operand Resharding: Optimized the partitioning of dot operations by directly resharding the rhs operand to match lhs and result tensor shardings, eliminating redundant rematerialization. 🎯
Enhanced GPU Operations: Introduced IndexingMapAttr to ApplyIndexingOp, improving the efficiency and correctness of GPU fusions in XLA. 🖥️

New Features

String Shape Kernel: Added registration for a Shape kernel that handles string tensors, enhancing TensorFlow's capabilities for string data processing on GPUs. 🧵
ASCII Art Memory Map: Introduced a function to print a compact 2D map of occupied heap memory over time as ASCII art, making debugging easier and more fun! 🎨
Long Polling for Error Propagation: Added long polling as a new way to propagate errors in the coordination service, improving robustness and responsiveness. 🕵️‍♂️
Gloo Support on macOS: Enabled Gloo to function on macOS using the libuv transport mechanism, expanding its compatibility. 🍏
Experimental Command Buffers: Added a flag to enable command buffers during profiling sessions in the XLA GPU backend, providing more flexibility. 🧪

Bugfixes

HLO Evaluator Stability: Fixed an issue where the HLO evaluator would dereference a disengaged optional, preventing potential runtime errors. 🛠️
Coordination Service Test: Addressed a data race in coordination_service_test.cc by implementing notifications for proper thread synchronization. 🏃‍♂️
oneDNN Crashes: Fixed crashes in oneDNN matmul, convolution, and layer norm tests by ensuring proper initialization of operands_stack_alloca arrays. 🚑

Chores

Model Builder Relocation: Moved the model_builder from TensorFlow Lite core to the TensorFlow compiler/converter module, streamlining the directory structure. 📦

That's all for now, folks! Keep coding and stay awesome! 💻✨

Included Commits

2024-08-01T00:47:08 See commit

This commit introduces an optimization in the partitioning of dot operations in a distributed computing environment, specifically addressing the handling of operands with matched sharding on non-contracting dimensions. Previously, the partitioner would redundantly reshard the right-hand side (rhs) operand twice, which was inefficient. The new approach allows for a more efficient resharding of the rhs directly from its original sharding to a more suitable configuration that matches the left-hand side (lhs) and the resulting tensor, thereby eliminating unnecessary rematerialization of the rhs operand.

The mechanism works by checking if the lhs and the result of the dot operation have matching sharding axes along non-contracting dimensions. If they do, the partitioner attempts to reshard the rhs operand to an expected sharding configuration rather than defaulting to replication if the initial attempt fails. This change not only improves the efficiency of the partitioning process but also enhances the overall performance of dot operations in distributed settings, as demonstrated by the addition of corresponding tests that validate the new behavior.

Files changed

third_party/xla/xla/service/spmd/dot_handler.cc
third_party/xla/xla/service/spmd/spmd_partitioner_test.cc

2024-08-01T07:45:18 See commit

The commit introduces the IndexingMapAttr to the ApplyIndexingOp within the XLA (Accelerated Linear Algebra) framework, specifically targeting GPU operations. This enhancement involves modifications across various test files and implementation files, indicating a broad impact on the functionality and testing of GPU fusions related to MLIR (Multi-Level Intermediate Representation).

The changes include updates to multiple test cases and the core implementation files, such as xla_gpu_ops.cc and xla_gpu_ops.h, ensuring that the new indexing map attribute is properly integrated and tested across different scenarios. The modifications also encompass a range of MLIR test files that validate the behavior of operations like concatenation, dynamic updates, reductions, and vectorization, highlighting the commit's focus on improving the efficiency and correctness of GPU fusions in XLA.

Files changed

third_party/xla/xla/service/gpu/fusions/concatenate_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/in_place_dynamic_update_slice_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/loop_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/mlir/elemental_hlo_to_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/mlir/ir/xla_gpu_ops.cc
third_party/xla/xla/service/gpu/fusions/mlir/ir/xla_gpu_ops.h
third_party/xla/xla/service/gpu/fusions/mlir/ir/xla_gpu_ops.td
third_party/xla/xla/service/gpu/fusions/mlir/tests/canonicalize.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/flatten_tensors.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/invalid.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/lower_tensors.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/ops.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/optimize_loops.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/simplify_affine.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/simplify_arith.mlir
third_party/xla/xla/service/gpu/fusions/mlir/tests/vectorize_loads_stores.mlir
third_party/xla/xla/service/gpu/fusions/reduction_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/scatter_mlir_test.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_fusion_emitter_device_test.cc

2024-07-26T04:09:26 See commit

This commit introduces long polling as a new mechanism for error propagation within the coordination service of the TensorFlow framework. The changes are extensive, affecting various components including coordination service agents, client implementations, and associated tests, indicating a comprehensive integration of this feature across the codebase.

Key files modified include the coordination service protocol definitions, client and service implementations, and related test cases, ensuring that the new long polling mechanism is well-supported and thoroughly tested. This enhancement aims to improve the robustness and responsiveness of the coordination service, enabling better error handling and communication in distributed runtime environments.

Files changed

tensorflow/core/common_runtime/next_pluggable_device/c_plugin_coordination_service_agent_test.cc
third_party/xla/third_party/tsl/tsl/protobuf/coordination_config.proto
third_party/xla/third_party/tsl/tsl/protobuf/coordination_service.proto
third_party/xla/xla/tsl/distributed_runtime/coordination/BUILD
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_client.h
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service.cc
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service.h
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_agent.cc
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_agent_test.cc
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_rpc_handler.cc
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_rpc_handler.h
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_test.cc
third_party/xla/xla/tsl/distributed_runtime/rpc/coordination/grpc_coordination_client.cc
third_party/xla/xla/tsl/distributed_runtime/rpc/coordination/grpc_coordination_service_impl.cc
third_party/xla/xla/tsl/distributed_runtime/rpc/coordination/grpc_coordination_service_impl.h

2024-07-26T17:02:17 See commit

The recent commit titled "hlo_evaluator: Don't dereference a disengaged optional" addresses a critical issue in the HLO evaluator by ensuring that the code does not attempt to dereference an optional value that may not be present. The change primarily focuses on the function that parses evaluation error details from an absl::Status object. The original implementation incorrectly assumed that the optional value would always contain a valid entry, which could lead to runtime errors if the value was absent. The updated code now checks for the presence of the value before attempting to access it, thus improving the robustness of error handling.

In addition to this safeguard, the commit also includes modifications to the associated header and test files to reflect the changes made in the evaluation error parsing logic. New tests have been added to validate the behavior of the parsing function under various conditions, including cases where the payload is absent or present. Overall, this commit enhances the reliability of the HLO evaluator by preventing potential dereferencing errors and ensuring that the system handles evaluation errors more gracefully.

Files changed

third_party/xla/xla/hlo/evaluator/BUILD
third_party/xla/xla/hlo/evaluator/hlo_evaluator.cc
third_party/xla/xla/hlo/evaluator/hlo_evaluator.h
third_party/xla/xla/hlo/evaluator/hlo_evaluator_test.cc

2024-07-26T17:08:19 See commit

The commit addresses a data race issue in the coordination_service_test.cc file by implementing notifications to ensure the correct ordering of operations across threads. The changes involve modifying the error polling mechanism for tasks to utilize absl::Notification, which allows for synchronization between threads. This ensures that the heartbeat errors are properly propagated to all tasks and that the status callbacks are executed in the expected order.

In addition to enhancing the synchronization, the commit also removes redundant code related to storing statuses in a vector, streamlining the test logic. The modifications help to ensure that errors are reported accurately when tasks fail to send heartbeats, thereby improving the reliability of the coordination service's error handling during testing. Overall, the changes contribute to more robust and thread-safe unit tests for the coordination service.

Files changed

third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_test.cc

2024-07-26T18:14:49 See commit

This commit involves the relocation of the model_builder component from the TensorFlow Lite core directory to the TensorFlow compiler/converter module. The changes include the addition of new files model_builder_base.cc and model_builder_base.h in the tensorflow/compiler/mlir/lite/core directory, which likely serve as the foundational implementation for the model builder within the new context.

Additionally, modifications were made to several build files to reflect this change, including updates to BUILD files and CMakeLists.txt in both the compiler and Lite core directories. The original model_builder.cc file has been removed from the TensorFlow Lite core, indicating a complete transition of the model-building functionality to the compiler/converter framework.

Files changed

tensorflow/compiler/mlir/lite/core/BUILD
tensorflow/compiler/mlir/lite/core/model_builder_base.cc
tensorflow/compiler/mlir/lite/core/model_builder_base.h
tensorflow/lite/CMakeLists.txt
tensorflow/lite/core/BUILD
tensorflow/lite/core/model_builder.cc
tensorflow/lite/core/model_builder.h

2024-07-26T21:06:27 See commit

This commit introduces an experimental feature in the XLA (Accelerated Linear Algebra) GPU backend, allowing command buffers to be enabled during profiling sessions. By default, when profiling is active, the system switches from utilizing command buffers to an op-by-op execution mode to prevent potential issues such as memory corruption. The new flag, xla_enable_command_buffers_during_profiling, is added to the debug options, allowing users to toggle this behavior. The default setting for this flag is false, meaning that command buffers will not be used during profiling unless explicitly enabled.

Additionally, the commit includes modifications across several files to incorporate this feature, such as updating the CommandBufferThunk class to accept the new flag and adjusting the logic in the execution process to respect the flag's state. The commit also adds tests to ensure the correct behavior of command buffers when the flag is toggled on and off during profiling. Overall, this change aims to enhance flexibility in the profiling process of GPU operations while still maintaining safeguards against potential issues.

Files changed

third_party/xla/xla/debug_options_flags.cc
third_party/xla/xla/service/gpu/ir_emitter_unnested.cc
third_party/xla/xla/service/gpu/runtime/BUILD
third_party/xla/xla/service/gpu/runtime/command_buffer_thunk.cc
third_party/xla/xla/service/gpu/runtime/command_buffer_thunk.h
third_party/xla/xla/service/gpu/runtime/command_buffer_thunk_test.cc
third_party/xla/xla/xla.proto

2024-07-29T17:36:35 See commit

This commit merges two methods, StreamExecutor::GetKernel and StreamExecutor::CreateKernel, into a single method called StreamExecutor::LoadKernel. The change is reflected across multiple files, including the header and implementation files for different executor classes such as GpuExecutor, HostExecutor, and XlaInterpreterExecutor. This consolidation simplifies the interface for loading kernels, allowing for a unified approach to kernel management.

The LoadKernel method returns a std::unique_ptr<Kernel> instead of using an output parameter, enhancing memory management and ownership clarity. The modifications include adjustments to existing logic for loading kernels from various sources, ensuring that kernel metadata is correctly set during the loading process. Overall, this change streamlines the codebase, reduces redundancy, and improves the maintainability of the kernel loading functionality within the XLA (Accelerated Linear Algebra) framework.

Files changed

third_party/xla/xla/backends/interpreter/executor.h
third_party/xla/xla/stream_executor/cuda/cuda_executor.cc
third_party/xla/xla/stream_executor/gpu/gpu_executor.h
third_party/xla/xla/stream_executor/host/host_executor.cc
third_party/xla/xla/stream_executor/host/host_executor.h
third_party/xla/xla/stream_executor/mock_stream_executor.h
third_party/xla/xla/stream_executor/rocm/rocm_executor.cc
third_party/xla/xla/stream_executor/stream_executor.h

2024-07-30T00:11:50 See commit

This commit introduces support for Gloo on macOS, utilizing the libuv transport mechanism. This enhancement serves as an alternative to a previous implementation (#7726) and allows Gloo to function on macOS platforms, thereby expanding its compatibility. The commit closes a related pull request and involves modifications across several files, including the addition of necessary references to libuv in the build configurations.

Key changes include updates to the build files to incorporate libuv, modifications to the Gloo transport definitions to enable its use on macOS, and adjustments in the XLA Python bindings to support Gloo's collective operations using libuv. The commit enhances the overall functionality of the system by enabling macOS users to leverage Gloo's capabilities for collective communications, which is particularly beneficial for distributed computing applications.

Files changed

tensorflow/tools/pip_package/THIRD_PARTY_NOTICES.txt
tensorflow/workspace2.bzl
third_party/gloo/gloo.BUILD
third_party/uv/uv.BUILD
third_party/uv/workspace.bzl
third_party/xla/workspace2.bzl
third_party/xla/xla/python/BUILD
third_party/xla/xla/python/xla.cc

2024-07-31T06:29:47 See commit

This commit addresses crashes that occurred when running oneDNN matmul, convolution, and layer normalization tests in libc++ hardened mode. The issue stemmed from the improper initialization of the operands_stack_alloca arrays in the emitters. To remedy this, the commit introduces a new method, EmitOneDnnOperandsAlloca, which efficiently allocates memory for the operands of oneDNN custom calls. This change not only fixes the crashes but also includes minor refactoring for better code organization.

In addition to the main fix, the commit modifies several files related to the XLA (Accelerated Linear Algebra) service, specifically within the CPU backend. The changes enhance the handling of operand allocation for oneDNN calls, ensuring that the arguments are correctly managed during the emission of LLVM IR. The updates also include adjustments to existing tests to ensure compatibility with the new implementation, ultimately improving the stability and performance of oneDNN operations within the XLA framework.

Files changed

third_party/xla/xla/service/cpu/ir_emitter.cc
third_party/xla/xla/service/cpu/ir_emitter.h
third_party/xla/xla/service/cpu/tests/onednn_matmul_test.cc

2024-07-31T21:16:54 See commit

This commit introduces the registration of a Shape kernel for the string data type in TensorFlow, enhancing the framework's capabilities to handle string tensors in GPU operations. Specifically, the code modifications in shape_ops.cc include the addition of the string type to the GPU kernel registration process, allowing for more efficient processing of string data.

Additionally, the commit updates the testing module shape_ops_test.py to include comprehensive tests for string tensors, ensuring that various operations such as shape, rank, and size comparisons are thoroughly validated for both CPU and GPU executions. The new tests utilize random string arrays of varying dimensions to confirm the correctness of the shape operations, thereby improving the robustness of TensorFlow's functionality when dealing with string types.

Files changed

tensorflow/core/kernels/shape_ops.cc
tensorflow/python/kernel_tests/array_ops/shape_ops_test.py

2024-07-31T21:37:49 See commit

This commit introduces a new function that generates a compact ASCII art representation of occupied heap memory over time, enhancing the debugging capabilities of the heap simulator in the XLA (Accelerated Linear Algebra) service. The function calculates the best memory block size for visual representation and constructs a 2D memory map that indicates occupied and free memory blocks at various time intervals. It also includes logic to handle scenarios where the memory map exceeds a predefined dimension size, in which case it falls back to printing the details of the buffer nodes instead.

Additionally, the commit modifies several files in the XLA repository, including the heap simulator's core files and related test cases. It adds new functions to compute and format the memory map, as well as tests to validate the functionality of the ASCII art representation. These tests ensure that the output correctly reflects the memory usage based on buffer intervals, and they cover edge cases such as free memory and large memory requests. Overall, this enhancement aims to provide developers with a clearer visual understanding of memory allocation patterns during runtime.

Files changed

third_party/xla/xla/service/heap_simulator/BUILD
third_party/xla/xla/service/heap_simulator/heap_simulator.cc
third_party/xla/xla/service/heap_simulator/heap_simulator.h
third_party/xla/xla/service/heap_simulator/heap_simulator_test.cc