tensorflow changelog

1 year ago

Here's a delightful summary of the latest updates and improvements, packed with exciting new features and crucial bug fixes! 🎉

New Feature: Host Memory Support in StreamExecutor
We've rolled out support for MemoryType::kHost in the CreateStreamExecutor function across multiple executor types. This means you can now allocate and deallocate host memory with ease, thanks to the new GenericMemoryAllocator. Plus, we've added tests to ensure everything runs smoothly. 🚀
New Feature: ARM64 CPU Builds in XLA
Say hello to ARM64 CPU builds for the XLA project via GitHub Actions! This nifty addition enhances our CI workflow, allowing for comprehensive testing across x86 and ARM64 architectures. 🛠️
Improvement: Custom Fusion Integrity in XLA
We've improved instruction fusion by ensuring that custom fusions and calls remain intact. This update enhances the robustness of the fusion process, maintaining the integrity of custom operations. 🔧
New Feature: DMA Operations in PJRT C API
Introducing PJRT_Client_DmaMap and DmaUnmap functions to the PJRT C API! These additions boost our direct memory access capabilities, complete with thorough testing to ensure seamless integration. 💾
New Feature: PyTorch Conversion in tf_tfl_translate
We've added new flags to the tf_tfl_translate tool, making it easier to convert PyTorch saved models. Now you can specify the model's origin and enable direct lowering of composite operations. 🔄
Improvement: Cross-Compile Architecture Support
Developers can now specify target machine architectures in cross-compile scenarios for CUDA, CUDNN, and NCCL. This update ensures smooth redistributions across various platforms. 🌐
Improvement: Bitcast Handling in XLA
We've enhanced the handling of bitcasts in the XLA framework by allowing split dimension mapping. This change optimizes memory allocations and boosts performance. ⚡
New Feature: Attribute Management in HloInstruction
Streamline your code with new methods for managing frontend attributes in HloInstruction. These functions simplify attribute handling, making your code more efficient and readable. 📈
Bugfix: Memory Crash in NcclAllToAllStartThunk
We've fixed a rare crash issue in the memcpy implementation by switching from absl::flat_hash_map to arrays, ensuring stable and performant memory handling. 🐞
Bugfix: Executable Creation in HloRunnerPjRt
A critical bug causing segmentation faults has been squashed by properly managing the ownership of executables in HloRunnerPjRt. 🛠️
Bugfix: Synchronous Dispatch for CPU Callbacks
To prevent deadlocks, CPU executables with host callbacks will now dispatch synchronously. This temporary fix ensures resources are allocated effectively. 🔄
Chore: Clean-Up in StreamExecutor
We've tidied up by removing the unused HostMemoryDeallocate method, enhancing code maintainability and clarity. 🧹

These updates are sure to enhance your experience and keep everything running smoothly. Happy coding! 🎈

Included Commits

2025-01-24T00:19:06 See commit

This commit introduces two new functions, PJRT_Client_DmaMap and PJRT_Client_DmaUnmap, to the PJRT C API, enhancing its capabilities for managing direct memory access (DMA) operations. The DmaMap function allows the mapping of host memory to the device, while DmaUnmap handles the unmapping process. Alongside the implementation, the commit updates the API version to 0.67 and modifies the changelog accordingly to reflect these additions.

The commit also includes extensive testing for the new DMA functions, ensuring their proper integration into the existing API structure. Tests were added to verify the functionality of mapping and unmapping operations, including checks for error handling when the operations are unimplemented. Additionally, the changes were documented in the header files and relevant source files to maintain consistency across the API. Overall, this update significantly enhances the PJRT C API's functionality by providing essential DMA capabilities.

Files changed

third_party/xla/xla/pjrt/c/CHANGELOG.md
third_party/xla/xla/pjrt/c/pjrt_c_api.h
third_party/xla/xla/pjrt/c/pjrt_c_api_gpu_test.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_test.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_wrapper_impl.cc
third_party/xla/xla/pjrt/c/pjrt_c_api_wrapper_impl.h
third_party/xla/xla/pjrt/pjrt_c_api_client.cc
third_party/xla/xla/pjrt/pjrt_c_api_client.h

2025-01-24T21:13:49 See commit

The commit introduces enhancements to the memory space assignment in the XLA (Accelerated Linear Algebra) framework, specifically focusing on how bitcasts are handled in relation to split dimensions. It modifies the existing code to allow for the mapping of split dimensions when processing bitcast positions. Key changes include the addition of a BitcastSplitFn function type that determines the appropriate output split for a bitcast based on its input split. This function is integrated into various allocation processing methods, such as PinnedAllocation, CopyAllocation, and SlicedCopyAllocation, allowing these methods to leverage the bitcast split functionality when updating uses of allocations.

In total, the commit includes modifications across several files, with a net addition of 35 lines of code and a deletion of 16 lines. The updates ensure that when allocations are processed, the bitcast split function is invoked to manage the dimensions appropriately. This change is expected to improve the handling of memory allocations and optimize performance by ensuring that the correct layout and split configurations are applied during memory space assignments.

Files changed

third_party/xla/xla/service/memory_space_assignment/allocation.cc
third_party/xla/xla/service/memory_space_assignment/allocation.h
third_party/xla/xla/service/memory_space_assignment/allocation_test.cc
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment.cc
third_party/xla/xla/service/memory_space_assignment/options.h

2025-01-27T07:40:55 See commit

The commit addresses a critical issue in the HloRunnerPjRt component of the XLA (Accelerated Linear Algebra) library, specifically regarding the creation of executables. The problem stemmed from the unspecified order of evaluating function arguments, which caused the pjrt_executable to be released prematurely, leading to segmentation faults during execution. This issue resulted in multiple test failures, including all_reduce_test, broadcast_test, and others.

To resolve this, the commit modifies the PjRtWrappedExecutable class to accept a std::unique_ptr<PjRtLoadedExecutable> instead of a raw pointer, ensuring that the ownership of the executable is properly managed. This change not only prevents the premature release of the executable but also enhances the safety and stability of the code. The adjustments made in the codebase include both additions and deletions in the hlo_runner_pjrt.cc file, ultimately leading to the successful execution of the affected tests. Merging this commit closes the associated pull request (#21687).

Files changed

third_party/xla/xla/service/hlo_runner_pjrt.cc

2025-01-27T19:53:32 See commit

This commit introduces support for specifying the target machine architecture in cross-compile scenarios for CUDA, CUDNN, and NCCL repositories. When downloading redistributions for different architectures, developers can now provide the target architecture name using the --repo_env=CUDA_REDIST_TARGET_PLATFORM flag. This enhancement allows for greater flexibility in building and deploying applications across various hardware platforms by ensuring that the appropriate architecture is recognized during the repository initialization process.

The changes primarily involve modifications to the cuda_redist_init_repositories.bzl and nccl_redist_init_repository.bzl files, where logic was added to handle the target architecture. If a target architecture is specified and is valid, it is used as the host architecture; otherwise, the system defaults to the current host architecture. This commit improves the usability of the build system in cross-compilation contexts and ensures that developers can seamlessly work with multiple architectures.

Files changed

third_party/gpus/cuda/hermetic/cuda_redist_init_repositories.bzl
third_party/nccl/hermetic/nccl_redist_init_repository.bzl

2025-01-28T01:51:45 See commit

This commit introduces new flags to the tf_tfl_translate tool, enhancing its functionality for converting PyTorch saved models. Specifically, it adds the model_origin_framework flag, which allows users to specify the source model type (e.g., PyTorch, JAX, TensorFlow), and the enable_composite_direct_lowering flag, which determines whether to directly lower composite operations into TensorFlow Lite (TFLite) operations. These changes are reflected in the modified source files where the new flags are integrated into the conversion process.

The update includes modifications to the command-line options and the internal configuration for the TFLite converter, ensuring that the newly added flags are properly parsed and utilized. The adjustments aim to improve the flexibility and usability of the conversion tool, making it more adaptable to various model origins and lowering strategies. Overall, this commit enhances the capabilities of the tf_tfl_translate tool for handling diverse machine learning frameworks.

Files changed

tensorflow/compiler/mlir/lite/tf_tfl_translate.cc
tensorflow/compiler/mlir/lite/tf_tfl_translate_cl.cc
tensorflow/compiler/mlir/lite/tf_tfl_translate_cl.h

2025-01-29T02:22:55 See commit

This commit addresses an issue with deadlocks that can occur when using host callbacks in asynchronously-dispatched CPU executables within the JAX library. The problem arises when the execution of a callback itself triggers additional asynchronous JAX CPU code, leading to a situation where the XLA intra op thread pool becomes overwhelmed with pending callbacks. To mitigate this, the commit introduces a temporary fix that mandates synchronous dispatch for CPU executables that contain host callbacks, ensuring that resources are adequately allocated to prevent deadlocks.

The change is implemented in the PjRtLoadedExecutable::Execute function, where it checks for the presence of host callbacks and sets the execution mode to synchronous if any are found. This adjustment is crucial to avoid the complications that can arise from reentrant callbacks. The author notes that while this change is a step towards a more robust solution, a follow-up will address additional conditions related to executing programs within the body of callbacks. The commit also updates the version number of the library to reflect these changes.

Files changed

third_party/xla/xla/python/pjrt_ifrt/pjrt_executable.cc
third_party/xla/xla/python/xla_client.py

2025-01-29T22:12:18 See commit

This commit introduces support for the MemoryType::kHost in the CreateStreamExecutor function across relevant executor types in TensorFlow's stream executor module. The changes involve implementing a GenericMemoryAllocator that allows for host memory allocation, enabling the allocation and deallocation of memory on the host using provided functions. The new functionality includes error handling for failed memory allocations and ensures that appropriate cleanup occurs when memory is no longer needed.

Additionally, the commit includes new tests to verify the correct behavior of the host memory allocator, ensuring that memory can be allocated and subsequently deallocated as expected. This enhancement is reflected in various executors, including CUDA and ROCm executors, which now support host memory operations, thereby improving the flexibility and usability of TensorFlow's memory management capabilities.

Files changed

tensorflow/c/experimental/stream_executor/stream_executor.cc
tensorflow/c/experimental/stream_executor/stream_executor_test.cc
third_party/xla/xla/backends/interpreter/BUILD
third_party/xla/xla/backends/interpreter/executor.h
third_party/xla/xla/stream_executor/cuda/cuda_executor.cc
third_party/xla/xla/stream_executor/cuda/cuda_executor_test.cc
third_party/xla/xla/stream_executor/host/BUILD
third_party/xla/xla/stream_executor/host/host_executor.cc
third_party/xla/xla/stream_executor/host/host_executor.h
third_party/xla/xla/stream_executor/rocm/rocm_executor.cc
third_party/xla/xla/stream_executor/rocm/rocm_executor_test.cc

2025-01-30T00:30:51 See commit

This commit introduces new methods for managing frontend attributes in the HloInstruction class, specifically the get_*, add_*, and set_frontend_attribute functions. These enhancements streamline the process of adding or modifying attributes by eliminating the need for temporary attribute objects, thereby simplifying the code and improving performance. The changes are accompanied by the addition of tests to ensure the correctness of these new functionalities, which include conditional addition of attributes and direct setting of values.

In addition to the new methods, the commit also updates various parts of the codebase to utilize these functions, replacing previous patterns of creating temporary FrontendAttributes objects. This not only reduces code complexity but also enhances readability by making the intent clearer when setting or adding attributes directly. Overall, the changes contribute to a more efficient and maintainable code structure within the XLA (Accelerated Linear Algebra) framework.

Files changed

third_party/xla/xla/frontend_attributes.cc
third_party/xla/xla/hlo/ir/BUILD
third_party/xla/xla/hlo/ir/hlo_instruction.h
third_party/xla/xla/hlo/ir/hlo_instruction_test.cc
third_party/xla/xla/service/collective_permute_decomposer.cc
third_party/xla/xla/service/gpu/transforms/collective_permute_valid_iteration_annotator.cc
third_party/xla/xla/service/gpu/transforms/windowed_einsum_handler.cc
third_party/xla/xla/service/spmd/stateful_rng_spmd_partitioner.cc

2025-01-30T01:54:22 See commit

This commit introduces support for a self-hosted ARM64 CPU build for the XLA (Accelerated Linear Algebra) project through GitHub Actions. It modifies the existing build configuration by adding a new enumeration type, CPU_ARM64_SELF_HOSTED, to the build types, which allows for the integration of ARM64 CPU builds into the continuous integration workflow. The changes include updates to the build.py script to define the new build type, modifications to the golden commands to include specific build commands for ARM64, and additions to the CI workflow configuration file to set up the necessary jobs for building and testing on ARM64 architecture.

Additionally, the commit enhances the CI workflow by establishing a matrix job that runs on both x86 and ARM64 environments, allowing for comprehensive testing across different architectures. The new ARM64 job will utilize a dedicated Docker container for building, ensuring that the project can be effectively compiled and tested in an ARM64 environment. This update not only broadens the compatibility of the XLA project but also improves the robustness of the CI process by enabling more extensive testing capabilities.

Files changed

third_party/xla/.github/workflows/ci.yml
third_party/xla/build_tools/ci/build.py
third_party/xla/build_tools/ci/golden_commands.txt

2025-01-30T21:47:07 See commit

This commit addresses a rare crash issue in the memcpy implementation of the NcclAllToAllStartThunk class, which was caused by using pointers from an absl::flat_hash_map. The original implementation registered these pointers with cuMemHostRegister, but since flat_hash_map does not guarantee pointer stability, a rehash could lead to pointer changes and unregistered memory, resulting in crashes with errors like CUDA_ERROR_ILLEGAL_ADDRESS. To resolve this, the code has been modified to use a pointer to an array instead of a hash map, which offers better stability and performance while avoiding the pitfalls associated with pointer rehashing.

Additionally, the commit introduces an 8-GPU all-to-all test to ensure that the issue does not manifest with a larger number of GPUs, as it was not consistently reproducible with only 2 GPUs. The changes include updates to the memory management logic, replacing the hash map structure with arrays and ensuring proper registration and unregistration of host memory. This enhancement aims to improve the robustness of the memory handling in GPU collective operations.

Files changed

third_party/xla/xla/backends/gpu/runtime/BUILD
third_party/xla/xla/backends/gpu/runtime/nccl_all_to_all_thunk.cc
third_party/xla/xla/backends/gpu/runtime/nccl_all_to_all_thunk.h
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/collective_ops_e2e_test.cc

2025-01-31T02:31:41 See commit

This commit removes the HostMemoryDeallocate method and its overrides from various classes within the TensorFlow codebase, specifically in the StreamExecutor and its derived classes. The method was previously used to deallocate host memory allocated by HostMemoryAllocate, but it has become unnecessary and is thus being cleaned up from the code. The changes affect several files, including implementations for CUDA, ROCm, and other executors, reflecting a broader effort to streamline the codebase by eliminating unused functionality.

By deleting these method declarations and their implementations across multiple executors, the commit enhances code maintainability and reduces potential confusion regarding memory management within the TensorFlow framework. This change is part of ongoing efforts to optimize the code and ensure that only relevant and utilized methods remain in the system.

Files changed

tensorflow/c/experimental/stream_executor/stream_executor.cc
third_party/xla/xla/backends/interpreter/executor.h
third_party/xla/xla/stream_executor/cuda/cuda_executor.cc
third_party/xla/xla/stream_executor/cuda/cuda_executor.h
third_party/xla/xla/stream_executor/host/host_executor.h
third_party/xla/xla/stream_executor/mock_stream_executor.h
third_party/xla/xla/stream_executor/rocm/rocm_executor.cc
third_party/xla/xla/stream_executor/rocm/rocm_executor.h
third_party/xla/xla/stream_executor/stream_executor.h
third_party/xla/xla/stream_executor/tpu/tpu_executor.h

2025-01-31T02:55:11 See commit

This commit introduces functionality to prevent the fusion of instructions that are part of custom fusions or calls within the CPU instruction fusion implementation in the XLA (Accelerated Linear Algebra) library. The CpuInstructionFusion class now includes a method called ComputeInstructionsToSkip, which identifies and tracks instructions that should not be fused—specifically those linked to CustomFusion or CustomCall operations. The ShouldSkip method checks if a given instruction is in the set of instructions to skip, and the ShouldFuse method has been updated to respect this logic, explicitly forbidding the fusion of such instructions.

Additionally, the commit includes new tests to validate this behavior. Two test cases ensure that the fusion process correctly skips computations attached to custom fusions and custom calls, confirming that the intended instructions are not fused and the overall fusion decision is accurately reflected in the output. These changes enhance the robustness of the instruction fusion process by maintaining the integrity of custom operations in the computational graph.

Files changed

third_party/xla/xla/service/cpu/cpu_instruction_fusion.cc
third_party/xla/xla/service/cpu/cpu_instruction_fusion.h
third_party/xla/xla/service/cpu/cpu_instruction_fusion_test.cc