tensorflow changelog

1 year ago

Here's a delightful update on the latest changes and enhancements that have been made:

🚀 New Features

XLA:CPU Thunk Serialization: We've jazzed up the XLA CPU backend with initial thunk serialization. This means thunks, those nifty units of computation, can now be serialized and deserialized, making computation saving and restoring a breeze. This is particularly handy for distributed computing scenarios. 🎉
NCCL ncclCommInitRankScalable API Support: The XLA GPU framework now supports the NCCL ncclCommInitRankScalable API. This allows NCCL communicators to be initialized using multiple root ranks, boosting performance in large-scale environments. You can tweak the ranks per root with a snazzy flag too! 🌟
Dispatch Op Custom Options: Introducing functions for managing custom options in TensorFlow Lite's LiteRT core using the flexbuffer API. This adds a structured, efficient way to handle dispatch operation options. Flexibility, meet efficiency! 💪
Data Lineage Logging: TensorFlow now sports a data lineage logging mechanism, helping you track and manage data like a pro. Perfect for those who love to keep things organized! 📚
IFRT Atom Programs Utility Pass: New utility pass for writing atom programs and the main IFRT function to files. This enhances management and output of atom programs in XLA. 📜

🔧 Improvements

Coordination Service Task Reconnection: Restartable tasks can now reconnect to a cluster, provided they maintain the same local topology. This boosts stability and reliability. 🔄
Gather/Scatter Operand Overlap Handling: We've added functionality to create copies of operands in gather and scatter instructions when they overlap, ensuring smooth operations without memory conflicts. 🧩
StreamExecutor Memory Allocation Unification: A step towards unifying memory allocation methods with new classes for streamlined management. Future-proofing memory handling like a boss! 🛠️

🐛 Bug Fixes

XLA:Python GIL Scoping: Fixed the scoping of GIL release in the XLA Python extension during nb::bytes object construction. No more threading hiccups! 🐍
PjitFunction Locking: Ensured the lock on cache_ is held when destroying executables_ in PjitFunction, maintaining thread safety in a free-threading mode. 🔒
TransposePlan Overflow: Resolved an overflow issue by changing data types to handle larger dimensions without a hitch. No more overflow woes! 📈

🧹 Chores

Refcounting Hashmap Cleanup: Removed an unused refcounting hashmap from the XLA codebase, making things cleaner and simpler. Out with the old! 🧹

These updates bring a mix of new features, improvements, bug fixes, and cleanup that enhance the overall performance and functionality of the framework. Keep exploring and enjoy the new capabilities! 🎊

Included Commits

2025-01-10T01:27:19 See commit

The commit titled "[xla] Delete unused refcounting hashmap" involves the removal of the refcounting_hash_map library and its associated test from the XLA (Accelerated Linear Algebra) codebase. The changes affect two files: third_party/xla/xla/backends/cpu/collectives/BUILD and third_party/xla/xla/BUILD. Specifically, the commit removes references to the refcounting_hash_map in the dependencies of two CPU collective libraries and eliminates the entire refcounting_hash_map library along with its test case.

This cleanup reflects an effort to streamline the codebase by removing unused components, which can help improve maintainability and reduce potential confusion for developers. The removal of 23 lines in the BUILD file signifies a focused effort to eliminate unnecessary code, thereby enhancing the overall efficiency of the XLA project.

Files changed

third_party/xla/xla/BUILD
third_party/xla/xla/backends/cpu/collectives/BUILD

2025-01-11T00:31:38 See commit

This commit introduces functionality in the Coordination Service that allows restartable tasks to reconnect to a cluster, provided they maintain the same local topology as before. The primary changes include the addition of checks to ensure that the local topology remains consistent across restarts. If a task attempts to reconnect with a different topology, an error will be raised, preventing potential inconsistencies within the cluster. This enhancement ensures that the system can handle node restarts more gracefully, thereby improving overall stability and reliability.

In terms of implementation, the commit modifies several files, including the addition of utility functions to compare device and local topology configurations. It also updates the topology exchange logic to incorporate these checks, ensuring that the existing topology is validated against the new one. The changes are accompanied by extensive tests that verify both successful reconnections with matching topologies and failures when discrepancies are detected, thereby reinforcing the robustness of the new feature.

Files changed

third_party/xla/xla/pjrt/distributed/BUILD
third_party/xla/xla/pjrt/distributed/topology_util.cc
third_party/xla/xla/pjrt/distributed/topology_util_test.cc

2025-01-11T01:07:53 See commit

This commit addresses a concurrency issue in the PjitFunction class of the XLA Python module by ensuring that the lock on the cache_ object is held when destroying the executables_. In a free-threading environment, the cache_ lock is essential for protecting the integrity of the executables_, which could otherwise lead to race conditions or undefined behavior during destruction.

The modification involves changing the destructor of the PjitFunction class to include a lock guard that secures access to the cache_ while setting executables_ to nullptr. This small but critical change enhances thread safety and ensures that the destructors operate correctly in a multi-threaded context, thereby maintaining the stability of the system.

Files changed

third_party/xla/xla/python/pjit.cc

2025-01-13T21:57:41 See commit

This commit introduces a new utility pass designed to facilitate the writing of atom programs and the main function of the IFRT (Intermediate Representation Framework for TensorFlow) to files. It adds a new pass called IfrtDumpAtomProgramsPass, which extracts atom programs from a given module and outputs them into specified files. The implementation includes modifications to several files, notably the addition of the pass's logic in ifrt_dump_atom_programs_pass.cc, which handles the dumping of the main function and associated atom programs.

Additionally, the commit updates various build configurations and header files to integrate this new functionality. It includes the necessary imports and declarations to ensure that the pass can be utilized effectively within the existing framework. The changes enhance the capabilities of the XLA (Accelerated Linear Algebra) project, allowing for better management and output of atom programs in a structured manner.

Files changed

third_party/xla/xla/python/ifrt/ir/tests/BUILD
third_party/xla/xla/python/ifrt/ir/tests/ifrt-opt.cc
third_party/xla/xla/python/ifrt/ir/transforms/BUILD
third_party/xla/xla/python/ifrt/ir/transforms/ifrt_dump_atom_programs_pass.cc
third_party/xla/xla/python/ifrt/ir/transforms/passes.h
third_party/xla/xla/python/ifrt/ir/transforms/passes.td

2025-01-14T01:18:14 See commit

This commit addresses an overflow issue in the TransposePlan component of the XLA (Accelerated Linear Algebra) library. The primary modification involves changing the data type of the is_negative lambda function from int to int64_t to accommodate larger dimensions and prevent potential overflow when handling large integer values. This change is crucial for ensuring that the dimensions passed to the TransposePlan are non-negative, as negative dimensions would lead to invalid arguments.

Additionally, the commit introduces a new test case in transpose_test.cc to validate the handling of large dimensions, specifically testing a scenario where the dimensions exceed typical integer limits. This test ensures that the TransposePlan::Create function can successfully process large inputs without encountering errors, thereby enhancing the robustness of the library against overflow-related issues. Overall, these changes improve the reliability and functionality of the TransposePlan in handling large data dimensions.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/transpose.cc
third_party/xla/xla/pjrt/transpose_test.cc

2025-01-14T01:25:24 See commit

This commit introduces functionality to create copies of operands in gather and scatter instructions when there is overlap between them. For gather instructions, if the input and indices point to the same operand, a copy of the indices is created to avoid conflicts. In the case of scatter instructions, which can have multiple inputs and updates, the commit specifies conditions under which copies should be made: if the indices overlap with any input or update, or if an update overlaps with any input. These copies are designed to be removed in subsequent memory-related passes if they are deemed redundant.

The changes are implemented in the gather_scatter_handler.cc file, where the logic for handling gather and scatter operations is modified to check for overlaps and create copies as necessary. Additionally, tests have been added in spmd_partitioner_test.cc to validate the new behavior, ensuring that the system correctly handles cases where all operands are the same instruction. This enhancement aims to maintain the integrity of operations while optimizing memory usage during computation.

Files changed

third_party/xla/xla/service/spmd/gather_scatter_handler.cc
third_party/xla/xla/service/spmd/spmd_partitioner_test.cc

2025-01-15T00:41:07 See commit

This commit represents a significant initial step toward unifying the various memory allocation and deallocation methods within the StreamExecutor framework, specifically targeting the Allocate and Deallocate functions. The changes introduce several new classes, including GenericMemoryAllocator and GenericMemoryAllocation, which streamline memory management by allowing custom allocation and deallocation mechanisms via the absl::AnyInvocable interface. This approach is designed to facilitate the addition of new memory types in the future without requiring the creation of unique allocation/deallocation pairs for each type.

In addition to the new classes, the commit includes corresponding test files to ensure the functionality and correctness of the new memory allocation mechanisms. The GenericMemoryAllocator class is responsible for memory allocation, while the GenericMemoryAllocation class manages the allocated memory and its cleanup. The introduction of these abstractions aims to enhance code maintainability and reduce redundancy in memory management throughout the StreamExecutor codebase.

Files changed

third_party/xla/xla/stream_executor/BUILD
third_party/xla/xla/stream_executor/generic_memory_allocation.h
third_party/xla/xla/stream_executor/generic_memory_allocation_test.cc
third_party/xla/xla/stream_executor/generic_memory_allocator.h
third_party/xla/xla/stream_executor/generic_memory_allocator_test.cc
third_party/xla/xla/stream_executor/memory_allocator.h

2025-01-15T19:50:07 See commit

This commit addresses a critical issue in the XLA Python extension by correcting the scoping of the Global Interpreter Lock (GIL) release during the construction of nb::bytes objects. Previously, the GIL was released while creating these objects, which is not permitted and could lead to potential inconsistencies or errors. The update encapsulates the GIL release within a dedicated scope, ensuring that it is held during the construction of the std::string result, thus maintaining thread safety.

Additionally, the commit optimizes the code by eliminating unnecessary string copies. This is achieved by modifying the way strings are handled when converting std::string values to nb::bytes, allowing for more efficient memory usage and performance. Overall, these changes enhance the robustness and efficiency of the XLA Python extension, ensuring proper GIL management while also streamlining string operations.

Files changed

third_party/xla/xla/python/xla.cc

2025-01-15T20:35:45 See commit

The commit associated with PR #21273 introduces support for the NCCL ncclCommInitRankScalable API within the XLA (Accelerated Linear Algebra) framework for GPU. This new functionality allows for the initialization of NCCL communicators using multiple root ranks, which significantly enhances initialization performance, particularly in large-scale environments. Users can customize the maximum number of ranks that can be associated with a root rank during this initialization process through the --xla_gpu_nccl_init_max_rank_per_root_ratio flag, with a default setting of 128 ranks per root.

In addition to implementing the new API, the commit includes modifications across various files to support this feature, as well as the addition of unit tests to ensure its functionality. The changes encompass updates to collective operations for both CPU and GPU backends, ensuring compatibility and performance improvements across the XLA framework. This commit effectively addresses performance bottlenecks in communicator initialization, making it a significant enhancement for users working with large-scale distributed systems.

Files changed

third_party/xla/xla/backends/cpu/collectives/gloo_collectives.cc
third_party/xla/xla/backends/cpu/collectives/gloo_collectives.h
third_party/xla/xla/backends/cpu/collectives/in_process_collectives.cc
third_party/xla/xla/backends/cpu/collectives/in_process_collectives.h
third_party/xla/xla/backends/cpu/collectives/mpi_collectives.cc
third_party/xla/xla/backends/cpu/collectives/mpi_collectives.h
third_party/xla/xla/backends/gpu/collectives/gpu_clique.cc
third_party/xla/xla/backends/gpu/collectives/gpu_clique.h
third_party/xla/xla/backends/gpu/collectives/gpu_clique_key.cc
third_party/xla/xla/backends/gpu/collectives/gpu_clique_key.h
third_party/xla/xla/backends/gpu/collectives/gpu_clique_key_test.cc
third_party/xla/xla/backends/gpu/collectives/gpu_cliques.cc
third_party/xla/xla/backends/gpu/collectives/gpu_collectives_stub.h
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.cc
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.h
third_party/xla/xla/core/collectives/clique_id.cc
third_party/xla/xla/core/collectives/clique_id.h
third_party/xla/xla/core/collectives/collectives.h
third_party/xla/xla/debug_options_flags.cc
third_party/xla/xla/pjrt/gpu/nccl_id_store.cc
third_party/xla/xla/tsl/cuda/nccl.symbols
third_party/xla/xla/xla.proto

2025-01-15T21:15:56 See commit

This commit introduces new functions for managing custom options related to dispatch operations within TensorFlow Lite's experimental LiteRT core, utilizing the flexbuffer API for enhanced flexibility. The changes include the addition of a new library, dispatch_op_schema, which consists of two primary source files (dispatch_op_schema.cc and dispatch_op_schema.h). These files define a structure, DispatchOpOptions, that holds metadata such as bytecode size, bytecode offset, and the name of the dispatch operation. Additionally, functions are provided to create, update, and retrieve these options in a serialized format, allowing for in-place modifications without resizing the underlying buffer.

To ensure the functionality of the new features, a corresponding test file (dispatch_op_schema_test.cc) has been added. This test suite verifies the creation, retrieval, and updating of dispatch operation options, confirming that the operations behave as expected. Overall, this commit enhances the capabilities of TensorFlow Lite's dispatch operation management by providing a structured and efficient way to handle custom options.

Files changed

tensorflow/lite/experimental/litert/core/BUILD
tensorflow/lite/experimental/litert/core/dispatch_op_schema.cc
tensorflow/lite/experimental/litert/core/dispatch_op_schema.h
tensorflow/lite/experimental/litert/core/dispatch_op_schema_test.cc

2025-01-15T21:58:11 See commit

This commit introduces new functionality to the TensorFlow library by adding a logging mechanism for data lineage. Specifically, it includes two new files: log_lineage.cc and log_lineage.h, which define a LogLineage function that takes a vector of file names as input. The purpose of this function is to log the lineage of the specified files, facilitating better tracking and management of data within TensorFlow's data processing framework.

Additionally, the commit modifies several existing files to integrate this new logging functionality. It updates the BUILD files to include the newly created log_lineage library and includes the necessary headers in relevant source files, such as tf_record_dataset_op.cc. This integration ensures that the lineage logging is triggered during dataset operations, specifically when handling TFRecord datasets, thereby enhancing the overall data management capabilities of TensorFlow.

Files changed

tensorflow/core/data/BUILD
tensorflow/core/data/log_lineage.cc
tensorflow/core/data/log_lineage.h
tensorflow/core/kernels/data/BUILD
tensorflow/core/kernels/data/tf_record_dataset_op.cc

2025-01-16T19:09:05 See commit

This commit introduces initial thunk serialization to the XLA (Accelerated Linear Algebra) CPU backend, enhancing the framework's ability to serialize and deserialize thunks, which are units of computation. The changes affect various files related to different types of thunks, including all-gather, all-reduce, and convolution thunks, among others. New files for serialization and deserialization protocols (serdes_base.h, thunk.proto, thunk_serdes_proto.cc, and thunk_serdes_proto.h) have been added, while existing thunk implementations have been modified to accommodate these new serialization features.

The modifications aim to improve the efficiency and flexibility of the XLA runtime, particularly in scenarios where computation needs to be saved and restored, such as in distributed computing environments. By enabling thunk serialization, the XLA framework can better manage resources and optimize performance across various CPU architectures, thereby enhancing its overall capability in executing complex linear algebra operations.

Files changed

third_party/xla/xla/backends/cpu/runtime/BUILD
third_party/xla/xla/backends/cpu/runtime/all_gather_thunk.cc
third_party/xla/xla/backends/cpu/runtime/all_gather_thunk.h
third_party/xla/xla/backends/cpu/runtime/all_reduce_thunk.cc
third_party/xla/xla/backends/cpu/runtime/all_reduce_thunk.h
third_party/xla/xla/backends/cpu/runtime/all_to_all_thunk.cc
third_party/xla/xla/backends/cpu/runtime/call_thunk.cc
third_party/xla/xla/backends/cpu/runtime/call_thunk.h
third_party/xla/xla/backends/cpu/runtime/collective_permute_thunk.cc
third_party/xla/xla/backends/cpu/runtime/collective_permute_thunk.h
third_party/xla/xla/backends/cpu/runtime/collective_thunk.cc
third_party/xla/xla/backends/cpu/runtime/collective_thunk.h
third_party/xla/xla/backends/cpu/runtime/conditional_thunk.cc
third_party/xla/xla/backends/cpu/runtime/conditional_thunk.h
third_party/xla/xla/backends/cpu/runtime/convolution_thunk.cc
third_party/xla/xla/backends/cpu/runtime/convolution_thunk.h
third_party/xla/xla/backends/cpu/runtime/copy_thunk.cc
third_party/xla/xla/backends/cpu/runtime/copy_thunk.h
third_party/xla/xla/backends/cpu/runtime/custom_call_thunk.cc
third_party/xla/xla/backends/cpu/runtime/custom_call_thunk.h
third_party/xla/xla/backends/cpu/runtime/dot_thunk.cc
third_party/xla/xla/backends/cpu/runtime/dot_thunk.h
third_party/xla/xla/backends/cpu/runtime/fft_thunk.cc
third_party/xla/xla/backends/cpu/runtime/fft_thunk.h
third_party/xla/xla/backends/cpu/runtime/infeed_thunk.cc
third_party/xla/xla/backends/cpu/runtime/infeed_thunk.h
third_party/xla/xla/backends/cpu/runtime/kernel_thunk.cc
third_party/xla/xla/backends/cpu/runtime/kernel_thunk.h
third_party/xla/xla/backends/cpu/runtime/logical_id_thunk.cc
third_party/xla/xla/backends/cpu/runtime/logical_id_thunk.h
third_party/xla/xla/backends/cpu/runtime/outfeed_thunk.cc
third_party/xla/xla/backends/cpu/runtime/outfeed_thunk.h
third_party/xla/xla/backends/cpu/runtime/reduce_scatter_thunk.cc
third_party/xla/xla/backends/cpu/runtime/reduce_scatter_thunk.h
third_party/xla/xla/backends/cpu/runtime/resource_use.cc
third_party/xla/xla/backends/cpu/runtime/rng_state_thunk.cc
third_party/xla/xla/backends/cpu/runtime/rng_state_thunk.h
third_party/xla/xla/backends/cpu/runtime/serdes_base.h
third_party/xla/xla/backends/cpu/runtime/sort_thunk.h
third_party/xla/xla/backends/cpu/runtime/thunk.cc
third_party/xla/xla/backends/cpu/runtime/thunk.proto
third_party/xla/xla/backends/cpu/runtime/thunk_executor.h
third_party/xla/xla/backends/cpu/runtime/thunk_serdes_proto.cc
third_party/xla/xla/backends/cpu/runtime/thunk_serdes_proto.h
third_party/xla/xla/backends/cpu/runtime/topk_thunk.cc
third_party/xla/xla/backends/cpu/runtime/topk_thunk.h
third_party/xla/xla/backends/cpu/runtime/while_thunk.cc
third_party/xla/xla/backends/cpu/runtime/while_thunk.h
third_party/xla/xla/backends/cpu/runtime/xnnpack/BUILD
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_dot_thunk.cc
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_dot_thunk.h
third_party/xla/xla/service/BUILD
third_party/xla/xla/service/buffer_assignment.cc
third_party/xla/xla/service/buffer_assignment.h
third_party/xla/xla/xla.bzl