tensorflow changelog

1 year ago

Hey team! Check out the latest and greatest updates to our codebase. We've got some cool new features, important improvements, and essential bug fixes. Dive in and see what's new! 🚀

New Features

Support for conditional() with manual subgroups in spmd_partitioner: Now you can handle conditional operations with manual subgroups, maintaining manual sharding where needed. This update includes changes to SpmdPartitioningVisitor and new test cases to validate this functionality. 🎉
Basic DAG Executor Implementation for XLA CPU: Introducing a basic Directed Acyclic Graph (DAG) executor for the XLA CPU service. This helps in executing thunks concurrently in a thread pool, ensuring correct ordering and execution. 🧩
Initial Implementation of ThunkExecutor: A new ThunkExecutor class is here! It builds a DAG defining execution order based on buffer uses, complete with methods and tests to ensure everything runs smoothly. 🛠️
Runtime Simulator for HLO Module Execution Time: A new simulator predicts execution time for HLO modules, taking into account nested loop trip counts. This helps in optimizing execution time estimates. ⏱️
ScratchAllocator in External FFI API: Introducing ScratchAllocator for efficient device memory allocation and deallocation in XLA's external FFI API. This improves overall usability and performance. 💾

Improvements

Simplified Code in dynamic_update_slice: We’ve streamlined the code by removing unnecessary template usage and converting indices into int64 before processing. This reduces the target binary size and optimizes performance. 📉
Export XLA:FFI Handlers as C Function Symbols: A new macro allows exporting XLA:FFI handlers as C function symbols, making it easier to work with FFI implementations in shared libraries. 🔧
Using Eigen Thread Pool for ThunkExecutor Tasks: ThunkExecutor tasks now utilize the Eigen thread pool, addressing mutex contention points and improving performance nearly linearly with the number of threads. 🏎️

Bug Fixes

Correct Propagation of Deserialization Errors: We’ve fixed the deserialization process to correctly propagate errors from HloProgramSerDes, ensuring better error handling and message communication. 🛠️
Vectorization with Modulo Operations: Fixed an issue where vectorization didn’t work properly with modulo operations. Now, both (a mod b) * x and (a * x) mod b are handled correctly. 🧮
Hash Function Compatibility with Numpy 2.0: Addressed a failure in the hash function with Numpy 2.0. The hash calculations now use Numpy's uint64 data type for better compatibility. 🔍

Chores

Removed Dead Code in XLA:GPU: Cleaned up the codebase by removing unused code related to MockNcclTopoModel from GpuExecutableRunOptions. This makes the code cleaner and easier to maintain. 🧹

That's all for now! Keep coding and stay awesome! 💻✨

Included Commits

2024-06-05T13:51:17 See commit

This commit removes dead code related to the MockNcclTopoModel enum and associated functions from the GpuExecutableRunOptions class in the gpu_executable_run_options.h file. The removed code includes the enum declaration, getter, setter, and member variable for mock_nccl_topo_model. These functions were no longer being used and were deemed unnecessary, so they were removed to clean up the codebase and improve maintainability.

Overall, this change reduces the complexity of the GpuExecutableRunOptions class by eliminating unused code, making the codebase cleaner and easier to understand. The commit does not introduce any new functionality but simply removes redundant code, resulting in a decrease of 12 lines of code in the modified file.

Files changed

third_party/xla/xla/service/gpu/gpu_executable_run_options.h

2024-06-05T14:35:04 See commit

This commit adds a macro to export XLA:FFI handlers as C function symbols. It includes changes to the api.h file, adding a macro to register decoding for a user-defined enum class type. Additionally, it introduces macros for declaring and defining C functions that implement FFI handlers, allowing users to export XLA:FFI handlers from a shared library as C function symbols. The commit also includes modifications to the BUILD file and ffi_test.cc, where tests are added to verify the static handler registration and handler symbol registration functionalities.

Overall, this commit enhances the functionality of XLA:FFI by providing a mechanism to export handlers as C function symbols, making it easier for users to work with FFI implementations in a shared library. It also includes tests to ensure the proper functioning of these new features.

Files changed

third_party/xla/xla/ffi/BUILD
third_party/xla/xla/ffi/api/api.h
third_party/xla/xla/ffi/ffi_test.cc

2024-06-05T15:51:04 See commit

This commit adds the initial implementation of ThunkExecutor to the xla:cpu library. ThunkExecutor is a dataflow-style executor for a ThunkSequence that depends on buffer uses to build a DAG defining execution order. The ThunkExecutor class includes methods to create the executor, get information about the execution order, and convert the executor to a string representation. Additionally, a test file ThunkExecutorTest is added to test the basic functionality of ThunkExecutor using a test-only Thunk implementation called BufferUseThunk.

Overall, this commit introduces a new class ThunkExecutor to handle the execution order of thunks based on buffer uses, along with associated methods and tests to ensure the functionality of the executor.

Files changed

third_party/xla/xla/runtime/BUILD
third_party/xla/xla/runtime/buffer_use.cc
third_party/xla/xla/runtime/buffer_use.h
third_party/xla/xla/runtime/buffer_use_test.cc
third_party/xla/xla/service/cpu/runtime/BUILD
third_party/xla/xla/service/cpu/runtime/thunk_executor.cc
third_party/xla/xla/service/cpu/runtime/thunk_executor.h
third_party/xla/xla/service/cpu/runtime/thunk_executor_test.cc

2024-06-05T17:42:03 See commit

This commit adds a very basic Directed Acyclic Graph (DAG) executor implementation to the XLA CPU service. The implementation includes changes to the BUILD file, thunk_executor.cc, thunk_executor.h, and thunk_executor_test.cc. It introduces structures like NodeDef and Node to represent nodes in the dataflow graph and implements methods for executing the thunk sequence using the prepared dataflow graph. The commit also includes test cases to ensure the correct ordering and execution of thunks in the executor.

Overall, this commit enhances the functionality of the XLA CPU service by adding a DAG executor implementation that can execute thunks concurrently in a given thread pool. It introduces new structures and methods to handle the execution of thunks based on the dataflow graph, and includes test cases to validate the ordering and execution of thunks in the executor.

Files changed

third_party/xla/xla/service/cpu/runtime/BUILD
third_party/xla/xla/service/cpu/runtime/thunk_executor.cc
third_party/xla/xla/service/cpu/runtime/thunk_executor.h
third_party/xla/xla/service/cpu/runtime/thunk_executor_test.cc

2024-06-05T19:24:19 See commit

This commit simplifies the code in dynamic_update_slice by removing unnecessary template usage, which helps reduce the target binary size. The changes involve converting indices into int64 before processing, which is done to optimize the code. The modifications include updating the ClampStartIndices function to take int64_t indices_data instead of a template, and adjusting the DynamicUpdateSlice function to work with int64_t indices_data. Additionally, the Eval function now converts indices into int64 before passing them to the DynamicUpdateSlice function for processing.

Overall, these changes aim to streamline the code in dynamic_update_slice, making it more efficient and reducing the binary size. By simplifying the template usage and ensuring indices are converted into int64 before processing, the code is optimized for better performance. The commit includes modifications to specific functions and switches in the code to handle different data types, ensuring compatibility with 1-bit/8-bit/32-bit/64-bit integer or float types.

Files changed

tensorflow/lite/kernels/dynamic_update_slice.cc

2024-06-05T22:23:52 See commit

This commit introduces the use of Eigen thread pool to execute ThunkExecutor tasks in the xla:cpu library. The purpose of this change is to address mutex contention points that were hindering the expected performance improvement. By removing these contention points, the performance improvement should now be nearly linear in terms of speedup compared to the baseline, scaling with the number of threads used.

The modifications include changes to various files such as cpu_client.cc, cpu_executable.cc, and ThunkExecutor files to implement the execution of ThunkExecutor tasks using Eigen thread pool. Additionally, the ThunkExecutor class now includes a TaskRunner function to execute tasks concurrently, improving the efficiency of executing ThunkExecutor tasks. This enhancement aims to optimize the execution of tasks within the xla:cpu library for better performance and scalability.

Files changed

third_party/xla/xla/pjrt/cpu/BUILD
third_party/xla/xla/pjrt/cpu/cpu_client.cc
third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/cpu_executable.cc
third_party/xla/xla/service/cpu/cpu_executable.h
third_party/xla/xla/service/cpu/runtime/thunk_executor.cc
third_party/xla/xla/service/cpu/runtime/thunk_executor.h

2024-06-05T22:35:14 See commit

This commit adds a new feature to the external FFI API in XLA. Specifically, it introduces a ScratchAllocator interface for device memory allocation. The ScratchAllocator deallocates all buffers it has allocated upon destruction. The commit includes changes to the BUILD file, c_api.h, ffi.h, ffi_test.cc, and ffi_api.cc files. It also adds new functions for allocating and freeing device memory within the XLA FFI API, along with the necessary implementation details for handling memory allocation and deallocation.

Overall, this commit enhances the functionality of the external FFI API in XLA by providing a mechanism for efficient device memory allocation and deallocation through the ScratchAllocator interface. It introduces new structures and functions to support this feature, improving the overall usability and performance of the XLA framework for handling device memory operations.

Files changed

third_party/xla/xla/ffi/BUILD
third_party/xla/xla/ffi/api/BUILD
third_party/xla/xla/ffi/api/c_api.h
third_party/xla/xla/ffi/api/ffi.h
third_party/xla/xla/ffi/api/ffi_test.cc
third_party/xla/xla/ffi/ffi_api.cc

2024-06-05T23:11:48 See commit

The commit focuses on correctly propagating deserialization errors from HloProgramSerDes in the codebase. The mlir::stablehlo::deserializePortableArtifact function now returns nullptr if parsing fails, and the code has been updated to handle this explicitly. Additionally, the ScopedDiagnosticHandler is used to propagate error messages from MLIR to the caller, ensuring that any errors during deserialization are properly communicated.

The changes in the commit include modifications to multiple files, such as mlir_to_hlo.cc, hlo_program_serdes.cc, hlo_program_serdes_test.cc, and updates to the BUILD files. These changes involve adding necessary dependencies like absl/status:statusor and absl/strings and making adjustments to error handling and error message propagation in the deserialization process. The modifications aim to improve error handling and provide more detailed error messages when deserialization failures occur, enhancing the overall reliability and maintainability of the codebase.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/mlir_to_hlo.cc
third_party/xla/xla/python/ifrt/hlo/BUILD
third_party/xla/xla/python/ifrt/hlo/hlo_program_serdes.cc
third_party/xla/xla/python/ifrt/hlo/hlo_program_serdes_test.cc

2024-06-05T23:46:33 See commit

This commit addresses a failure in the hash function with Numpy 2.0 in TensorFlow Lite. The changes include modifying the BUILD file to add a reference to the third-party Numpy library, as well as making modifications to the util.py file. In util.py, the hashing implementation for model structures has been updated to use a C++ layer instead of relying solely on the Python API. Additionally, the update_hash_with_primitive_value function now uses Numpy's uint64 data type for hash calculations, ensuring compatibility with the new Numpy version.

Furthermore, the conversion_metadata.fbs file has been modified to change the data type of the model_hash field from int64 to uint64, aligning it with the updated hash calculations in util.py. These changes aim to fix the hash function failure and improve compatibility with Numpy 2.0 in TensorFlow Lite.

Files changed

tensorflow/lite/python/BUILD
tensorflow/lite/python/util.py
tensorflow/lite/schema/conversion_metadata.fbs

2024-06-06T01:15:11 See commit

This commit adds a runtime simulator to predict the execution time of a High-Level Optimization (HLO) module based on a given memory space assignment. The simulator takes into account the number of trip counts of outer loops when calculating the execution time of an instruction, particularly important for nested loops where the execution time can vary significantly based on the number of times it is executed. Prior to this update, the simulator assumed all nested layers had the same trip count, which was a user-provided configuration with a default value of 5.

With this patch, a new simulator is implemented that uses static analysis to determine the trip count for each nested layer and then multiplies them to calculate the total trip counts. This update affects several files in the memory space assignment module, including adding new files for the simulator implementation and testing.

Files changed

third_party/xla/xla/service/memory_space_assignment/BUILD
third_party/xla/xla/service/memory_space_assignment/cost_analysis.cc
third_party/xla/xla/service/memory_space_assignment/cost_analysis.h
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment.cc
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment.h
third_party/xla/xla/service/memory_space_assignment/simulator.cc
third_party/xla/xla/service/memory_space_assignment/simulator.h
third_party/xla/xla/service/memory_space_assignment/simulator_test.cc

2024-06-06T19:08:13 See commit

This commit addresses an issue where vectorization was not working properly when the index computation contained a modulo operation. Specifically, the commit fixed a scenario where (a mod b) * x was functioning correctly, but (a * x) mod b was not. The changes made in this commit involved modifying certain functions and expressions to ensure that vectorization with modulo operations would work as expected.

In addition to fixing the vectorization issue, the commit also included modifications to the code in the vectorize_loads_stores.mlir and vectorize_loads_stores.cc files. These changes involved adding new functions and expressions related to handling modulo operations in index computations. Overall, the commit aimed to improve the vectorization process when dealing with modulo operations, ensuring that it functions correctly in all relevant scenarios.

Files changed

third_party/xla/xla/service/gpu/fusions/mlir/tests/vectorize_loads_stores.mlir
third_party/xla/xla/service/gpu/fusions/mlir/vectorize_loads_stores.cc

2024-06-06T23:49:21 See commit

This commit adds support for conditional operations with manual subgroups in the spmd_partitioner in the xla service. Previously, the conditional operation replicated the predicate so that all partitions followed the same control flow. With this update, if the conditional operation's first operand has manual subgroups, the sharding is adjusted accordingly to maintain the manual subgroups. Otherwise, the replication process remains the same. The changes involve modifications to the SpmdPartitioningVisitor in spmd_partitioner.cc and corresponding test cases in spmd_partitioner_test.cc.

In addition to updating the conditional operation handling in the spmd_partitioner, this commit also includes test cases for conditional operations with partial manual sharding. The test cases verify that the conditional operation correctly handles manual subgroups in the sharding configuration of the operands, ensuring that the control flow is maintained according to the manual subgroup settings. These test cases help validate the functionality of the updated conditional operation support with manual subgroups in the spmd_partitioner.

Files changed

third_party/xla/xla/service/spmd/spmd_partitioner.cc
third_party/xla/xla/service/spmd/spmd_partitioner_test.cc