tensorflow changelog

1 year ago

Welcome to the latest change log! We've been busy making some exciting updates and improvements. Here's a rundown of what's new, fixed, and improved:

New Features

Freeze API for Device Tensors 🧊: Introducing a Freeze() API to release host memory for device tensors in TensorFlow. It decides whether to release a tensor based on its usage by CPU/Host operations. This helps in managing memory more efficiently by freeing up resources used solely by the device.
Shard-as Propagation Support 🚀: Added support for shard-as propagation with unspecified dimensions in the XLA:SPMD framework. This update ensures better handling of sharding instructions and enhances the propagation process.
GemmDegenerateDimRemover Pass: A new pass called GemmDegenerateDimRemover has been added to the XLA service for GPU. This pass removes degenerate dimensions introduced by GemvRewriter, optimizing matrix-vector multiplications.
Remove Unused Dimensions in IndexingMap: A method to remove unused dimensions from the IndexingMap class in the XLA:GPU service has been introduced. This helps in cleaning up and optimizing representations by removing unused dimensions.
HloAnyOf Function 🌟: Added a new traversal function called HloAnyOf to the XLA:GPU codebase. This function provides a flexible way to traverse HLO nodes without needing additional adaptors, making the codebase more user-friendly.

Improvements

Multi-threading in tf-data Module 🧵: We've introduced multi-threading to run the flat map function in TensorFlow's tf-data module. This change boosts the efficiency and performance of processing input datasets by using multiple threads.
Memory Term Reduction Algorithm: A simpler and more effective algorithm for reducing memory terms has been implemented. This update uses ActivePrim pairs instead of LiveAndPrim pairs, making the merging of overlapping intervals more efficient.
Remove Unused Dims and Symbols in XLA:GPU: A method to remove both unused dimensions and symbols has been added to the XLA:GPU IndexAnalysis module. This optimization reduces redundancy and improves performance.

Bug Fixes

Early Error for Coordination Service Shutdown: Fixed an issue where a barrier request after the coordination service shutdown would proceed. Now, it returns an error early, ensuring proper handling of such requests.
Close Host Callback Queues: Explicitly closing host callback queues inside IfrtBackend destruction to avoid potential deadlocks caused by blocked executions.
Unpropagatable Dots in Space-to-Batch Conversion: Marked dots as unpropagatable during space-to-batch conversion to prevent issues related to dot propagation post layout assignment.

Chores

Remove Deprecated MLIR Codegen: Removed deprecated XLA:CPU MLIR-based codegen parts to clean up the codebase and streamline the compilation pipeline.

That's all for now! Stay tuned for more updates and improvements. 🌟

Included Commits

2024-05-01T00:27:53 See commit

The commit introduces a new Freeze() API to release host memory for device tensors in TensorFlow. The decision to release a restored tensor is based on whether it is used by CPU/Host operations, with this information coming from the compiler. The implementation includes modifications in various files, such as tfrt_ops_kernel.cc, ifrt_restore_tensor_registry.cc, and others, to handle the freezing of resources used only by the device and not the host.

Specific changes include adding a used_by_host attribute in the IfrtLoadVariableOp and related operations to determine if the host tensor can be released. The IfrtRestoreTensorRegistry class now includes methods to set a tensor as used by the host and to freeze the model by releasing host tensors used only by the device. Additionally, the MlrtIfrtLoadVariableKernel class now checks the used_by_host attribute and sets the tensor as used by the host accordingly. Tests have been updated to reflect these changes and ensure the correct behavior of the freezing mechanism.

Files changed

tensorflow/compiler/mlir/tensorflow/ir/host_runtime/tfrt_ops.td
tensorflow/compiler/mlir/tfrt/ir/mlrt/tf_mlrt_ops.td
tensorflow/compiler/mlir/tfrt/ir/mlrt/tf_ops.td
tensorflow/compiler/mlir/tfrt/tests/ifrt/sink_variable_as_named_array.mlir
tensorflow/compiler/mlir/tfrt/tests/mlrt/rewrite_ifrt_load_variable.mlir
tensorflow/compiler/mlir/tfrt/tests/mlrt/tf_to_mlrt.mlir
tensorflow/compiler/mlir/tfrt/transforms/ifrt/sink_variable_as_named_array.cc
tensorflow/compiler/mlir/tfrt/transforms/mlrt/tf_to_mlrt.cc
tensorflow/core/tfrt/ifrt/ifrt_loaded_variable_utils_test.cc
tensorflow/core/tfrt/ifrt/ifrt_model_context.cc
tensorflow/core/tfrt/ifrt/ifrt_model_context.h
tensorflow/core/tfrt/ifrt/ifrt_restore_tensor_registry.cc
tensorflow/core/tfrt/ifrt/ifrt_restore_tensor_registry.h
tensorflow/core/tfrt/mlrt/kernel/BUILD
tensorflow/core/tfrt/mlrt/kernel/ifrt_ops_kernel.cc
tensorflow/core/tfrt/mlrt/kernel/ifrt_ops_kernel_test.cc

2024-05-01T01:08:26 See commit

This commit adds a new function called HloAnyOf to the hlo_traversal.h file in the XLA:GPU codebase. This function is a variant of HloFindIf and is designed to traverse HLO nodes without requiring instruction and fusion adaptors. It takes a span of HloInstruction pointers as input, along with a visit function that determines whether to return true for a given node, and a boolean flag to specify whether to visit operands or users of the nodes. The function returns true if the visit function returns true for any of the nodes in the traversal.

Overall, this commit enhances the GPU implementation in XLA by adding a more flexible and efficient way to traverse HLO nodes for specific conditions without the need for additional adaptors. It improves the functionality and usability of the codebase by providing a new tool for developers working on GPU-related tasks within the XLA framework.

Files changed

third_party/xla/xla/service/gpu/hlo_traversal.cc
third_party/xla/xla/service/gpu/hlo_traversal.h

2024-05-01T03:19:56 See commit

This commit adds a new pass called GemmDegenerateDimRemover to the xla service for the GPU. The purpose of this pass is to remove the degenerate dimension introduced by GemvRewriter. The commit includes changes to the BUILD file, adding a new cc_library for GemmDegenerateDimRemover, as well as adding the implementation files for the pass (gemm_degenerate_dim_remover.cc and gemm_degenerate_dim_remover.h). Additionally, a test file (gemm_degenerate_dim_remover_test.cc) is added to test the functionality of the GemmDegenerateDimRemover pass.

The GemmDegenerateDimRemover pass is designed to rewrite a gemm with a degenerate dimension to a matrix-vector multiplication. The pass identifies the degenerate dimension introduced by the GemvRewriter and removes it after GemmFusion is run. The implementation includes a visitor class (GemmDegenerateDimRemoverVisitor) that handles the rewriting of dot instructions to remove the degenerate dimension. The test cases in gemm_degenerate_dim_remover_test.cc validate the functionality of the GemmDegenerateDimRemover pass by testing different scenarios of matrix-vector multiplications and ensuring the correct rewriting of dimensions.

Files changed

third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/gemm_degenerate_dim_remover.cc
third_party/xla/xla/service/gpu/gemm_degenerate_dim_remover.h
third_party/xla/xla/service/gpu/gemm_degenerate_dim_remover_test.cc

2024-05-01T22:15:37 See commit

In this commit, the deprecated XLA:CPU MLIR-based codegen part #1 was removed from the XLA:CPU. The changes include modifications in various files such as BUILD, register_common_dialects.cc, tf_mlir_opt_main.cc, compilation_pipeline_cpu.cc, and cpu_compiler.cc. The code related to HLO XLA runtime pipeline was removed from the files hlo_xla_runtime_pipeline.cc and hlo_xla_runtime_pipeline.h. Additionally, the dependency on the HLO XLA runtime pipeline was removed from the BUILD file. The commit also includes changes in the mlir-based codegen pipeline to lower modules from HLO to Linalg on buffers, with the creation of a new pipeline and registration of dialects.

Files changed

tensorflow/compiler/mlir/BUILD
tensorflow/compiler/mlir/register_common_dialects.cc
tensorflow/compiler/mlir/tf_mlir_opt_main.cc
third_party/xla/xla/mlir/runtime/transforms/BUILD
third_party/xla/xla/mlir/runtime/transforms/compilation_pipeline_cpu.cc
third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/cpu_compiler.cc
third_party/xla/xla/service/cpu/hlo_xla_runtime_pipeline.cc
third_party/xla/xla/service/cpu/hlo_xla_runtime_pipeline.h
third_party/xla/xla/translate/BUILD

2024-05-01T22:31:28 See commit

This commit introduces support for shard-as propagation with unspecified dimensions in the XLA:SPMD framework. The changes include modifications to the sharding_propagation.cc file, adding 79 lines of code and deleting 5 lines, resulting in a total of 84 changes. The commit includes a new function called InferUnspecifiedDimsFromShardGroup, which processes sharding instructions and propagates sharding information based on specified conditions. Additionally, the commit includes updates to the ShardingPropagation class to handle shard groups and sharding propagation in SPMD mode.

In the accompanying test file, sharding_propagation_test.cc, new test cases are added to verify the functionality of shard-as propagation with shard barriers. The tests validate the behavior of sharding propagation with unspecified dimensions and shard barriers in different scenarios, ensuring the correctness and effectiveness of the implemented changes related to shard-as propagation in the XLA:SPMD framework.

Files changed

third_party/xla/xla/service/sharding_propagation.cc
third_party/xla/xla/service/sharding_propagation_test.cc

2024-05-01T23:39:50 See commit

This commit makes changes to the coordination service in the XLA library to return an error early if a barrier is requested after the coordination service has shut down. The changes include modifying the coordination service's implementation to check if the service has stopped before processing a barrier request, and if it has, immediately return an error. This ensures that any barrier requests made after the service has shut down are handled appropriately by returning an error instead of proceeding with the request.

Additionally, tests have been added to verify that barriers fail when called after the service has stopped. These tests simulate scenarios where the service stops due to lack of heartbeat or other reasons, and then verify that barrier requests made in such situations result in internal errors, indicating that the coordination service has shut down. These tests ensure that the coordination service behaves as expected when it is no longer available to process requests, preventing any unexpected behavior or issues.

Files changed

third_party/xla/xla/tsl/distributed_runtime/coordination/BUILD
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service.cc
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service.h
third_party/xla/xla/tsl/distributed_runtime/coordination/coordination_service_test.cc

2024-05-02T08:01:47 See commit

This commit adds a method to remove unused dimensions and symbols in the XLA:GPU IndexAnalysis module. The commit includes changes to the indexing_map.cc, indexing_map.h, and indexing_map_test.cc files. The new method, CompressVars, efficiently removes unused dimensions and symbols from the affine map and constraints. It combines the functionality of RemoveUnusedSymbols and RemoveUnusedDims to avoid running the removal process twice when both symbols and dimensions need to be removed. The commit also includes test cases to ensure the proper removal of unused dimensions and symbols from the indexing map.

Overall, this commit enhances the efficiency of removing unused dimensions and symbols in the XLA:GPU IndexAnalysis module by introducing a new method that combines the removal process for symbols and dimensions. This optimization reduces redundancy and improves the performance of removing unused variables from the indexing map. The accompanying test cases verify the correct removal of unused dimensions and symbols, ensuring the functionality of the new method.

Files changed

third_party/xla/xla/service/gpu/model/BUILD
third_party/xla/xla/service/gpu/model/indexing_map.cc
third_party/xla/xla/service/gpu/model/indexing_map.h
third_party/xla/xla/service/gpu/model/indexing_map_test.cc

2024-04-30T00:24:41 See commit

The commit introduces multi-threading to run the flat map function in TensorFlow's tf-data module. This change involves modifying the flat_map_utils.cc and flat_map_utils.h files to use a deque data structure for input datasets and implement a method to create input datasets using multiple threads. The MakeInputDatasets method now returns an absl::StatusOr object containing a deque of DatasetBase pointers, and a new MakeInputDataset method is added to handle the creation of individual input datasets by applying the map function to input tensors in a thread-safe manner.

Overall, the commit optimizes the processing of input datasets by utilizing multi-threading, improving the efficiency and performance of running the flat map function in TensorFlow's tf-data module. The changes made in the flat_map_utils.cc and flat_map_utils.h files enable the creation of input datasets using multiple threads and ensure thread safety when applying the map function to input tensors, enhancing the overall functionality of the tf-data module.

Files changed

tensorflow/core/data/BUILD
tensorflow/core/data/flat_map_utils.cc
tensorflow/core/data/flat_map_utils.h

2024-04-30T00:54:33 See commit

This commit introduces a simpler and more effective algorithm for reducing memory terms. The changes include modifications to the auto_sharding_memory.cc file, with 11 additions and 14 deletions. The algorithm now uses ActivePrim pairs instead of LiveAndPrim pairs, improving the efficiency of merging large overlaps. Additionally, the MemoryTermReducer class has been updated to include a function called SweepAndMerge, which helps in merging overlapping intervals more effectively. The commit also includes modifications to the auto_sharding_memory_test.cc file, with 116 additions, to test the new algorithm with different scenarios of memory term reduction.

Overall, this commit enhances the memory term reduction algorithm by simplifying it and making it more efficient. The changes in the code aim to improve the merging of intervals and reduce memory usage effectively. The addition of new test cases in the test file ensures that the algorithm performs as expected in various scenarios, validating the effectiveness of the updated approach.

Files changed

third_party/xla/xla/hlo/experimental/auto_sharding/auto_sharding_memory.cc
third_party/xla/xla/hlo/experimental/auto_sharding/auto_sharding_memory_test.cc

2024-04-30T04:46:02 See commit

The commit explicitly closes the host callback queues inside the IfrtBackend destruction process. Previously, if there were host callback executions blocked within RemoteLoadedHostCallbackQueue::Pop(), they would not be automatically cancelled unless RemoteLoadedHostCallbackQueue::Close() was called. This issue led to a deadlock as IfrtBackend also waited for all in-flight operations to finish. The changes made in the commit include adding code to close the host callback queues in the IfrtBackend destructor, ensuring that all in-flight host callback executions are cancelled properly before the destruction of the backend.

In the code modifications, a loop was added in the IfrtBackend destructor to iterate through all the host callback queues and call the Close() function on each queue to cancel any pending executions. This change aims to prevent potential deadlocks that could occur due to blocked host callback executions. The commit addresses this issue by ensuring that all host callback queues are explicitly closed during the destruction of the IfrtBackend, resolving the deadlock problem that could arise from uncancelled host callback executions.

Files changed

third_party/xla/xla/python/ifrt_proxy/server/grpc_service_impl.cc
third_party/xla/xla/python/ifrt_proxy/server/ifrt_backend.cc

2024-04-30T21:47:48 See commit

This commit adds a method to remove unused dimensions from the IndexingMap class in the XLA:GPU service. The method detects unused dimensions and symbols in the affine map and constraints, and removes them, compressing the map and updating the constraints accordingly. It returns a bit vector of the dimensions that were removed. Tests were added to ensure the correct removal of unused dimensions, including cases where constraints contain used or only unused dimensions.

Overall, this commit enhances the functionality of the IndexingMap class by providing a way to clean up and optimize the representation by removing dimensions that are not used in the affine map or constraints, improving the efficiency and accuracy of the indexing operations in the XLA:GPU service.

Files changed

third_party/xla/xla/service/gpu/model/indexing_map.cc
third_party/xla/xla/service/gpu/model/indexing_map.h
third_party/xla/xla/service/gpu/model/indexing_map_test.cc

2024-04-30T23:41:39 See commit

This commit marks dots to be unpropagatable on space-to-batch conversion by adding a case for HloOpcode::kDot in the ConvolutionVisitor function in space_to_batch_converter.cc file. The delay in the conversion from dots to convolutions occurring post layout assignment was causing this issue. Additionally, a test case NoPropagateThroughDot is added in space_to_batch_converter_test.cc to ensure that space-to-batch conversion does not start on conv->dot chains, preventing any propagation through dots.

Overall, this commit makes necessary changes in the ConvolutionVisitor function to address the unpropagatable behavior of dots during space-to-batch conversion and adds a test case to verify this behavior. By marking dots as unpropagatable, it ensures that the conversion process is handled correctly post layout assignment, preventing any issues related to dot propagation during the conversion process.

Files changed

third_party/xla/xla/service/space_to_batch_converter.cc
third_party/xla/xla/service/space_to_batch_converter_test.cc