tensorflow changelog

1 year ago

Hey there, awesome developers! We've got some exciting updates and improvements to share with you. Check out the latest changes below:

New Features 🚀

Integrate StableHLO at openxla/stablehlo@dd48ec58: We've integrated StableHLO, introducing new operations like UniformDequantizeOp and UniformQuantizeOp along with their inference and verification functions. This brings enhancements to uniform quantization and all-to-all operations. 🎉
Add num_warps to BlockLevelFusionConfig: A new field, "num_warps," has been added to the BlockLevelFusionConfig message in the GPU backend, along with a method to convert the struct to proto. This improves GPU backend settings configuration. 🛠️
Support for CollectivePermute thunk: We've added support for the CollectivePermute thunk in XLA for CPU, enabling all collective operations to be executed using thunks. 🙌
Shardings for CaseOp and IfOp: This update adds shardings for implicit operands and return values of CaseOp and IfOp, ensuring correct sharding settings based on input parameters. 🔄
Layout method for BasicStringArray: Implemented the layout method for the BasicStringArray class, adding functionality to handle the layout of BasicStringArray objects. 📐

Improvements ✨

Split DotThunk for parallel compilation: The DotThunk implementation in XLA CPU service now supports parallel compilation, optimizing matrix multiplication operations. 💪
Profiling enhancements with NVTX: Named threads, CUDA devices, and CUDA streams in the Nsight Systems UI for a better profiling experience. 🖥️
Memcpy function restructuring: Moved the StreamExecutor::Memcpy function to the Stream and its derived classes, streamlining the code and improving efficiency. 🔄

Bugfixes 🐛

Prevent XLA crash if PATH variable not set: Addressed an issue where XLA would crash if the PATH environment variable was not set, now providing an error message instead. 🚫
Hashable Interval & IndexingMap: Made the Interval and IndexingMap classes properly hashable, ensuring they can be used in containers and other data structures. 🔍
Stop using xla/statusor.h: Updated various files to directly include tsl/platform/statusor.h instead of xla/statusor.h, which now only contains an alias for absl::Status. 🔄

Chores 🧹

Clean-up before removing tiling: Cleaned up code related to XLA:GPU and MLIR-based indexing in preparation for removing tiling functionality. 🧽

Stay awesome and keep coding! 👩‍💻👨‍💻

Included Commits

2024-06-14T01:51:24 See commit

This commit integrates StableHLO at openxla/stablehlo@dd48ec58. The changes include modifications to third_party/stablehlo/workspace.bzl, with the commit and SHA256 values updated. Additionally, modifications were made to third_party/xla/xla/mlir_hlo/mhlo/IR/hlo_ops.cc, where new operations like UniformDequantizeOp and UniformQuantizeOp were added, along with their respective functions for inference and verification. Changes were also made to third_party/xla/xla/mlir_hlo/mhlo/IR/hlo_ops.td to include a verifier for UniformQuantizeOp and third_party/xla/xla/mlir_hlo/tests/Dialect/mhlo/mhlo_quantized.mlir to add new functions for uniform quantization.

Furthermore, modifications were made to third_party/xla/xla/mlir_hlo/tests/Dialect/mhlo/ops.mlir to include new functions like all_to_all_same_split_concat_dim, demonstrating the usage of mhlo.all_to_all operation with specific parameters. Overall, this commit introduces new operations, functions, and verifiers related to uniform quantization and all-to-all operations in StableHLO.

Files changed

third_party/stablehlo/workspace.bzl
third_party/xla/xla/mlir_hlo/mhlo/IR/hlo_ops.cc
third_party/xla/xla/mlir_hlo/mhlo/IR/hlo_ops.td
third_party/xla/xla/mlir_hlo/tests/Dialect/mhlo/mhlo_quantized.mlir
third_party/xla/xla/mlir_hlo/tests/Dialect/mhlo/ops.mlir

2024-06-14T10:19:56 See commit

This commit makes the Interval and IndexingMap classes properly hashable by implementing the necessary comparison operators and hash functions. The changes include modifying the Interval class to define comparison operators for greater than, less than, greater than or equal to, less than or equal to, equal to, and not equal to, as well as implementing hash functions for AbslHashValue and llvm::hash_combine. Additionally, the IndexingMap class is updated to support hashing by implementing the AbslHashValue function and ensuring consistency in hash values for different instances.

The modifications involve updating the Interval and IndexingMap classes in the gpu model to include the necessary comparison operators, hash functions, and supporting methods. These changes ensure that instances of Interval and IndexingMap can be properly hashed for use in containers and other data structures that rely on hashing. The commit also includes test cases to verify the correctness and consistency of the hashing implementation for Interval, DimVar, RangeVar, RTVar, and IndexingMap instances.

Files changed

third_party/xla/xla/service/gpu/fusions/mlir/simplify_arith.cc
third_party/xla/xla/service/gpu/model/BUILD
third_party/xla/xla/service/gpu/model/indexing_map.cc
third_party/xla/xla/service/gpu/model/indexing_map.h
third_party/xla/xla/service/gpu/model/indexing_map_test.cc

2024-06-14T10:38:16 See commit

This commit addresses an issue where XLA crashes if the PATH environment variable is not set, which is a rare occurrence but can happen due to misconfiguration. Instead of the program crashing with a SIGSEGV error in XLA internals, this change ensures that an error indicating CUDA was not found is received, preventing the crash. The commit modifies the code in cuda_asm_compiler.cc to handle the case where the PATH variable is not set, adding checks and error handling to prevent the crash.

The changes made in this commit include adding a check for the PATH environment variable in the code, and if it is not set, handling the situation to prevent the XLA crash. By making these modifications, the program will now provide an error message indicating the absence of CUDA rather than crashing unexpectedly. The commit closes the related issue #13513 and ensures that the XLA program does not crash if the PATH variable is not properly configured.

Files changed

third_party/xla/xla/stream_executor/cuda/cuda_asm_compiler.cc

2024-06-14T12:44:58 See commit

This commit involves cleaning up code related to XLA:GPU and MLIR-based indexing before removing tiling. The changes are made to various files within the third_party/xla/xla/service/gpu directory, including reduction_base.cc, reduction_base.h, reduction.cc, reduction.h, reduction_mlir.cc, reduction_mlir.h, indexing_analysis.cc, and indexing_analysis.h. The modifications are likely aimed at improving the codebase and preparing it for the removal of tiling functionality.

Overall, this commit appears to be part of a larger effort to streamline and optimize the code related to XLA:GPU and MLIR-based indexing. By cleaning up the code before removing tiling, the developers are likely ensuring that the codebase is in a better state and more maintainable for future updates and changes.

Files changed

third_party/xla/xla/service/gpu/fusions/reduction.cc
third_party/xla/xla/service/gpu/fusions/reduction.h
third_party/xla/xla/service/gpu/fusions/reduction_base.cc
third_party/xla/xla/service/gpu/fusions/reduction_base.h
third_party/xla/xla/service/gpu/fusions/reduction_mlir.cc
third_party/xla/xla/service/gpu/fusions/reduction_mlir.h
third_party/xla/xla/service/gpu/model/indexing_analysis.cc
third_party/xla/xla/service/gpu/model/indexing_analysis.h

2024-06-16T22:01:41 See commit

This commit introduces changes to the DotThunk implementation in the XLA CPU service to enable parallel compilation. The DotThunk class now includes template functions for matrix multiplication operations, with specific implementations for different data types like half, float, double, int32_t, and complex numbers. These template functions utilize Eigen contraction for matrix multiplication and include alignment considerations for improved performance. Additionally, new files for specific data types like dot_thunk_c128.cc, dot_thunk_f16.cc, dot_thunk_f32.cc, dot_thunk_f64.cc, and dot_thunk_s32.cc have been added to provide specialized implementations for these data types.

The changes also include modifications to the dot_benchmark_test.cc file to include additional argument pairs for benchmarking the performance of batched matrix multiplication operations. The BUILD file has been updated to include new source files for dot_thunk, and the dot_thunk.cc file has been modified to remove unnecessary code and include the template functions for matrix multiplication. Overall, these changes aim to optimize the compilation and execution of matrix multiplication operations in the XLA CPU service.

Files changed

third_party/xla/xla/service/cpu/benchmarks/dot_benchmark_test.cc
third_party/xla/xla/service/cpu/runtime/BUILD
third_party/xla/xla/service/cpu/runtime/dot_thunk.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk.h
third_party/xla/xla/service/cpu/runtime/dot_thunk_c128.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk_c64.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk_f16.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk_f32.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk_f64.cc
third_party/xla/xla/service/cpu/runtime/dot_thunk_s32.cc

2024-06-18T21:18:54 See commit

This commit adds shardings for implicit operands and return values of CaseOp and IfOp in the XLA library. It introduces arg shardings only if there are result shardings present, indicating that sharding propagation has not been done yet. The commit modifies the mlir_hlo_to_hlo.cc file, adding 77 lines of code, deleting 23 lines, and making 100 changes in total. The changes include functions to create tuple shardings, get result shardings, get XlaOp shardings, and handle conditional operations in IfOp and CaseOp.

Additionally, the commit includes changes to the sharding.mlir test file, adding 140 lines of code. The test cases in the file verify the sharding configurations for different scenarios involving IfOp and CaseOp operations with multiple operands and return values, ensuring that the sharding settings are correctly applied based on the input parameters.

Files changed

third_party/xla/xla/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc
third_party/xla/xla/translate/mhlo_to_hlo/tests/sharding.mlir

2024-06-19T17:47:17 See commit

This commit adds a new field, "num_warps," to the BlockLevelFusionConfig message in the gpu backend_configs.proto file. Additionally, a method is implemented to convert the BlockLevelParameters struct to a BlockLevelFusionConfig proto in the tiled_hlo_computation.h file. The changes include modifications to the BlockLevelParameters struct to include the new num_warps field and the implementation of the ToBlockLevelFusionConfig method. Unit tests are also added in the tiled_hlo_computation_test.cc file to ensure the conversion functions correctly.

Overall, this commit enhances the XLA:GPU functionality by allowing the specification of the number of warps to use for the kernel in the BlockLevelFusionConfig message and provides a method to convert BlockLevelParameters to BlockLevelFusionConfig. The changes aim to improve the configuration and handling of GPU backend settings within the XLA service.

Files changed

third_party/xla/xla/service/gpu/backend_configs.proto
third_party/xla/xla/service/gpu/model/BUILD
third_party/xla/xla/service/gpu/model/tiled_hlo_computation.h
third_party/xla/xla/service/gpu/model/tiled_hlo_computation_test.cc

2024-06-20T05:54:37 See commit

This commit improves the profiling experience by naming threads, CUDA devices, and CUDA streams in the Nsight Systems UI. The device names, stream names, and thread names are now displayed in the UI, providing a better understanding of the profiling data. Additionally, a missing link between replica IDs in the HLO and the physical devices in the profile is addressed.

The changes include adding functions to assign human-readable names to threads, devices, and streams, as well as updating various files to incorporate these changes. Names are assigned to threads that launch work on devices, CUDA devices, and execution streams, enhancing the profiling information available. The commit also includes adjustments based on review comments to improve the implementation.

Files changed

third_party/xla/third_party/tsl/tsl/profiler/lib/nvtx_utils.cc
third_party/xla/third_party/tsl/tsl/profiler/lib/nvtx_utils.h
third_party/xla/third_party/tsl/tsl/profiler/lib/nvtx_utils_stub.cc
third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/gpu/BUILD
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client.cc
third_party/xla/xla/pjrt/local_device_state.cc
third_party/xla/xla/service/gpu/infeed_manager.cc
third_party/xla/xla/service/stream_pool.cc
third_party/xla/xla/stream_executor/BUILD
third_party/xla/xla/stream_executor/gpu/BUILD
third_party/xla/xla/stream_executor/gpu/gpu_stream.cc
third_party/xla/xla/stream_executor/gpu/gpu_stream.h
third_party/xla/xla/stream_executor/stream.h
third_party/xla/xla/stream_executor/stream_common.cc
third_party/xla/xla/stream_executor/stream_common.h
third_party/xla/xla/stream_executor/stream_executor_memory_allocator.cc
third_party/xla/xla/stream_executor/trace_command_buffer_factory.cc

2024-06-20T19:10:46 See commit

The commit moves the StreamExecutor::Memcpy function to the Stream and its derived classes. The changes include modifying the CStreamExecutor and XlaInterpreterExecutor classes to handle the Memcpy function differently. The GpuExecutor class has also been updated to remove the Memcpy function and handle asynchronous memcpy differently. Additionally, the HostExecutor and TpuExecutor classes have been modified to adjust the Memcpy function accordingly. Changes have been made in various files to accommodate this restructuring, including stream_executor.cc, stream_executor_internal.h, executor.cc, executor.h, cuda_executor.cc, gpu_executor.h, gpu_stream.cc, host_executor.cc, host_stream.cc, mock_stream_executor.h, stream_common.cc, stream_common.h, stream_executor.h, tpu_executor.cc, and tpu_stream.h.

Overall, the commit centralizes the handling of the Memcpy function in the Stream and its derived classes, streamlining the code and improving efficiency by moving the processing to the appropriate classes.

Files changed

tensorflow/c/experimental/stream_executor/stream_executor.cc
tensorflow/c/experimental/stream_executor/stream_executor_internal.h
third_party/xla/xla/backends/interpreter/executor.cc
third_party/xla/xla/backends/interpreter/executor.h
third_party/xla/xla/stream_executor/cuda/cuda_executor.cc
third_party/xla/xla/stream_executor/gpu/gpu_executor.h
third_party/xla/xla/stream_executor/gpu/gpu_stream.cc
third_party/xla/xla/stream_executor/gpu/gpu_stream.h
third_party/xla/xla/stream_executor/host/host_executor.cc
third_party/xla/xla/stream_executor/host/host_executor.h
third_party/xla/xla/stream_executor/host/host_stream.cc
third_party/xla/xla/stream_executor/host/host_stream.h
third_party/xla/xla/stream_executor/mock_stream_executor.h
third_party/xla/xla/stream_executor/stream_common.cc
third_party/xla/xla/stream_executor/stream_common.h
third_party/xla/xla/stream_executor/stream_executor.h
third_party/xla/xla/stream_executor/tpu/tpu_executor.cc
third_party/xla/xla/stream_executor/tpu/tpu_executor.h
third_party/xla/xla/stream_executor/tpu/tpu_stream.h

2024-06-20T20:13:08 See commit

The commit involves stopping the use of xla/statusor.h as it now only contains an alias for absl::Status. This change required modifying various files to directly include tsl/platform/statusor.h to access definitions for TF_ASSIGN_OR_RETURN and other elements that were previously included transitively. The commit includes modifications to multiple files in the third_party/xla/xla/client directory, such as BUILD, client.h, client_library.h, compile_only_client.h, executable_build_options.cc, local_client.h, padding.h, value_inference.cc, xla_builder.h, as well as files in the third_party/xla/xla/pjrt/gpu directory and the third_party/xla/xla/service/llvm_ir directory.

Overall, the commit focuses on updating the include statements in the mentioned files to use absl::Status directly and make necessary changes to ensure a smooth transition away from xla/statusor.h. This change affects multiple files across different directories within the project.

Files changed

third_party/xla/xla/client/BUILD
third_party/xla/xla/client/client.h
third_party/xla/xla/client/client_library.h
third_party/xla/xla/client/compile_only_client.h
third_party/xla/xla/client/executable_build_options.cc
third_party/xla/xla/client/local_client.h
third_party/xla/xla/client/padding.h
third_party/xla/xla/client/value_inference.cc
third_party/xla/xla/client/xla_builder.h
third_party/xla/xla/pjrt/gpu/BUILD
third_party/xla/xla/pjrt/gpu/gpu_helpers.cc
third_party/xla/xla/pjrt/gpu/gpu_helpers.h
third_party/xla/xla/pjrt/gpu/nccl_id_store.cc
third_party/xla/xla/pjrt/gpu/nccl_id_store.h
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client.cc
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client.h
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client_test.cc
third_party/xla/xla/service/llvm_ir/BUILD
third_party/xla/xla/service/llvm_ir/fused_ir_emitter.cc
third_party/xla/xla/service/llvm_ir/fused_ir_emitter.h
third_party/xla/xla/service/llvm_ir/ir_array.cc
third_party/xla/xla/service/llvm_ir/llvm_util.cc
third_party/xla/xla/service/llvm_ir/llvm_util.h
third_party/xla/xla/service/llvm_ir/loop_emitter.cc
third_party/xla/xla/service/llvm_ir/loop_emitter.h

2024-06-20T21:37:53 See commit

This commit adds support for CollectivePermute thunk in XLA for CPU. The CollectivePermute thunk is created with the necessary parameters and buffers, and it can execute the CollectivePermute operation with source and target pairs. The implementation includes the creation of the CollectivePermuteThunk class, its constructor, and the Execute method to perform the CollectivePermute operation. The necessary changes are made in various files such as BUILD files, thunk files, and tests to support this new feature.

Additionally, the commit enables the collective ops test with thunks as all collective operations are now supported, showcasing that the new CollectivePermute thunk functionality has been successfully integrated and tested. Overall, this commit enhances the XLA for CPU by adding support for the CollectivePermute thunk and ensuring that all collective operations can now be executed using thunks.

Files changed

third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/runtime/BUILD
third_party/xla/xla/service/cpu/runtime/collective_permute_thunk.cc
third_party/xla/xla/service/cpu/runtime/collective_permute_thunk.h
third_party/xla/xla/service/cpu/runtime/thunk.cc
third_party/xla/xla/service/cpu/runtime/thunk.h
third_party/xla/xla/service/cpu/thunk_emitter.cc
third_party/xla/xla/service/cpu/thunk_emitter.h
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/collective_ops_test.cc

2024-06-20T22:18:18 See commit

This commit implements the layout method for the BasicStringArray class in the basic_string_array.cc file. It includes the addition of the BasicStringArrayLayout class to describe the layout of a BasicStringArray, with methods for serialization, equality comparison, and hashing. The commit also includes test cases in the basic_string_array_test.cc file to test the serialization, equality, and functionality of the BasicStringArrayLayout and BasicStringArray classes.

Overall, the commit adds functionality to handle the layout of BasicStringArray objects and includes corresponding test cases to ensure the correct implementation and behavior of the layout-related methods. Additionally, there are modifications in the basic_string_array.h and BUILD files to support these changes.

Files changed

third_party/xla/xla/python/pjrt_ifrt/BUILD
third_party/xla/xla/python/pjrt_ifrt/basic_string_array.cc
third_party/xla/xla/python/pjrt_ifrt/basic_string_array.h
third_party/xla/xla/python/pjrt_ifrt/basic_string_array_test.cc