tensorflow changelog

11 months ago

Hey there, fabulous developers! 🌟 We've got some exciting updates and tweaks to share with you. Let's dive right into the latest changes:

New feature: 🚀 Add support for `atomic_rmw fadd` for bf16 on HOPPER

Summary: This update brings in the magic of atomic_rmw fadd for bf16 data type on HOPPER CUDA compute capability within XLA:GPU and MLIR-based emitters. Now, you can perform atomic operations on bf16 data types with ease. A test case has been added to ensure everything runs smoothly on the HOPPER architecture.

Improvement: 🛠 Avoid building `hlo_runner_main.cc` twice

Summary: We've streamlined the build process by moving the actual build into a shared library target and creating two binary targets that depend on it. This makes maintaining dependencies easier and more explicit. Say goodbye to redundant builds!

Improvement: 🏎️ Run fusion-wrapper pass before scheduling in XLA:GPU

Summary: The fusion-wrapper pass now runs before scheduling in the GPU compiler. This change enhances the fusion and scheduling process, making it more efficient. Plus, there's a new test to ensure non-fused instructions are wrapped correctly.

New feature: 🌟 Open source XLA passes for Shardy

Summary: Shardy just got a major upgrade with new XLA passes! We've added new files, headers, and functions for exporting and importing operations and shardings. Test files are also included to ensure everything works perfectly.

Improvement: ⚡️ Port concatenate instruction to Thunks in XLA:CPU

Summary: Concatenate instructions are now ported to Thunks, with a fast concatenate option for better performance. Benchmarks show a 4% improvement in parallel concatenate performance and an 11% boost in CPU time. Fast concatenate without parallel processing shows a slight performance dip.

New feature: 🎉 Add a basic test case for circular pipeline collective permute

Summary: A new test case for circular pipeline collective permute has been added. It involves a simple computation using collective permute with source-target pairs and verifies the results. A more complex test case is outlined for future implementation.

New feature: 🧸 Add a toy example for using Shardy

Summary: A toy example for using Shardy in the XLA pipeline is now available. This includes changes to workspace files, BUILD files, a main file for Shardy optimization, and a test file with a simple MLIR test case. Perfect for getting started with Shardy!

New feature: 🔧 Add Thunk::ExecuteSession to control concurrent workers

Summary: Control the number of concurrent workers processing XLA execute requests with Thunk::ExecuteSession. This helps manage task scheduling overheads for XLA programs with many tiny thunks. Unit tests ensure the locking mechanism works as expected.

Bugfix: 🐛 Remove support for CUDA versions below 12.3 in XLA

Summary: We’ve streamlined XLA by removing support for CUDA versions below 12.3. This update affects multiple files related to GPU functionality, profiling, and testing, aligning XLA with the latest CUDA technology for improved performance.

Bugfix: 🛠 Revert fix for 3 DeadCode findings

Summary: Reverted a previous fix that addressed 3 DeadCode findings related to DelayKernelIsSupported, LaunchDelayKernel, and UnsupportedGpuFeature. The revert undoes changes made to gpu_timer_kernel_rocm.cc and gpu_types.h.

Bugfix: ⚙️ Only use the kernel threadpool if it is enabled

Summary: Added a conditional check to use the kernel threadpool only if it is enabled. This ensures optimal performance and resource utilization when working with TensorFlow Lite delegates.

Chore: 🧹 Make stablehlo tests private

Summary: The visibility of stablehlo tests has been changed from public to private. This keeps these tests restricted to their intended scope, maintaining the integrity and organization of the codebase.

That's all for now, folks! Keep coding and stay awesome! ✨

Included Commits

2024-07-05T06:50:36 See commit

This commit modifies the xnnpack_delegate.cc file in the TensorFlow Lite delegates directory. It adds a conditional check to only use the kernel threadpool if it is enabled. Specifically, it checks if the TFLITE_KERNEL_USE_XNNPACK flag is defined and if a valid threadpool is passed via the context. If these conditions are met, the xnnpack threadpool will be used with the same number of threads as the kernel threadpool.

Overall, this change ensures that the kernel threadpool is only used when it is explicitly enabled, providing more control over the threadpool usage in the xnnpack delegate. This modification helps optimize performance and resource utilization when working with TensorFlow Lite delegates.

Files changed

tensorflow/lite/delegates/xnnpack/xnnpack_delegate.cc

2024-07-05T12:02:33 See commit

This commit in the XLA:GPU framework involves running the fusion-wrapper pass before scheduling in the GPU compiler. The changes include modifying the GpuCompiler class to add the fusion-wrapper pass in the RunPreSchedulingPasses method. Additionally, a new pass called CollectiveScheduleLinearizer is added in the GpuCompiler class using a HloPassPipeline called "collective-schedule-linearizer". The GpuCompiler::CompileToBackendResult method is updated to include the RunPreSchedulingPasses call. In the GpuCompiler::RunPreSchedulingPasses method, the FusionWrapper pass is added to the HloPassPipeline before running it on the module.

In the GpuCompilerTest class, a new test called NonFusedInstructionsAreWrapped is added to check if non-fused instructions are wrapped correctly using the FusionWrapper pass. The test involves creating a HloModule with a specific structure, compiling it using the backend compiler, and checking that the root instruction of the compiled module matches the Fusion instruction pattern. Overall, this commit enhances the fusion and scheduling process in the XLA:GPU framework by introducing the fusion-wrapper pass and updating the GPU compiler classes and tests accordingly.

Files changed

third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/gpu_compiler_test.cc

2024-07-05T12:11:44 See commit

This commit aims to avoid building hlo_runner_main.cc twice by moving the actual build into a shared library target and creating two binary targets that depend on the shared library target. By doing this, maintaining dependencies becomes easier and more explicit. The change includes modifying the BUILD file, adding 22 lines of code, deleting 30 lines, and making 52 changes overall. The hlo_runner_main binary target is now dependent on the hlo_runner_main_lib shared library target, ensuring that the code is not built twice.

Overall, this change streamlines the build process and improves dependency management in the project. By creating separate binary targets that depend on a shared library target, the code is more organized and easier to maintain. Additionally, the modifications in the BUILD file make the dependencies clearer and more explicit, contributing to a more efficient and structured build system.

Files changed

third_party/xla/xla/tools/multihost_hlo_runner/BUILD

2024-07-05T12:56:18 See commit

This commit adds support for the atomic_rmw fadd operation for bf16 data type on the HOPPER CUDA compute capability within XLA:GPU and MLIR-based emitters. The changes include modifications to the lower_tensors.cc and lower_tensors.mlir files, specifically adding support for bf16 atomic operations on the HOPPER architecture. This update allows for atomic operations on bf16 data types within the specified CUDA compute capability, enhancing the functionality and compatibility of the system.

The code modifications involve checking for support for bf16 atomic operations on the HOPPER architecture and enabling the atomic_rmw fadd operation for bf16 data types. Additionally, a test case is added in the lower_tensors.mlir file to ensure that the direct atomic_rmw fadd operation with bf16 data type is correctly handled on the HOPPER architecture. Overall, this commit enhances the capabilities of XLA:GPU and MLIR-based emitters by extending support for atomic operations on bf16 data types specifically for the HOPPER compute capability.

Files changed

third_party/xla/xla/service/gpu/fusions/mlir/lower_tensors.cc
third_party/xla/xla/service/gpu/fusions/mlir/tests/lower_tensors.mlir

2024-07-05T14:07:31 See commit

This commit introduces a port of the concatenate instruction to Thunks in the xla:cpu framework. The implementation includes a fast concatenate option for improved performance. The benchmarks show an overall improvement of about 4% in parallel concatenate performance, with a roughly 11% improvement in CPU time. However, the fast concatenate without parallel processing shows a slight degradation of about 0.1% in overall performance and roughly 0.11% worse CPU time.

The commit includes modifications to various files in the xla:cpu framework, such as adding concatenate benchmark tests, modifying benchmark runners, and adjusting emitter files. The benchmarks compare performance between fast concatenation with and without Thunks, showing varying results in terms of time and CPU usage across different batch sizes, widths, heights, and axes. Overall, the introduction of Thunks for concatenate instruction appears to have a positive impact on parallel concatenate performance, while the impact on fast concatenate performance is more nuanced.

Files changed

third_party/xla/xla/service/cpu/BUILD
third_party/xla/xla/service/cpu/benchmarks/BUILD
third_party/xla/xla/service/cpu/benchmarks/concatenate_benchmark_test.cc
third_party/xla/xla/service/cpu/benchmarks/hlo_benchmark_runner.cc
third_party/xla/xla/service/cpu/benchmarks/hlo_benchmark_runner.h
third_party/xla/xla/service/cpu/ir_emitter.cc
third_party/xla/xla/service/cpu/ir_emitter.h
third_party/xla/xla/service/cpu/ir_emitter2.cc
third_party/xla/xla/service/cpu/ir_emitter2.h
third_party/xla/xla/service/cpu/thunk_emitter.cc
third_party/xla/xla/service/cpu/thunk_emitter.h
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/concatenate_test.cc

2024-07-08T06:41:53 See commit

This commit reverts a previous fix that addressed 3 DeadCode findings related to DelayKernelIsSupported, LaunchDelayKernel, and UnsupportedGpuFeature. The code changes in the files gpu_timer_kernel_rocm.cc and gpu_types.h were modified to include functions and structures related to these dead and untested features. The revert was done to undo the changes made in the previous commit.

The specific changes in the code include adding functions DelayKernelIsSupported and LaunchDelayKernel in gpu_timer_kernel_rocm.cc, as well as adding the structure UnsupportedGpuFeature in gpu_types.h. These changes were made to address the dead code findings, but the revert was done to undo these modifications. The commit message also includes the reference to the reverted commit and the origin revision ID.

Files changed

third_party/xla/xla/stream_executor/gpu/gpu_timer_kernel_rocm.cc
third_party/xla/xla/stream_executor/gpu/gpu_types.h

2024-07-08T21:22:19 See commit

This commit removes support for CUDA versions below 12.3 in XLA. Various files related to CUDA, GPU profiling, GPU service, kernels, tests, and stream executors have been modified to reflect this change. The removal of support for older CUDA versions may indicate a shift towards optimizing performance and compatibility with newer CUDA releases in XLA.

Overall, this commit streamlines XLA's compatibility with CUDA by removing support for versions below 12.3. This update affects multiple files related to GPU functionality, profiling, testing, and execution, indicating a focused effort on aligning XLA with the latest CUDA technology for improved performance and efficiency.

Files changed

third_party/xla/xla/backends/profiler/gpu/cupti_tracer.cc
third_party/xla/xla/backends/profiler/gpu/cupti_wrapper.cc
third_party/xla/xla/backends/profiler/gpu/device_tracer_cuda.cc
third_party/xla/xla/pjrt/gpu/se_gpu_pjrt_client.cc
third_party/xla/xla/service/gpu/cudnn_fused_conv_rewriter_test.cc
third_party/xla/xla/service/gpu/cudnn_fused_mha_rewriter.cc
third_party/xla/xla/service/gpu/cudnn_fused_mha_rewriter_test.cc
third_party/xla/xla/service/gpu/cudnn_norm_rewriter_test.cc
third_party/xla/xla/service/gpu/kernels/cutlass_gemm_custom_kernel.cc
third_party/xla/xla/service/gpu/runtime/command_buffer_thunk_test.cc
third_party/xla/xla/service/gpu/tests/gemm_rewrite_test.cc
third_party/xla/xla/service/gpu/tests/gpu_fused_mha_test.cc
third_party/xla/xla/stream_executor/cuda/cuda_blas.cc
third_party/xla/xla/stream_executor/cuda/cuda_blas_lt.cc
third_party/xla/xla/stream_executor/cuda/cuda_blas_utils.cc
third_party/xla/xla/stream_executor/cuda/cuda_driver.cc
third_party/xla/xla/stream_executor/cuda/cuda_fft.cc
third_party/xla/xla/stream_executor/cuda/ptx_compiler_impl.cc
third_party/xla/xla/stream_executor/gpu/gpu_command_buffer_test.cc
third_party/xla/xla/stream_executor/gpu/gpu_types.h
third_party/xla/xla/tsl/cuda/cupti_stub.cc

2024-07-09T00:28:02 See commit

This commit adds a basic test case for circular pipeline collective permute in the third_party/xla/xla/tests/collective_ops_test.cc file. The test case involves a simple computation using collective permute with source-target pairs and verifies the results. Additionally, a more complex test case for circular pipelining is outlined in the code but marked as a TODO for future implementation. The test cases are executed with a specified number of replicas and the results are compared to expected values.

The changes in this commit include modifications to the collective_ops_test.cc file, with 53 lines added for the new test case and no deletions. The test case involves setting up a while loop for computation and verifying the results using collective permute. The commit also includes a degenerate test case for collective permute in the same file, focusing on a different scenario.

Files changed

third_party/xla/xla/tests/collective_ops_test.cc

2024-07-09T01:24:04 See commit

This commit makes the stablehlo tests in the tensorflow compiler MLIR lite directory private instead of public. This change is reflected in the BUILD file where the visibility is updated from public to private. The modifications in this commit are focused on changing the visibility setting for the stablehlo tests, making them accessible only within the appropriate scope.

Overall, this commit ensures that the stablehlo tests are kept private, restricting their visibility to only the necessary components. This adjustment helps maintain the integrity and organization of the codebase by ensuring that these specific tests are not accessible outside of their intended scope.

Files changed

tensorflow/compiler/mlir/lite/stablehlo/tests/BUILD

2024-07-09T02:49:10 See commit

This commit adds a toy example for using shardy in the XLA pipeline. It includes changes to the workspace files to load the shardy repository, adds BUILD files for shardy integration, a main file for shardy optimization, and a test file with a simple MLIR test case. The changes involve adding dependencies, registering passes, and setting up lit tests for shardy in XLA.

The commit introduces the necessary configurations and files to integrate shardy into the XLA pipeline, demonstrating how shardy can be used for optimization. It includes updates to workspace files, the addition of shardy specific BUILD files, a main file for shardy optimization, and a test file with a simple MLIR test case. The changes set up the infrastructure for shardy integration and testing within the XLA framework.

Files changed

tensorflow/workspace2.bzl
third_party/shardy/BUILD
third_party/shardy/shardy.patch
third_party/shardy/workspace.bzl
third_party/xla/workspace2.bzl
third_party/xla/xla/service/spmd/shardy/BUILD
third_party/xla/xla/service/spmd/shardy/sdy_opt_main.cc
third_party/xla/xla/service/spmd/shardy/test/BUILD
third_party/xla/xla/service/spmd/shardy/test/simple.mlir

2024-07-09T17:53:44 See commit

This commit introduces a new feature in XLA for controlling the number of concurrent workers processing XLA execute requests. By adding a new function called Thunk::ExecuteSession, the maximum number of concurrent workers can be controlled, which helps in managing task scheduling overheads for XLA programs with a large number of tiny thunks. The commit includes changes to Thunk::ExecuteState, ThunkExecutor, ThunkExecutorTest, Thunk, and ThunkTest files to implement this feature. The Execute function in ThunkExecutor now takes an additional parameter for the lock, and a new function SplitReadyQueue is added to offload thunks processing to the task runner.

Additionally, the ExecuteSession class in the Thunk file is introduced to control the number of task runner threads that can execute thunks concurrently. The class includes methods to join the execute session, increment the number of session workers, and try to join the session while ensuring the maximum number of workers is not exceeded. Unit tests are added to test the functionality of the ExecuteSession class, ensuring that the locking mechanism works correctly and the number of workers is managed as expected.

Files changed

third_party/xla/xla/service/cpu/runtime/thunk.cc
third_party/xla/xla/service/cpu/runtime/thunk.h
third_party/xla/xla/service/cpu/runtime/thunk_executor.cc
third_party/xla/xla/service/cpu/runtime/thunk_executor.h
third_party/xla/xla/service/cpu/runtime/thunk_executor_test.cc
third_party/xla/xla/service/cpu/runtime/thunk_test.cc

2024-07-11T21:50:57 See commit

This commit introduces open source xla passes for Shardy, with various changes made to files within the third_party/xla/xla directory. The BUILD file for Shardy has been modified, while mlir_to_hlo.cc and sdy_opt_main.cc have also been modified. Additionally, new files have been added such as mhlo_round_trip and round_trip_common, with various functions and headers included for exporting and importing operations and shardings. Test files have also been added to test the functionality of the new xla passes for Shardy.

Overall, this commit focuses on adding new xla passes for Shardy, with modifications made to existing files and the addition of new files to support the functionality of these passes. The changes include renaming files, adding new headers and functions for exporting and importing operations, as well as test files to ensure the proper functioning of the xla passes for Shardy.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/mlir_to_hlo.cc
third_party/xla/xla/service/spmd/shardonnay/BUILD
third_party/xla/xla/service/spmd/shardonnay/README.md
third_party/xla/xla/service/spmd/shardy/BUILD
third_party/xla/xla/service/spmd/shardy/README.md
third_party/xla/xla/service/spmd/shardy/constants.h
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/BUILD
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/export_ops.cc
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/export_ops.h
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/export_shardings.cc
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/export_shardings.h
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/mhlo_export.cc
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/mhlo_export.h
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/mhlo_import.cc
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/mhlo_import.h
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/shard_map_export.cc
third_party/xla/xla/service/spmd/shardy/mhlo_round_trip/shard_map_export.h
third_party/xla/xla/service/spmd/shardy/round_trip_common/BUILD
third_party/xla/xla/service/spmd/shardy/round_trip_common/convert_sharding_custom_calls.cc
third_party/xla/xla/service/spmd/shardy/round_trip_common/convert_sharding_custom_calls.h
third_party/xla/xla/service/spmd/shardy/round_trip_common/identity_to_pass_through_while_args.cc
third_party/xla/xla/service/spmd/shardy/round_trip_common/identity_to_pass_through_while_args.h
third_party/xla/xla/service/spmd/shardy/round_trip_common/import_constants.cc
third_party/xla/xla/service/spmd/shardy/round_trip_common/import_constants.h
third_party/xla/xla/service/spmd/shardy/round_trip_common/pipeline_passes.cc
third_party/xla/xla/service/spmd/shardy/round_trip_common/pipeline_passes.h
third_party/xla/xla/service/spmd/shardy/round_trip_common/shard_map_import.cc
third_party/xla/xla/service/spmd/shardy/round_trip_common/shard_map_import.h
third_party/xla/xla/service/spmd/shardy/sdy_opt_main.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/BUILD
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/export_ops.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/export_ops.h
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/export_shardings.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/export_shardings.h
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/import_shardings.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/import_shardings.h
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/pipelines.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/pipelines.h
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/test_utils/BUILD
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/test_utils/mhlo_to_hlo_to_mhlo.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/test_utils/mhlo_to_hlo_to_mhlo.h
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/test_utils/testing_pipeline.cc
third_party/xla/xla/service/spmd/shardy/sdy_round_trip/test_utils/testing_pipeline.h
third_party/xla/xla/service/spmd/shardy/shardonnay_call_inliner.cc
third_party/xla/xla/service/spmd/shardy/shardonnay_call_inliner.h
third_party/xla/xla/service/spmd/shardy/shardonnay_call_inliner_test.cc
third_party/xla/xla/service/spmd/shardy/shardonnay_xla_pass.cc
third_party/xla/xla/service/spmd/shardy/shardonnay_xla_pass.h
third_party/xla/xla/service/spmd/shardy/shardonnay_xla_pass_test.cc
third_party/xla/xla/service/spmd/shardy/test/BUILD
third_party/xla/xla/service/spmd/shardy/test/import_shardings.mlir
third_party/xla/xla/service/spmd/shardy/test/mhlo_export_pipeline.mlir
third_party/xla/xla/service/spmd/shardy/test/mhlo_import_pipeline.mlir
third_party/xla/xla/service/spmd/shardy/test/round_trip_pipeline.mlir
third_party/xla/xla/service/spmd/shardy/test/round_trip_pipeline_manual_computation.mlir
third_party/xla/xla/service/spmd/shardy/test/sdy_round_trip_export_pipeline.mlir
third_party/xla/xla/service/spmd/shardy/test/sdy_round_trip_import_pipeline.mlir
third_party/xla/xla/service/spmd/shardy/test/shard_map_export.mlir
third_party/xla/xla/service/spmd/shardy/test/shard_map_import.mlir
third_party/xla/xla/service/spmd/shardy/test/shard_map_import_failure.mlir
third_party/xla/xla/service/spmd/shardy/utils.cc
third_party/xla/xla/service/spmd/shardy/utils.h

tensorflow changelog

New feature: 🚀 Add support for atomic_rmw fadd for bf16 on HOPPER

Improvement: 🛠 Avoid building hlo_runner_main.cc twice

Improvement: 🏎️ Run fusion-wrapper pass before scheduling in XLA:GPU

New feature: 🌟 Open source XLA passes for Shardy

Improvement: ⚡️ Port concatenate instruction to Thunks in XLA:CPU

New feature: 🎉 Add a basic test case for circular pipeline collective permute

New feature: 🧸 Add a toy example for using Shardy

New feature: 🔧 Add Thunk::ExecuteSession to control concurrent workers

Bugfix: 🐛 Remove support for CUDA versions below 12.3 in XLA

Bugfix: 🛠 Revert fix for 3 DeadCode findings

Bugfix: ⚙️ Only use the kernel threadpool if it is enabled

Chore: 🧹 Make stablehlo tests private

Included Commits

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

Files changed

New feature: 🚀 Add support for `atomic_rmw fadd` for bf16 on HOPPER

Improvement: 🛠 Avoid building `hlo_runner_main.cc` twice