tensorflow changelog

10 months ago

Hey there, awesome coders! 🎉 We've got some exciting updates and bug fixes in the latest release. Here's a quick rundown of what's new and improved:

New Feature 🚀: GroupExecute API for XLA GPU Collectives
We've added a shiny new GroupExecute API to the XLA GPU collectives. This nifty feature supercharges group-based execution for collective communication patterns, making parallel processing tasks in machine learning and computational applications more efficient and performant.
New Feature 🛠️: Pre-Calibration Magic in TensorFlow MLIR
Say hello to tf_pre_calibration! This new component in the TensorFlow MLIR quantization framework is all about pre-calibration transformations during post-training static-range quantization. It collects quantization statistics and processes quantizable functions like a champ.
New Feature 💡: Device Assignment in XLA CPU Backend
We've empowered the XLA CPU backend to let you pass device assignments to NanoRt. This means more flexibility and efficiency in managing computations across multiple devices. Yay for smarter resource allocation!
Improvement 🔧: Memory Space Allocation in TfrtGpuClient
Our TfrtGpuClient just got a boost with platform and memory space allocator support. This upgrade means better GPU resource management, making your GPU applications run smoother than ever.
New Feature 🚀: Data Transfer in PJRT Async GPU Client
Introducing TransferToInfeed and TransferFromOutfeed in the PJRT async GPU client! These functions make data transfers to and from infeed and outfeed buffers a breeze, enhancing GPU data handling within the XLA framework.
Improvement 🛠️: Refactored TensorFlow MLIR Passes
We've reorganized the TensorFlow MLIR quantization passes into a new namespace, tf_passes. This refactor improves modularity and maintainability, making future development and enhancements a walk in the park.
Improvement 🔒: Shutdown Method in PreemptionSyncManager
A new Shutdown method in PreemptionSyncManager ensures a smooth and controlled shutdown process, enhancing system stability and reliability.
Bugfix 🐛: CUDA Graph Launch Callback
We've squashed a bug related to missing CUDA graph launch callbacks in the latest CUDA versions. Now, your GPU profiling should be as accurate as ever!
New Feature 📈: Post-Calibration in TensorFlow MLIR
Meet tf_post_calibration! This new library in the TensorFlow MLIR quantization framework performs post-calibration graph transformations, optimizing model performance after quantization.
Chore 🔄: Internal Directory Restructure
We've reorganized the TensorFlow codebase, focusing on directory structure and build configurations. This cleanup aims to streamline development and improve maintainability.
Bugfix 🔧: Deadlock in Tracked Device Buffer
We've fixed a potential deadlock issue in the XLA framework by replacing on_ready_tasks_callback_ with AndThen callbacks. This change ensures reliable task execution without any hiccups.
Bugfix 🐞: Concurrent Collective Creation
We've tackled a bug in the XLA library related to concurrent collective creation. Now, communicators are created safely in a multi-threaded environment, making collective communication more robust.

Enjoy these updates, and happy coding! 🎉

Included Commits

2025-05-02T16:55:35 See commit

This commit addresses a bug related to concurrent collective creation in the XLA (Accelerated Linear Algebra) library. The changes primarily involve modifications to the cpu_cliques.cc file, where new mechanisms have been introduced to ensure that communicators are created safely in a multi-threaded environment. Specifically, the implementation now utilizes absl::once_flag to guarantee that each communicator for a given rank is constructed only once, preventing race conditions that could arise when multiple threads attempt to create the same communicator simultaneously.

Additionally, the commit adds necessary headers from the Abseil library, including absl/container/flat_hash_map and absl/base/call_once.h, to support the new functionality. The updates include the introduction of data structures to track the status of communicator creation and ensure thread safety through the use of mutex locks. Overall, these changes enhance the robustness of the collective communication features in the XLA library, making it more reliable in concurrent execution scenarios.

Files changed

third_party/xla/xla/backends/cpu/collectives/BUILD
third_party/xla/xla/backends/cpu/collectives/cpu_cliques.cc

2025-05-02T17:00:11 See commit

This commit introduces a new Shutdown method to the PreemptionSyncManager class, enhancing its functionality by allowing for a controlled shutdown process. The method ensures that the manager can safely terminate its operations without interference from other methods, which is crucial for maintaining system stability. The implementation includes mutex locking to protect shared resources and prevent race conditions, along with logging to provide visibility into the shutdown process.

Additionally, the commit updates relevant header and source files to accommodate these changes, ensuring that the shutdown functionality is integrated smoothly into the existing architecture. The commit also includes new unit tests to validate the behavior of the Shutdown method under various scenarios, including cases where shutdown occurs without preemption or initialization. Overall, this enhancement aims to improve the robustness and reliability of the PreemptionSyncManager.

Files changed

third_party/xla/xla/tsl/distributed_runtime/preemption/BUILD
third_party/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.cc
third_party/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.h
third_party/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager_test.cc

2025-05-03T23:04:05 See commit

This commit addresses a deadlock issue in the tracked_device_buffer component of the XLA (Accelerated Linear Algebra) framework by replacing the existing on_ready_tasks_callback_ mechanism with AndThen callbacks on the AsyncValueRef. The previous implementation could lead to a race condition where a task's callback might not be executed if the definition_status_ was set to available before the callback was added, potentially resulting in a deadlock situation.

To resolve this, the commit modifies the task execution logic to ensure that tasks are executed immediately when the definition event is available. If the event is not yet ready, the task is added to the defined_status_ async value, which will handle its execution once it becomes available. This change simplifies the code by eliminating the need for a separate callback map and enhances the overall reliability of task execution within the buffer sequencing event handling.

Files changed

third_party/xla/xla/pjrt/tracked_device_buffer.cc
third_party/xla/xla/pjrt/tracked_device_buffer.h

2025-05-05T22:02:14 See commit

This commit introduces the implementation of two functions, TransferToInfeed and TransferFromOutfeed, in the TfrtGpuDevice class of the PJRT (Platform for Just-in-Time Resource Transfer) async GPU client. These functions are designed to facilitate the transfer of data to and from the infeed and outfeed buffers, respectively. The implementation includes a helper method, GetTransferManager, which retrieves the appropriate transfer manager from the client, ensuring that data transfers are managed effectively. The changes made to the tfrt_gpu_client.cc file involve adding 15 lines of code while modifying existing lines to replace placeholder "Unimplemented" responses with functional code that performs the intended data transfer operations.

Additionally, the header file tfrt_gpu_client.h has been updated to declare the new GetTransferManager method, enhancing the class's interface. This commit is part of ongoing development to improve the efficiency and functionality of GPU data handling within the XLA (Accelerated Linear Algebra) framework, specifically enhancing the capabilities of the GPU client within the TensorFlow ecosystem.

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.h

2025-05-06T04:24:39 See commit

This commit introduces support for a platform and memory space allocator within the TfrtGpuClient, enhancing its functionality in GPU resource management. The changes affect several files, including modifications to the Tfrt GPU client and its associated tests, as well as updates to the integration with the TensorFlow allocator adapter.

Key files impacted by this commit include tfrt_gpu_client.cc, tracked_tfrt_gpu_device_buffer.cc, and the corresponding test files, which have been updated to reflect the new memory allocation capabilities. The modifications aim to improve the efficiency and flexibility of memory management in GPU applications using the TensorFlow runtime (TFRT).

Files changed

third_party/xla/xla/pjrt/gpu/tfrt/BUILD
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_buffer_test.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.cc
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client.h
third_party/xla/xla/pjrt/gpu/tfrt/tfrt_gpu_client_test.cc
third_party/xla/xla/pjrt/gpu/tfrt/tracked_tfrt_gpu_device_buffer.cc
third_party/xla/xla/pjrt/gpu/tfrt/tracked_tfrt_gpu_device_buffer.h
third_party/xla/xla/pjrt/gpu/tfrt/tracked_tfrt_gpu_device_buffer_test.cc
third_party/xla/xla/stream_executor/integrations/BUILD
third_party/xla/xla/stream_executor/integrations/tf_allocator_adapter.cc
third_party/xla/xla/stream_executor/integrations/tf_allocator_adapter.h

2025-05-06T23:45:18 See commit

This commit focuses on refactoring the TensorFlow MLIR quantization passes by forking the remaining passes into a dedicated namespace, specifically tf_passes. The changes include the addition of multiple new files related to various quantization operations, optimizations, and utility functions, which are crucial for enhancing the MLIR framework's capabilities. Additionally, several existing files have been modified to accommodate the new structure, ensuring a clear organization of the quantization-related functionalities.

The commit also encompasses a comprehensive suite of new MLIR tests designed to validate the functionality of the newly added passes and modifications. These tests cover a wide range of operations, including weight quantization, function insertion, and optimization processes. Overall, this update aims to improve the modularity and maintainability of the quantization passes within TensorFlow's MLIR infrastructure, facilitating future development and enhancements.

Files changed

tensorflow/compiler/mlir/quantization/tensorflow/BUILD
tensorflow/compiler/mlir/quantization/tensorflow/ops/tf_op_quant_spec.cc
tensorflow/compiler/mlir/quantization/tensorflow/ops/tf_quantize_op.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_add_dump_tensor_op.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_cast_bf16_ops_to_f32.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_cast_bf16_ops_to_f32.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_convert_custom_aggregation_op_to_quant_stats.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_convert_tf_xla_op_to_tf_op.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_convert_tf_xla_op_to_tf_op.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_convert_tpu_model_to_cpu.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_convert_tpu_model_to_cpu.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_duplicate_shape_determining_constants.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_insert_custom_aggregation_ops.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_insert_main_function.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_insert_quantized_functions.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_insert_restore_op.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_insert_save_op.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_lift_hashtable_ops_as_args.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_lift_quantizable_spots_as_functions.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_lift_quantizable_spots_as_functions_drq.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_lift_quantizable_spots_as_functions_drq.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_mark_functions_noinline.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_merge_duplicate_resource_ops.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_merge_initializer_function_ops_to_main.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_merge_save_function_ops_to_main.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_optimize.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_optimize.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_passes.h
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_prepare_lifting.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_propagate_quantize_type.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_quantize_composite_functions.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_quantize_composite_functions.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_quantize_weights.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_remove_var_init_by_const.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_replace_cast_hacks_with_tf_xla_ops.cc
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_replace_cast_hacks_with_tf_xla_ops.td
tensorflow/compiler/mlir/quantization/tensorflow/passes/tf_unfreeze_constants.cc
tensorflow/compiler/mlir/quantization/tensorflow/tests/quantize_weights.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_add_dump_tensor_op.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_cast_bf16_ops_to_f32.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_convert_custom_aggregation_op_to_quant_stats.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_convert_tf_xla_op_to_tf_op.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_convert_tpu_model_to_cpu.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_duplicate_shape_determining_constants.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_insert_custom_aggregation_ops.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_insert_main_function.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_insert_quantized_functions.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_insert_restore_op.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_insert_save_op.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_lift_hashtable_ops_as_args.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_lift_quantizable_spots_as_functions_drq.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_mark_functions_noinline.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_merge_duplicate_resource_ops.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_merge_initializer_function_ops_to_main.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_merge_save_function_ops_to_main.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_optimize.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_propagate_quantize_type.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_quantize_composite_functions.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_quantize_weights.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_remove_var_init_by_const.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_replace_cast_hacks_with_tf_xla_ops.mlir
tensorflow/compiler/mlir/quantization/tensorflow/tests/tf_unfreeze_constants.mlir

2025-05-07T19:40:02 See commit

This commit introduces a new library, tf_post_calibration, to the TensorFlow MLIR quantization framework, specifically under the stablehlo directory. The new library consists of two files: tf_post_calibration.cc and tf_post_calibration.h, which implement the PostCalibrationComponent class. This component is responsible for performing post-calibration graph transformations as part of post-training static-range quantization. It utilizes collected statistics from the calibration step to generate quantized StableHLO operations, which are serialized in TF::XlaCallModuleOps.

Additionally, the commit modifies the BUILD file to include the new library and its dependencies, enhancing the overall functionality of the quantization process within TensorFlow. The changes reflect a significant step towards improving the quantization capabilities, ensuring that the resulting modules are optimized based on the calibration data, which is crucial for effective model performance after quantization.

Files changed

tensorflow/compiler/mlir/quantization/stablehlo/cc/BUILD
tensorflow/compiler/mlir/quantization/stablehlo/cc/tf_post_calibration.cc
tensorflow/compiler/mlir/quantization/stablehlo/cc/tf_post_calibration.h

2025-05-07T20:18:20 See commit

The commit addresses a critical issue related to the integration of CUDA graph launch callback events in the latest versions of CUDA. Specifically, it modifies the device_tracer_cuda.cc file within the XLA (Accelerated Linear Algebra) backend profiler to ensure that necessary CUDA graph callbacks, such as cuGraphLaunch, are included. This adjustment is essential because, without these callbacks, the CUPTI (CUDA Profiling Tools Interface) would fail to send the appropriate events, potentially leading to incomplete profiling and performance analysis.

The changes involve adding several CUDA graph-related callback identifiers that are now required in CUDA versions 12.8 and above. The commit also includes updates to existing callback identifiers for kernel launches and memory operations, reflecting a more comprehensive approach to tracking GPU activities. Overall, this enhancement aims to improve the accuracy and effectiveness of GPU profiling in the context of XLA, ensuring that developers can gather detailed performance metrics when utilizing CUDA's advanced features.

Files changed

third_party/xla/xla/backends/profiler/gpu/device_tracer_cuda.cc

2025-05-08T01:22:09 See commit

The commit titled "Internal dir restructure" involves a significant reorganization of the TensorFlow codebase, specifically focusing on the directory structure and build configurations. Multiple files across various components, including CI scripts, TensorFlow's core modules, and third-party dependencies, have been modified to reflect this new structure. Notably, many BUILD files have been updated, indicating changes in how the build system is configured for different TensorFlow components, such as MLIR, Lite, and Python interfaces.

In addition to modifications, the commit also includes the removal of numerous files related to third-party libraries, suggesting a cleanup or consolidation of dependencies. This restructuring aims to streamline the development process and improve the maintainability of the codebase, potentially enhancing the overall performance and organization of TensorFlow's components. The extensive changes across both modified and removed files highlight a comprehensive effort to refine the project's architecture.

Files changed

ci/official/containers/linux_arm64/devel.usertools/code_check_full.bats
ci/official/utilities/code_check_full.bats
tensorflow/BUILD
tensorflow/compiler/mlir/lite/experimental/tac/py_wrapper/BUILD
tensorflow/compiler/mlir/lite/integrations/BUILD
tensorflow/compiler/mlir/lite/python/BUILD
tensorflow/compiler/mlir/lite/python/interpreter_wrapper/BUILD
tensorflow/compiler/mlir/quantization/tensorflow/python/BUILD
tensorflow/compiler/mlir/stablehlo/BUILD
tensorflow/compiler/mlir/tensorflow_to_stablehlo/python/BUILD
tensorflow/core/tfrt/graph_executor/python/BUILD
tensorflow/core/tfrt/saved_model/python/BUILD
tensorflow/lite/experimental/genai/BUILD
tensorflow/lite/kernels/BUILD
tensorflow/lite/python/interpreter_wrapper/BUILD
tensorflow/lite/python/metrics/BUILD
tensorflow/lite/python/optimize/BUILD
tensorflow/lite/python/testdata/BUILD
tensorflow/lite/testing/BUILD
tensorflow/lite/toco/python/BUILD
tensorflow/lite/tools/optimize/sparsity/BUILD
tensorflow/opensource_only.files
tensorflow/python/BUILD
tensorflow/python/client/BUILD
tensorflow/python/data/experimental/service/BUILD
tensorflow/python/eager/BUILD
tensorflow/python/framework/BUILD
tensorflow/python/lib/core/BUILD
tensorflow/python/platform/BUILD
tensorflow/python/tpu/BUILD
tensorflow/python/util/BUILD
tensorflow/tensorflow.bzl
tensorflow/tools/lib_package/BUILD
tensorflow/tools/pip_package/BUILD
tensorflow/tools/tf_sig_build_dockerfiles/devel.usertools/code_check_full.bats
tensorflow/tools/toolchains/cpus/aarch64/aarch64.bzl
tensorflow/tools/toolchains/cpus/aarch64/aarch64_compiler_configure.bzl
tensorflow/tools/toolchains/remote_config/rbe_config.bzl
tensorflow/workspace0.bzl
tensorflow/workspace2.bzl
third_party/FP16/FP16.BUILD
third_party/FP16/workspace.bzl
third_party/absl/com_google_absl.BUILD
third_party/absl/invert_the_is_inline_bin.patch
third_party/absl/nvidia_jetson.patch
third_party/absl/system.BUILD
third_party/absl/system.absl.algorithm.BUILD
third_party/absl/system.absl.base.BUILD
third_party/absl/system.absl.cleanup.BUILD
third_party/absl/system.absl.container.BUILD
third_party/absl/system.absl.debugging.BUILD
third_party/absl/system.absl.flags.BUILD
third_party/absl/system.absl.functional.BUILD
third_party/absl/system.absl.hash.BUILD
third_party/absl/system.absl.memory.BUILD
third_party/absl/system.absl.meta.BUILD
third_party/absl/system.absl.numeric.BUILD
third_party/absl/system.absl.random.BUILD
third_party/absl/system.absl.status.BUILD
third_party/absl/system.absl.strings.BUILD
third_party/absl/system.absl.synchronization.BUILD
third_party/absl/system.absl.time.BUILD
third_party/absl/system.absl.types.BUILD
third_party/absl/system.absl.utility.BUILD
third_party/benchmark/workspace.bzl
third_party/clang_toolchain/cc_configure_clang.bzl
third_party/clang_toolchain/download_clang.bzl
third_party/compute_library/build_defs.bzl
third_party/compute_library/compute_library.patch
third_party/cudnn_frontend.BUILD
third_party/cutlass.BUILD
third_party/cython.BUILD
third_party/dlpack/dlpack.BUILD
third_party/dlpack/workspace.bzl
third_party/ducc/ducc.BUILD
third_party/ducc/ducc0_custom_lowlevel_threading.h
third_party/ducc/fft.cc
third_party/ducc/fft.h
third_party/ducc/threading.cc
third_party/ducc/threading.h
third_party/ducc/workspace.bzl
third_party/eigen3/LICENSE
third_party/eigen3/eigen_archive.BUILD
third_party/eigen3/workspace.bzl
third_party/farmhash/farmhash.BUILD
third_party/farmhash/farmhash_gpu.BUILD
third_party/farmhash/farmhash_support_cuda.patch
third_party/farmhash/workspace.bzl
third_party/gemmlowp/workspace.bzl
third_party/git/BUILD.tpl
third_party/gloo/gloo.BUILD
third_party/gloo/workspace.bzl
third_party/googletest/BUILD.bazel
third_party/googletest/googletest.patch
third_party/gpus/check_cuda_libs.py
third_party/gpus/crosstool/BUILD.rocm.tpl
third_party/gpus/crosstool/BUILD.sycl.tpl
third_party/gpus/crosstool/BUILD.tpl
third_party/gpus/crosstool/LICENSE
third_party/gpus/crosstool/cc_toolchain_config.bzl.tpl
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_rocm.tpl
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_sycl.tpl
third_party/gpus/crosstool/hipcc_cc_toolchain_config.bzl.tpl
third_party/gpus/crosstool/sycl_cc_toolchain_config.bzl.tpl
third_party/gpus/crosstool/windows/msvc_wrapper_for_nvcc.py.tpl
third_party/gpus/cuda/BUILD.tpl
third_party/gpus/cuda/BUILD.windows.tpl
third_party/gpus/cuda/build_defs.bzl.tpl
third_party/gpus/cuda/cuda_config.h.tpl
third_party/gpus/cuda/cuda_config.py.tpl
third_party/gpus/cuda/hermetic/BUILD.tpl
third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
third_party/gpus/cuda/hermetic/cuda_nvprune.BUILD.tpl
third_party/gpus/cuda/hermetic/cuda_nvrtc.BUILD.tpl
third_party/gpus/find_cuda_config.py
third_party/gpus/find_rocm_config.py
third_party/gpus/find_sycl_config.py
third_party/gpus/local_config_cuda.BUILD
third_party/gpus/rocm/build_defs.bzl.tpl
third_party/gpus/rocm/rocm_config.h.tpl
third_party/gpus/rocm/rocm_redist.bzl
third_party/gpus/rocm/rocm_redist_ubuntu_20_04.bzl
third_party/gpus/rocm/rocm_redist_ubuntu_22_04.bzl
third_party/gpus/rocm/rocm_redist_ubuntu_24_04.bzl
third_party/gpus/sycl/BUILD.tpl
third_party/gpus/sycl/build_defs.bzl.tpl
third_party/gpus/sycl_configure.bzl
third_party/grpc/generate_cc_env_fix.patch
third_party/grpc/register_go_toolchain.patch
third_party/grpc/upb_platform_fix.patch
third_party/hwloc/BUILD
third_party/hwloc/BUILD.system
third_party/hwloc/hwloc.BUILD
third_party/hwloc/static-components.h
third_party/hwloc/workspace.bzl
third_party/implib_so/get_symbols.py
third_party/implib_so/implib_so.BUILD
third_party/implib_so/make_stub.py
third_party/implib_so/workspace.bzl
third_party/llvm_openmp/cmake_vars.bzl
third_party/llvm_openmp/expand_cmake_vars.py
third_party/llvm_openmp/openmp_switch_default_patch.patch
third_party/mkl_dnn/LICENSE
third_party/mkl_dnn/mkldnn_acl.BUILD
third_party/mkl_dnn/mkldnn_v1.BUILD
third_party/mkl_dnn/setting_init.patch
third_party/mpitrampoline/mpitrampoline.BUILD
third_party/nanobind/nanobind.BUILD
third_party/nanobind/workspace.bzl
third_party/nasm/BUILD
third_party/nasm/BUILD.system
third_party/nasm/config.h
third_party/nasm/nasm.BUILD
third_party/nasm/workspace.bzl
third_party/nccl/LICENSE
third_party/nccl/archive.patch
third_party/nccl/build_defs.bzl.tpl
third_party/nccl/generated_names.bzl.tpl
third_party/nccl/hermetic/cuda_nccl.BUILD.tpl
third_party/nccl/nccl_configure.bzl
third_party/nvshmem/nvshmem.BUILD
third_party/nvshmem/workspace.bzl
third_party/nvtx.BUILD
third_party/nvtx/LICENSE
third_party/ortools/bliss.BUILD
third_party/ortools/glpk.BUILD
third_party/ortools/ortools.patch
third_party/ortools/scip.BUILD
third_party/ortools/scip.patch
third_party/py/python_configure.bzl
third_party/pybind11.BUILD
third_party/pybind11_abseil/remove_license.patch
third_party/pybind11_bazel/pybind11_bazel.patch
third_party/pybind11_bazel/workspace.bzl
third_party/remote_config/BUILD.tpl
third_party/remote_config/remote_platform_configure.bzl
third_party/robin_map/robin_map.BUILD
third_party/robin_map/workspace.bzl
third_party/six.BUILD
third_party/snappy.BUILD
third_party/spirv_llvm_translator/BUILD
third_party/spirv_llvm_translator/spirv_llvm_translator.BUILD
third_party/spirv_llvm_translator/spirv_llvm_translator.patch
third_party/systemlibs/pybind11.BUILD
third_party/tensorrt/tensorrt_configure.bzl
third_party/uv/uv.BUILD
third_party/xla/opensource_only.files
third_party/xla/third_party/FP16/BUILD
third_party/xla/third_party/gpus/rocm_configure.bzl
third_party/xla/third_party/hwloc/BUILD
third_party/xla/third_party/mkl_dnn/BUILD
third_party/xla/third_party/nasm/BUILD
third_party/xla/third_party/pybind11_abseil/BUILD
third_party/xla/third_party/spirv_llvm_translator/BUILD
third_party/xla/tools/toolchains/cpus/aarch64/aarch64.bzl
third_party/xla/tools/toolchains/cpus/aarch64/aarch64_compiler_configure.bzl

2025-05-08T17:49:47 See commit

This commit introduces a new component, tf_pre_calibration, to the TensorFlow MLIR quantization framework, specifically under the stablehlo directory. The new library includes source and header files (tf_pre_calibration.cc and tf_pre_calibration.h), which define a PreCalibrationComponent class responsible for performing pre-calibration transformations during post-training static-range quantization. The component integrates with existing quantization options and configurations, enabling the collection of quantization statistics and facilitating the processing of quantizable functions through the use of TF::CustomAggregatorOp and TF::XlaCallModuleOp.

Additionally, the commit modifies the BUILD file to register the new tf_pre_calibration library, specifying its dependencies and visibility. This enhancement is aimed at improving the quantization capabilities of TensorFlow by streamlining the pre-calibration process, ultimately contributing to more efficient model quantization. The changes reflect a focus on modularity and compatibility within the TensorFlow quantization ecosystem.

Files changed

tensorflow/compiler/mlir/quantization/stablehlo/cc/BUILD
tensorflow/compiler/mlir/quantization/stablehlo/cc/tf_pre_calibration.cc
tensorflow/compiler/mlir/quantization/stablehlo/cc/tf_pre_calibration.h

2025-05-08T18:18:16 See commit

This commit introduces enhancements to the XLA (Accelerated Linear Algebra) CPU backend, specifically allowing users to pass device assignments to the NanoRt (Nano Runtime) execution environment. The changes involve modifications to several files, including updates to the nanort_executable and nanort_client_test components. Key additions include new options in the ExecuteOptions class to set local and global device IDs, as well as device assignments. This functionality is crucial for managing computations across multiple devices, improving the flexibility and efficiency of resource allocation during execution.

Furthermore, the commit includes a new test case in nanort_client_test.cc that verifies the correct functioning of the device assignment feature. The test checks that the correct replica and partition IDs are returned when executing a computation with specified device assignments. Overall, these changes enhance the capability of the XLA CPU backend to handle complex multi-device scenarios, thereby optimizing performance for users working with distributed computations.

Files changed

third_party/xla/xla/backends/cpu/nanort/BUILD
third_party/xla/xla/backends/cpu/nanort/nanort_client_test.cc
third_party/xla/xla/backends/cpu/nanort/nanort_executable.cc
third_party/xla/xla/backends/cpu/nanort/nanort_executable.h
third_party/xla/xla/backends/cpu/runtime/thunk.h

2025-05-08T20:39:56 See commit

This commit introduces a new GroupExecute API to the XLA GPU collectives, enhancing the functionality for collective operations in GPU computing. The addition aims to improve the efficiency and performance of group-based execution for collective communication patterns, which are crucial for parallel processing tasks in machine learning and other computational applications.

Several files across the XLA GPU backend have been modified to accommodate this new API, including updates to the core collective functions and related headers. Notably, new header files have been added, and existing implementations for various collective operations like all-gather, all-reduce, and collective broadcasts have been updated to integrate the GroupExecute functionality. This update signifies a step forward in optimizing GPU collective operations, potentially benefiting users who rely on XLA for high-performance computing tasks.

Files changed

third_party/xla/xla/backends/gpu/collectives/BUILD
third_party/xla/xla/backends/gpu/collectives/gpu_collectives.cc
third_party/xla/xla/backends/gpu/collectives/gpu_collectives.h
third_party/xla/xla/backends/gpu/collectives/gpu_collectives_stub.h
third_party/xla/xla/backends/gpu/collectives/gpu_communicator.h
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.cc
third_party/xla/xla/backends/gpu/collectives/nccl_collectives.h
third_party/xla/xla/backends/gpu/collectives/nccl_communicator.cc
third_party/xla/xla/backends/gpu/collectives/nccl_communicator.h
third_party/xla/xla/backends/gpu/collectives/nccl_communicator_test.cc
third_party/xla/xla/backends/gpu/collectives/nvshmem_collectives.h
third_party/xla/xla/backends/gpu/runtime/BUILD
third_party/xla/xla/backends/gpu/runtime/all_gather_thunk.cc
third_party/xla/xla/backends/gpu/runtime/all_reduce_thunk.cc
third_party/xla/xla/backends/gpu/runtime/all_to_all_thunk.cc
third_party/xla/xla/backends/gpu/runtime/collective_broadcast_thunk.cc
third_party/xla/xla/backends/gpu/runtime/collective_group_thunk.cc
third_party/xla/xla/backends/gpu/runtime/collective_group_thunk.h
third_party/xla/xla/backends/gpu/runtime/collective_permute_thunk.cc
third_party/xla/xla/backends/gpu/runtime/collective_thunk.cc
third_party/xla/xla/backends/gpu/runtime/collective_thunk.h
third_party/xla/xla/backends/gpu/runtime/ragged_all_to_all_thunk.cc
third_party/xla/xla/backends/gpu/runtime/thunk.h