tensorflow changelog

6 months ago

Welcome to the latest change log! We've got some exciting updates and improvements to share with you. From new features that enhance performance to bug fixes that ensure smoother operations, here's a rundown of what's new and improved. 🎉

New feature: Introduced F4E2M1FN and F8E8M0FNU types to the XLA framework, enabling microscaling formats like MXFP4. This addition expands the framework's data type capabilities, providing support for unique floating-point formats. 💾
New feature: Added RecordBatchTaskSizeSum in TensorFlow's batching utility to track the cumulative size of tasks within a batch. This function enhances task size analysis during batch processing, offering better insights into task handling. 📊
New feature: Moved ProfileTimeBreakdown to open-source, allowing for detailed execution time analysis of HLO instructions within TensorFlow. This change enhances profiling capabilities for performance monitoring. 🔍
New feature: Added free-threading support to WeakrefLRUCache, improving its functionality in multithreaded environments. The update ensures thread safety with proper locking mechanisms, validated by a new multithreaded test. 🔒
New feature: Introduced a generic XnnFusionThunk for the XLA CPU backend and ported XnnDotThunk to it, optimizing fusion operations for improved performance. 🚀
Improvement: Enhanced the XLA GPU framework by using NCCL thunk for RaggedAllToAll operations, even in scenarios without inter-replica communication. This update improves handling of ragged data structures. 🤝
Improvement: Enabled sorted scatters in the XLA GPU backend, optimizing scatter operations with sorted indices for better performance. 📈
Improvement: Added locking around lazily-initialized fields in PyDeviceList to ensure thread safety in the XLA Python interface, enhancing robustness in multi-threaded environments. 🛡️
Bugfix: Fixed a crash due to out-of-memory errors in XLA's custom convolution algorithm by introducing a threshold for convolution matrix size, ensuring memory constraints are respected. 🛠️
Bugfix: Corrected kernel launch dimensions for ROCm to comply with platform-specific checks, enhancing compatibility and performance for ROCm applications. 🎯
Bugfix: Resolved a Bazel code check error by updating the BUILD file to use the correct namespace for platform compatibility, ensuring smoother build processes. 🔧
Chore: Integrated Triton library up to a specific commit, including patch files to address issues and improve compatibility. This ongoing effort refines the Triton integration for enhanced functionality. ⚙️

We hope these updates make your experience even better! Stay tuned for more improvements and features. 🚀

Included Commits

2024-12-20T20:05:32 See commit

The commit associated with PR #19096 introduces two new primitive types to the XLA framework: F4E2M1FN, a 4-bit floating-point format with 2 bits for the exponent and 1 bit for the mantissa, and F8E8M0FNU, an 8-bit floating-point format with 8 bits for the exponent but no mantissa or sign. This addition enables the implementation of microscaling formats, such as MXFP4, enhancing the framework's capability to handle various data types. The F4E2M1FN type supports positive and negative zeros but lacks representations for infinity and NaNs, while the F8E8M0FNU type does not support zero or negative values, with NaNs encoded distinctly.

The commit includes extensive modifications across multiple files, addressing the need for literal support, conversion code generation, and Python interface integration for the new types. It also involves adding tests to ensure functionality and correctness. The changes are organized into several commits to facilitate review, and the merging of this PR is expected to close the associated issue. Overall, these enhancements aim to broaden the range of data types supported by XLA, thereby improving its performance and flexibility in numerical computations.

Files changed

tensorflow/core/BUILD
third_party/xla/third_party/tsl/tsl/platform/BUILD
third_party/xla/third_party/tsl/tsl/platform/ml_dtypes.h
third_party/xla/xla/array2d_test.cc
third_party/xla/xla/backends/gpu/codegen/transforms/expand_float_ops.cc
third_party/xla/xla/backends/gpu/codegen/transforms/lower_tensors.cc
third_party/xla/xla/backends/gpu/codegen/transforms/tests/expand_float_ops.mlir
third_party/xla/xla/backends/gpu/codegen/transforms/tests/lower_tensors.mlir
third_party/xla/xla/comparison_util.h
third_party/xla/xla/ffi/api/api.h
third_party/xla/xla/ffi/api/c_api.h
third_party/xla/xla/ffi/api/ffi.h
third_party/xla/xla/ffi/api/ffi_test.cc
third_party/xla/xla/ffi/call_frame.cc
third_party/xla/xla/fp_util_test.cc
third_party/xla/xla/hlo/builder/lib/math.cc
third_party/xla/xla/hlo/builder/lib/math_test.cc
third_party/xla/xla/hlo/evaluator/BUILD
third_party/xla/xla/hlo/evaluator/hlo_evaluator.cc
third_party/xla/xla/hlo/evaluator/hlo_evaluator_typed_visitor.h
third_party/xla/xla/hlo/evaluator/hlo_evaluator_typed_visitor_mxfloat.cc
third_party/xla/xla/hlo/transforms/expanders/comparison_expander.cc
third_party/xla/xla/hlo/transforms/simplifiers/float_normalization.cc
third_party/xla/xla/hlo/transforms/simplifiers/float_normalization_test.cc
third_party/xla/xla/hlo/translate/hlo_to_mhlo/hlo_utils.cc
third_party/xla/xla/hlo/translate/hlo_to_mhlo/tests/import.hlo
third_party/xla/xla/hlo/translate/mhlo_to_hlo/literal_exporter.cc
third_party/xla/xla/hlo/translate/mhlo_to_hlo/tests/export.mlir
third_party/xla/xla/literal.cc
third_party/xla/xla/literal.h
third_party/xla/xla/literal_comparison.cc
third_party/xla/xla/literal_comparison_test.cc
third_party/xla/xla/literal_test.cc
third_party/xla/xla/mlir/utils/type_util.cc
third_party/xla/xla/mlir/utils/type_util_test.cc
third_party/xla/xla/mlir_hlo/tests/Dialect/mhlo/ops.mlir
third_party/xla/xla/pjrt/c/CHANGELOG.md
third_party/xla/xla/pjrt/c/pjrt_c_api.h
third_party/xla/xla/pjrt/c/pjrt_c_api_helpers.cc
third_party/xla/xla/primitive_util.cc
third_party/xla/xla/primitive_util.h
third_party/xla/xla/primitive_util_test.cc
third_party/xla/xla/python/ifrt/dtype.cc
third_party/xla/xla/python/ifrt/dtype.h
third_party/xla/xla/python/ifrt/dtype.proto
third_party/xla/xla/python/ifrt/dtype_test.cc
third_party/xla/xla/python/pjrt_ifrt/pjrt_dtype.cc
third_party/xla/xla/python/py_values.cc
third_party/xla/xla/python/types.cc
third_party/xla/xla/python/types.h
third_party/xla/xla/python/xla.cc
third_party/xla/xla/python/xla_client.py
third_party/xla/xla/python/xla_client.pyi
third_party/xla/xla/python/xla_client_test.py
third_party/xla/xla/python/xla_extension/__init__.pyi
third_party/xla/xla/service/cpu/cpu_compiler.cc
third_party/xla/xla/service/cpu/onednn_memory_util.h
third_party/xla/xla/service/elemental_ir_emitter.cc
third_party/xla/xla/service/elemental_ir_emitter_test.cc
third_party/xla/xla/service/float8_fnuz_ir_emitter.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_support_test.cc
third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/tests/float_conversions_test.cc
third_party/xla/xla/service/hlo_verifier.cc
third_party/xla/xla/service/llvm_ir/llvm_util.cc
third_party/xla/xla/stream_executor/data_type.h
third_party/xla/xla/stream_executor/dnn.cc
third_party/xla/xla/stream_executor/gpu/gpu_blas_lt.cc
third_party/xla/xla/stream_executor/rocm/hip_blas_utils.cc
third_party/xla/xla/tests/BUILD
third_party/xla/xla/tests/array_elementwise_ops_test.cc
third_party/xla/xla/tests/constants_test.cc
third_party/xla/xla/tests/convert_test.cc
third_party/xla/xla/tools/driver.cc
third_party/xla/xla/tsl/framework/BUILD
third_party/xla/xla/tsl/framework/type_traits.h
third_party/xla/xla/tsl/protobuf/dnn.proto
third_party/xla/xla/tsl/python/lib/core/ml_dtypes.cc
third_party/xla/xla/tsl/python/lib/core/ml_dtypes.h
third_party/xla/xla/types.h
third_party/xla/xla/util.cc
third_party/xla/xla/util.h
third_party/xla/xla/util_test.cc
third_party/xla/xla/xla_data.proto

2024-12-20T20:21:51 See commit

The commit involves moving the ProfileTimeBreakdown functionality to an open-source repository within TensorFlow. This change includes the addition of new source files (profile_time_breakdown.cc and profile_time_breakdown.h) and modifications to the BUILD configuration to incorporate the new library. The ProfileTimeBreakdown class is designed to manage and analyze profiling data related to different HLO instruction categories, allowing for a detailed breakdown of execution times and performance metrics.

The newly added class includes methods for setting and retrieving category-specific execution times, calculating fractions of time spent on various tasks, and generating debug strings for easier inspection of profiling data. The implementation leverages the Abseil library for efficient data structures and string manipulation, ensuring that the profiling capabilities are robust and efficient for performance analysis within TensorFlow.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/profile_time_breakdown.cc
tensorflow/core/profiler/convert/profile_time_breakdown.h

2024-12-20T21:06:14 See commit

This commit enhances the thread safety of the PyDeviceList class in the XLA Python interface by introducing locking mechanisms around several lazily-initialized fields, specifically is_fully_addressable_, addressable_device_list_, memory_kind_info_, and hash_. The commit modifies these fields to be protected by the associated lock of the PyDeviceList object, ensuring that concurrent access does not lead to race conditions. Additionally, it updates the DefaultMemoryKind and MemoryKinds methods to be static, allowing them to take a Python object reference and access the lock more easily. Several other methods have been made private to encapsulate functionality better, and the module registration function has been moved into a static method for improved access to these private methods.

Furthermore, the changes include refactoring several comparison operators to static methods, which also utilize the locking mechanism for thread safety. The commit enhances the overall structure and maintainability of the PyDeviceList class by ensuring that operations that may modify shared state are properly synchronized, thus improving its robustness in multi-threaded environments. The modifications involve a total of 47 additions and 28 deletions across the relevant source files, reflecting a focused effort to enhance the safety and clarity of the code.

Files changed

third_party/xla/xla/python/py_device_list.cc
third_party/xla/xla/python/py_device_list.h
third_party/xla/xla/python/sharding.cc
third_party/xla/xla/python/xla.cc

2024-12-20T22:17:13 See commit

The commit introduces free-threading support to the WeakrefLRUCache class within the XLA project, enhancing its functionality for multithreaded environments. It includes modifications to ensure proper locking mechanisms are in place when accessing the cache, which helps maintain thread safety. The changes involve adding a critical section for Python-associated objects and updating the class methods to lock the instance during calls, ensuring that concurrent access does not lead to race conditions.

Additionally, a new multithreaded test case has been added to validate the functionality of the WeakrefLRUCache under concurrent operations. This test simulates multiple threads adding and clearing cache entries, thereby ensuring that the cache behaves correctly in a multithreaded context. The overall enhancements improve the robustness and reliability of the caching mechanism when used in environments with multiple threads, addressing issues that could arise from simultaneous access.

Files changed

third_party/xla/xla/python/BUILD
third_party/xla/xla/python/weakref_lru_cache.cc
third_party/xla/xla/python/weakref_lru_cache_test.py

2024-12-20T23:42:12 See commit

This commit introduces a new function, RecordBatchTaskSizeSum, to track the cumulative size of tasks within a batch in TensorFlow's batching utility. The function utilizes a monitoring counter to record the sizes of both batched and unbatched tasks associated with a specific model and operation. It increments the counter for batched tasks based on the provided batch_task_size and for unbatched tasks based on unbatched_task_size, allowing for better tracking and analysis of task sizes during batch processing.

The implementation modifies the batch_resource_base.cc file, adding 14 lines of code without any deletions. Additionally, the RecordBatchTaskSizeSum function is called within the ConcatInputTensors method to log the sizes of the current batch and any unbatched tasks, enhancing the metrics available for performance monitoring. This change aims to improve insights into how tasks are processed in batches and may serve as a foundation for further enhancements in task size metrics in the future.

Files changed

tensorflow/core/kernels/batching_util/batch_resource_base.cc

2024-12-23T12:57:13 See commit

This commit integrates the Triton library up to the specific commit identified as 88c704e. It includes the addition of two patch files: const_signature_fixes.patch and revert_67ea999.patch, which likely address certain issues or modifications in the Triton codebase.

Additionally, the commit modifies the workspace.bzl file, which is typically used for managing dependencies and configurations within the project. Overall, these changes suggest ongoing efforts to refine and stabilize the integration of the Triton library, ensuring compatibility and improved functionality.

Files changed

third_party/triton/temporary/const_signature_fixes.patch
third_party/triton/temporary/revert_67ea999.patch
third_party/triton/workspace.bzl

2024-12-23T18:33:49 See commit

This commit addresses a Bazel code check error that arose from a previous commit (eae4b03), which caused a failure due to a missing BUILD file in the 'third_party/bazel_platforms' directory. The specific error indicated that the Bazel build system could not locate the required package, resulting in a disruption of the build process. The error was reproducible using the provided Bazel query command.

To resolve the issue, modifications were made to the BUILD file located in tensorflow/lite/experimental/litert/vendors/mediatek/compiler/. The changes involved updating the target compatibility definitions from referencing 'third_party/bazel_platforms' to using the '@platforms' namespace, which is a more appropriate reference for platform compatibility. This adjustment not only fixes the immediate error but also aligns the code with the intended Bazel structure, ensuring smoother build processes moving forward.

Files changed

tensorflow/lite/experimental/litert/vendors/mediatek/compiler/BUILD

2024-12-23T19:01:07 See commit

This commit addresses a crash issue caused by out-of-memory (OOM) errors in XLA's custom convolution algorithm on CPU. To mitigate this, a threshold for the convolution matrix size has been introduced. If the size of the convolution matrix exceeds 8 GiB, the implementation falls back to a more generic convolution algorithm, preventing crashes related to excessive memory usage. This change involves modifications to the convolution implementation, ensuring that memory constraints are respected while still allowing for efficient computations when possible.

The commit includes substantial changes to the code, such as the introduction of a new constant for the maximum convolution matrix size and adjustments to the functions handling convolution operations. Specifically, the logic for memory allocation and the conditions under which the custom algorithm is executed have been updated to incorporate this new threshold. Overall, this enhances the robustness of the XLA library by preventing crashes due to memory limitations while maintaining performance where feasible.

Files changed

third_party/xla/xla/backends/cpu/runtime/convolution_thunk_internal.h

2024-12-23T21:03:17 See commit

This commit introduces a new generic XnnFusionThunk to the XLA CPU backend, enhancing the system's capability to handle optimized fusion operations. It also ports the existing XnnDotThunk functionality to this new generic thunk, allowing for improved performance and maintainability of the codebase.

The changes involve modifications to several files, including updates to header and implementation files for both the new XnnFusionThunk and the ported XnnDotThunk. Additionally, a new test file has been added to ensure the functionality of the XnnFusionThunk, contributing to the overall robustness of the XLA framework.

Files changed

third_party/xla/xla/backends/cpu/runtime/thunk.cc
third_party/xla/xla/backends/cpu/runtime/thunk.h
third_party/xla/xla/backends/cpu/runtime/xnnpack/BUILD
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_dot_thunk.cc
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_dot_thunk.h
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_fusion_thunk.cc
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_fusion_thunk.h
third_party/xla/xla/backends/cpu/runtime/xnnpack/xnn_fusion_thunk_test.cc

2024-12-23T21:25:51 See commit

The commit addressed in PR #19582 focuses on fixing the kernel launch dimensions for ROCm (Radeon Open Compute) by ensuring that the launch dimensions are correctly formatted as ((block.x, 1, 1), (thread.x, thread.y, 1)). This change is necessary to comply with specific checks outlined in the parallel_loop_emitter.cc file, which stipulate that the product of block dimensions and thread dimensions must not exceed 0xFFFFFFFF. The modifications include updates to the calculations for the number of blocks and threads, particularly for the ROCm platform, to ensure that they adhere to these constraints.

The implementation involves a conditional check for the ROCm platform that adjusts the thread and block dimensions accordingly, ensuring that the total number of threads does not exceed the device's limits. The changes were made in the launch_dimensions.cc file and included additional imports to support the new calculations. This merge resolves the issue documented in the original pull request, enhancing compatibility and performance for applications utilizing ROCm.

Files changed

third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/launch_dimensions.cc

2024-12-24T17:41:36 See commit

This commit modifies the XLA (Accelerated Linear Algebra) framework for GPU by implementing the NCCL (NVIDIA Collective Communications Library) thunk for the RaggedAllToAll operation, even in degenerate cases where communication between replicas is not required. Traditionally, collective operations would default to a simple copy when no communication is necessary; however, for RaggedAllToAll, this approach is inadequate because it translates to a DynamicUpdateSlice operation that cannot be expressed in HLO (High-Level Operations). As a result, the commit establishes that using the NCCL thunk is the best solution for handling these scenarios.

Additionally, the commit introduces a new test case for the RaggedAllToAll operation to ensure its functionality in a degenerate context across multiple GPUs. The test validates that the operation behaves as expected when given specific input and output configurations, confirming that the implementation correctly handles cases where the operation should not default to a simple copy. Overall, this update enhances the efficiency and correctness of collective operations in the XLA framework, particularly for scenarios involving ragged data structures.

Files changed

third_party/xla/xla/service/gpu/ir_emitter_unnested.cc
third_party/xla/xla/tests/collective_ops_e2e_test.cc

2024-12-24T21:59:39 See commit

The commit introduces support for sorted scatters in the XLA (Accelerated Linear Algebra) GPU backend, enhancing the efficiency of scatter operations when the indices are sorted. Key changes include modifications to the LowerTensorsPass class to streamline tensor pattern applications, and adjustments in the ScatterWithDistributedIndices class to better handle the distribution of indices across warps. The logic for determining the number of blocks and warps has been refined to optimize performance, especially when working with sorted indices.

Additionally, a new test case has been added to ensure the correctness of the sorted scatter functionality, while previously excluded test files have been re-included to validate the implementation. The changes reflect a significant improvement in handling scatter operations, particularly under conditions that leverage sorted indices, thus potentially improving performance for certain workloads in GPU computations.

Files changed

third_party/xla/xla/backends/gpu/codegen/transforms/lower_tensors.cc
third_party/xla/xla/service/gpu/fusions/scatter_mlir.cc
third_party/xla/xla/service/gpu/fusions/tests/BUILD
third_party/xla/xla/service/gpu/fusions/tests/scatter/sorted_indices_small.hlo