tensorflow changelog

11 months ago

Here's what's hot off the press! 🚀 We've got a bunch of shiny new features and some bug fixes that'll make your code run smoother than a cat on a Roomba. Let's dive in and see what's new:

New Feature: Flexible Quantization in BATCH_MATMUL 🎉
We've jazzed up the BATCH_MATMUL operation in TensorFlow Lite! Now, you can use any integer divisor of batch_size * n for quantization parameters, making per-channel quantization more flexible and robust. This means more options for handling quantized inputs and a better fit for your model's needs.
New Feature: Model Transformations Flag 🏗️
Introducing the apply_model_transformations flag in the TensorFlow Lite GPU delegate. This nifty flag lets you decide if model transformations should be applied during the building process. It's like having a choice between a smoothie or a milkshake—both are great, but now you get to choose!
New Feature: PjRtDeviceEventOrPromise Class ⏳
Say hello to PjRtDeviceEventOrPromise, a new class for managing device events and promises. It's all about tracking asynchronous operations in the XLA framework, making your device event handling as smooth as a buttered slide.
New Feature: Enhanced Quantization Passes 🔧
We've merged quantization/stablehlo into a new version that leans on non-lite QuantizeUtils.h and TFQuantDialect. This means better quantization capabilities and more comprehensive test coverage to boot.
New Feature: TargetMetric in XLA Benchmarking 🏃‍♂️
Benchmarking just got cooler with TargetMetric in the XLA benchmarking tool. Now, you can specify metrics like wall time, GPU device time, and peak memory usage, giving you a detailed view of your benchmarks.
New Feature: Support for jax.lax.optimization_barrier 🚧
Our TFL converter now supports jax.lax.optimization_barrier, ensuring that certain operations are isolated from optimization passes. It's like setting up cones around your precious computations, keeping them safe and sound.
Improvement: Preserved Weights in Custom BWD Ops 🏋️‍♂️
You can now pass preserved weights to custom backward operations in TensorFlow's sparse-dense matrix multiplication. This makes custom combiners more flexible and efficient, perfect for those working with sparse data structures.
Improvement: Multi-Pair Support in sdy_all_to_all 🔄
The sdy_all_to_all function now supports multiple source/target dimension pairs, offering more flexibility in tensor operations. It's like having a Swiss Army knife for your dimensions!
Bugfix: Race Condition in TileAssignment 🔒
We've squashed a pesky race condition in the TileAssignment class. Now, mutation is protected by a mutex, ensuring thread safety and peace of mind.
Bugfix: Multi-Type Transpose Handling 🌀
Fixed an issue where multiple transposes with different types weren't handled correctly in the XLA GPU backend. Now, your transposes should be as smooth as a synchronized swim team.
Bugfix: Debug Options Dumping 🐛
We've fixed an issue with dumping non-default debug options, ensuring all relevant options are included in the output. Debugging just got a little less frustrating!
Chore: Profiler Client Cleanup 🧹
The profiler_client has been removed from the public package namespace, streamlining TensorFlow and focusing on core features.

These updates are all about making your experience more flexible, efficient, and robust. Enjoy the new features and happy coding! 🎈

Included Commits

2025-04-18T18:55:50 See commit

This commit introduces support for jax.lax.optimization_barrier in the TensorFlow Lite (TFL) converter, which now accepts the shlo.optimization_barrier operation to prevent optimizations from occurring across these barriers. This functionality is essential for maintaining the integrity of computations that require certain operations to be isolated from optimization passes. Before exporting the model to a flatbuffer, the converter will automatically remove these barriers, ensuring that the final output is optimized for performance while respecting the intended computational boundaries.

The changes include the addition of a new library for the cleanup pass that handles the removal of optimization barriers, as well as updates to the build files and the introduction of a test case for the new functionality. The cleanup pass is implemented in two new files, cleanup_optimization_barrier_pass.cc and cleanup_optimization_barrier_pass.h, which define the logic to replace the shlo.optimization_barrier operation with its input, thereby streamlining the model for further processing. This enhancement is expected to facilitate more efficient model conversion while preserving the necessary operational semantics dictated by the barriers.

Files changed

tensorflow/compiler/mlir/lite/BUILD
tensorflow/compiler/mlir/lite/tests/cleanup_optimization_barrier.mlir
tensorflow/compiler/mlir/lite/tf_tfl_passes.cc
tensorflow/compiler/mlir/lite/transforms/cleanup_optimization_barrier_pass.cc
tensorflow/compiler/mlir/lite/transforms/cleanup_optimization_barrier_pass.h
tensorflow/compiler/mlir/lite/transforms/passes.h

2025-04-18T19:17:11 See commit

This commit merges the quantization/stablehlo branch into quantization/stablehlo:tf_passes, transitioning to a new version that utilizes the non-lite QuantizeUtils.h and the mlir::quant::ir::TFQuantDialect instead of the previously used mlir::quantfork::QuantizationForkDialect. This change aims to enhance the quantization capabilities within the TensorFlow framework.

The commit introduces numerous additions and modifications across various files, including new headers and source files for handling TensorFlow attributes, constraints, and quantization passes. Notable files added include those for functions like tf_convert_func_to_bfloat16, tf_optimize_graph, and tf_quantize, among others. Additionally, several test cases are included to validate the new quantization functionalities, ensuring comprehensive coverage of the updated features and their correct integration within the TensorFlow ecosystem.

Files changed

tensorflow/compiler/mlir/quantization/common/BUILD
tensorflow/compiler/mlir/quantization/common/tf_attrs_and_constraints.cc
tensorflow/compiler/mlir/quantization/common/tf_attrs_and_constraints.h
tensorflow/compiler/mlir/quantization/common/tf_lift_as_function_call.cc
tensorflow/compiler/mlir/quantization/common/tf_lift_as_function_call.h
tensorflow/compiler/mlir/quantization/common/tf_uniform_quantized_types.cc
tensorflow/compiler/mlir/quantization/common/tf_uniform_quantized_types.h
tensorflow/compiler/mlir/quantization/stablehlo/BUILD
tensorflow/compiler/mlir/quantization/stablehlo/ops/BUILD
tensorflow/compiler/mlir/quantization/stablehlo/ops/tf_stablehlo_op_quant_spec.cc
tensorflow/compiler/mlir/quantization/stablehlo/ops/tf_stablehlo_op_quant_spec.h
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_convert_func_to_bfloat16.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_convert_shape_constraint_to_assert.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_convert_xla_call_module_op_to_bfloat16.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_defer_activation_transpose.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_fold_constant_transpose.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_insert_calibration_statistics_saver.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_insert_weight_param.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_lift_quantizable_spots_as_functions.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_merge_fusion_with_dequantize.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_nchw_convolution_to_nhwc.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_optimize_graph.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_passes.h
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_passes.td
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_post_quantize.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_prepare_quantize.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_quantization_patterns.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_quantization_patterns.h
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_quantize.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_quantize_composite_functions.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_quantize_weight.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_remove_sharding_custom_call.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_replace_stablehlo_ops_in_main_function_with_xla_call_module_ops.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_restore_function_name.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_unfuse_mhlo_batch_norm.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_unwrap_xla_call_module_op.cc
tensorflow/compiler/mlir/quantization/stablehlo/passes/tf_xla_call_module_to_call.cc
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/prepare_quantize/tf_prepare_quantize.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/prepare_quantize/tf_prepare_quantize_int4.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/prepare_quantize/tf_prepare_quantize_per_channel.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/quantize/tf_quantize.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/quantize/tf_quantize_op_with_region.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/quantize/tf_quantize_same_scale.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/quantize/tf_quantize_weight_only.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_convert_func_to_bfloat16.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_convert_xla_call_module_op_to_bfloat16.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_defer_activation_transpose.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_fold_constant_transpose.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_insert_calibration_statistics_saver.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_insert_calibration_statistics_saver_with_skipping.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_insert_weight_param.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_lift_quantizable_spots_as_functions.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_merge-fusion-with-dequantize.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_nchw_convolution_to_nhwc.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_optimize_graph.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_post_quantize.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_quantize_composite_functions.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_quantize_composite_functions_weight_only.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_remove_sharding_custom_call.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_replace_stablehlo_ops_in_main_function_with_xla_call_module_ops.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_restore_function_name.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_shape_cstr_legalize_to_hlo.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_unfuse_mhlo_batch_norm.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_unwrap_xla_call_module_op.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tests/passes/tf_xla_call_module_to_call.mlir
tensorflow/compiler/mlir/quantization/stablehlo/tools/stablehlo_quant_opt.cc

2025-04-21T21:01:47 See commit

This commit addresses a race condition in the TileAssignment class within the XLA (Accelerated Linear Algebra) library, which can lead to concurrent access issues when TileAssignment objects are mutated. To ensure thread safety, a mutex has been introduced to protect the shared resources during mutation operations. The changes include the addition of mutex locks in the constructors, assignment operators, and various member functions to safeguard against concurrent modifications.

In total, the commit modifies several files, including the tile_assignment.cc and tile_assignment.h, to implement these mutex protections. The adjustments ensure that the internal state of TileAssignment objects is consistently managed even when accessed from multiple threads, thereby fixing the reported issue in the JAX GitHub repository. The implementation includes new member functions for copying and moving the TileAssignment objects, ensuring that the mutex is properly locked during these operations.

Files changed

third_party/xla/xla/hlo/ir/BUILD
third_party/xla/xla/hlo/ir/tile_assignment.cc
third_party/xla/xla/hlo/ir/tile_assignment.h

2025-04-22T21:38:16 See commit

This commit removes the profiler_client from the public package namespace in TensorFlow, specifically from the tensorflow/python module. The changes involve modifications to two BUILD files: tensorflow/BUILD and tensorflow/python/BUILD. In the tensorflow/BUILD file, the profiler_client is added as a dependency, while in the tensorflow/python/BUILD file, the reference to profiler_client is deleted, indicating that it will no longer be publicly accessible.

This adjustment is part of a broader effort to streamline the TensorFlow package and potentially improve its usability by reducing the number of public-facing components. The move suggests a shift towards focusing on more stable and essential features, while also possibly preparing for future updates or restructuring within the profiling tools available in TensorFlow.

Files changed

tensorflow/BUILD
tensorflow/python/BUILD

2025-04-23T19:41:24 See commit

This commit introduces a significant enhancement to the BATCH_MATMUL operation in TensorFlow Lite by allowing the number of quantization parameters to be any integer divisor of the product of batch_size and n. Previously, the operation was limited in how it handled quantized inputs, particularly for per-channel quantization. The changes made in this commit include modifications to the test suite and the core implementation, enabling more flexible handling of quantization parameters and improving the overall functionality of the BatchMatrixMultiply operation.

In addition to the core changes, the commit also updates several test cases to reflect the new capabilities, including the re-enabling of tests that were previously disabled due to limitations in handling per-channel quantized inputs. This update not only enhances the robustness of the BATCH_MATMUL operation but also ensures that it can accommodate a wider range of quantization configurations, thereby improving its applicability in various model scenarios. Overall, this commit marks a step forward in optimizing TensorFlow Lite's performance and flexibility in dealing with quantized operations.

Files changed

tensorflow/lite/delegates/xnnpack/batch_matrix_multiply_test.cc
tensorflow/lite/delegates/xnnpack/xnnpack_delegate.cc

2025-04-23T21:54:32 See commit

This commit addresses an issue in the XLA GPU backend related to handling multiple transposes with differing data types. It modifies the PackedTranspose class to ensure that the correct element type is used when allocating shared memory and reading from it. Specifically, the code now iterates over the transposes to dynamically determine the element type for each transpose operation, allowing for proper memory allocation and data handling.

In addition to the code changes, a new test case has been added to validate the functionality of the modified code. This test ensures that the transposes of different types are correctly processed and that the resulting outputs are verified against expected values. The changes enhance the robustness of the GPU emitters by accommodating various data types in transpose operations, thereby improving overall performance and correctness in tensor computations.

Files changed

third_party/xla/xla/backends/gpu/codegen/emitters/tests/transpose/packed_transpose_two_heroes.hlo
third_party/xla/xla/backends/gpu/codegen/emitters/transpose.cc

2025-04-24T00:06:00 See commit

The recent commit enhances the sdy_all_to_all function by allowing it to accept multiple source and target dimension pairs through a list of AllToAllParam tuples. Each tuple consists of axes, source dimension, and target dimension, and the parameters are formatted as a list of mappings. It is important to note that the source and target dimensions must not overlap, ensuring that each dimension appears only once in the parameter list.

An example provided illustrates the new syntax, demonstrating how to use the sdy.all_to_all with multiple mappings for different dimensions. The changes also include updates to the codebase, specifically in the stablehlo_export_pipeline.mlir file, where the previous single mapping syntax has been replaced with the new list format, reflecting the broader functionality of the function. This update aims to improve flexibility and usability in tensor operations within the framework.

Files changed

third_party/xla/xla/service/spmd/shardy/test/stablehlo_export_pipeline.mlir

2025-04-24T21:25:09 See commit

The recent commit introduces a new class, PjRtDeviceEventOrPromise, which serves as a base for representing both device events and promises that signify future device events. This class includes methods for handling asynchronous values and tracking metadata related to device events. Additionally, it establishes a structure for two derived classes: PjRtDeviceEvent, which represents an event that can be waited upon or passed between APIs, and PjRtDeviceEventPromise, which allows for the fulfillment of promises at a later time.

The changes also involve modifications to the build configuration and header files, incorporating new dependencies and expanding the functionality of device events. The new class structure enhances event tracking capabilities and provides a flexible mechanism for managing asynchronous operations in the XLA (Accelerated Linear Algebra) framework. Overall, this commit aims to improve the handling of device events and promises within the XLA ecosystem.

Files changed

third_party/xla/xla/pjrt/BUILD
third_party/xla/xla/pjrt/device_event.h

2025-04-24T23:05:54 See commit

The commit introduces a new enumeration called TargetMetric to the BenchmarkConfig within the XLA (Accelerated Linear Algebra) benchmarking tool. This addition allows for the specification of various benchmark metrics such as wall time, GPU device time, CPU time, and peak memory usage for both CPU and GPU. The BenchmarkConfig message structure is updated to include a repeated field for these target metrics, facilitating more detailed and categorized benchmarking capabilities based on hardware type.

Additionally, the commit updates the default registry files to incorporate new benchmark configurations that utilize the recently added target metrics. Two benchmark configurations are defined: one for a GPU-based benchmark using the Gemma3 model and another for a CPU-based benchmark using the Gemma2 model. Each configuration specifies various parameters such as run frequencies, update policies, and runtime flags, ensuring a comprehensive setup for both presubmit and postsubmit testing scenarios.

Files changed

third_party/xla/xla/tools/benchmarks/proto/benchmark_config.proto
third_party/xla/xla/tools/benchmarks/registries/default_registry.yml

2025-04-25T02:02:44 See commit

This commit introduces the ability to pass preserved weights to custom backward (BWD) operations for TensorFlow's sparse-dense matrix multiplication. The modification affects various files, including TensorFlow's MLIR operations and several protobuf text files that document compatibility for different gradient computation scenarios, such as Adagrad, Adam, and SGD with CSR (Compressed Sparse Row) input formats.

Additionally, changes were made to the core XLA (Accelerated Linear Algebra) operations for sparse matrices, ensuring that these new functionalities are integrated into the TensorFlow ecosystem. This enhancement aims to improve the flexibility and efficiency of custom combiners in backward operations, potentially leading to better performance in machine learning tasks that utilize sparse data structures.

Files changed

tensorflow/compiler/mlir/tensorflow/ir/tf_ops.td
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithAdagradAndCsrInput.pbtxt
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithAdagradMomentumAndCsrInput.pbtxt
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithAdamAndCsrInput.pbtxt
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithCsrInput.pbtxt
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithFtrlAndCsrInput.pbtxt
tensorflow/core/ops/compat/ops_history_v2/XlaSparseDenseMatmulCustomCombinerOnTcGradWithSgdAndCsrInput.pbtxt
tensorflow/core/tpu/kernels/sparse_core_xla_ops.cc
tensorflow/core/tpu/ops/sparse_core_ops.cc
tensorflow/tools/api/golden/v1/tensorflow.raw_ops.pbtxt

2025-04-25T03:04:00 See commit

This commit introduces a new flag, apply_model_transformations, to the BuildFromFlatBuffer function within the TensorFlow Lite GPU delegate. The modification allows users to specify whether model transformations should be applied during the building process of a model from a FlatBuffer. If the flag is set to true, the function will create a ModelTransformer instance and attempt to apply the transformations; if the application fails, it returns an error. Conversely, when the flag is false, the transformation step is skipped.

In addition to the changes in the implementation file (model_builder.cc), the header file (model_builder.h) is also updated to include the new parameter in the function signature, with a default value of true for backward compatibility. Overall, this enhancement provides more flexibility in model building, allowing for easier experimentation with and without model transformations.

Files changed

tensorflow/lite/delegates/gpu/common/model_builder.cc
tensorflow/lite/delegates/gpu/common/model_builder.h

2025-04-25T04:27:59 See commit

The commit addresses an issue in the XLA (Accelerated Linear Algebra) library where non-default debug options were not being correctly dumped when the default value of a flag was set to true, and the debug options explicitly set it to false. This situation led to the absence of these non-default options in the output, which could hinder debugging efforts. The patch modifies the logic in the GetNonDefaultDebugOptions function to ensure that all relevant non-default options are included in the dump, regardless of their default states.

In addition to the code changes, the commit also updates tests to verify that the changes work as intended. Specifically, it includes new assertions to check that both the xla_gpu_enable_shared_constants flag, which is set to false, and the xla_gpu_enable_nccl_user_buffers flag, are correctly reflected in the output. This fix enhances the functionality of the debug options dumping feature, ensuring that developers have access to the complete set of relevant debug information.

Files changed

third_party/xla/xla/service/dump.cc
third_party/xla/xla/service/dump_test.cc