tensorflow changelog

5 months ago

Here's the latest scoop on what's new and improved in our codebase! We've been busy bees, adding some cool new features and squashing pesky bugs to make things run smoother than ever. Check out the highlights below! 🚀

New Feature: Infeed and Outfeed Support for HloRunnerPjRt
We've just rolled out infeed and outfeed support for HloRunnerPjRt in the XLA library. This means you can now transfer data into and out of computations in real-time, making your workflows more dynamic and interactive. Plus, we've added some nifty functions for buffer conversions and threading to keep things running smoothly. 🏃‍♂️💨
Improvement: All-to-All Operation Enhancements
Our latest update optimizes the handling of multiple source-target pairs during all-to-all operations. By merging and splitting sharding axes more efficiently, we've reduced the number of operations needed, boosting performance for distributed computations. Let's get those tensors reshaped and transposed like pros! 🔄
New Feature: CreateFromAhwb Method in TensorBuffer
Say hello to the CreateFromAhwb method in TensorFlow Lite's TensorBuffer class! This new addition allows you to create a TensorBuffer from an Android Hardware Buffer, making it easier to work with hardware-backed tensors. We've got tests in place to ensure everything works like a charm. 📱🔧
New Feature: Pinning Tensors to Device Memory in XLA
You can now pin tensors to device memory in XLA, keeping them from being pre-fetched to alternate memory. This feature enhances memory management and performance, especially for applications that need quick access to critical tensors. 📌💾
Improvement: Dynamic Slice Operation Optimization
We've optimized the partitioning process for dynamic-slice operations, making them more efficient by replicating input data along slice dimensions. This change eliminates unnecessary input replication, leading to faster execution in distributed environments. 🎯
New Feature: Lower Fake Quant Annotation
Introducing the LowerQuantAnnotationsPass! This new pass transforms quant.fake_quant operations into tfl.Quantize and tfl.Dequantize ops, paving the way for better quantization handling in TensorFlow MLIR. 🧙‍♂️✨
New Feature: cuDNN Flash Attention Sequence Packing
Our cuDNN flash attention now supports sequence packing, allowing multiple segments to be packed into one batch. This enhancement saves memory and speeds up both training and inference, making your workflows more efficient. 🧩⚡
Bugfix: Dispatch API Build Error
We've fixed a build error in the TensorFlow Lite dispatch API by refining memory management and handling unknown C++ types. This ensures a smoother and error-free build process. 🛠️🐞
Bugfix: 3D Input Quantization in Fully Connected Layers
We've addressed an issue with per-channel quantization for 3D input tensors, ensuring that fully connected operations handle output shapes correctly. Now, your models can process 3D inputs without a hitch! 📏🔍
Bugfix: Operation Profile Improvements
We’ve improved the TensorFlow profiler's operation profile by refining the deduplication process and enhancing the user interface. This makes it easier to manage and analyze operation profiles. 📊🔧
Chore: Remove Unused Refcounting Hashmap
We've cleaned up the codebase by removing an unused refcounting hashmap, streamlining the XLA project for better maintainability. 🧹🗑️

Stay tuned for more updates as we continue to enhance our codebase with awesome features and improvements! 🌟

Included Commits

2025-01-06T05:01:11 See commit

This commit addresses an issue with per-channel quantization in fully connected layers when handling 3D input tensors in TensorFlow Lite. The changes include the addition of a new test case that validates the output shape of a per-channel quantized fully connected operation with 3D input. The test initializes a model with specified input dimensions and quantization scales, sets weights and bias values, and checks that the dequantized output matches expected values. It ensures that the quantization process adheres to the constraints of the input and output scales.

Additionally, modifications were made to the internal reference implementation of the fully connected operation to accommodate the new requirements for output shape handling. The code now allows for output shapes with more than two dimensions, adjusting the way batch sizes and output depths are calculated. This enhancement ensures that the fully connected operation can properly process 3D inputs while maintaining the integrity of the quantization process.

Files changed

tensorflow/lite/kernels/fully_connected_test.cc
tensorflow/lite/kernels/internal/reference/integer_ops/fully_connected.h

2025-01-06T22:34:41 See commit

This commit introduces several internal changes to the TensorFlow MLIR (Multi-Level Intermediate Representation) codebase, specifically within the tensorflow/compiler/mlir/tfrt/transforms/ifrt directory. Key modifications include updates to the Tf2HloArg structure, which now includes a mutable std::vector<DtypeAndShape> for input_dtypes_and_shapes and a new member variable_arg_indices. This change enhances the flexibility of the argument structure during the compilation process from TensorFlow to HLO (High-Level Operations). Additionally, various source files were modified to accommodate these changes, including updates to function signatures and the introduction of new include directives.

Moreover, the commit refines the handling of executable creation and execution within the IfrtServingExecutable class. It adds support for variable_arg_indices in relevant functions, ensuring that the executable can correctly process variable arguments during runtime. The modifications also include improvements to input shape handling, allowing for reshaping of tensors to match expected shapes after compilation. Overall, these changes aim to enhance the functionality and robustness of the TensorFlow MLIR framework, particularly in the context of serving and executing ML models.

Files changed

tensorflow/compiler/mlir/tfrt/transforms/ifrt/BUILD
tensorflow/compiler/mlir/tfrt/transforms/ifrt/tf2hlo.cc
tensorflow/compiler/mlir/tfrt/transforms/ifrt/tf2hlo.h
tensorflow/compiler/mlir/tfrt/transforms/ifrt/tf2hlo_test.cc
tensorflow/core/tfrt/ifrt/ifrt_serving_executable.cc
tensorflow/core/tfrt/ifrt/ifrt_serving_executable.h

2025-01-06T23:30:30 See commit

This commit introduces a new flag-protected pass called LowerQuantAnnotationsPass, which is designed to transform quant.fake_quant composite operations into a pair of tfl.Quantize and tfl.Dequantize operations. These transformations are essential as they prepare the quantization annotations for further processing by subsequent converter quantization passes within the TensorFlow MLIR (Multi-Level Intermediate Representation) framework.

The changes include modifications to several files within the TensorFlow MLIR Lite directory, specifically in the build configuration, pass definitions, and the implementation of the new lowering functionality. New helper files were added to support the lowering process, ensuring that the quantization operations can be effectively handled in the conversion pipeline. Overall, this commit enhances the quantization capabilities of TensorFlow Lite by providing a structured way to lower fake quantization annotations.

Files changed

tensorflow/compiler/mlir/lite/BUILD
tensorflow/compiler/mlir/lite/tf_tfl_passes.cc
tensorflow/compiler/mlir/lite/transforms/lower_quant_annotations_helper.cc
tensorflow/compiler/mlir/lite/transforms/lower_quant_annotations_helper.h
tensorflow/compiler/mlir/lite/transforms/lower_quant_annotations_pass.cc
tensorflow/compiler/mlir/lite/transforms/passes.h
tensorflow/compiler/mlir/lite/transforms/passes.td

2025-01-07T01:59:50 See commit

This commit addresses two key issues in the TensorFlow profiler's operation profile functionality. First, it enhances the deduplication process by ensuring that the root deduplication node is included even when its deduplicated operation name is an empty string. This change is crucial for accurately grouping operations that may not have duplicates but still need to be represented in the profiling hierarchy. The logic for handling nodes with a single child has also been refined to remove unnecessary deduplication layers, thus streamlining the tree structure.

Secondly, the commit fixes the operation limit control in the operation profile user interface. This improvement ensures that users can effectively manage the display and interaction with operation profiles, leading to a more user-friendly experience. Overall, these modifications not only enhance the accuracy of operation profiling but also improve usability within the TensorFlow profiling tools.

Files changed

tensorflow/core/profiler/convert/op_profile_builder.cc

2025-01-07T04:53:52 See commit

This commit introduces optimizations to the partitioning process for dynamic-slice operations within the XLA (Accelerated Linear Algebra) framework. The main changes involve a more efficient handling of the input data by replicating it along the slice dimensions to create a temporary sharding configuration (temp_sharding). This allows the input to be reshaped accordingly before applying the dynamic slice operation. After the dynamic slice is performed using the temp_sharding, the output is then reshaped back to the final desired sharding configuration. This approach eliminates the previous suboptimal behavior of replicating the input unnecessarily when a sharded slice dimension exists.

Additionally, the commit includes updates to the associated tests to validate the new partitioning behavior, ensuring that the dynamic slice operations are now correctly partitioned across both non-partitioned and partitioned dimensions. The changes result in a more efficient execution of dynamic slice operations, as evidenced by the updated HLO (High-Level Operations) representations that reflect the improved partitioning strategy. The overall impact of this commit enhances performance by reducing the overhead associated with input replication and optimizing the dynamic slice handling in a distributed computing environment.

Files changed

third_party/xla/xla/service/spmd/spmd_partitioner.cc
third_party/xla/xla/service/spmd/spmd_partitioner_test.cc

2025-01-07T07:41:46 See commit

This commit introduces support for pinning tensors to device memory in the XLA (Accelerated Linear Algebra) framework. When a tensor is pinned, it will remain in device memory and will not be prefetched to alternate memory or assigned to alternate memory, which is an option when it is not pinned. The changes include modifications to multiple source files, particularly in the memory placement transformation logic, where new annotations for pinned device memory are added. This feature aims to enhance memory management and performance by allowing developers to specify that certain tensors should remain in device memory for efficient access.

Additionally, the commit includes modifications to the testing framework to ensure that the new functionality is properly validated. New test cases were added to check the behavior of tensors when pinned to device memory, ensuring that the expected memory space assignments occur as intended. Overall, this enhancement is expected to improve the efficiency of memory usage in XLA, particularly for applications that benefit from keeping critical tensors in device memory.

Files changed

third_party/xla/xla/hlo/transforms/convert_memory_placement_to_internal_annotations.cc
third_party/xla/xla/hlo/transforms/convert_memory_placement_to_internal_annotations_test.cc
third_party/xla/xla/service/host_memory_offload_annotations.h
third_party/xla/xla/service/memory_space_assignment/BUILD
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment_test.cc
third_party/xla/xla/service/memory_space_assignment/memory_space_assignment_test_base.h

2025-01-07T21:31:45 See commit

The recent commit introduces infeed and outfeed support for the HloRunnerPjRt component of the XLA (Accelerated Linear Algebra) library. This enhancement allows for the transfer of data into and out of the computation during execution, enabling more complex workflows that require real-time data interaction. The implementation includes the addition of new functions for handling buffer conversions and the management of infeed and outfeed operations, which are crucial for executing replicated computations across multiple devices.

In terms of code changes, the commit modifies several files, notably hlo_runner_pjrt.cc and hlo_runner_pjrt.h, where new methods for managing infeed and outfeed are integrated. The commit also leverages threading to efficiently handle these operations, ensuring that data can be fed into and out of the execution context without blocking the main computation flow. Overall, this update significantly enhances the functionality of HloRunnerPjRt, allowing it to better support diverse computational needs in machine learning and other applications.

Files changed

third_party/xla/xla/service/BUILD
third_party/xla/xla/service/hlo_runner_pjrt.cc
third_party/xla/xla/service/hlo_runner_pjrt.h

2025-01-07T21:33:08 See commit

The commit introduces support for sequence packing in cuDNN's flash attention mechanism, allowing multiple segments to be efficiently packed into a single batch. This enhancement is designed to optimize memory usage and improve the performance of both training and inference workloads by utilizing two additional tensors, q_offsets and kv_offsets, which define the starting and ending positions of each segment within a batch. By specifying these offsets, the layout of the query (Q), key (K), value (V), output (O), and their respective gradients can be effectively managed.

Additionally, the commit includes a new configuration option, max_segment_per_batch, which sets the maximum number of segments allowed in a batch. This is particularly important due to XLA's static memory allocation, enabling the compilation of a cuDNN graph with a predetermined size for the softmax_stat tensors. The implementation is accompanied by a test case that validates the functionality of the sequence packing feature, demonstrating its equivalence to using a segment mask by comparing the two approaches in the context of cuDNN.

Files changed

third_party/xla/xla/service/gpu/backend_configs.proto
third_party/xla/xla/service/gpu/tests/gpu_fused_mha_test.cc
third_party/xla/xla/service/gpu/transforms/cudnn_custom_call_compiler.cc
third_party/xla/xla/stream_executor/cuda/cuda_dnn.cc
third_party/xla/xla/stream_executor/cuda/cuda_dnn.h

2025-01-08T19:38:10 See commit

This commit introduces enhancements to the handling of multiple source-target pairs during the generation of all-to-all operations in the XLA (Accelerated Linear Algebra) framework. The primary focus is on optimizing the reshaping and transposition of data to facilitate efficient communication across distributed devices. The implementation details include the merging of sharding axes into a single dimension, executing the all-to-all operation, and subsequently splitting the sharding axes back into multiple dimensions. This approach aims to improve performance by reducing the number of collective operations needed when reshaping tensors for parallel processing.

The changes also include the addition of a new function, TryMultipleSourceTargetDims, which allows the handling of multiple source-target dimensions in one all-to-all operation. The commit modifies several files, including the core partitioning logic and updates to the associated tests, ensuring that the new functionality is thoroughly validated. The implementation allows for more flexible and efficient data distribution strategies, which can lead to significant performance improvements in distributed computing scenarios.

Files changed

third_party/xla/xla/service/spmd/spmd_partitioner.cc
third_party/xla/xla/service/spmd/spmd_partitioner.h
third_party/xla/xla/service/spmd/spmd_partitioner_test.cc

2025-01-08T21:16:50 See commit

This commit introduces a new method called CreateFromAhwb to the TensorBuffer class within TensorFlow Lite's experimental LiteRT module. This method allows for the creation of a TensorBuffer object that wraps an Android Hardware Buffer (AHWB), enabling the use of tensors backed by hardware buffers. The implementation ensures that the provided AHardwareBuffer is not owned by the TensorBuffer, meaning that the buffer must outlive the TensorBuffer instance. The method takes a tensor type, an AHardwareBuffer pointer, and an offset for the tensor data, and it handles the creation process, returning an expected TensorBuffer or an error status if the operation fails.

Additionally, the commit includes updates to the test suite to validate the new method. A test case is added to check the functionality of creating a TensorBuffer from an AHardwareBuffer, ensuring that the data can be correctly copied and accessed. The tests are conditioned on the availability of AHardwareBuffer support, and provisions are made to skip the tests if the platform does not support AHWB. Overall, this commit enhances the TensorBuffer functionality by integrating support for Android Hardware Buffers, along with corresponding tests to ensure reliability.

Files changed

tensorflow/lite/experimental/litert/cc/litert_tensor_buffer.h
tensorflow/lite/experimental/litert/cc/litert_tensor_buffer_test.cc

2025-01-10T00:44:41 See commit

This commit addresses a build error related to the dispatch API in the TensorFlow Lite framework. Key modifications include the addition of a unique pointer type alias for the Interpreter class, which enhances memory management and usability. The commit also introduces a workaround for a static assertion failure in the GetElementType function by implementing a dependent_false template, ensuring that unknown C++ types are properly handled during compilation.

Additionally, the commit updates the Uses function in the Tensor class to use a more structured approach by directly creating TensorUse objects, improving code clarity and maintainability. It also includes minor adjustments to header files, adding necessary includes for types like cstdint and vector, which support the overall functionality of the TensorFlow Lite experimental features. Overall, these changes contribute to a more robust and error-free build process for the dispatch API.

Files changed

tensorflow/lite/core/interpreter.h
tensorflow/lite/experimental/litert/cc/litert_element_type.h
tensorflow/lite/experimental/litert/cc/litert_model.cc
tensorflow/lite/experimental/litert/runtime/tensor_buffer.h

2025-01-10T01:27:19 See commit

The commit titled "[xla] Delete unused refcounting hashmap" involves the removal of the refcounting_hash_map library and its associated test from the XLA (Accelerated Linear Algebra) codebase. The changes affect two files: third_party/xla/xla/backends/cpu/collectives/BUILD and third_party/xla/xla/BUILD. Specifically, the commit removes references to the refcounting_hash_map in the dependencies of two CPU collective libraries and eliminates the entire refcounting_hash_map library along with its test case.

This cleanup reflects an effort to streamline the codebase by removing unused components, which can help improve maintainability and reduce potential confusion for developers. The removal of 23 lines in the BUILD file signifies a focused effort to eliminate unnecessary code, thereby enhancing the overall efficiency of the XLA project.

Files changed

third_party/xla/xla/BUILD
third_party/xla/xla/backends/cpu/collectives/BUILD