tensorflow changelog

11 months ago

Welcome to the latest update! We've been busy bees 🐝 making some exciting changes, adding new features, squashing bugs, and improving performance. Here's a rundown of what's new:

New Features

Original Value Tracking: Introduced a pass that adds the original_value field to each operation in the HLO graph. This is a game-changer for value tracking within the graph, making it easier to manage and analyze computations.
cuDNN Custom Call Conversions: Added a pass to convert specific cuDNN custom calls into custom fusion operations. This allows JAX users to run selected computations as cuDNN kernels, optimizing performance on GPUs.
Batch Dimension in Gather/Scatter: Now supporting batch dimensions in Gather and Scatter HLO syntax, enhancing data manipulation operations in XLA.
BatchFunction Operation: Updated protocol buffer text files to include a new "BatchFunction" operation, allowing for more flexible batching of input tensors.
AsyncWrapper: Introduced AsyncWrapper to wrap instructions into async blocks, enabling concurrent execution and potentially improving performance.

Improvements

Additional Batch Padding Policies: Exposed new batch padding policies like "BATCH_DOWN" and "MINIMIZE_TPU_COST_PER_REQUEST" for more efficient batch processing.
Async Dispatch for JAX CPU Backend: Enabled asynchronous dispatch for expensive computations on the JAX CPU backend, with an opt-out option for those who prefer the old synchronous behavior.

Bugfixes

Pipelining with Sequential Extracts: Fixed a bug related to pipelining sequential extracts, ensuring only the induction variable of a loop can be replaced.
Revert Changes in TensorFlow Lite GPU Delegate: Reverted a previous change to simplify the handling of the kClFastRelaxedMath compiler option, standardizing behavior across different GPU architectures.
Revert Changes in CUDA FFT Library: Reverted modifications to rename and update dependencies for the CUDA FFT library, ensuring proper initialization and integration.

Chores

Automated Code Cleanup: Removed unnecessary TensorFlow C API headers from pywrap_parallel_device.cc, streamlining the codebase.

We hope these updates make your development experience smoother and more efficient. Happy coding! 🚀

Included Commits

2024-08-02T14:22:41 See commit

This commit addresses a bug related to pipelining in the context of sequential extracts within the XLA (Accelerated Linear Algebra) project. The changes primarily focus on the function CanReplaceInductionVar, which has been modified to ensure that only the induction variable of a loop can be replaced, while other loop-carried values are preserved. This adjustment is crucial for maintaining the integrity of the pipelining process, particularly when dealing with block arguments that are defined outside the loop.

Additionally, the commit introduces a new test case in the optimize_loops.mlir file that illustrates the behavior of the sequential_extract function. The test demonstrates how the extraction of tensor elements can be pipelined, while also highlighting a limitation where the second extraction cannot currently be pipelined due to the way it interacts with the first extraction. Overall, these changes enhance the robustness of the pipelining functionality in the XLA GPU service.

Files changed

third_party/xla/xla/service/gpu/fusions/mlir/optimize_loops.cc
third_party/xla/xla/service/gpu/fusions/mlir/tests/optimize_loops.mlir

2024-08-02T23:58:37 See commit

This commit introduces a new pass in the XLA (Accelerated Linear Algebra) service, named AddOriginalValue, which enhances the HLO (High-Level Optimizer) graph by adding an original_value attribute to each operation. This attribute is crucial for value tracking within the HLO graph, allowing for better management and analysis of the values processed by various computations. The implementation involves iterating through the instructions of each computation in the HLO module, determining the appropriate original value for each instruction based on its type, and setting this value accordingly.

In addition to the core functionality, the commit includes comprehensive unit tests to validate the behavior of the new pass. These tests cover various scenarios, including basic operations, tuple manipulations, and get-tuple-element instructions, ensuring that the original_value attribute is correctly assigned and reflects the expected structure. The new files added for this feature include the implementation and header files for the pass, as well as the corresponding test cases, contributing a total of 224 lines of new code to the codebase.

Files changed

third_party/xla/xla/service/BUILD
third_party/xla/xla/service/add_original_value.cc
third_party/xla/xla/service/add_original_value.h
third_party/xla/xla/service/add_original_value_test.cc

2024-08-05T19:01:44 See commit

This commit introduces a new class called AsyncWrapper, which is designed to enhance the concurrency of operations within the XLA (Accelerated Linear Algebra) framework by wrapping specific instructions in asynchronous blocks. The AsyncWrapper takes a predicate as an argument to identify which instructions should be wrapped in async-start and async-stop instructions, allowing them to run concurrently. The implementation includes methods to traverse HLO (High-Level Operations) computations and modify instructions based on the provided predicate, thereby potentially improving performance by enabling parallel execution of certain operations.

Additionally, the commit includes test cases to validate the functionality of the AsyncWrapper. The tests ensure that the wrapper correctly identifies and counts asynchronous instructions in a sample HLO module, confirming the expected behavior. The commit adds a total of 208 lines of code across three files, including the implementation of the AsyncWrapper, its interface, and the associated unit tests. This enhancement could significantly optimize GPU computations by leveraging asynchronous execution where applicable.

Files changed

third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/async_wrapper.cc
third_party/xla/xla/service/gpu/async_wrapper.h
third_party/xla/xla/service/gpu/async_wrapper_test.cc

2024-08-05T22:51:10 See commit

This commit reverts a previous change identified by the commit hash f9e52c4ff0c8a891a2689b748de6d37021ed0cd2, making several modifications to the CUDA-related files in the XLA (Accelerated Linear Algebra) library. The changes include renaming the CUDA FFT library from "cuda_fft" to "cufft_plugin" and updating its dependencies to reflect this new structure. The commit also removes unnecessary references to the original "cuda_fft" library, ensuring that the new plugin architecture is correctly integrated with other components of the library.

Additionally, the commit introduces a new initialization function for the cuFFT support library, which registers a factory for creating instances of the cuFFT class within the plugin registry. This function is crucial for ensuring that the cuFFT functionality is properly set up when the library is initialized. Overall, the revert and modifications aim to streamline the integration of FFT support in the CUDA environment, improving the organization and functionality of the XLA library's CUDA components.

Files changed

third_party/xla/xla/stream_executor/cuda/BUILD
third_party/xla/xla/stream_executor/cuda/cuda_executor.cc
third_party/xla/xla/stream_executor/cuda/cuda_fft.cc

2024-08-05T23:00:39 See commit

This commit reverts a previous change identified by the hash 91241fdc7c30fe9a033d6765cce045fdc7959b36 in the TensorFlow Lite GPU delegate code. The modification specifically affects the function CompilerOptionToString in the cl_program.cc file, where the handling of the kClFastRelaxedMath compiler option has been simplified.

In the reverted code, the logic that differentiated behavior based on the GPU type (specifically for Mali GPUs with Valhall architecture) has been removed. Now, the function consistently returns the -cl-fast-relaxed-math flag without any conditional checks. This change reduces complexity in the code and may reflect a decision to standardize the behavior across different GPU architectures.

Files changed

tensorflow/lite/delegates/gpu/cl/cl_program.cc

2024-08-06T00:40:11 See commit

The recent commit introduces a new option, remove_id, to the HloControlFlowFlattening::Options class, which allows for the removal of specific identifiers (partition-id, replica-id, and slice-id) without affecting collective communication operations. This enhancement enables developers to selectively manage the presence of these identifiers in their HLO (High-Level Operations) modules, providing greater flexibility in control flow flattening.

The changes include modifications to the control flow flattening implementation and updates to the relevant header files, as well as new test cases to validate the behavior of the remove_id option. The added test case specifically checks that the replica-id can be successfully removed while retaining an all-reduce operation, demonstrating the intended functionality of the new option. Overall, this commit enhances the existing control flow flattening capabilities by allowing finer control over identifier removal.

Files changed

third_party/xla/xla/tools/hlo_control_flow_flattening.cc
third_party/xla/xla/tools/hlo_control_flow_flattening.h
third_party/xla/xla/tools/hlo_control_flow_flattening_test.cc

2024-08-06T00:47:34 See commit

This commit introduces asynchronous dispatch for expensive computations on the JAX CPU backend, enhancing performance by allowing tasks to be executed concurrently. Users can opt-out of this new asynchronous behavior by setting the configuration option jax.config.update('jax_cpu_enable_async_dispatch', False), reverting to the previous synchronous execution model.

The changes include modifications to the cpu_client.cc and cpu_client.h files, where the implementation of thread pools for running PjRtClient tasks has been updated. Specifically, the commit adjusts the initialization of the pjrt_client_thread_pool_ and async_work_runner_ to support the new asynchronous execution model while maintaining the existing functionality for intra-operation threading with Eigen. Overall, this update aims to improve the efficiency of computation dispatch on the JAX CPU backend.

Files changed

third_party/xla/xla/pjrt/cpu/cpu_client.cc
third_party/xla/xla/pjrt/cpu/cpu_client.h

2024-08-08T00:03:06 See commit

This commit introduces support for batch dimensions in the Gather and Scatter High-Level Operations (HLO) syntax within the XLA (Accelerated Linear Algebra) framework. The modifications affect several files, including the core HLO instruction definitions, the HLO parser, and related shape inference and Python client interface files, indicating a comprehensive update to accommodate the new functionality.

By enabling batch dimensions, this enhancement aims to improve the versatility and efficiency of data manipulation operations in XLA, allowing for more complex and varied data structures to be processed. The changes are reflected across multiple components of the XLA codebase, ensuring that both the backend operations and the user-facing API are aligned with the new batch dimension capabilities.

Files changed

third_party/xla/xla/hlo/ir/hlo_instructions.cc
third_party/xla/xla/hlo/ir/hlo_instructions.h
third_party/xla/xla/python/xla_client.pyi
third_party/xla/xla/service/hlo_parser.cc
third_party/xla/xla/service/hlo_parser_test.cc
third_party/xla/xla/service/shape_inference.cc
third_party/xla/xla/xla_data.proto

2024-08-08T06:58:21 See commit

The commit associated with PR #15399 introduces a new pass in the XLA (Accelerated Linear Algebra) compiler that converts specific custom calls related to cuDNN (NVIDIA's CUDA Deep Neural Network library) into custom fusion operations. This enhancement allows JAX users to execute selected computations as cuDNN kernels, optimizing performance by leveraging the efficiencies of GPU processing. The change includes the addition of a CuDnnCustomCallConverter class, which identifies custom calls with a designated backend configuration and transforms them into fusion instructions that can be processed by the existing XLA pipeline.

The commit modifies several files, including the build configuration and source files for the new converter, and introduces corresponding tests to ensure the functionality works as intended. By integrating this converter, the XLA framework enhances its ability to handle deep learning workloads more efficiently, ultimately benefiting users who rely on JAX for high-performance computing tasks.

Files changed

third_party/xla/xla/service/gpu/BUILD
third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/transforms/BUILD
third_party/xla/xla/service/gpu/transforms/cudnn_custom_call_converter.cc
third_party/xla/xla/service/gpu/transforms/cudnn_custom_call_converter.h
third_party/xla/xla/service/gpu/transforms/cudnn_custom_call_converter_test.cc

2024-08-08T20:48:14 See commit

The recent commit introduces additional batch padding policies for the TensorFlow framework, specifically enhancing the existing functionality related to batch processing. The new policies include "BATCH_DOWN" and "MINIMIZE_TPU_COST_PER_REQUEST," which expand the options available for managing how batches are padded during processing. The default behavior remains unchanged, still using the "PAD_UP" policy unless specified otherwise. The modifications were made in several files, including the tf_generated_ops.td, batch_kernels.cc, and related test files, ensuring that the new policies can be tested effectively.

In addition to the implementation, extensive unit tests were added to verify the behavior of these new padding policies. The tests check that the expected batch sizes are maintained under different conditions, providing a robust framework for validating the new features. This enhancement aims to improve batch processing efficiency, particularly in scenarios involving TPU usage, by allowing for more granular control over how requests are batched and managed.

Files changed

tensorflow/compiler/mlir/tensorflow/ir/tf_generated_ops.td
tensorflow/core/kernels/batch_kernels.cc
tensorflow/core/kernels/batch_kernels_test.cc
tensorflow/core/ops/batch_ops.cc
tensorflow/core/runtime_fallback/runtime/runtime_fallback_batch_tf_opkernels.cc

2024-08-08T22:20:48 See commit

This commit updates the protocol buffer text (pbtxt) files related to operations in TensorFlow, specifically focusing on the addition of a new operation called "BatchFunction." The "BatchFunction" operation allows for the batching of input tensors with various configurable attributes such as the number of batch threads, maximum batch size, and timeout settings. The commit introduces a total of 149 changes in the BatchFunction.pbtxt file, defining input and output arguments along with multiple attributes that control its behavior, including options for priority handling and batch padding policies.

Additionally, the commit modifies the ops.pbtxt file to include new allowed values for the batch_padding_policy attribute, expanding the options available for users to optimize their batching strategies. Overall, these changes enhance the functionality and flexibility of batching operations within TensorFlow, catering to different performance and resource management needs in distributed computing environments.

Files changed

tensorflow/core/ops/compat/ops_history_v2/BatchFunction.pbtxt
tensorflow/core/ops/ops.pbtxt

2024-08-09T04:48:36 See commit

The recent commit involves modifications to the pywrap_parallel_device.cc file within TensorFlow's Python distribution. The changes include the removal of six lines of code, specifically related to various TensorFlow C API headers that were deemed unnecessary. This cleanup likely aims to streamline the codebase by eliminating unused imports, which can enhance maintainability and reduce potential confusion for developers working on this part of the TensorFlow framework.

Overall, the commit reflects an effort to refine the code by removing redundant dependencies, specifically those associated with the TensorFlow C API, while retaining essential includes necessary for the functionality of parallel device operations. This kind of automated code change is part of ongoing maintenance practices to ensure the code remains efficient and relevant to current development needs.

Files changed

tensorflow/python/distribute/parallel_device/pywrap_parallel_device.cc