tensorflow changelog

7 months ago

Here's the scoop on the latest updates to our favorite machine learning libraries. Get ready for some cool new features, bug fixes, and a sprinkle of optimizations. Let's dive in! 🚀

New feature: TensorBoard now has an inference_latency_chart! 🎉 This new feature lets you visualize how long your model's inference takes, helping you make smarter optimization decisions.
New feature: Say hello to per-channel quantization in LiteRT! This enhancement allows for more precise model optimization by applying different quantization scales for each tensor channel, improving accuracy in resource-constrained environments.
New feature: The Qualcomm compiler plugin for TensorFlow Lite now supports per-channel quantization parameters. This update brings greater flexibility and efficiency, especially for models that benefit from per-channel quantization techniques.
New feature: The WhileLoopAllReduceCodeMotion pass is now part of the XLA optimization toolkit. This addition could boost the performance of while loops by enabling more efficient code motion techniques.
Bugfix: The XLA latency hiding scheduler got a tune-up to better handle annotated no-op instructions. The fix ensures these instructions wait for the whole annotation set to be ready before scheduling, improving performance.
Bugfix: We squashed a bug causing crashes in the XLA Latency Hiding Scheduler with non-standard async ops. The scheduler now handles complex dependencies more effectively, ensuring smooth operation.
Bugfix: Fixed a range analysis bug in XLA where operand ranges weren't multiplied correctly with constants. The updated logic ensures accurate range calculations, strengthening the reliability of the XLA service.
Improvement: TensorFlow's profiler just got a boost! It now supports sampling for inference profiles, making it easier to analyze inference performance with more detailed statistics.
Improvement: Essential StepEvents have been added for GPU inference profiles, enhancing the profiling capabilities of TensorFlow applications running on GPUs.
Chore: Clean-up time! The --xla_gpu_experimental_enable_triton_softmax_priority_fusion flag has been removed from the XLA GPU compiler's API, simplifying the codebase by eliminating unnecessary features.

That's all for now, folks! Keep those models running smoothly and efficiently. 🌟

Included Commits

2024-12-06T06:55:36 See commit

This commit enhances the XLA (Accelerated Linear Algebra) library by adding support for splitting the "ragged all-to-all" operation into asynchronous start and done phases. Specifically, it introduces a new predicate for converting ragged all-to-all operations and modifies the AsyncCollectiveCreator class to handle this new operation type alongside existing collective operations like all-to-all and reduce scatter. The changes include updates to the matching logic for collectives and the implementation of the asynchronous start-done pattern for the ragged all-to-all operation.

Additionally, the commit includes extensive updates to the testing framework, renaming existing tests to reflect the new collective creator structure and adding new tests specifically for the ragged all-to-all functionality. The new tests verify that the operation is correctly transformed into an asynchronous format, ensuring that the start and done phases are accurately represented in the computation graph. Overall, this commit strengthens the XLA's capabilities in handling more complex collective operations in a more efficient and asynchronous manner.

Files changed

third_party/xla/xla/hlo/transforms/collectives/async_collective_creator.cc
third_party/xla/xla/hlo/transforms/collectives/async_collective_creator.h
third_party/xla/xla/hlo/transforms/collectives/async_collective_creator_test.cc

2024-12-07T07:45:24 See commit

This commit introduces enhancements to the TensorFlow profiler by enabling sampling for inference profiles and making these samples accessible through the inference profile tool. Key modifications include the addition of a new inference_stats_sampler component and the integration of sampled inference statistics into existing data structures. Specifically, the commit updates several files, including the addition of new message types in the protobuf definitions to accommodate sampled inference statistics, which map model indices to their respective sampled request and batch details.

The changes also involve modifications to various utility functions that process session snapshots and convert them into inference statistics. The ConvertMultiXSpaceToInferenceStats function is updated to accept additional parameters for request and batch columns, allowing for a more refined extraction of sampled statistics. Overall, this commit enhances the profiling capabilities within TensorFlow, providing developers with improved tools for analyzing inference performance.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/inference_stats_sampler.h
tensorflow/core/profiler/convert/multi_xspace_to_inference_stats.cc
tensorflow/core/profiler/convert/multi_xspace_to_inference_stats.h
tensorflow/core/profiler/convert/xplane_to_tools_data.cc
tensorflow/core/profiler/protobuf/inference_stats.proto

2024-12-09T21:36:16 See commit

The recent commit introduces the inference_latency_chart feature to TensorBoard, enhancing its capabilities for visualizing inference latency metrics. Modifications were made to the TensorFlow profiler's conversion components, specifically in the BUILD file and the xplane_to_tools_data.cc source file. The changes include the addition of a dependency on the compute_inference_latency module and the integration of inference latency statistics into the overview page generated by the profiler.

In the code, new logic was added to compute inference latency results based on the collected inference statistics and incorporate these results into the overview page. This enhancement aims to provide users with better insights into the performance of their models by visualizing how long inference processes take, thereby facilitating more informed optimization decisions.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/xplane_to_tools_data.cc

2024-12-09T23:14:37 See commit

This commit introduces essential StepEvents for enhancing Inference Profiles on GPU within the TensorFlow profiler framework. The modifications primarily occur in the BUILD file and the multi_xspace_to_inference_stats.cc source file. The changes include the addition of new utility functions and library dependencies, specifically focusing on the conversion of device and host trace data into StepEvents. The new functionality allows for the extraction and combination of non-overlapping StepEvents from GPU and host threads, which is crucial for accurate inference profiling.

In detail, the GetNonOverlappedStepEvents function has been implemented to gather StepEvents from GPU device traces and host threads, ensuring that overlapping events are managed correctly. This enhancement facilitates a more precise analysis of inference performance by providing a clearer view of execution timelines, thereby improving the overall profiling capabilities for TensorFlow applications running on GPUs. The commit ultimately aims to bolster the profiling tools available for developers working with TensorFlow, particularly in GPU contexts.

Files changed

tensorflow/core/profiler/convert/BUILD
tensorflow/core/profiler/convert/multi_xspace_to_inference_stats.cc

2024-12-09T23:24:41 See commit

This commit addresses an issue in the XLA latency hiding scheduler related to the handling of annotated no-op instructions. Previously, these instructions were incorrectly placed in a no-op set, which allowed them to be scheduled immediately upon availability. The fix modifies this behavior by ensuring that annotated no-op instructions are added to the ready set instead, requiring the entire annotation set to be ready before scheduling. The implementation changes include adjustments in the scheduling logic to accommodate this new handling.

Additionally, the commit introduces a new test case for annotated no-op instructions to validate the corrected scheduling behavior. The test checks that the scheduling of operations respects the dependencies imposed by the annotations, ensuring that the sequence of operations adheres to the expected execution order. This change enhances the scheduler's ability to effectively manage instruction dependencies, thereby improving overall performance in scenarios involving annotated no-ops.

Files changed

third_party/xla/xla/service/latency_hiding_scheduler.cc
third_party/xla/xla/service/latency_hiding_scheduler_test.cc

2024-12-10T01:51:18 See commit

This commit introduces the registration of the WhileLoopAllReduceCodeMotion optimization pass to the HLO (High-Level Optimizer) tool within the XLA (Accelerated Linear Algebra) framework. The changes involve modifications to two files: BUILD and opt_lib.cc. In the BUILD file, the new pass is added to the list of existing optimization passes, while in opt_lib.cc, the corresponding header file for the new pass is included and the pass is registered in the OptProvider::RegisterAllHardwareIndependentPasses() function.

By integrating this new pass, the commit enhances the optimization capabilities of the HLO tool, potentially improving the performance of while loops in XLA by enabling more efficient code motion techniques. This addition signifies ongoing efforts to refine and extend the optimization features available to developers using the XLA framework.

Files changed

third_party/xla/xla/tools/hlo_opt/BUILD
third_party/xla/xla/tools/hlo_opt/opt_lib.cc

2024-12-10T19:55:55 See commit

This commit introduces support for per-channel quantization in LiteRT, enhancing the framework's capabilities for model optimization. The changes span multiple files, including modifications to model definitions, utility functions, and associated tests to accommodate the new quantization method. Key files affected include litert_model.cc, flatbuffer_to_litert.cc, and various test files, ensuring that the new functionality is thoroughly validated.

By implementing per-channel quantization, LiteRT can potentially improve the accuracy of quantized models by allowing different quantization scales for each channel of a tensor, rather than applying a uniform scale across all channels. This enhancement is crucial for optimizing performance in resource-constrained environments, making the framework more versatile for deploying machine learning models efficiently.

Files changed

tensorflow/lite/experimental/litert/c/BUILD
tensorflow/lite/experimental/litert/c/litert_model.cc
tensorflow/lite/experimental/litert/c/litert_model.h
tensorflow/lite/experimental/litert/c/litert_model_test.cc
tensorflow/lite/experimental/litert/cc/litert_model.h
tensorflow/lite/experimental/litert/cc/litert_model_test.cc
tensorflow/lite/experimental/litert/core/model/BUILD
tensorflow/lite/experimental/litert/core/model/flatbuffer_to_litert.cc
tensorflow/lite/experimental/litert/core/model/flatbuffer_to_litert.h
tensorflow/lite/experimental/litert/core/model/flatbuffer_to_litert_test.cc
tensorflow/lite/experimental/litert/core/model/litert_to_flatbuffer.cc
tensorflow/lite/experimental/litert/core/model/litert_to_flatbuffer.h
tensorflow/lite/experimental/litert/core/model/litert_to_flatbuffer_test.cc
tensorflow/lite/experimental/litert/core/model/model.h
tensorflow/lite/experimental/litert/core/model/model_file_test_util.cc
tensorflow/lite/experimental/litert/core/model/model_load.cc
tensorflow/lite/experimental/litert/core/util/BUILD
tensorflow/lite/experimental/litert/core/util/flatbuffer_tools.cc
tensorflow/lite/experimental/litert/core/util/flatbuffer_tools.h
tensorflow/lite/experimental/litert/core/util/flatbuffer_tools_test.cc
tensorflow/lite/experimental/litert/tools/dump.cc
tensorflow/lite/experimental/litert/tools/dump_test.cc

2024-12-10T20:31:49 See commit

The commit removes the --xla_gpu_experimental_enable_triton_softmax_priority_fusion flag from the XLA GPU compiler's API, as it is deemed unnecessary. This change involves deleting the flag from various files, including the debug options and test cases, streamlining the code and eliminating redundancy. The removal is also reflected in the associated test classes, which have been renamed to better represent their purpose without the now-obsolete flag.

In addition to the deletion of the flag, the commit includes updates to several test cases and the GPU compiler logic to ensure compatibility with this change. The refactoring aims to enhance code clarity and maintainability, as the functionality associated with the flag is no longer required. Overall, this commit represents a step toward simplifying the XLA GPU codebase by removing outdated experimental features.

Files changed

third_party/xla/xla/debug_options_flags.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_fusion_emitter_large_test.cc
third_party/xla/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc
third_party/xla/xla/service/gpu/gpu_compiler.cc
third_party/xla/xla/service/gpu/transforms/triton_fusion_numerics_verifier.cc
third_party/xla/xla/service/gpu/transforms/triton_fusion_numerics_verifier_test.cc
third_party/xla/xla/xla.proto

2024-12-11T05:40:38 See commit

This commit addresses a crash issue in the XLA Latency Hiding Scheduler related to non-standard asynchronous operations. Specifically, it resolves problems that arise when the "done" operation does not consume the corresponding "start" operation, particularly in scenarios involving partial pipeline parallelism. The crash occurs because the "done" operations from previous iterations can create reverse data dependencies that disrupt the expected scheduling order. To remedy this, the commit removes the requirement for the "done" operation to consume the "start" operation and allows for non-traditional traversal of these operations.

The changes include modifications to the scheduling logic in the latency_hiding_scheduler.cc file, which now accommodates cases where "recv-done" operations exist without corresponding "recv" operations. Additionally, new test cases have been added to ensure that the scheduler can handle out-of-order "start" and "done" operations without crashing. The commit enhances the robustness of the scheduler by allowing it to manage complex dependencies more effectively, thereby improving the overall functionality of the XLA framework.

Files changed

third_party/xla/xla/service/latency_hiding_scheduler.cc
third_party/xla/xla/service/latency_hiding_scheduler_test.cc

2024-12-13T02:33:22 See commit

This commit introduces the functionality to combine different hardware types in the CombineRunEnvironment function within TensorFlow's profiling module. Specifically, it ensures that when merging two RunEnvironment instances, if there is a discrepancy in the hardware types, the function selects the highest hardware type. For example, if one instance indicates a TPU or GPU while the other indicates CPU_ONLY, the resulting combined environment will reflect the TPU or GPU as the dominant hardware type. This change is crucial for accurately representing the hardware capabilities in profiling scenarios.

Additionally, the commit includes a new test case to validate this behavior, ensuring that when combining operational statistics from different environments, the resulting hardware type correctly reflects the highest priority hardware. The test confirms that if a coordinator operation is set to CPU_ONLY and a device operation is set to TPU, the combined result will indicate TPU as the hardware type. Overall, these modifications enhance the robustness of the profiling system by accurately accounting for hardware configurations.

Files changed

tensorflow/core/profiler/convert/op_stats_combiner.cc
tensorflow/core/profiler/convert/op_stats_combiner_test.cc

2024-12-13T03:01:29 See commit

This commit addresses a bug in the range analysis of operand multiplication with constants in the XLA (Accelerated Linear Algebra) service. The issue was that the step value was not being correctly multiplied when the operand was a constant, leading to incorrect range calculations. The fix involved modifying the logic in the RecursivelyIdentifyRange function to ensure that when multiplying operand ranges with a constant, all components—minimum, maximum, and step—are accurately multiplied by the constant value.

In addition to the code changes, the commit also includes updates to the corresponding unit tests to reflect the corrected behavior. The tests now properly validate the ranges after multiplication, ensuring that the minimum, maximum, and step values are computed as expected. This enhancement not only resolves the existing bug but also strengthens the overall reliability of the range analysis functionality within the XLA service.

Files changed

third_party/xla/xla/service/value_range.cc
third_party/xla/xla/service/value_range_test.cc

2024-12-13T03:36:01 See commit

This commit introduces support for per-channel quantization parameters within the Qualcomm compiler plugin for TensorFlow Lite. It modifies several files to accommodate this new functionality, including the addition of functions to handle per-channel quantization. Specifically, the SetPerChannelQuantization function is implemented to set the quantization parameters for tensors based on the number of channels and their respective scales and zero points. Additionally, a new function, FreePerChannelQuantization, is added to manage memory allocation for these parameters, ensuring that the system can handle the increased complexity of per-channel quantization without memory leaks.

Furthermore, the commit includes updates to the testing framework to validate the correct implementation of per-channel quantization. A new test case, TestLegalizeTensor, is created to check if the per-channel quantized tensors are processed correctly, confirming that the quantization parameters are set as expected. Overall, this change enhances the flexibility and efficiency of the TensorFlow Lite framework, particularly for models that benefit from per-channel quantization techniques.

Files changed

tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/IR/BUILD
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/IR/qnn_tensor.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/IR/qnn_tensor_test.cc
tensorflow/lite/experimental/litert/vendors/qualcomm/compiler/qnn_compiler_plugin_test.cc