tensorflow changelog


Welcome to the latest and greatest update roundup! 🚀 We've been busy bees, buzzing around and making some awesome improvements to our beloved frameworks. Here's the lowdown on what's new, what's improved, and what's been squashed:

  • New feature: Nested Calls in XLA:CPU
    Our ElementalKernelEmitter has leveled up! It can now handle nested calls, enhancing the CPU backend's kernel generation capabilities. This means more efficient and flexible computations are on the horizon!

  • New feature: Pinning Tensors on TPU
    Introducing tensor pinning to device SRAM on TPUs via custom calls. This update optimizes memory management, ensuring your computations run smoother and faster.

  • Improvement: Automated Code Changes in TensorFlow MLIR
    We've unleashed a flurry of automated updates across TensorFlow's MLIR compiler, enhancing everything from variable initialization to layout optimization. It's like a turbo boost for model compilation and execution!

  • New feature: XLA:CPU Collectives API
    Say hello to the new collectives API for XLA:CPU! This fresh addition supports collective operations, paving the way for optimized machine learning performance on CPUs.

  • Improvement: HloInstruction & BufferAssignment in XLA:CPU
    We've supercharged the XLA CPU backend by refining the EmitKernelPrototype process, leading to more efficient memory handling and kernel execution. It's all about making things faster and cleaner!

  • New feature: XLA GPU Documentation
    We've added a comprehensive guide to the XLA GPU architecture, complete with visual aids and examples. This documentation is your new best friend for navigating the GPU compiler pipeline.

  • Improvement: Transposed Convolutions in XLA:CPU
    Our transposed convolution algorithm now supports multiple input and output channels, with performance improvements that will make your jaw drop—over 99% faster in some cases!

  • New feature: TFLite Quantization Option
    TFLite users, rejoice! You can now disable per-channel quantization for dense layers, giving you more control over your model's quantization strategy.

  • Chore: Temporary Wheel Size Increase
    We've temporarily increased the wheel size limit to keep those nightly builds rolling smoothly. It's a quick fix while we sort out the underlying issues.

  • Bugfix: ShapeError Crashes in XLA
    We've tackled a pesky bug that caused crashes when element_type was out of bounds. Now, we print the integer value instead, making error reporting clearer and more robust.

That's all for now, folks! Keep those updates coming, and we'll keep making things better, faster, and more awesome. 🎉

Included Commits

2025-01-01T05:12:22 See commit

This commit introduces a series of automated modifications across various files within the TensorFlow MLIR (Multi-Level Intermediate Representation) compiler. The changes primarily affect transformation files related to TensorFlow's compilation processes, including variable initialization, layout optimization, and resource management. The updates span a wide range of components, such as initialize_variables_in_session_init.cc, lower_tf.cc, and optimize.cc, indicating a comprehensive overhaul aimed at enhancing the efficiency and functionality of the MLIR framework.

The extensive list of modified files suggests a significant effort to refine the underlying codebase, potentially addressing issues related to performance optimization, resource allocation, and the handling of TensorFlow operations. This commit reflects a commitment to improving the TensorFlow ecosystem, likely paving the way for more effective model compilation and execution strategies in future versions.

Files changed

  • tensorflow/compiler/mlir/tensorflow/transforms/BUILD
  • tensorflow/compiler/mlir/tensorflow/transforms/init_text_file_to_import.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/init_text_file_to_import_test_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/initialize_variables_in_session_init.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/initialize_variables_in_session_init_test_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/launch_to_device_attribute.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/layout_optimization.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/lift_variables.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/lift_variables_test_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/lower_quantized.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/lower_tf.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/lower_tf_test_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/mark_initialized_variables.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/mark_input_output_aliases.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/merge_control_flow.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/mlprogram.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/optimize.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/optimize_global_tensors.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/order_by_dialect.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/prepare_tpu_computation_for_tf_export.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/promote_resources_to_args.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/readonly_references_to_resources.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/remove_unused_arguments.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/remove_unused_while_results.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/remove_vars_in_session_initializer.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/replica_id_to_device_ordinal.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/replicate_to_island.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/resource_device_inference.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/resource_op_lifting.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/rewrite_tpu_embedding_ops.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/rewrite_util.h
  • tensorflow/compiler/mlir/tensorflow/transforms/set_tpu_infeed_layout.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/shape_inference.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/sink_constant.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/stack_ops_decomposition.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/strip_noinline_attribute.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tensor_array_ops_decomposition.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tensor_device_copy_conversion.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tensor_list_ops_decomposition.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/test_cluster_ops_by_policy.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/test_resource_alias_analysis.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/test_side_effect_analysis.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_data_optimization_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_device_assignment.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_functional_to_executor.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_graph_optimization_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_graph_optimization_pass.h
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_saved_model_freeze_variables.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tf_saved_model_freeze_variables_test_pass.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tfg-to-tfe.cc
  • tensorflow/compiler/mlir/tensorflow/transforms/tpu_annotate_dynamic_shape_inputs.cc
2025-01-02T11:36:03 See commit

This commit introduces the capability for the ElementalKernelEmitter in the XLA (Accelerated Linear Algebra) framework to emit nested calls, enhancing its functionality within CPU backends. The changes span multiple files across the XLA codebase, including modifications to header files, source files, and test scripts, which collectively support this new feature.

Key updates include alterations to the kernel emitter and compiler files, as well as adjustments in test libraries to ensure that the new nested call functionality is thoroughly tested and validated. By enabling nested calls, the commit aims to improve the efficiency and flexibility of kernel generation for CPU operations, potentially leading to better performance in executing complex computations.

Files changed

  • third_party/xla/xla/backends/cpu/codegen/target_machine_features.h
  • third_party/xla/xla/backends/cpu/testlib/BUILD
  • third_party/xla/xla/backends/cpu/testlib/__init__.py
  • third_party/xla/xla/backends/cpu/testlib/elemental_kernel_emitter.cc
  • third_party/xla/xla/backends/cpu/testlib/elemental_kernel_emitter.h
  • third_party/xla/xla/backends/cpu/testlib/elemental_kernel_emitter_test.py
  • third_party/xla/xla/backends/cpu/testlib/kernel_runner.cc
  • third_party/xla/xla/backends/cpu/testlib/kernel_runner.h
  • third_party/xla/xla/backends/cpu/testlib/kernel_runner_extension.cc
  • third_party/xla/xla/codegen/testlib/BUILD
  • third_party/xla/xla/codegen/testlib/__init__.py
  • third_party/xla/xla/codegen/testlib/kernel_runner_extension.cc
  • third_party/xla/xla/codegen/testlib/utilities.py
  • third_party/xla/xla/service/compiler.h
  • third_party/xla/xla/service/cpu/BUILD
  • third_party/xla/xla/service/cpu/cpu_compiler.cc
  • third_party/xla/xla/service/cpu/cpu_compiler.h
2025-01-02T16:40:11 See commit

This commit introduces the initial implementation of the XLA:CPU collectives API, adding necessary files and functionality to support collective operations specifically for the CPU backend. The commit includes the creation of a new package for CPU collectives, which defines a C++ library (cpu_collectives) that includes both source and header files. The library integrates with existing XLA components, such as shape utilities and collectives registries, ensuring that it can function seamlessly within the broader XLA framework.

The implementation consists of a CpuCollectives class that extends the existing Collectives interface, providing a method to retrieve the default CPU collectives implementation. Additionally, the commit modifies the GPU collectives implementation by removing an unnecessary logging include, indicating a focus on maintaining clean and efficient code across the XLA backends. Overall, this update lays the groundwork for enhanced collective operations on CPU, which are critical for optimizing performance in machine learning tasks.

Files changed

  • third_party/xla/xla/backends/cpu/collectives/BUILD
  • third_party/xla/xla/backends/cpu/collectives/cpu_collectives.cc
  • third_party/xla/xla/backends/cpu/collectives/cpu_collectives.h
  • third_party/xla/xla/backends/gpu/collectives/gpu_collectives.cc
2024-12-27T19:03:52 See commit

The commit associated with PR #20587 enhances the documentation for the XLA GPU architecture by incorporating a comprehensive overview of its compiler pipeline. The new document serves as a critical resource for understanding how XLA functions as a domain-specific compiler optimized for linear algebra, detailing its interaction with various frameworks such as JAX, TensorFlow, and PyTorch. The documentation includes visual aids and practical examples, such as a JAX function that demonstrates the generation of HLO (High-Level Operations) and the optimization processes that occur, including SPMD partitioning, layout assignment, and fusion.

This addition is significant as it consolidates essential information about XLA GPU into the OpenXLA documentation repository, mitigating concerns about the potential loss of the original document. The commit adds 253 lines of new content without deletions, ensuring that users have access to a well-rounded understanding of the GPU architecture and its capabilities, while also emphasizing the importance of maintaining readily available documentation in the evolving landscape of machine learning frameworks and compilers.

Files changed

  • third_party/xla/docs/gpu_architecture.md
  • third_party/xla/docs/images/annotated_module.png
  • third_party/xla/docs/images/fused_module.png
  • third_party/xla/docs/images/gpu_pipeline.png
  • third_party/xla/docs/images/layout_assigned_module.png
  • third_party/xla/docs/images/lowered_hlo.png
  • third_party/xla/docs/images/partitioned_module.png
  • third_party/xla/docs/images/pre_layout_module.png
  • third_party/xla/docs/images/triton_opt_pipeline.png
  • third_party/xla/docs/images/xla_hardware.png
2024-12-27T19:07:35 See commit

This commit introduces a new option, disable_per_channel_quantization_for_dense_layers, into the TensorFlow Lite (TFLite) calibration and quantization pipeline. This feature provides users with the ability to disable per-channel quantization specifically for dense layers, which can be beneficial in certain scenarios where this type of quantization may not yield optimal results.

The changes affect multiple files within the TFLite codebase, including modifications to Python scripts for calibration and quantization, as well as updates to related C++ files and their headers. The adjustments aim to enhance the flexibility of the quantization process, allowing developers to tailor the quantization strategy to better suit their model requirements.

Files changed

  • tensorflow/lite/python/lite.py
  • tensorflow/lite/python/optimize/_pywrap_tensorflow_lite_calibration_wrapper.pyi
  • tensorflow/lite/python/optimize/calibration_wrapper.cc
  • tensorflow/lite/python/optimize/calibration_wrapper.h
  • tensorflow/lite/python/optimize/calibration_wrapper_pybind11.cc
  • tensorflow/lite/python/optimize/calibrator.py
  • tensorflow/lite/tools/optimize/quantize_model.cc
  • tensorflow/lite/tools/optimize/quantize_model.h
  • tensorflow/lite/tools/optimize/quantize_model_test.cc
2024-12-28T16:21:56 See commit

This commit enhances the XLA (Accelerated Linear Algebra) library for CPU by extending the custom algorithm for transposed convolutions to support multiple input and output channels simultaneously. The implementation maintains performance for existing cases while achieving significant improvements for the newly supported configurations, particularly in 1D transposed convolutions where processing time decreased dramatically—by over 99% for certain benchmarks. The commit includes modifications to the internal algorithm and the benchmarking tests to accommodate the new capabilities, ensuring that the performance metrics reflect the enhancements accurately.

Future improvements for the algorithm are planned, including support for grouped convolutions, parallel processing of patches, and exploration of kernel rotation effects on performance. The changes involve updates to several files within the XLA codebase, highlighting both the complexity and the potential for further optimization in convolution operations. The commit demonstrates a commitment to enhancing the efficiency and versatility of the library, which is crucial for various machine learning and deep learning applications.

Files changed

  • third_party/xla/xla/backends/cpu/runtime/convolution_thunk_internal.h
  • third_party/xla/xla/service/cpu/benchmarks/convolution_benchmark_test.cc
2024-12-28T20:13:30 See commit

This commit introduces support for pinning tensors to device SRAM (Static Random Access Memory) through the implementation of custom calls in the XLA (Accelerated Linear Algebra) framework for TPU (Tensor Processing Units). The changes affect multiple files, including modifications to the memory placement conversion processes, host memory offload annotations, and memory space assignment algorithms.

Specifically, the updates include alterations to the build configurations and tests associated with memory space assignment, ensuring that the new functionality is both integrated and validated within the existing framework. This enhancement aims to optimize memory management on TPUs by allowing for more efficient tensor placement, potentially improving performance in computational tasks.

Files changed

  • third_party/xla/xla/hlo/transforms/BUILD
  • third_party/xla/xla/hlo/transforms/convert_memory_placement_to_internal_annotations.cc
  • third_party/xla/xla/hlo/transforms/convert_memory_placement_to_internal_annotations_test.cc
  • third_party/xla/xla/service/host_memory_offload_annotations.h
  • third_party/xla/xla/service/memory_space_assignment/BUILD
  • third_party/xla/xla/service/memory_space_assignment/algorithm.cc
  • third_party/xla/xla/service/memory_space_assignment/memory_space_assignment_test.cc
2024-12-30T11:09:51 See commit

This commit introduces significant enhancements to the XLA CPU backend by enabling the passing of HloInstruction and BufferAssignment to the EmitKernelPrototype function. The modifications involve updates to the kernel_api_ir_builder.cc file, where a new class, MemoryDependencyAnalyzer, is added to manage memory dependencies and aliasing for kernel parameters. This class helps construct metadata for buffer slices, allowing for better management of memory usage and optimizations during kernel execution. Additionally, the commit refines the process of gathering kernel arguments and results, ensuring that aliasing information is accurately computed when a buffer assignment is provided.

Changes also extend to the ir_emitter2.cc file, where the previous methods for obtaining allocation slices and kernel parameters have been removed in favor of the new approach that leverages the updated EmitKernelPrototype. This consolidation simplifies the code and enhances maintainability, while also improving the performance of the XLA CPU backend by ensuring more efficient memory handling and aliasing metadata generation. Overall, these changes contribute to more robust kernel generation and execution within the XLA framework.

Files changed

  • third_party/xla/xla/backends/cpu/codegen/BUILD
  • third_party/xla/xla/backends/cpu/codegen/kernel_api_ir_builder.cc
  • third_party/xla/xla/backends/cpu/codegen/kernel_api_ir_builder.h
  • third_party/xla/xla/service/cpu/ir_emitter2.cc
  • third_party/xla/xla/service/cpu/ir_emitter2.h
2024-12-30T19:08:27 See commit

This commit modifies the configuration for a continuous integration environment, specifically for Linux x86 builds, to temporarily increase the wheel size limit from 240MB to 250MB. The change is intended as a quick fix to address issues with nightly builds that may arise due to the previous size constraint.

The commit includes a note indicating that the wheel size limit will be reverted back to 240MB once the underlying issue causing the need for this adjustment is resolved. This change highlights a proactive approach to maintaining build stability while acknowledging the need for future adjustments.

Files changed

  • ci/official/envs/linux_x86
2024-12-31T09:35:33 See commit

The commit addresses a bug in the XLA (Accelerated Linear Algebra) framework that caused crashes due to ShapeError when the element_type of a shape was not defined within the expected enumeration. Instead of attempting to pretty-print the type's name, which was not feasible for undefined types, the code now prints the underlying integer value of the element_type. This change enhances the robustness of error reporting by providing clearer information about the invalid type encountered.

Modifications were made to include additional checks for the validity of element_type before attempting to retrieve its name. The code was updated to use a more reliable indexing method for the lowercase_name_ array, ensuring that only valid types are processed. Furthermore, new test cases were added to validate the behavior when encountering out-of-range element_type values, thereby improving the overall stability and reliability of shape validation within the XLA framework.

Files changed

  • third_party/xla/xla/primitive_util.cc
  • third_party/xla/xla/shape_util.cc
  • third_party/xla/xla/shape_util_test.cc