tensorflow changelog


Hey there, code enthusiasts! We've got a fresh batch of updates that are sure to make your TensorFlow experience even more exciting. Dive into the latest changes and enhancements that have been made to improve performance, add new features, and fix those pesky bugs. Let's take a closer look at what's new and improved:

  • New feature ๐Ÿš€: We've introduced a new XlaOp for a custom combiner backward pass, enhancing TensorFlow's capabilities in handling sparse-dense matrix multiplication operations. This update is a big win for those optimizing deep learning models on TPUs with sparse data structures.

  • New feature ๐ŸŒŸ: Direct translations for unary elementwise operations from StableHLO to HLO are now available, streamlining the process and improving performance for numerical computations in XLA. Say hello to seamless handling of operations like cosine, sine, and tangent!

  • Improvement ๐ŸŽ‰: A progress bar has been added to stdout for long-running matcher processes, giving you visual feedback and making those waiting times a bit more bearable. Keep an eye on the progress and know exactly where you stand!

  • New feature ๐Ÿ†•: We've expanded direct translation support for BroadcastOp, BroadcastInDimOp, and DynamicBroadcastInDimOp from StableHLO to HLO. This enhancement ensures better handling of broadcast dimensions and shapes, making your operations run smoother.

  • Bugfix ๐Ÿ”ง: We've fixed an integer overflow issue in the TFL::FullyConnectedOp::verify() function by switching to int64_t for storing num_elements. This fix prevents erroneous outputs and ensures accurate calculations even with large tensor sizes.

  • Improvement ๐Ÿš€: Hot array iterations just got a performance boost with a new templated Array::Each API variation. This change eliminates type-erasure and virtual calls, optimizing those critical code paths for better efficiency.

  • New feature ๐ŸŒŸ: Scoped alternate memory allocations can now expand to the biggest free chunk at the end of MSA, improving memory utilization and reducing fragmentation for optimized execution performance.

  • New feature ๐Ÿ†•: Binary elementwise operations can now be directly translated from StableHLO to HLO, broadening the scope of operations and enhancing the efficiency of machine learning models relying on these operations.

  • Improvement ๐ŸŽ‰: GPU command buffers are now smarter with automatic inference of command dependencies using an execution graph. Enable xla_gpu_graph_enable_concurrent_region and enjoy a more efficient command execution!

  • Chore ๐Ÿงน: We've removed the pipelining pass from XLA GPU emitters, simplifying the codebase and shifting towards alternative optimization strategies for loop execution.

  • Bugfix ๐Ÿ”ง: We've addressed undefined behaviors in PJRT by fixing pointer casting issues between unrelated types. This update enhances code safety and correctness, ensuring smooth operations across CPU and GPU implementations.

  • Bugfix ๐Ÿ”„: A regression fix in the XLA collective pipeliner ensures proper handling of scalar constants for padding values. This prevents unnecessary broadcasting and improves the efficiency of dynamic tensor operations.

That's all for now, folks! Enjoy the new features and improvements, and keep coding like a rockstar! ๐ŸŒŸ


Hey there, code wranglers! We've been busy optimizing, fixing, and adding some cool new features to our codebase. Here's the latest scoop on what's new, improved, and bug-fixed. ๐Ÿš€

  • New feature: We've added the GetDefaultLayout API to the IFRT Proxy, allowing you to easily retrieve default layouts for specified data types, dimensions, devices, and memory kinds. This is a big win for optimizing data placement and access patterns! ๐ŸŽ‰

  • Improvement: Reinstated support for cuDNN explicit CUDA graph construction in the GPU backend, thanks to the release of cuDNN frontend v1.11.0. This enhancement is crucial for boosting performance in deep learning apps. ๐Ÿ’ช

  • New feature: Say hello to collect_symlink_data_aspect, a nifty addition for hunting down symlinked files in target runfiles. This makes file management in the build process more robust and efficient. ๐Ÿ”

  • New feature: We've added a "copy" button for the full HLO instruction text format in HTML outputs. Now, you can easily copy HLO instruction text directly from the rendered output. Handy, right? ๐Ÿ–ฑ๏ธ

  • New feature: Introducing IOPDDL utilities to XLA Auto Sharding's third-party directory. These tools are essential for tackling optimization problems and evaluating solutions. ๐Ÿ› ๏ธ

  • New feature: Simplified the ComputeAndCompareLiteral function with an overload that doesn't require an error_spec. This makes testing a breeze! ๐ŸŒฌ๏ธ

  • Improvement: Enhanced the HLO diff tool to better visualize repetitive computation patterns. This makes it easier to spot and analyze patterns in computation differences. ๐Ÿ”

  • Bugfix: Addressed a concurrency issue in GPU compiler tests by mutex-guarding the default_device_assignment_ pointer. No more race conditions here! ๐ŸŽ๏ธ๐Ÿ’จ

  • Bugfix: Fixed undefined behaviors in PJRT by correcting how pointers are cast between unrelated types. Safety first! ๐Ÿšฆ

  • Bugfix: Improved the conversion of HLO to StableHLO for programs with bounded dynamism. Now, the conversion process handles these programs more robustly. ๐Ÿ”„

  • Improvement: Integrated updates from LLVM, aligning with the latest changes and enhancing TensorFlow's capabilities and performance. โš™๏ธ

  • Chore: We've moved tensorflow/lite/experimental/litert to the google-ai-edge/litert repository, streamlining the codebase for better organization. ๐Ÿ“ฆ

That's all for now, folks! Keep coding and stay awesome! ๐Ÿ˜Ž


Here's the latest scoop on the updates and improvements made to the TensorFlow and XLA frameworks. We've got some exciting new features, important bug fixes, and a sprinkle of organizational tidying. Let's dive in! ๐Ÿš€

  • New Feature: Dynamic GELU Composite Lowerings
    Say hello to dynamic composite lowerings for the GELU operation in TensorFlow's MLIR framework. This update brings two new patterns to the table, LegalizeCompositeGELUDynamicShaped and LegalizeCompositeGELUDynamicShaped2, which handle dynamic input shapes with grace and style. Now, TensorFlow can flexibly manage varying input dimensions, making your machine learning models even more robust! ๐ŸŽ‰

  • Improvement: Custom Op for odml.detector
    We've waved our magic wand and transformed the odml.detector composite operation into a custom operation within TensorFlow Lite. This makeover streamlines integration and boosts performance by allowing complex operations to be executed as custom operations. A win for flexibility and speed! ๐Ÿง™โ€โ™‚๏ธ

  • New Feature: Explicit Collectives Grouping in JAX
    Introducing an explicit collectives grouping pass for jitted JAX methods! This feature ensures computations run within a single NCCL group, optimizing NVLink systems for multi-directional communications. With this addition, expect improved performance and fewer NCCL kernels during execution. Go team efficiency! ๐ŸŽ๏ธ

  • Bugfix: Shape Representation Safety
    We've tightened the bolts on shape representation by using std::variant<> to ensure a shape holds only one exclusive state at a time. This fix prevents misuse and potential crashes, making your code safer and more reliable. Safety first! ๐Ÿ›ก๏ธ

  • New Feature: Direct StableHLO to HLO Conversion
    Get ready for a smoother ride with direct conversion from StableHLO to HLO for AddOp and ConstantOp. This prototype skips the MHLO step, paving the way for more efficient conversion processes in the future. Streamlining for the win! ๐Ÿ†

  • New Feature: GetDefaultLayout API in IFRT Proxy
    Meet the new GetDefaultLayout API method in the IFRT Proxy, your go-to for retrieving default layouts for specified data types. This enhancement optimizes data placement and access patterns, making your computational tasks more efficient. Layouts made easy! ๐Ÿ“

  • Improvement: Scheduler Statistics in XLA
    We've added the ability to dump scheduler statistics into a proto, giving you a detailed breakdown of wasted cycles and memory pressure. This enhancement boosts debugging and performance analysis, helping you optimize your scheduling process. Knowledge is power! ๐Ÿ“Š

  • Improvement: CommandBuffer API Update
    The CommandBuffer class in the XLA GPU backend now features an explicit command update API for the If command. This update allows for more complex command management and resource optimization. Command and conquer! ๐Ÿ’ช

  • New Feature: HLO Test for Command Buffers
    Introducing a new end-to-end HLO test for command buffers in the XLA GPU service. This test simplifies the process of verifying complex command buffers, strengthening the testing framework and laying the groundwork for future developments. Testing made easy! ๐Ÿงช

  • Bugfix: Post-Order Traversal Non-Determinism
    We've tackled the non-determinism bug in post-order traversal, ensuring correct instruction ordering by allowing pre-computed post-orders. This fix enhances robustness and prevents potential errors in instruction execution. Order restored! ๐Ÿ”„

  • Bugfix: Determinism in SHARD_AS/SHARD_LIKE
    We've addressed non-determinism in SHARD_AS and SHARD_LIKE operations by switching to std::vector for consistent ordering. This fix enhances the reliability of sharding operations, ensuring predictable outputs in parallel computations. Consistency is key! ๐Ÿ”ง

  • Chore: Kernel Generation Passes Reorganization
    We've tidied up the TensorFlow MLIR codebase by moving kernel generation-specific passes to a dedicated directory. This reorganization improves code clarity and maintainability, paving the way for future enhancements. Organization FTW! ๐Ÿ“‚

That's a wrap on the latest updates! Keep coding, keep innovating, and as always, stay awesome! ๐ŸŒŸ


Here's a fresh batch of updates to keep your TensorFlow projects running smoothly and efficiently! ๐Ÿš€

  • Improvement: The XLA Latency Hiding Scheduler now dumps its stats to a proto, giving you a clearer picture of performance metrics like wasted cycles and memory pressure. This makes debugging and optimizing your scheduling process a breeze! ๐Ÿ“Š

  • Improvement: Say hello to non-blocking NCCL communicators! This update boosts the performance of collective operations in GPU backends by allowing tasks to run concurrently. Faster, smoother, and more efficient GPU operations are now at your fingertips! โšก๏ธ

  • Improvement: Multiple compilation configs are now supported in the TensorFlow Lite experimental LiteRT compiler plugin. Plus, you can now track partition stats in your compiled models, making performance tuning a lot easier. ๐Ÿ› ๏ธ

  • New Feature: The ifrt::Client interface gets a makeover with two new methods! CreateContext helps you capture the runtime context, and a variant of MakeArrayFromHostBuffer uses this context to streamline performance analysis and debugging. ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  • New Feature: Introducing the TfrtGpuAsyncHostToDeviceTransferManager and TfrtGpuClient::CreateBuffersForAsyncHostToDevice() for managing async transfers from host to device. More unit tests mean more reliability and correctness! ๐Ÿงช

  • New Feature: We've integrated StableHLO from OpenXLA, bringing significant updates and enhancements to the StableHLO framework within the project. ๐Ÿ›ก๏ธ

  • New Feature: Check out TfrtGpuBuffer::CopyRawToHostFuture and TfrtGpuClient::BufferFromHostLiteral for efficient and asynchronous data transfers in the TensorFlow XLA GPU backend. ๐Ÿš€

  • New Feature: Quantization functionalities have been copied over to TensorFlow Lite, optimizing models for resource-constrained devices. More tests mean more reliability! ๐Ÿ“ˆ

  • Bugfix: A critical bug fix for the FloorDiv operation in TF/TFL lowering to TOSA ensures correct rounding behavior. Accuracy restored! ๐Ÿ”ง

  • Bugfix: Addressed a use-after-move issue in CpuCompiler::CompileAheadOfTime within the XLA CPU module. This fix enhances stability and reliability. ๐Ÿ› ๏ธ

  • Bugfix: Reverted a previous change affecting TensorFlow profiler's error handling, ensuring that any issues are flagged and not ignored. Error management just got stricter! ๐Ÿšจ

  • Chore: Renamed serialization_base.fbs to tflite_serialization_base.fbs for better clarity and organization within the TensorFlow Lite framework. ๐Ÿ—‚๏ธ

Enjoy the new features and improvements, and happy coding! ๐ŸŽ‰


Here's a delightful rundown of all the awesome changes that have been made recently. Get ready to dive into some cool new features and improvements! ๐Ÿš€

New Features

  • Advanced Profiler Configuration: We've jazzed up the profiler with an advanced configuration option. Now, you can specify various settings with greater flexibility, like a pro! ๐ŸŽ›๏ธ
  • GPU Environment via C API: Say goodbye to singleton headaches! Access the GPU environment with our shiny new C API LiteRtGpuGlobalEnvironmentCreate(). It's all about smoother GPU operations now! ๐Ÿ–ฅ๏ธ
  • TfrtGpuBuffer Debut: Introducing the TfrtGpuBuffer! This is the first step in supercharging GPU support within the XLA framework. Let's get that GPU party started! ๐ŸŽ‰
  • Inlineable Attribute: The inlineable attribute is now a first-class citizen, giving you more control over which call operations get inlined. More power to you! ๐Ÿ’ช
  • CreateErrorBuffer Functionality: Meet CreateErrorBuffer, your new best friend for error handling in GPU operations. It keeps things running smoothly even when errors pop up. ๐Ÿ› ๏ธ

Improvements

  • Dynamic & Static GPU Accelerator Support: Whether you're dynamically or statically linking your GPU accelerators, we've got you covered. Flexibility at its finest! ๐Ÿ”—
  • More Op Builders: We've added more operation builders, especially for the ResizeNearestNeighbor operation. Your models just got a makeover! ๐Ÿ—๏ธ
  • IFRT Arrays Layout Management: Layouts are now better managed with IFRT Arrays, thanks to some nifty tweaks and a roll-forward fix. It's all about keeping things neat and tidy! ๐Ÿ“

Bug Fixes

  • Shared Library Path Fix in QNN: No more wandering paths! We've fixed the library path issues in QNN, ensuring your shared libraries are right where they need to be. ๐Ÿ› ๏ธ
  • Layout Creation from Proto: Crashes are so yesterday. Now, Layout::CreateFromProto() handles invalid inputs gracefully, keeping your app running smoothly. ๐Ÿšซ๐Ÿ’ฅ
  • GPU Model Execution Fixes: We've squashed bugs causing GPU model execution failures, including layout mishaps and memory leaks. Your GPU tasks just got a whole lot smoother! ๐Ÿ›๐Ÿ”ง

Chore

  • Automated Code Cleanup: A little spring cleaning never hurt anybody! We've removed unnecessary header files to keep the codebase lean and mean. ๐Ÿงน

Keep exploring these updates and enjoy the enhanced TensorFlow experience! Happy coding! ๐Ÿ˜ƒโœจ


Here's a fresh batch of updates for you, packed with new features, improvements, and bug fixes. Let's dive in! ๐Ÿš€

  • New Feature: LiteRT GPU Accelerator
    The ml_drift_cl_litert feature has been unleashed, enhancing TensorBuffer integration via the DelegateKernelLiteRt. This includes publishing TensorBufferRequirements in kLiteRtTensorBufferTypeOpenCl, binding TensorBuffers with BindTensorBuffers(), and a simplified Invoke() implementation. The TensorFlow Lite experimental LiteRT codebase got some love too, with updates ensuring OpenCL is recognized as the buffer type for input and output tensors.

  • New Feature: XLA TopK Operation Semantics
    Added a detailed section in the XLA docs about the TopK operation, explaining how it identifies the largest or smallest elements in a tensor. Whether you're dealing with one-dimensional arrays or multi-dimensional tensors, this update has got your back!

  • Improvement: Unary Functions in XLA
    Enhanced the XLA builder by adding ResultAccuracy support for unary functions like Cbrt, Cos, Erf, and more. This comprehensive update spans multiple files to boost precision and reliability across the TensorFlow ecosystem.

  • New Feature: chlo.ragged_dot CAPI and Python API
    Say hello to the new CAPI and Python API for chlo.ragged_dot in the StableHLO framework. This includes a new RaggedDotDimensionNumbers attribute, allowing users to specify dimension configurations for matrix operations. Python bindings and test cases have been updated to ensure everything runs smoothly.

  • Improvement: cuDNN Fusion Compiler
    The cuDNN fusion compiler now processes graphs with assigned workspaces, optimizing High-Level Operations (HLO) for better GPU performance. This update includes test cleanups and improved resource management.

  • New Feature: TfrtGpuBuffer
    Introducing the TfrtGpuBuffer for GPU support in XLA. This initial version includes updates to the GPU client implementation and a new test file to ensure everything's running like a well-oiled machine.

  • New Feature: SmallWhileLoopHoistingPass
    A new optimization pass for the XLA CPU backend, SmallWhileLoopHoistingPass, improves small while loop performance by hoisting them into callable computations. This update includes unit tests and refinements to cost analysis.

  • Improvement: Dynamic Test Case Generation
    Dynamic test case generation for TensorFlow Lite's compiled models is here! This feature creates C++ test cases on-the-fly, adapting to different environments and consolidating testing into a single binary.

  • Bugfix: litert::Expected Assignment Operators
    Fixed a critical bug in the litert::Expected class assignment operators, ensuring proper handling of different value states and preventing data corruption.

  • Bugfix: HloRunner Thread Safety
    Enhanced the thread safety of the HloRunner class by removing race conditions and introducing a mutex for safe resource management.

  • Bugfix: Model Round-Tripping
    Ensured buffers initially appended to the FlatBuffer remain correctly appended during serialization and deserialization in TensorFlow Lite's LiteRT.

  • Chore: NCCL References Removed
    Cleaned up the XLA GPU backend by removing NCCL references from CollectiveBroadcast and CollectivePermute functionalities, streamlining the codebase for better flexibility and performance.

Stay tuned for more updates, and happy coding! ๐Ÿ˜„โœจ


In this update, we've got a bunch of exciting new features and improvements that will make your developer life a whole lot easier. From enhanced benchmarking workflows to new operation builders for Qualcomm's AI Engine, we've got it all. Plus, we've squashed some pesky bugs to keep things running smoothly. Let's dive into the details! ๐Ÿš€

  • New Feature: Benchmark Presubmit Workflow
    We've rolled out a shiny new presubmit workflow for benchmarking performance to catch potential regressions before they sneak into the main codebase. This new setup runs tests across various configurations and helps keep the performance top-notch. Plus, we've renamed existing benchmark workflows to make it crystal clear which ones are for nightly runs and which are for presubmit checks. ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  • Improvement: StableHLO Integration
    Integrated a specific version of StableHLO to streamline tensor operations and enhance compatibility within the MLIR framework. This update brings a more efficient syntax for operations and introduces new tests to ensure everything's running smoothly.

  • New Feature: TraceMe for Thunk Execution
    Added a new tracing mechanism to the Thunk execution process in the XLA CPU backend. This feature provides detailed execution traces, making it easier to monitor and debug performance. ๐ŸŽฏ

  • Improvement: PjRtClient::Compile for TFRT GPU
    Implemented the PjRtClient::Compile function for enhanced GPU support in TensorFlow Runtime, optimizing resource utilization and boosting performance for TensorFlow applications.

  • New Feature: Qualcomm AI Engine Direct Op Builders
    Introduced new operation builders for Qualcomm's AI Engine Direct, including Conv2d, DepthwiseConv2d, and more. These additions come with unit tests to ensure robust functionality and improved machine learning model performance. ๐Ÿค–

  • New Feature: LiteRT GPU Accelerator Integration
    Added the ml_drift_cl_litert feature for better TensorBuffer integration in GPU-accelerated models, enhancing the TensorFlow Lite experimental framework.

  • New Feature: Elementwise Ops in Collective Pipeliner
    Enabled support for elementwise operations in the collective pipeliner, improving the efficiency of GPU computations, especially in scaled FP8 GEMMs.

  • Bugfix: Cross-Module Instruction References
    Fixed an issue where instructions were referencing computations across different modules, which was causing some test failures. This update strengthens module encapsulation and code robustness.

  • Improvement: LiteRT Google Implementation
    Updated the LiteRT Google implementation to try loading the newer libedgetpu_litert.so library first, ensuring compatibility with recent Android builds while maintaining backward compatibility.

  • Chore: Logging Cleanup
    Removed excessive logging in parallel_batch_dataset_op.cc to prevent log spamming and enhance user experience.

  • Bugfix: VhloToVersion Reversion
    Reverted a previous change in the VhloToVersion transformation to simplify version compatibility checks within the StableHLO framework.

  • Bugfix: Trace Events Reversion
    Reverted a change in the trace_events.proto file to clarify the handling of flow events, ensuring the trace event framework functions smoothly.

That's all for now, folks! Keep coding, and stay awesome! ๐Ÿ˜Ž


Here's a delightful summary of the latest updates and improvements, packed with exciting new features and crucial bug fixes! ๐ŸŽ‰

  • New Feature: Host Memory Support in StreamExecutor
    We've rolled out support for MemoryType::kHost in the CreateStreamExecutor function across multiple executor types. This means you can now allocate and deallocate host memory with ease, thanks to the new GenericMemoryAllocator. Plus, we've added tests to ensure everything runs smoothly. ๐Ÿš€

  • New Feature: ARM64 CPU Builds in XLA
    Say hello to ARM64 CPU builds for the XLA project via GitHub Actions! This nifty addition enhances our CI workflow, allowing for comprehensive testing across x86 and ARM64 architectures. ๐Ÿ› ๏ธ

  • Improvement: Custom Fusion Integrity in XLA
    We've improved instruction fusion by ensuring that custom fusions and calls remain intact. This update enhances the robustness of the fusion process, maintaining the integrity of custom operations. ๐Ÿ”ง

  • New Feature: DMA Operations in PJRT C API
    Introducing PJRT_Client_DmaMap and DmaUnmap functions to the PJRT C API! These additions boost our direct memory access capabilities, complete with thorough testing to ensure seamless integration. ๐Ÿ’พ

  • New Feature: PyTorch Conversion in tf_tfl_translate
    We've added new flags to the tf_tfl_translate tool, making it easier to convert PyTorch saved models. Now you can specify the model's origin and enable direct lowering of composite operations. ๐Ÿ”„

  • Improvement: Cross-Compile Architecture Support
    Developers can now specify target machine architectures in cross-compile scenarios for CUDA, CUDNN, and NCCL. This update ensures smooth redistributions across various platforms. ๐ŸŒ

  • Improvement: Bitcast Handling in XLA
    We've enhanced the handling of bitcasts in the XLA framework by allowing split dimension mapping. This change optimizes memory allocations and boosts performance. โšก

  • New Feature: Attribute Management in HloInstruction
    Streamline your code with new methods for managing frontend attributes in HloInstruction. These functions simplify attribute handling, making your code more efficient and readable. ๐Ÿ“ˆ

  • Bugfix: Memory Crash in NcclAllToAllStartThunk
    We've fixed a rare crash issue in the memcpy implementation by switching from absl::flat_hash_map to arrays, ensuring stable and performant memory handling. ๐Ÿž

  • Bugfix: Executable Creation in HloRunnerPjRt
    A critical bug causing segmentation faults has been squashed by properly managing the ownership of executables in HloRunnerPjRt. ๐Ÿ› ๏ธ

  • Bugfix: Synchronous Dispatch for CPU Callbacks
    To prevent deadlocks, CPU executables with host callbacks will now dispatch synchronously. This temporary fix ensures resources are allocated effectively. ๐Ÿ”„

  • Chore: Clean-Up in StreamExecutor
    We've tidied up by removing the unused HostMemoryDeallocate method, enhancing code maintainability and clarity. ๐Ÿงน

These updates are sure to enhance your experience and keep everything running smoothly. Happy coding! ๐ŸŽˆ


Here's a delightful summary of the recent updates and improvements. Get ready to dive into the world of new features, bug fixes, and more! ๐Ÿš€

New Features

  • Flatten-Tuple Pass Migration: We've migrated from MHLO to StableHLO with a new transformation pass that flattens tuples in HLO operations. This makes tuple handling more efficient and includes robust test cases to ensure everything is ship-shape. ๐Ÿ› ๏ธ
  • kCpu Property Tag: Say hello to the kCpu property tag in the HloRunner class, which helps distinguish between CPU and GPU environments, paving the way for targeted optimizations. ๐Ÿ–ฅ๏ธ
  • LiteRt C Runtime Shared Library: A new rule to generate a shared library for the LiteRt C runtime is here, making the TensorFlow Lite framework more versatile and organized. ๐Ÿ“š
  • SourceTargetPairs Class: Introducing the SourceTargetPairs class to the XLA service, enhancing the structure and functionality of collective operations. ๐ŸŽ‰
  • Pack Op Legalization: The LiteRT framework now supports the Pack operation, crucial for tensor manipulations in deep learning models. ๐Ÿ“ฆ

Improvements

  • HostOffloader Enhancements: We've improved the handling of DynamicUpdateSlice operations, marking them as host compute when working with host memory, enhancing memory management efficiency. ๐Ÿง 
  • Reshard Optimization: In the IFRT framework, multiple reshards are now merged into a single operation when possible, reducing redundancy and boosting performance. ๐Ÿ”„
  • Persistent Workers for Parallel Loops: Persistent workers are now used for pthreadpool parallel loops, significantly improving execution times and efficiency in the XLA CPU backend. ๐Ÿš€

Bug Fixes

  • CUDA Driver Compatibility: Fixed issues with XLA builds on CUDA Driver versions lower than 12.3, ensuring robust functionality across different versions. ๐Ÿ› ๏ธ
  • SparseCore Device ID Fix: Resolved issues with SparseCore device IDs in the TensorFlow profiler's trace viewer, enhancing performance profiling reliability. ๐Ÿ“Š
  • Timeline v1 Timestamp Compatibility: Improved timestamp accuracy in the TensorFlow profiler's timeline version 1, ensuring correct timing for GPU events. โฑ๏ธ

Chores

  • Cleanup of Deprecated References: We've cleaned up references to the deprecated global_data.h in XLA, streamlining the codebase for clarity and future improvements. ๐Ÿงน

These updates bring a mix of new capabilities, optimizations, and fixes, making the TensorFlow ecosystem more robust and ready for the future! ๐ŸŒŸ


Here's a delightful update on the latest changes and enhancements that have been made:

๐Ÿš€ New Features

  • XLA:CPU Thunk Serialization: We've jazzed up the XLA CPU backend with initial thunk serialization. This means thunks, those nifty units of computation, can now be serialized and deserialized, making computation saving and restoring a breeze. This is particularly handy for distributed computing scenarios. ๐ŸŽ‰

  • NCCL ncclCommInitRankScalable API Support: The XLA GPU framework now supports the NCCL ncclCommInitRankScalable API. This allows NCCL communicators to be initialized using multiple root ranks, boosting performance in large-scale environments. You can tweak the ranks per root with a snazzy flag too! ๐ŸŒŸ

  • Dispatch Op Custom Options: Introducing functions for managing custom options in TensorFlow Lite's LiteRT core using the flexbuffer API. This adds a structured, efficient way to handle dispatch operation options. Flexibility, meet efficiency! ๐Ÿ’ช

  • Data Lineage Logging: TensorFlow now sports a data lineage logging mechanism, helping you track and manage data like a pro. Perfect for those who love to keep things organized! ๐Ÿ“š

  • IFRT Atom Programs Utility Pass: New utility pass for writing atom programs and the main IFRT function to files. This enhances management and output of atom programs in XLA. ๐Ÿ“œ

๐Ÿ”ง Improvements

  • Coordination Service Task Reconnection: Restartable tasks can now reconnect to a cluster, provided they maintain the same local topology. This boosts stability and reliability. ๐Ÿ”„

  • Gather/Scatter Operand Overlap Handling: We've added functionality to create copies of operands in gather and scatter instructions when they overlap, ensuring smooth operations without memory conflicts. ๐Ÿงฉ

  • StreamExecutor Memory Allocation Unification: A step towards unifying memory allocation methods with new classes for streamlined management. Future-proofing memory handling like a boss! ๐Ÿ› ๏ธ

๐Ÿ› Bug Fixes

  • XLA:Python GIL Scoping: Fixed the scoping of GIL release in the XLA Python extension during nb::bytes object construction. No more threading hiccups! ๐Ÿ

  • PjitFunction Locking: Ensured the lock on cache_ is held when destroying executables_ in PjitFunction, maintaining thread safety in a free-threading mode. ๐Ÿ”’

  • TransposePlan Overflow: Resolved an overflow issue by changing data types to handle larger dimensions without a hitch. No more overflow woes! ๐Ÿ“ˆ

๐Ÿงน Chores

  • Refcounting Hashmap Cleanup: Removed an unused refcounting hashmap from the XLA codebase, making things cleaner and simpler. Out with the old! ๐Ÿงน

These updates bring a mix of new features, improvements, bug fixes, and cleanup that enhance the overall performance and functionality of the framework. Keep exploring and enjoy the new capabilities! ๐ŸŽŠ


Here's the latest scoop on what's new and improved in our codebase! We've been busy bees, adding some cool new features and squashing pesky bugs to make things run smoother than ever. Check out the highlights below! ๐Ÿš€

  • New Feature: Infeed and Outfeed Support for HloRunnerPjRt
    We've just rolled out infeed and outfeed support for HloRunnerPjRt in the XLA library. This means you can now transfer data into and out of computations in real-time, making your workflows more dynamic and interactive. Plus, we've added some nifty functions for buffer conversions and threading to keep things running smoothly. ๐Ÿƒโ€โ™‚๏ธ๐Ÿ’จ

  • Improvement: All-to-All Operation Enhancements
    Our latest update optimizes the handling of multiple source-target pairs during all-to-all operations. By merging and splitting sharding axes more efficiently, we've reduced the number of operations needed, boosting performance for distributed computations. Let's get those tensors reshaped and transposed like pros! ๐Ÿ”„

  • New Feature: CreateFromAhwb Method in TensorBuffer
    Say hello to the CreateFromAhwb method in TensorFlow Lite's TensorBuffer class! This new addition allows you to create a TensorBuffer from an Android Hardware Buffer, making it easier to work with hardware-backed tensors. We've got tests in place to ensure everything works like a charm. ๐Ÿ“ฑ๐Ÿ”ง

  • New Feature: Pinning Tensors to Device Memory in XLA
    You can now pin tensors to device memory in XLA, keeping them from being pre-fetched to alternate memory. This feature enhances memory management and performance, especially for applications that need quick access to critical tensors. ๐Ÿ“Œ๐Ÿ’พ

  • Improvement: Dynamic Slice Operation Optimization
    We've optimized the partitioning process for dynamic-slice operations, making them more efficient by replicating input data along slice dimensions. This change eliminates unnecessary input replication, leading to faster execution in distributed environments. ๐ŸŽฏ

  • New Feature: Lower Fake Quant Annotation
    Introducing the LowerQuantAnnotationsPass! This new pass transforms quant.fake_quant operations into tfl.Quantize and tfl.Dequantize ops, paving the way for better quantization handling in TensorFlow MLIR. ๐Ÿง™โ€โ™‚๏ธโœจ

  • New Feature: cuDNN Flash Attention Sequence Packing
    Our cuDNN flash attention now supports sequence packing, allowing multiple segments to be packed into one batch. This enhancement saves memory and speeds up both training and inference, making your workflows more efficient. ๐Ÿงฉโšก

  • Bugfix: Dispatch API Build Error
    We've fixed a build error in the TensorFlow Lite dispatch API by refining memory management and handling unknown C++ types. This ensures a smoother and error-free build process. ๐Ÿ› ๏ธ๐Ÿž

  • Bugfix: 3D Input Quantization in Fully Connected Layers
    We've addressed an issue with per-channel quantization for 3D input tensors, ensuring that fully connected operations handle output shapes correctly. Now, your models can process 3D inputs without a hitch! ๐Ÿ“๐Ÿ”

  • Bugfix: Operation Profile Improvements
    Weโ€™ve improved the TensorFlow profiler's operation profile by refining the deduplication process and enhancing the user interface. This makes it easier to manage and analyze operation profiles. ๐Ÿ“Š๐Ÿ”ง

  • Chore: Remove Unused Refcounting Hashmap
    We've cleaned up the codebase by removing an unused refcounting hashmap, streamlining the XLA project for better maintainability. ๐Ÿงน๐Ÿ—‘๏ธ

Stay tuned for more updates as we continue to enhance our codebase with awesome features and improvements! ๐ŸŒŸ


Welcome to the latest and greatest update roundup! ๐Ÿš€ We've been busy bees, buzzing around and making some awesome improvements to our beloved frameworks. Here's the lowdown on what's new, what's improved, and what's been squashed:

  • New feature: Nested Calls in XLA:CPU
    Our ElementalKernelEmitter has leveled up! It can now handle nested calls, enhancing the CPU backend's kernel generation capabilities. This means more efficient and flexible computations are on the horizon!

  • New feature: Pinning Tensors on TPU
    Introducing tensor pinning to device SRAM on TPUs via custom calls. This update optimizes memory management, ensuring your computations run smoother and faster.

  • Improvement: Automated Code Changes in TensorFlow MLIR
    We've unleashed a flurry of automated updates across TensorFlow's MLIR compiler, enhancing everything from variable initialization to layout optimization. It's like a turbo boost for model compilation and execution!

  • New feature: XLA:CPU Collectives API
    Say hello to the new collectives API for XLA:CPU! This fresh addition supports collective operations, paving the way for optimized machine learning performance on CPUs.

  • Improvement: HloInstruction & BufferAssignment in XLA:CPU
    We've supercharged the XLA CPU backend by refining the EmitKernelPrototype process, leading to more efficient memory handling and kernel execution. It's all about making things faster and cleaner!

  • New feature: XLA GPU Documentation
    We've added a comprehensive guide to the XLA GPU architecture, complete with visual aids and examples. This documentation is your new best friend for navigating the GPU compiler pipeline.

  • Improvement: Transposed Convolutions in XLA:CPU
    Our transposed convolution algorithm now supports multiple input and output channels, with performance improvements that will make your jaw dropโ€”over 99% faster in some cases!

  • New feature: TFLite Quantization Option
    TFLite users, rejoice! You can now disable per-channel quantization for dense layers, giving you more control over your model's quantization strategy.

  • Chore: Temporary Wheel Size Increase
    We've temporarily increased the wheel size limit to keep those nightly builds rolling smoothly. It's a quick fix while we sort out the underlying issues.

  • Bugfix: ShapeError Crashes in XLA
    We've tackled a pesky bug that caused crashes when element_type was out of bounds. Now, we print the integer value instead, making error reporting clearer and more robust.

That's all for now, folks! Keep those updates coming, and we'll keep making things better, faster, and more awesome. ๐ŸŽ‰


Welcome to the latest change log! We've got some exciting updates and improvements to share with you. From new features that enhance performance to bug fixes that ensure smoother operations, here's a rundown of what's new and improved. ๐ŸŽ‰

  • New feature: Introduced F4E2M1FN and F8E8M0FNU types to the XLA framework, enabling microscaling formats like MXFP4. This addition expands the framework's data type capabilities, providing support for unique floating-point formats. ๐Ÿ’พ

  • New feature: Added RecordBatchTaskSizeSum in TensorFlow's batching utility to track the cumulative size of tasks within a batch. This function enhances task size analysis during batch processing, offering better insights into task handling. ๐Ÿ“Š

  • New feature: Moved ProfileTimeBreakdown to open-source, allowing for detailed execution time analysis of HLO instructions within TensorFlow. This change enhances profiling capabilities for performance monitoring. ๐Ÿ”

  • New feature: Added free-threading support to WeakrefLRUCache, improving its functionality in multithreaded environments. The update ensures thread safety with proper locking mechanisms, validated by a new multithreaded test. ๐Ÿ”’

  • New feature: Introduced a generic XnnFusionThunk for the XLA CPU backend and ported XnnDotThunk to it, optimizing fusion operations for improved performance. ๐Ÿš€

  • Improvement: Enhanced the XLA GPU framework by using NCCL thunk for RaggedAllToAll operations, even in scenarios without inter-replica communication. This update improves handling of ragged data structures. ๐Ÿค

  • Improvement: Enabled sorted scatters in the XLA GPU backend, optimizing scatter operations with sorted indices for better performance. ๐Ÿ“ˆ

  • Improvement: Added locking around lazily-initialized fields in PyDeviceList to ensure thread safety in the XLA Python interface, enhancing robustness in multi-threaded environments. ๐Ÿ›ก๏ธ

  • Bugfix: Fixed a crash due to out-of-memory errors in XLA's custom convolution algorithm by introducing a threshold for convolution matrix size, ensuring memory constraints are respected. ๐Ÿ› ๏ธ

  • Bugfix: Corrected kernel launch dimensions for ROCm to comply with platform-specific checks, enhancing compatibility and performance for ROCm applications. ๐ŸŽฏ

  • Bugfix: Resolved a Bazel code check error by updating the BUILD file to use the correct namespace for platform compatibility, ensuring smoother build processes. ๐Ÿ”ง

  • Chore: Integrated Triton library up to a specific commit, including patch files to address issues and improve compatibility. This ongoing effort refines the Triton integration for enhanced functionality. โš™๏ธ

We hope these updates make your experience even better! Stay tuned for more improvements and features. ๐Ÿš€


Hey there, fabulous TensorFlow fans! ๐ŸŽ‰ Get ready to dive into the latest and greatest updates that are making TensorFlow Lite even more awesome. We've got some cool new features, essential improvements, and a few bug fixes that are smoothing out the ride. Let's see what's new!

  • Improvement: Enhanced Compiler Plugin API
    The compiler plugin API now partitions at the subgraph level instead of the model level. This fine-tunes the association of operations with subgraphs, making the compilation process more precise and efficient. ๐Ÿš€

  • Improvement: Improved Model Management
    Pre-allocated subgraphs can now be transferred into models, and metadata can be popped from the model's map. This boosts memory management and organization, ensuring smoother model operations. ๐Ÿง 

  • Improvement: Model FLOPs Calculations
    Model-specific FLOPs are now part of the device operation metrics, providing deeper insights into model performance and helping you optimize better. ๐Ÿ“ˆ

  • New Feature: Per-Channel Quantization in QC Compiler Plugin
    The Qualcomm compiler plugin now supports per-channel quantization parameters, boosting flexibility and efficiency for models that need it. ๐ŸŽ›

  • New Feature: std::any to LiteRtAny Conversion
    Introducing conversion between std::any and LiteRtAny, enhancing data handling flexibility in TensorFlow Lite's experimental library. ๐Ÿ”„

  • New Feature: Per-Tensor Quantization in QNN IR
    QNN Intermediate Representation now supports per-tensor quantization, expanding its capabilities for handling diverse models. ๐Ÿ“Š

  • New Feature: Open Source TPU Step Utils
    Say hello to tpu_step_breakdown_utils and tpu_step_details_utils! These libraries provide detailed breakdowns of TPU performance metrics, helping you optimize your TPU workloads. ๐Ÿ–ฅ

  • New Feature: HardwareType Combining
    Now, when merging RunEnvironment instances, the highest hardware type is selected, ensuring accurate profiling of hardware capabilities. ๐Ÿ–ง

  • Bugfix: Range Analysis Fix
    Fixed an issue in operand range multiplication with constants. Now, all components are correctly multiplied, ensuring accurate range analysis. ๐Ÿ”ง

  • Bugfix: Gather Operation Index Clamping
    Out-of-bound indices in gather operations are now clamped, preventing execution bugs in SPMD partitioners. ๐Ÿ› 

  • Bugfix: Build Breakage Fix
    Resolved a build issue by aligning data types in flatbuffer tools for Android, ensuring smooth compilation and operation. ๐Ÿ—

These updates are designed to make your TensorFlow experience smoother, faster, and more powerful. Keep innovating and stay tuned for more exciting updates! ๐Ÿš€


Here's the scoop on the latest updates to our favorite machine learning libraries. Get ready for some cool new features, bug fixes, and a sprinkle of optimizations. Let's dive in! ๐Ÿš€

  • New feature: TensorBoard now has an inference_latency_chart! ๐ŸŽ‰ This new feature lets you visualize how long your model's inference takes, helping you make smarter optimization decisions.

  • New feature: Say hello to per-channel quantization in LiteRT! This enhancement allows for more precise model optimization by applying different quantization scales for each tensor channel, improving accuracy in resource-constrained environments.

  • New feature: The Qualcomm compiler plugin for TensorFlow Lite now supports per-channel quantization parameters. This update brings greater flexibility and efficiency, especially for models that benefit from per-channel quantization techniques.

  • New feature: The WhileLoopAllReduceCodeMotion pass is now part of the XLA optimization toolkit. This addition could boost the performance of while loops by enabling more efficient code motion techniques.

  • Bugfix: The XLA latency hiding scheduler got a tune-up to better handle annotated no-op instructions. The fix ensures these instructions wait for the whole annotation set to be ready before scheduling, improving performance.

  • Bugfix: We squashed a bug causing crashes in the XLA Latency Hiding Scheduler with non-standard async ops. The scheduler now handles complex dependencies more effectively, ensuring smooth operation.

  • Bugfix: Fixed a range analysis bug in XLA where operand ranges weren't multiplied correctly with constants. The updated logic ensures accurate range calculations, strengthening the reliability of the XLA service.

  • Improvement: TensorFlow's profiler just got a boost! It now supports sampling for inference profiles, making it easier to analyze inference performance with more detailed statistics.

  • Improvement: Essential StepEvents have been added for GPU inference profiles, enhancing the profiling capabilities of TensorFlow applications running on GPUs.

  • Chore: Clean-up time! The --xla_gpu_experimental_enable_triton_softmax_priority_fusion flag has been removed from the XLA GPU compiler's API, simplifying the codebase by eliminating unnecessary features.

That's all for now, folks! Keep those models running smoothly and efficiently. ๐ŸŒŸ


Here's the scoop on our latest updates, where we've been busy adding new features, squashing bugs, and refining our systems to make everything run smoother than ever. Check out the highlights below and see how we're making things better for you! ๐Ÿš€


New Features:

  • xla::Collectives API: We've rolled out the new xla::Collectives API, setting the stage for NVIDIA Collective Communications Library (NCCL) integration. This makes XLA more robust for parallel processing on GPUs, with support for both host and device-initiated collective operations. ๐ŸŒŸ

  • Greater OP Legalization: TensorFlow Lite's LiteRT framework now supports the "greater" operation, complete with new test data and build configurations. This addition enhances tensor comparison capabilities. ๐Ÿ“ˆ

  • Dynamic Shapes in Convolutions: StableHLO now supports dynamic shapes in 1D convolutions, offering more flexibility and aligning with modern machine learning needs. ๐ŸŒ€

  • Ragged All-to-All in XLA: We've added asynchronous start and done phases for the "ragged all-to-all" operation, boosting XLA's efficiency in handling complex collective operations. ๐Ÿš€

  • Custom Options in IFRT: Users can now specify custom_options for runtime-specific execution, allowing more tailored execution parameters. ๐Ÿ› ๏ธ

  • Multi XSpace to InferenceStats Conversion: A new function transforms multiple XSpace instances into InferenceStats, enhancing TensorFlow's profiling framework for better inference performance insights. ๐Ÿ”

  • HLO Stats Tool: Introducing the HLO Stats Tool in TensorFlow's profiler for deeper performance analysis of high-level operations. ๐Ÿ“Š

Improvements:

  • C++ Tree with Path API: We've transitioned the tree_util.tree_flatten_with_path and tree_map_with_path APIs to C++, speeding up the pytree flattening process. โšก

Bug Fixes:

  • Triton Dot Product Bug: Fixed a bug in Triton's dot product algorithm for dot(inf, 1.0), ensuring correct results by addressing non-finite value summation. ๐Ÿ”ง

  • Wheel Creation Logic: Resolved issues in TensorFlow's wheel creation logic when using pywrap rules, improving the packaging process. ๐Ÿ“ฆ

  • Graph Output Tensor Recognition: Corrected logic in TensorFlow Lite to ensure graph output tensors are recognized even when used by other Ops. ๐Ÿ› ๏ธ

Chores:

  • Obsolete TODO Removal: Cleaned up outdated TODO comments in the TensorFlow XLA compiler codebase, streamlining and clarifying the code. ๐Ÿงน

These updates are all about making your experience smoother, faster, and more efficient. Stay tuned for more exciting improvements and keep those feedbacks coming! ๐Ÿ˜Š


Welcome to the latest updates! We've been busy adding some shiny new features and fixing pesky bugs to make your experience smoother and more efficient. Here's a rundown of what's new and improved:

  • New Feature ๐Ÿš€: Parallel compilation is now live for the XLA CPU backend, thanks to our new ORC TaskDispatcher. This means faster and more efficient JIT compilation, leveraging multi-threading to get things done in a snap!

  • New Feature ๐ŸŽ‰: TensorV1Attr support has been added to the flatbuffer_export and flatbuffer_operator, allowing for a more structured and efficient data representation in TensorFlow's MLIR framework. Now you can handle tensor attributes like a pro!

  • New Feature ๐ŸŒŸ: Introducing the VIFRT pass for converting between VIFRT versions. This nifty addition ensures compatibility and flexibility across different versions, making your development process smoother than ever.

  • New Feature ๐Ÿ: Python bindings for VIFRT serialization are here! Now you can serialize and deserialize IFRT IR programs with ease, ensuring compatibility across versions and making advanced serialization features more accessible.

  • New Feature ๐Ÿ”ง: Say hello to the experimental C++ graph builder for TensorFlow Lite! This tool empowers developers to construct and manipulate machine learning models programmatically, enhancing TFLite's flexibility and usability.

  • Improvement ๐Ÿ› ๏ธ: We've migrated the CpuCompiler from SimpleOrcJit to JitCompiler in the XLA backend for CPU. This upgrade promises better optimization and execution speeds, keeping things running like a well-oiled machine.

  • Improvement โš™๏ธ: To prep for JIT compilation, we've enhanced the CpuCompiler by constructing the JitCompiler within it, setting the stage for more efficient compilation processes.

  • New Feature ๐Ÿ’ก: A sharding config has been added to XLA's HloModuleConfig, as part of the AutoFDO integration. This gives you better control over operation distribution, optimizing performance like never before.

  • Bugfix ๐Ÿ›: We've squashed a bug in the MoveUserInstructionsIn function that was causing compilation errors with conditional operations. Now it handles multiple users like a champ!

  • Bugfix ๐Ÿž: Fixed an async execution bug in transposed convolution operations for XLA CPU. The intermediate buffer now stays in scope, preventing any memory mishaps.

  • Bugfix ๐Ÿ”ง: The tune_ctas logic in GemmFusionAutotunerImpl has been restored, ensuring proper CTA tuning for GPU computations, especially on Hopper architectures.

  • Chore ๐Ÿ”: Updated internal visibility settings for the registry library, ensuring access is managed effectively for Google-specific clients.

These updates are all about making your experience smoother, faster, and more powerful. Enjoy the new features and improvements, and keep an eye out for more exciting updates coming your way! ๐ŸŽˆ


Welcome to the latest round of updates! We've been busy bees ๐Ÿ, adding some slick new features, squashing pesky bugs, and tidying up the codebase. Here's a rundown of whatโ€™s new and improved:

  • New feature: ๐ŸŽ‰ We've added support for overriding cross-program prefetch behavior and filtering buffer intervals based on their usage in the XLA:TPU:MSA. These enhancements make memory management more flexible and efficient. Plus, we've included tests to make sure everything runs smoothly.

  • New feature: ๐Ÿš€ The HLO evaluator now supports explicit batch dimensions for gather and scatter operations. This change reserves necessary dimensions for tensors, making these operations more flexible and robust.

  • Improvement: ๐Ÿ› ๏ธ Introducing the AssertEq wrapper! This nifty tool helps ensure function outputs match expected results, enhancing our assertion framework. We've also improved error checking in the TensorFlow Lite runtime by validating tensor types more reliably.

  • New feature: ๐Ÿงฉ Say hello to HloModuleInterface and HloInstructionInterface! These new interfaces provide a more organized way to manage HLO data, improving efficiency and performance metrics retrieval.

  • New feature: โš™๏ธ Weโ€™ve added a RuntimeConfig when loading SavedModels, allowing you to disable the tf2xla MLIR bridge. This update optimizes graph execution for better performance.

  • Bugfix: ๐Ÿž Fixed a critical issue in CalculatePostOrderScheduleHelper(), ensuring kAsyncStart instructions are correctly initialized. This fix prevents instructions from being processed out of order.

  • New feature: ๐Ÿ” The HloUnaryInstruction class is here to boost result accuracy for specific unary functions, enhancing precision in computations.

  • Improvement: ๐Ÿ”ง Enhanced GPU GEMM fusions by allowing effective parameters and their broadcasts to be fused in the epilogues, optimizing performance.

  • New feature: ๐ŸŽ›๏ธ A new ToolParam for the XNNPACK TFLite delegate lets you easily toggle the Slinky optimizer via command-line flags, giving you more control over performance tuning.

  • Bugfix: ๐Ÿ›ก๏ธ Addressed a crucial issue in the GPU dot algorithm rewriter to handle infinity and NaN values correctly, ensuring accurate results in BF16 operations.

  • Bugfix: ๐Ÿ”ง Fixed the AlgebraicSimplifier to ensure it doesn't eliminate host offloading copies, maintaining the integrity of host memory operations.

  • Chore: ๐Ÿงน We've cleaned up by removing an unnecessary gpu_types.h inclusion in topk_kernel_test.cc, streamlining the code and reducing compilation time.

We hope these updates make your experience even better! Keep exploring and enjoy the improvements. ๐ŸŒŸ


Welcome to the latest and greatest updates! We've been busy making some awesome improvements and squashing pesky bugs. Here's a rundown of the cool new features, improvements, and fixes we've rolled out:


New Features ๐ŸŒŸ

  • PJRT Buffer Magic: Say hello to PJRT_Buffer_CopyRawToHost in the PJRT C API! This nifty feature lets you copy raw data from device to host memory, making your GPU app data handling smoother than ever. Itโ€™s a game-changer for high-performance computing and machine learning aficionados.

  • HLO Interfaces: We've introduced HloModuleInterface and HloInstructionInterface to spice up your HLO module and instruction management. These interfaces bring organization and efficiency to your TensorFlow profiling utilities with enhanced data handling.

  • Dot Product Testing: The XLA GPU framework now includes a test for dot products with batch and contracting dimensions. This ensures robust backend support for your matrix operations, making sure everything runs like a well-oiled machine.

Improvements ๐Ÿš€

  • LLVM Update: We've synced up with the latest LLVM updates, ensuring our project stays sharp and up-to-date with the latest features and improvements.

  • GEMM Fusion Flexibility: Our GPU GEMM fusion now supports broadcasts of trivially-sized dimensions, like [1,n] to [1,m,n], thanks to PR #19112. This means more flexibility and efficiency in your matrix operations.

  • TFL Pass Migration: The PushTransposeThroughEwisePass has migrated to the new TFL pass mechanism, streamlining the code and making it easier to maintain. Plus, we've updated the command-line argument for consistency.

Bugfixes ๐Ÿ›

  • No Signature, No Problem: Fixed an issue in TensorFlow Lite where models without signatures were causing hiccups. Now, we pass a nullptr for models lacking function signatures, keeping everything running smoothly.

  • Algebraic Simplifier Tweaks: We've ensured the AlgebraicSimplifier in XLA respects host offloading copies, preventing any unwanted eliminations and maintaining computation integrity.

  • Developer Guide Tweak: Fixed a formatting blip in developer_guide.md where <USER> was misbehaving. It's now {USER}, and the guide looks fab!

Chore ๐Ÿงน

  • Code Cleanup: Tidied up gpu_types.h by removing unused type aliases. This decluttering enhances clarity and makes room for future awesomeness.

That's all for now, folks! Keep your eyes peeled for more exciting updates and improvements coming your way. ๐ŸŽ‰


Here's the latest scoop on our codebase updates! We've been busy bees, buzzing around to bring you some fantastic new features, improvements, and bug fixes. Let's dive right in! ๐Ÿ


New feature: We've jazzed up the XLA framework by using the CUDA runtime API to accurately determine if two ranks are on the same host. This ensures more reliable local communication during collective operations, especially in multi-GPU setups. ๐Ÿš€

New feature: A new transformation pass is here! We've added a pass to outline an IFRT IR atom program into a module, enhancing the XLA framework's capabilities in handling IR atom programs. ๐ŸŽ‰

Improvement: TensorFlow Lite compiler now checks for infinity when folding max and min ops. This ensures that operations handle extreme floating-point values correctly, boosting robustness. ๐Ÿ’ช

New feature: You can now save output data from TFLite models as TensorFlow Example protocol buffers and output them to a file. This makes model evaluation and debugging a breeze! ๐Ÿ“Š

Improvement: Weโ€™ve added profiling to the ifrt-proxy client, enabling request-response trace tracking. This makes monitoring and analyzing RPC calls a piece of cake. ๐Ÿฐ

New feature: Direct legalization for min and max operations is now available in TensorFlow Lite, streamlining the conversion process and enhancing performance. โšก๏ธ

New feature: We introduced a pattern to reorder gather and cast ops in TensorFlow Lite for more efficient execution. Less work, more play! ๐ŸŽฎ

New feature: A new optimization pattern simplifies broadcasting and reshaping operations in TensorFlow MLIR, enhancing efficiency. Who doesn't love a good optimization? ๐Ÿ› ๏ธ

Bugfix: We fixed a critical issue in JAX where input arrays weren't reshaped correctly, preventing crashes on TPU and ensuring correct outputs on GPU. Phew! ๐Ÿ˜…

Bugfix: Memory leaks in cuda_executor.cc error paths are now a thing of the past. We've improved memory management to keep things running smoothly. ๐Ÿงน

Bugfix: Compatibility issues with Numpy 2.x in TensorFlow's numpy-like operations have been resolved. We're all set for the future! ๐Ÿ”ฎ

Chore: We tidied up by deleting status_test_util.h after migrating all its users. A cleaner codebase is a happier codebase! ๐Ÿงผ


That's all for now, folks! Stay tuned for more exciting updates and improvements. Keep coding and keep smiling! ๐Ÿ˜„

Showing 1 to 20 of 35 Entries