tensorflow changelog

Get notified when there are new updates

9 days ago

Here's the latest scoop on the updates and enhancements we've made recently. We've been busy bees, buzzing around to optimize, enhance, and squash those pesky bugs. Let's dive into the details! 🐝

New feature: We’ve rolled out the DynamicSliceCopyFusionCmd in the XLA GPU backend to make memory operations smoother than a fresh jar of Skippy! This command, derived from DynamicMemcpyThunk, now allows for efficient slice copying with static offsets. 🥜
Improvement: The RedzoneBuffers got a turbo boost! Now, you can create these buffers from an Executable, making memory management in GPU computations as easy as pie. 🍰
New feature: Say hello to the xla::PjRtPhaseCompiler! This addition is all about enhancing the XLA framework with a sprinkle of phase compilation magic. ✨
Improvement: We've supercharged the DotLibraryRewriter in the XLA CPU backend to support oneDNN fusions for dot operations with element-wise ops like Add, Mul, and Exp. It's like giving your CPU a shot of espresso! ☕️
New feature: Introducing PjRtFuture::Map! This nifty API lets you transform future results with ease, ensuring error propagation is handled like a pro. 🚀
New feature: The CreateRawAliasOfBuffer() method is now live, allowing for more efficient memory management in the XLA framework. It's all about sharing the love... and the buffers! 💾
New feature: With the TryMap method, you can now map futures with a functor that might fail, handling errors gracefully and keeping things running smoothly. 🎩
Improvement: Mesh deduplication just got a whole lot cooler with support for mapping a single sub-axis to multiple sub-axes. It's like a party for your axes! 🎉
Bugfix: We've squashed a bug in BufferFromHostLiteral to ensure events are fulfilled even when allocation fails. No more unhandled errors raining on your parade! ☔️
Bugfix: Crashes caused by unsupported operand types in XNNPACK have been fixed. Now, we check operands to keep things crash-free and smooth sailing! 🛳️
Chore: A little housekeeping with the forward compatibility horizon updated to June 28, 2025. Just keeping things fresh and up-to-date! 📅
Bugfix: We rolled back a previous change due to suspected correctness issues. Sometimes you gotta take a step back to move forward! 🔄

That's all for now, folks! Keep those GPUs buzzing and CPUs humming. Until next time! 🎶

16 days ago

Here's a rundown of the latest changes, packed with exciting new features and crucial bug fixes to keep everything running smoothly. 🚀

New Feature: Python bindings for the HLO Diff tool are now live! This update makes it easier to compare HLO modules with new options for computing instruction fingerprints, enhancing the tool's flexibility and usability.
New Feature: Say hello to the TfLiteQuantizationClone function! This nifty addition lets you clone TfLiteQuantization structs in TensorFlow Lite, making it a breeze to duplicate quantization parameters without altering the originals. Handy, right?
New Feature: We've rolled out _XlaShardingV2 for tf.XlaShardOp, boosting TensorFlow's sharding capabilities during the XLA lowering process. This means better performance for distributed computing and TPU workloads!
New Feature: The SerDesDefaultVersionAccessor::Get() method is here to make your life easier by managing default SerDes versions in IFRT. It ensures robust version handling, especially useful for IFRT Proxy development.
New Feature: Two new methods, ToLiteral and LazyToLiteral, have been added to CommonPjRtBufferImpl. These methods provide more flexibility in handling data conversions, making asynchronous operations smoother in the XLA framework.
Improvement: We've integrated Triton up to version 0a4aa69, updating the LLVM integration patches and improving CUDA and ROCm compatibility. This makes the build process more streamlined and efficient.
Improvement: The RedzoneBuffers functionality in XLA:GPU has been enhanced. Now, RedzoneBuffers can be created from an Executable, improving memory management and flexibility in buffer creation.
Improvement: Upwards tile propagation for BroadcastOp in XLA:GPU is now implemented. This enhancement optimizes tensor operations by accurately propagating tile information to inputs of broadcast operations.
Bugfix: A pesky bug in the alternate memory allocation for XLA:MSA has been squashed! Chunks are now properly reserved and tracked, preventing issues with overlapping memory.
Bugfix: The integration of hermetic C++ toolchains in TensorFlow has been rolled back. This decision was made to avoid increased wheel sizes and maintain compliance with manyLinux standards.
Bugfix: Due to timeouts on Linux, the worker_tags_test has been temporarily disabled for Python 3.13. We're on it and will get this sorted out soon!
Chore: The highwayhash library has been moved to a new location within the TensorFlow repository, tidying up the project's structure and improving maintainability.

We hope these updates make your development experience even better! Keep an eye out for more improvements and features coming your way. 🌟

23 days ago

Here's a summary of the recent updates and changes we've made. We've been busy enhancing performance, adding new features, and squashing bugs to make everything run smoother and more efficiently. Let's dive into the details! 🚀

Improvement: Enhanced Resource Calculation for Scheduling Groups
We've fine-tuned the resource calculation for scheduling groups with the "keep_original_sequence_order_in_group" attribute. Now, the scheduler maintains the original sequence of instructions while accurately tracking resource usage, thanks to the new GetNumResourcesNeededForAnnotationWithKeepOriginalOrderAttrs function. Comprehensive tests ensure precision in resource calculations, even under different resource limits. 🎯
Improvement: Optimized Hadamard Rotation Algorithm
The Hadamard rotation algorithm in TensorFlow Lite got a turbo boost! The introduction of FWHTGeneral and FWHTFast functions has enhanced performance, especially for larger sizes. These changes mean faster Hadamard rotations, making your TensorFlow Lite applications zippier than ever. ⚡️
New Feature: Dynamic Registration Helper
Say hello to REGISTER_DYNAMIC, a new helper that complements REGISTER_PJRT_PLUGIN. It allows developers to dynamically load shared object files based on environment variables, simplifying plugin integration and management. 🎉
New Feature: Precision Test for XLA:GPU
We've added a test that checks how precision drops with increasing K dimension sizes in dot algorithms. This test helps us understand precision degradation and ensures computations remain accurate as the size scales. 📊
New Feature: HloProgram Serialization Methods
Introducing HloProgram::ToBytes() and HloProgram::FromBytes(), ensuring an exact serialization/deserialization roundtrip. These methods are perfect for specific use cases that require identical program results, although they aren't version-compatible. 🔄
Improvement: XLA:CPU Exponential Function Optimization
The xla.exp operation now runs like a dream in the XLA CPU backend. We've optimized the exponential function calls for F64, resulting in massive performance improvements—up to 85% faster in some cases! 🚀
New Feature: NCCL Net Plugin XPlane IDs Reserved
We've reserved the last 100 custom XPlane IDs for the NCCL Net Plugin, improving profiling capabilities and ID space management within the XLA framework. 🛠️
New Feature: ComputePeakMemory Method
The buffer assignment API now includes a ComputePeakMemory method, accurately calculating peak memory usage. This addition enhances memory management and robustness with extensive unit tests covering various scenarios. 🧠
Bugfix: Revert Host Platform Configuration
We reverted a previous change to address Android build issues, refining platform-specific settings to ensure a stable and predictable build outcome across environments. 🔧
Bugfix: 256-Byte Alignment for cuBLAS Compatibility
To avoid breakages with cuBLAS 12.9.1.4, we've implemented a 256-byte alignment across tests and components, ensuring compatibility and performance in GPU operations. 🖥️
Bugfix: Delegate Closure Order
We've reordered operations in TensorFlow Lite's close() method to prevent use-after-free errors by ensuring delegates are closed before the model handle is deleted. 🔒
Chore: Removed Empty test_macros.h
We've cleaned up the codebase by deleting all uses of the now-empty test_macros.h file, keeping things neat and tidy. 🧹

These updates reflect our ongoing commitment to delivering a robust, high-performance experience. Keep an eye out for more exciting changes coming your way! 🌟

1 month ago

Welcome to the latest and greatest updates! We've been busy bees 🐝 and have some exciting new features and improvements to share with you. Let's dive into the juicy details:

New Features

Raw Buffers FTW! 🎉: Say hello to use_raw_buffers, a feature that keeps raw buffer references alive and kicking until data transfer is complete. No more premature deletions! Just a heads up, it might sneakily read from donated arrays, but we're working on it!
XLA Scheduling Gets a Boost: We've added a ScheduleConfig to XLA's HloModuleConfig. Now you can manage instruction execution like a pro, making your computations smoother than ever.
Unified Model ID Metric: Track your loaded models with a new gauge metric that records the unified model ID. It's like a fingerprint for your models, ensuring observability is top-notch.
Weight-Only PTQ for TensorFlow: Introducing tf_weight_only_ptq for StableHLO. This library lets you perform int8 weight-only quantization on dot_general operations, streamlining model optimization without needing calibration.
Calibration Component in TensorFlow: Meet tf_component, a new addition to the TensorFlow MLIR quantization framework. It manages post-calibration transformations, improving the accuracy of your quantized models.

Improvements

Low Latency Thread Pool in PjRT: We've optimized the PjRT GPU client with a low latency thread pool for async operations. Your GPU computations are about to get a whole lot zippier! 🚀
Allocator Magic During Compilation: The GPU client in XLA now uses a configured memory allocator during compilation. This means better memory management and performance across devices.
Fusion Flexibility: The MultiOutputFusion class now allows more flexibility for derived classes, making the fusion process efficient and tailored to your backend needs.

Bugfixes

Race Condition No More: We've squashed a race condition bug in flat_map_utils.cc. Now, threads won't step on each other's toes, ensuring smooth and stable dataset handling.
Crash-Free Layout Printing: Printing an invalid Layout in XLA won't crash your app anymore. Instead, it gracefully handles errors with a friendly "?" placeholder.
Memory Space Propagation Fix: Fixed an issue with the NVIDIA GPU's CollectiveColorer, ensuring memory spaces are correctly assigned and tests pass with flying colors.

Chore

Tidying Up: Removed an unnecessary header file in calibration_wrapper.cc, making the codebase leaner and meaner.

That's all for now, folks! Stay tuned for more updates and keep those feedbacks coming. Happy coding! 😄

2 months ago

Hey there, awesome coders! 🎉 We've got some exciting updates and bug fixes in the latest release. Here's a quick rundown of what's new and improved:

New Feature 🚀: GroupExecute API for XLA GPU Collectives
We've added a shiny new GroupExecute API to the XLA GPU collectives. This nifty feature supercharges group-based execution for collective communication patterns, making parallel processing tasks in machine learning and computational applications more efficient and performant.
New Feature 🛠️: Pre-Calibration Magic in TensorFlow MLIR
Say hello to tf_pre_calibration! This new component in the TensorFlow MLIR quantization framework is all about pre-calibration transformations during post-training static-range quantization. It collects quantization statistics and processes quantizable functions like a champ.
New Feature 💡: Device Assignment in XLA CPU Backend
We've empowered the XLA CPU backend to let you pass device assignments to NanoRt. This means more flexibility and efficiency in managing computations across multiple devices. Yay for smarter resource allocation!
Improvement 🔧: Memory Space Allocation in TfrtGpuClient
Our TfrtGpuClient just got a boost with platform and memory space allocator support. This upgrade means better GPU resource management, making your GPU applications run smoother than ever.
New Feature 🚀: Data Transfer in PJRT Async GPU Client
Introducing TransferToInfeed and TransferFromOutfeed in the PJRT async GPU client! These functions make data transfers to and from infeed and outfeed buffers a breeze, enhancing GPU data handling within the XLA framework.
Improvement 🛠️: Refactored TensorFlow MLIR Passes
We've reorganized the TensorFlow MLIR quantization passes into a new namespace, tf_passes. This refactor improves modularity and maintainability, making future development and enhancements a walk in the park.
Improvement 🔒: Shutdown Method in PreemptionSyncManager
A new Shutdown method in PreemptionSyncManager ensures a smooth and controlled shutdown process, enhancing system stability and reliability.
Bugfix 🐛: CUDA Graph Launch Callback
We've squashed a bug related to missing CUDA graph launch callbacks in the latest CUDA versions. Now, your GPU profiling should be as accurate as ever!
New Feature 📈: Post-Calibration in TensorFlow MLIR
Meet tf_post_calibration! This new library in the TensorFlow MLIR quantization framework performs post-calibration graph transformations, optimizing model performance after quantization.
Chore 🔄: Internal Directory Restructure
We've reorganized the TensorFlow codebase, focusing on directory structure and build configurations. This cleanup aims to streamline development and improve maintainability.
Bugfix 🔧: Deadlock in Tracked Device Buffer
We've fixed a potential deadlock issue in the XLA framework by replacing on_ready_tasks_callback_ with AndThen callbacks. This change ensures reliable task execution without any hiccups.
Bugfix 🐞: Concurrent Collective Creation
We've tackled a bug in the XLA library related to concurrent collective creation. Now, communicators are created safely in a multi-threaded environment, making collective communication more robust.

Enjoy these updates, and happy coding! 🎉

2 months ago

Here's a fresh batch of updates and enhancements to keep your codebase running smoother than ever! 🎉

New Features

BufferFromHostLiteral in CommonPjRtClient: A shiny new method is here to create buffers from host literals, complete with error handling and device memory management. Perfect for those who love seamless data transitions in machine learning tasks! 🚀
Scheduling Annotation in XLA Collective Pipeliner: Now you can schedule operations across loop iterations with the new _scheduling_group_id=<group_id>:<iteration_id> attribute. This makes optimizing performance a breeze! 🌀
CommonAsyncHostToDeviceTransferManager: Say goodbye to redundant implementations! This new manager handles asynchronous transfers using raw buffers, simplifying backend processes. 🎈
DiffResult Serialization/Deserialization: HLO diff tool gets a boost with new serialization capabilities, making data interchange and storage a piece of cake. 🍰
XLA Microbenchmarking Utilities: A set of C++ utilities have been added to set up a microbenchmarking pipeline, ensuring your performance evaluations are top-notch. 🏋️

Improvements

Sharding Devices in XlaCompileOptions: Enhancements to support MPMD parallelism in McJAX, ensuring complex parallelism scenarios are handled with finesse. 🎯
Variable Ops in XNNPACK Delegate: The implementation now uses TFlite storage, fixing visibility issues and simplifying architecture. 🛠️
Profiling Context in Runtime Library: Enhanced profiling capabilities with a new context to manage profiling info across device execution threads. 📈

Bugfixes

cuDNN Command Buffer: Fixed the incorrect updates in the XLA GPU backend, ensuring GPU operations run smoothly. 🛡️
Data Race in ObjectPool: Resolved data race issues with a new marking mechanism for safer push/pop operations. 🎢
Buffer Donation Events: Now waiting on usage and definition events before donation, preventing premature buffer donations. ⏳

Chores

Internal Proto Change: Streamlined protocol buffer definitions in TensorFlow Lite's profiling module for improved consistency. 🧹

These updates are sure to enhance your coding experience, making everything from data management to performance optimization more efficient and fun! Keep coding, keep thriving! 🌟

2 months ago

Here's what's hot off the press! 🚀 We've got a bunch of shiny new features and some bug fixes that'll make your code run smoother than a cat on a Roomba. Let's dive in and see what's new:

New Feature: Flexible Quantization in BATCH_MATMUL 🎉
We've jazzed up the BATCH_MATMUL operation in TensorFlow Lite! Now, you can use any integer divisor of batch_size * n for quantization parameters, making per-channel quantization more flexible and robust. This means more options for handling quantized inputs and a better fit for your model's needs.
New Feature: Model Transformations Flag 🏗️
Introducing the apply_model_transformations flag in the TensorFlow Lite GPU delegate. This nifty flag lets you decide if model transformations should be applied during the building process. It's like having a choice between a smoothie or a milkshake—both are great, but now you get to choose!
New Feature: PjRtDeviceEventOrPromise Class ⏳
Say hello to PjRtDeviceEventOrPromise, a new class for managing device events and promises. It's all about tracking asynchronous operations in the XLA framework, making your device event handling as smooth as a buttered slide.
New Feature: Enhanced Quantization Passes 🔧
We've merged quantization/stablehlo into a new version that leans on non-lite QuantizeUtils.h and TFQuantDialect. This means better quantization capabilities and more comprehensive test coverage to boot.
New Feature: TargetMetric in XLA Benchmarking 🏃‍♂️
Benchmarking just got cooler with TargetMetric in the XLA benchmarking tool. Now, you can specify metrics like wall time, GPU device time, and peak memory usage, giving you a detailed view of your benchmarks.
New Feature: Support for jax.lax.optimization_barrier 🚧
Our TFL converter now supports jax.lax.optimization_barrier, ensuring that certain operations are isolated from optimization passes. It's like setting up cones around your precious computations, keeping them safe and sound.
Improvement: Preserved Weights in Custom BWD Ops 🏋️‍♂️
You can now pass preserved weights to custom backward operations in TensorFlow's sparse-dense matrix multiplication. This makes custom combiners more flexible and efficient, perfect for those working with sparse data structures.
Improvement: Multi-Pair Support in sdy_all_to_all 🔄
The sdy_all_to_all function now supports multiple source/target dimension pairs, offering more flexibility in tensor operations. It's like having a Swiss Army knife for your dimensions!
Bugfix: Race Condition in TileAssignment 🔒
We've squashed a pesky race condition in the TileAssignment class. Now, mutation is protected by a mutex, ensuring thread safety and peace of mind.
Bugfix: Multi-Type Transpose Handling 🌀
Fixed an issue where multiple transposes with different types weren't handled correctly in the XLA GPU backend. Now, your transposes should be as smooth as a synchronized swim team.
Bugfix: Debug Options Dumping 🐛
We've fixed an issue with dumping non-default debug options, ensuring all relevant options are included in the output. Debugging just got a little less frustrating!
Chore: Profiler Client Cleanup 🧹
The profiler_client has been removed from the public package namespace, streamlining TensorFlow and focusing on core features.

These updates are all about making your experience more flexible, efficient, and robust. Enjoy the new features and happy coding! 🎈

2 months ago

Hey there, code enthusiasts! We've got a fresh batch of updates that are sure to make your TensorFlow experience even more exciting. Dive into the latest changes and enhancements that have been made to improve performance, add new features, and fix those pesky bugs. Let's take a closer look at what's new and improved:

New feature 🚀: We've introduced a new XlaOp for a custom combiner backward pass, enhancing TensorFlow's capabilities in handling sparse-dense matrix multiplication operations. This update is a big win for those optimizing deep learning models on TPUs with sparse data structures.
New feature 🌟: Direct translations for unary elementwise operations from StableHLO to HLO are now available, streamlining the process and improving performance for numerical computations in XLA. Say hello to seamless handling of operations like cosine, sine, and tangent!
Improvement 🎉: A progress bar has been added to stdout for long-running matcher processes, giving you visual feedback and making those waiting times a bit more bearable. Keep an eye on the progress and know exactly where you stand!
New feature 🆕: We've expanded direct translation support for BroadcastOp, BroadcastInDimOp, and DynamicBroadcastInDimOp from StableHLO to HLO. This enhancement ensures better handling of broadcast dimensions and shapes, making your operations run smoother.
Bugfix 🔧: We've fixed an integer overflow issue in the TFL::FullyConnectedOp::verify() function by switching to int64_t for storing num_elements. This fix prevents erroneous outputs and ensures accurate calculations even with large tensor sizes.
Improvement 🚀: Hot array iterations just got a performance boost with a new templated Array::Each API variation. This change eliminates type-erasure and virtual calls, optimizing those critical code paths for better efficiency.
New feature 🌟: Scoped alternate memory allocations can now expand to the biggest free chunk at the end of MSA, improving memory utilization and reducing fragmentation for optimized execution performance.
New feature 🆕: Binary elementwise operations can now be directly translated from StableHLO to HLO, broadening the scope of operations and enhancing the efficiency of machine learning models relying on these operations.
Improvement 🎉: GPU command buffers are now smarter with automatic inference of command dependencies using an execution graph. Enable xla_gpu_graph_enable_concurrent_region and enjoy a more efficient command execution!
Chore 🧹: We've removed the pipelining pass from XLA GPU emitters, simplifying the codebase and shifting towards alternative optimization strategies for loop execution.
Bugfix 🔧: We've addressed undefined behaviors in PJRT by fixing pointer casting issues between unrelated types. This update enhances code safety and correctness, ensuring smooth operations across CPU and GPU implementations.
Bugfix 🔄: A regression fix in the XLA collective pipeliner ensures proper handling of scalar constants for padding values. This prevents unnecessary broadcasting and improves the efficiency of dynamic tensor operations.

That's all for now, folks! Enjoy the new features and improvements, and keep coding like a rockstar! 🌟

3 months ago

Hey there, code wranglers! We've been busy optimizing, fixing, and adding some cool new features to our codebase. Here's the latest scoop on what's new, improved, and bug-fixed. 🚀

New feature: We've added the GetDefaultLayout API to the IFRT Proxy, allowing you to easily retrieve default layouts for specified data types, dimensions, devices, and memory kinds. This is a big win for optimizing data placement and access patterns! 🎉
Improvement: Reinstated support for cuDNN explicit CUDA graph construction in the GPU backend, thanks to the release of cuDNN frontend v1.11.0. This enhancement is crucial for boosting performance in deep learning apps. 💪
New feature: Say hello to collect_symlink_data_aspect, a nifty addition for hunting down symlinked files in target runfiles. This makes file management in the build process more robust and efficient. 🔍
New feature: We've added a "copy" button for the full HLO instruction text format in HTML outputs. Now, you can easily copy HLO instruction text directly from the rendered output. Handy, right? 🖱️
New feature: Introducing IOPDDL utilities to XLA Auto Sharding's third-party directory. These tools are essential for tackling optimization problems and evaluating solutions. 🛠️
New feature: Simplified the ComputeAndCompareLiteral function with an overload that doesn't require an error_spec. This makes testing a breeze! 🌬️
Improvement: Enhanced the HLO diff tool to better visualize repetitive computation patterns. This makes it easier to spot and analyze patterns in computation differences. 🔍
Bugfix: Addressed a concurrency issue in GPU compiler tests by mutex-guarding the default_device_assignment_ pointer. No more race conditions here! 🏎️💨
Bugfix: Fixed undefined behaviors in PJRT by correcting how pointers are cast between unrelated types. Safety first! 🚦
Bugfix: Improved the conversion of HLO to StableHLO for programs with bounded dynamism. Now, the conversion process handles these programs more robustly. 🔄
Improvement: Integrated updates from LLVM, aligning with the latest changes and enhancing TensorFlow's capabilities and performance. ⚙️
Chore: We've moved tensorflow/lite/experimental/litert to the google-ai-edge/litert repository, streamlining the codebase for better organization. 📦

That's all for now, folks! Keep coding and stay awesome! 😎

3 months ago

Here's the latest scoop on the updates and improvements made to the TensorFlow and XLA frameworks. We've got some exciting new features, important bug fixes, and a sprinkle of organizational tidying. Let's dive in! 🚀

New Feature: Dynamic GELU Composite Lowerings
Say hello to dynamic composite lowerings for the GELU operation in TensorFlow's MLIR framework. This update brings two new patterns to the table, LegalizeCompositeGELUDynamicShaped and LegalizeCompositeGELUDynamicShaped2, which handle dynamic input shapes with grace and style. Now, TensorFlow can flexibly manage varying input dimensions, making your machine learning models even more robust! 🎉
Improvement: Custom Op for odml.detector
We've waved our magic wand and transformed the odml.detector composite operation into a custom operation within TensorFlow Lite. This makeover streamlines integration and boosts performance by allowing complex operations to be executed as custom operations. A win for flexibility and speed! 🧙‍♂️
New Feature: Explicit Collectives Grouping in JAX
Introducing an explicit collectives grouping pass for jitted JAX methods! This feature ensures computations run within a single NCCL group, optimizing NVLink systems for multi-directional communications. With this addition, expect improved performance and fewer NCCL kernels during execution. Go team efficiency! 🏎️
Bugfix: Shape Representation Safety
We've tightened the bolts on shape representation by using std::variant<> to ensure a shape holds only one exclusive state at a time. This fix prevents misuse and potential crashes, making your code safer and more reliable. Safety first! 🛡️
New Feature: Direct StableHLO to HLO Conversion
Get ready for a smoother ride with direct conversion from StableHLO to HLO for AddOp and ConstantOp. This prototype skips the MHLO step, paving the way for more efficient conversion processes in the future. Streamlining for the win! 🏆
New Feature: GetDefaultLayout API in IFRT Proxy
Meet the new GetDefaultLayout API method in the IFRT Proxy, your go-to for retrieving default layouts for specified data types. This enhancement optimizes data placement and access patterns, making your computational tasks more efficient. Layouts made easy! 📐
Improvement: Scheduler Statistics in XLA
We've added the ability to dump scheduler statistics into a proto, giving you a detailed breakdown of wasted cycles and memory pressure. This enhancement boosts debugging and performance analysis, helping you optimize your scheduling process. Knowledge is power! 📊
Improvement: CommandBuffer API Update
The CommandBuffer class in the XLA GPU backend now features an explicit command update API for the If command. This update allows for more complex command management and resource optimization. Command and conquer! 💪
New Feature: HLO Test for Command Buffers
Introducing a new end-to-end HLO test for command buffers in the XLA GPU service. This test simplifies the process of verifying complex command buffers, strengthening the testing framework and laying the groundwork for future developments. Testing made easy! 🧪
Bugfix: Post-Order Traversal Non-Determinism
We've tackled the non-determinism bug in post-order traversal, ensuring correct instruction ordering by allowing pre-computed post-orders. This fix enhances robustness and prevents potential errors in instruction execution. Order restored! 🔄
Bugfix: Determinism in SHARD_AS/SHARD_LIKE
We've addressed non-determinism in SHARD_AS and SHARD_LIKE operations by switching to std::vector for consistent ordering. This fix enhances the reliability of sharding operations, ensuring predictable outputs in parallel computations. Consistency is key! 🔧
Chore: Kernel Generation Passes Reorganization
We've tidied up the TensorFlow MLIR codebase by moving kernel generation-specific passes to a dedicated directory. This reorganization improves code clarity and maintainability, paving the way for future enhancements. Organization FTW! 📂

That's a wrap on the latest updates! Keep coding, keep innovating, and as always, stay awesome! 🌟

3 months ago

Here's a fresh batch of updates to keep your TensorFlow projects running smoothly and efficiently! 🚀

Improvement: The XLA Latency Hiding Scheduler now dumps its stats to a proto, giving you a clearer picture of performance metrics like wasted cycles and memory pressure. This makes debugging and optimizing your scheduling process a breeze! 📊
Improvement: Say hello to non-blocking NCCL communicators! This update boosts the performance of collective operations in GPU backends by allowing tasks to run concurrently. Faster, smoother, and more efficient GPU operations are now at your fingertips! ⚡️
Improvement: Multiple compilation configs are now supported in the TensorFlow Lite experimental LiteRT compiler plugin. Plus, you can now track partition stats in your compiled models, making performance tuning a lot easier. 🛠️
New Feature: The ifrt::Client interface gets a makeover with two new methods! CreateContext helps you capture the runtime context, and a variant of MakeArrayFromHostBuffer uses this context to streamline performance analysis and debugging. 🕵️‍♂️
New Feature: Introducing the TfrtGpuAsyncHostToDeviceTransferManager and TfrtGpuClient::CreateBuffersForAsyncHostToDevice() for managing async transfers from host to device. More unit tests mean more reliability and correctness! 🧪
New Feature: We've integrated StableHLO from OpenXLA, bringing significant updates and enhancements to the StableHLO framework within the project. 🛡️
New Feature: Check out TfrtGpuBuffer::CopyRawToHostFuture and TfrtGpuClient::BufferFromHostLiteral for efficient and asynchronous data transfers in the TensorFlow XLA GPU backend. 🚀
New Feature: Quantization functionalities have been copied over to TensorFlow Lite, optimizing models for resource-constrained devices. More tests mean more reliability! 📈
Bugfix: A critical bug fix for the FloorDiv operation in TF/TFL lowering to TOSA ensures correct rounding behavior. Accuracy restored! 🔧
Bugfix: Addressed a use-after-move issue in CpuCompiler::CompileAheadOfTime within the XLA CPU module. This fix enhances stability and reliability. 🛠️
Bugfix: Reverted a previous change affecting TensorFlow profiler's error handling, ensuring that any issues are flagged and not ignored. Error management just got stricter! 🚨
Chore: Renamed serialization_base.fbs to tflite_serialization_base.fbs for better clarity and organization within the TensorFlow Lite framework. 🗂️

Enjoy the new features and improvements, and happy coding! 🎉

3 months ago

Here's a delightful rundown of all the awesome changes that have been made recently. Get ready to dive into some cool new features and improvements! 🚀

New Features

Advanced Profiler Configuration: We've jazzed up the profiler with an advanced configuration option. Now, you can specify various settings with greater flexibility, like a pro! 🎛️
GPU Environment via C API: Say goodbye to singleton headaches! Access the GPU environment with our shiny new C API LiteRtGpuGlobalEnvironmentCreate(). It's all about smoother GPU operations now! 🖥️
TfrtGpuBuffer Debut: Introducing the TfrtGpuBuffer! This is the first step in supercharging GPU support within the XLA framework. Let's get that GPU party started! 🎉
Inlineable Attribute: The inlineable attribute is now a first-class citizen, giving you more control over which call operations get inlined. More power to you! 💪
CreateErrorBuffer Functionality: Meet CreateErrorBuffer, your new best friend for error handling in GPU operations. It keeps things running smoothly even when errors pop up. 🛠️

Improvements

Dynamic & Static GPU Accelerator Support: Whether you're dynamically or statically linking your GPU accelerators, we've got you covered. Flexibility at its finest! 🔗
More Op Builders: We've added more operation builders, especially for the ResizeNearestNeighbor operation. Your models just got a makeover! 🏗️
IFRT Arrays Layout Management: Layouts are now better managed with IFRT Arrays, thanks to some nifty tweaks and a roll-forward fix. It's all about keeping things neat and tidy! 📐

Bug Fixes

Shared Library Path Fix in QNN: No more wandering paths! We've fixed the library path issues in QNN, ensuring your shared libraries are right where they need to be. 🛠️
Layout Creation from Proto: Crashes are so yesterday. Now, Layout::CreateFromProto() handles invalid inputs gracefully, keeping your app running smoothly. 🚫💥
GPU Model Execution Fixes: We've squashed bugs causing GPU model execution failures, including layout mishaps and memory leaks. Your GPU tasks just got a whole lot smoother! 🐛🔧

Chore

Automated Code Cleanup: A little spring cleaning never hurt anybody! We've removed unnecessary header files to keep the codebase lean and mean. 🧹

Keep exploring these updates and enjoy the enhanced TensorFlow experience! Happy coding! 😃✨

4 months ago

Here's a fresh batch of updates for you, packed with new features, improvements, and bug fixes. Let's dive in! 🚀

New Feature: LiteRT GPU Accelerator
The ml_drift_cl_litert feature has been unleashed, enhancing TensorBuffer integration via the DelegateKernelLiteRt. This includes publishing TensorBufferRequirements in kLiteRtTensorBufferTypeOpenCl, binding TensorBuffers with BindTensorBuffers(), and a simplified Invoke() implementation. The TensorFlow Lite experimental LiteRT codebase got some love too, with updates ensuring OpenCL is recognized as the buffer type for input and output tensors.
New Feature: XLA TopK Operation Semantics
Added a detailed section in the XLA docs about the TopK operation, explaining how it identifies the largest or smallest elements in a tensor. Whether you're dealing with one-dimensional arrays or multi-dimensional tensors, this update has got your back!
Improvement: Unary Functions in XLA
Enhanced the XLA builder by adding ResultAccuracy support for unary functions like Cbrt, Cos, Erf, and more. This comprehensive update spans multiple files to boost precision and reliability across the TensorFlow ecosystem.
New Feature: chlo.ragged_dot CAPI and Python API
Say hello to the new CAPI and Python API for chlo.ragged_dot in the StableHLO framework. This includes a new RaggedDotDimensionNumbers attribute, allowing users to specify dimension configurations for matrix operations. Python bindings and test cases have been updated to ensure everything runs smoothly.
Improvement: cuDNN Fusion Compiler
The cuDNN fusion compiler now processes graphs with assigned workspaces, optimizing High-Level Operations (HLO) for better GPU performance. This update includes test cleanups and improved resource management.
New Feature: TfrtGpuBuffer
Introducing the TfrtGpuBuffer for GPU support in XLA. This initial version includes updates to the GPU client implementation and a new test file to ensure everything's running like a well-oiled machine.
New Feature: SmallWhileLoopHoistingPass
A new optimization pass for the XLA CPU backend, SmallWhileLoopHoistingPass, improves small while loop performance by hoisting them into callable computations. This update includes unit tests and refinements to cost analysis.
Improvement: Dynamic Test Case Generation
Dynamic test case generation for TensorFlow Lite's compiled models is here! This feature creates C++ test cases on-the-fly, adapting to different environments and consolidating testing into a single binary.
Bugfix: litert::Expected Assignment Operators
Fixed a critical bug in the litert::Expected class assignment operators, ensuring proper handling of different value states and preventing data corruption.
Bugfix: HloRunner Thread Safety
Enhanced the thread safety of the HloRunner class by removing race conditions and introducing a mutex for safe resource management.
Bugfix: Model Round-Tripping
Ensured buffers initially appended to the FlatBuffer remain correctly appended during serialization and deserialization in TensorFlow Lite's LiteRT.
Chore: NCCL References Removed
Cleaned up the XLA GPU backend by removing NCCL references from CollectiveBroadcast and CollectivePermute functionalities, streamlining the codebase for better flexibility and performance.

Stay tuned for more updates, and happy coding! 😄✨

4 months ago

In this update, we've got a bunch of exciting new features and improvements that will make your developer life a whole lot easier. From enhanced benchmarking workflows to new operation builders for Qualcomm's AI Engine, we've got it all. Plus, we've squashed some pesky bugs to keep things running smoothly. Let's dive into the details! 🚀

New Feature: Benchmark Presubmit Workflow
We've rolled out a shiny new presubmit workflow for benchmarking performance to catch potential regressions before they sneak into the main codebase. This new setup runs tests across various configurations and helps keep the performance top-notch. Plus, we've renamed existing benchmark workflows to make it crystal clear which ones are for nightly runs and which are for presubmit checks. 🕵️‍♂️
Improvement: StableHLO Integration
Integrated a specific version of StableHLO to streamline tensor operations and enhance compatibility within the MLIR framework. This update brings a more efficient syntax for operations and introduces new tests to ensure everything's running smoothly.
New Feature: TraceMe for Thunk Execution
Added a new tracing mechanism to the Thunk execution process in the XLA CPU backend. This feature provides detailed execution traces, making it easier to monitor and debug performance. 🎯
Improvement: PjRtClient::Compile for TFRT GPU
Implemented the PjRtClient::Compile function for enhanced GPU support in TensorFlow Runtime, optimizing resource utilization and boosting performance for TensorFlow applications.
New Feature: Qualcomm AI Engine Direct Op Builders
Introduced new operation builders for Qualcomm's AI Engine Direct, including Conv2d, DepthwiseConv2d, and more. These additions come with unit tests to ensure robust functionality and improved machine learning model performance. 🤖
New Feature: LiteRT GPU Accelerator Integration
Added the ml_drift_cl_litert feature for better TensorBuffer integration in GPU-accelerated models, enhancing the TensorFlow Lite experimental framework.
New Feature: Elementwise Ops in Collective Pipeliner
Enabled support for elementwise operations in the collective pipeliner, improving the efficiency of GPU computations, especially in scaled FP8 GEMMs.
Bugfix: Cross-Module Instruction References
Fixed an issue where instructions were referencing computations across different modules, which was causing some test failures. This update strengthens module encapsulation and code robustness.
Improvement: LiteRT Google Implementation
Updated the LiteRT Google implementation to try loading the newer libedgetpu_litert.so library first, ensuring compatibility with recent Android builds while maintaining backward compatibility.
Chore: Logging Cleanup
Removed excessive logging in parallel_batch_dataset_op.cc to prevent log spamming and enhance user experience.
Bugfix: VhloToVersion Reversion
Reverted a previous change in the VhloToVersion transformation to simplify version compatibility checks within the StableHLO framework.
Bugfix: Trace Events Reversion
Reverted a change in the trace_events.proto file to clarify the handling of flow events, ensuring the trace event framework functions smoothly.

That's all for now, folks! Keep coding, and stay awesome! 😎

5 months ago

Here's a delightful summary of the latest updates and improvements, packed with exciting new features and crucial bug fixes! 🎉

New Feature: Host Memory Support in StreamExecutor
We've rolled out support for MemoryType::kHost in the CreateStreamExecutor function across multiple executor types. This means you can now allocate and deallocate host memory with ease, thanks to the new GenericMemoryAllocator. Plus, we've added tests to ensure everything runs smoothly. 🚀
New Feature: ARM64 CPU Builds in XLA
Say hello to ARM64 CPU builds for the XLA project via GitHub Actions! This nifty addition enhances our CI workflow, allowing for comprehensive testing across x86 and ARM64 architectures. 🛠️
Improvement: Custom Fusion Integrity in XLA
We've improved instruction fusion by ensuring that custom fusions and calls remain intact. This update enhances the robustness of the fusion process, maintaining the integrity of custom operations. 🔧
New Feature: DMA Operations in PJRT C API
Introducing PJRT_Client_DmaMap and DmaUnmap functions to the PJRT C API! These additions boost our direct memory access capabilities, complete with thorough testing to ensure seamless integration. 💾
New Feature: PyTorch Conversion in tf_tfl_translate
We've added new flags to the tf_tfl_translate tool, making it easier to convert PyTorch saved models. Now you can specify the model's origin and enable direct lowering of composite operations. 🔄
Improvement: Cross-Compile Architecture Support
Developers can now specify target machine architectures in cross-compile scenarios for CUDA, CUDNN, and NCCL. This update ensures smooth redistributions across various platforms. 🌐
Improvement: Bitcast Handling in XLA
We've enhanced the handling of bitcasts in the XLA framework by allowing split dimension mapping. This change optimizes memory allocations and boosts performance. ⚡
New Feature: Attribute Management in HloInstruction
Streamline your code with new methods for managing frontend attributes in HloInstruction. These functions simplify attribute handling, making your code more efficient and readable. 📈
Bugfix: Memory Crash in NcclAllToAllStartThunk
We've fixed a rare crash issue in the memcpy implementation by switching from absl::flat_hash_map to arrays, ensuring stable and performant memory handling. 🐞
Bugfix: Executable Creation in HloRunnerPjRt
A critical bug causing segmentation faults has been squashed by properly managing the ownership of executables in HloRunnerPjRt. 🛠️
Bugfix: Synchronous Dispatch for CPU Callbacks
To prevent deadlocks, CPU executables with host callbacks will now dispatch synchronously. This temporary fix ensures resources are allocated effectively. 🔄
Chore: Clean-Up in StreamExecutor
We've tidied up by removing the unused HostMemoryDeallocate method, enhancing code maintainability and clarity. 🧹

These updates are sure to enhance your experience and keep everything running smoothly. Happy coding! 🎈

5 months ago

Here's a delightful summary of the recent updates and improvements. Get ready to dive into the world of new features, bug fixes, and more! 🚀

New Features

Flatten-Tuple Pass Migration: We've migrated from MHLO to StableHLO with a new transformation pass that flattens tuples in HLO operations. This makes tuple handling more efficient and includes robust test cases to ensure everything is ship-shape. 🛠️
kCpu Property Tag: Say hello to the kCpu property tag in the HloRunner class, which helps distinguish between CPU and GPU environments, paving the way for targeted optimizations. 🖥️
LiteRt C Runtime Shared Library: A new rule to generate a shared library for the LiteRt C runtime is here, making the TensorFlow Lite framework more versatile and organized. 📚
SourceTargetPairs Class: Introducing the SourceTargetPairs class to the XLA service, enhancing the structure and functionality of collective operations. 🎉
Pack Op Legalization: The LiteRT framework now supports the Pack operation, crucial for tensor manipulations in deep learning models. 📦

Improvements

HostOffloader Enhancements: We've improved the handling of DynamicUpdateSlice operations, marking them as host compute when working with host memory, enhancing memory management efficiency. 🧠
Reshard Optimization: In the IFRT framework, multiple reshards are now merged into a single operation when possible, reducing redundancy and boosting performance. 🔄
Persistent Workers for Parallel Loops: Persistent workers are now used for pthreadpool parallel loops, significantly improving execution times and efficiency in the XLA CPU backend. 🚀

Bug Fixes

CUDA Driver Compatibility: Fixed issues with XLA builds on CUDA Driver versions lower than 12.3, ensuring robust functionality across different versions. 🛠️
SparseCore Device ID Fix: Resolved issues with SparseCore device IDs in the TensorFlow profiler's trace viewer, enhancing performance profiling reliability. 📊
Timeline v1 Timestamp Compatibility: Improved timestamp accuracy in the TensorFlow profiler's timeline version 1, ensuring correct timing for GPU events. ⏱️

Chores

Cleanup of Deprecated References: We've cleaned up references to the deprecated global_data.h in XLA, streamlining the codebase for clarity and future improvements. 🧹

These updates bring a mix of new capabilities, optimizations, and fixes, making the TensorFlow ecosystem more robust and ready for the future! 🌟

5 months ago

Here's a delightful update on the latest changes and enhancements that have been made:

🚀 New Features

XLA:CPU Thunk Serialization: We've jazzed up the XLA CPU backend with initial thunk serialization. This means thunks, those nifty units of computation, can now be serialized and deserialized, making computation saving and restoring a breeze. This is particularly handy for distributed computing scenarios. 🎉
NCCL ncclCommInitRankScalable API Support: The XLA GPU framework now supports the NCCL ncclCommInitRankScalable API. This allows NCCL communicators to be initialized using multiple root ranks, boosting performance in large-scale environments. You can tweak the ranks per root with a snazzy flag too! 🌟
Dispatch Op Custom Options: Introducing functions for managing custom options in TensorFlow Lite's LiteRT core using the flexbuffer API. This adds a structured, efficient way to handle dispatch operation options. Flexibility, meet efficiency! 💪
Data Lineage Logging: TensorFlow now sports a data lineage logging mechanism, helping you track and manage data like a pro. Perfect for those who love to keep things organized! 📚
IFRT Atom Programs Utility Pass: New utility pass for writing atom programs and the main IFRT function to files. This enhances management and output of atom programs in XLA. 📜

🔧 Improvements

Coordination Service Task Reconnection: Restartable tasks can now reconnect to a cluster, provided they maintain the same local topology. This boosts stability and reliability. 🔄
Gather/Scatter Operand Overlap Handling: We've added functionality to create copies of operands in gather and scatter instructions when they overlap, ensuring smooth operations without memory conflicts. 🧩
StreamExecutor Memory Allocation Unification: A step towards unifying memory allocation methods with new classes for streamlined management. Future-proofing memory handling like a boss! 🛠️

🐛 Bug Fixes

XLA:Python GIL Scoping: Fixed the scoping of GIL release in the XLA Python extension during nb::bytes object construction. No more threading hiccups! 🐍
PjitFunction Locking: Ensured the lock on cache_ is held when destroying executables_ in PjitFunction, maintaining thread safety in a free-threading mode. 🔒
TransposePlan Overflow: Resolved an overflow issue by changing data types to handle larger dimensions without a hitch. No more overflow woes! 📈

🧹 Chores

Refcounting Hashmap Cleanup: Removed an unused refcounting hashmap from the XLA codebase, making things cleaner and simpler. Out with the old! 🧹

These updates bring a mix of new features, improvements, bug fixes, and cleanup that enhance the overall performance and functionality of the framework. Keep exploring and enjoy the new capabilities! 🎊

6 months ago

Here's the latest scoop on what's new and improved in our codebase! We've been busy bees, adding some cool new features and squashing pesky bugs to make things run smoother than ever. Check out the highlights below! 🚀

New Feature: Infeed and Outfeed Support for HloRunnerPjRt
We've just rolled out infeed and outfeed support for HloRunnerPjRt in the XLA library. This means you can now transfer data into and out of computations in real-time, making your workflows more dynamic and interactive. Plus, we've added some nifty functions for buffer conversions and threading to keep things running smoothly. 🏃‍♂️💨
Improvement: All-to-All Operation Enhancements
Our latest update optimizes the handling of multiple source-target pairs during all-to-all operations. By merging and splitting sharding axes more efficiently, we've reduced the number of operations needed, boosting performance for distributed computations. Let's get those tensors reshaped and transposed like pros! 🔄
New Feature: CreateFromAhwb Method in TensorBuffer
Say hello to the CreateFromAhwb method in TensorFlow Lite's TensorBuffer class! This new addition allows you to create a TensorBuffer from an Android Hardware Buffer, making it easier to work with hardware-backed tensors. We've got tests in place to ensure everything works like a charm. 📱🔧
New Feature: Pinning Tensors to Device Memory in XLA
You can now pin tensors to device memory in XLA, keeping them from being pre-fetched to alternate memory. This feature enhances memory management and performance, especially for applications that need quick access to critical tensors. 📌💾
Improvement: Dynamic Slice Operation Optimization
We've optimized the partitioning process for dynamic-slice operations, making them more efficient by replicating input data along slice dimensions. This change eliminates unnecessary input replication, leading to faster execution in distributed environments. 🎯
New Feature: Lower Fake Quant Annotation
Introducing the LowerQuantAnnotationsPass! This new pass transforms quant.fake_quant operations into tfl.Quantize and tfl.Dequantize ops, paving the way for better quantization handling in TensorFlow MLIR. 🧙‍♂️✨
New Feature: cuDNN Flash Attention Sequence Packing
Our cuDNN flash attention now supports sequence packing, allowing multiple segments to be packed into one batch. This enhancement saves memory and speeds up both training and inference, making your workflows more efficient. 🧩⚡
Bugfix: Dispatch API Build Error
We've fixed a build error in the TensorFlow Lite dispatch API by refining memory management and handling unknown C++ types. This ensures a smoother and error-free build process. 🛠️🐞
Bugfix: 3D Input Quantization in Fully Connected Layers
We've addressed an issue with per-channel quantization for 3D input tensors, ensuring that fully connected operations handle output shapes correctly. Now, your models can process 3D inputs without a hitch! 📏🔍
Bugfix: Operation Profile Improvements
We’ve improved the TensorFlow profiler's operation profile by refining the deduplication process and enhancing the user interface. This makes it easier to manage and analyze operation profiles. 📊🔧
Chore: Remove Unused Refcounting Hashmap
We've cleaned up the codebase by removing an unused refcounting hashmap, streamlining the XLA project for better maintainability. 🧹🗑️

Stay tuned for more updates as we continue to enhance our codebase with awesome features and improvements! 🌟

6 months ago

Welcome to the latest and greatest update roundup! 🚀 We've been busy bees, buzzing around and making some awesome improvements to our beloved frameworks. Here's the lowdown on what's new, what's improved, and what's been squashed:

New feature: Nested Calls in XLA:CPU
Our ElementalKernelEmitter has leveled up! It can now handle nested calls, enhancing the CPU backend's kernel generation capabilities. This means more efficient and flexible computations are on the horizon!
New feature: Pinning Tensors on TPU
Introducing tensor pinning to device SRAM on TPUs via custom calls. This update optimizes memory management, ensuring your computations run smoother and faster.
Improvement: Automated Code Changes in TensorFlow MLIR
We've unleashed a flurry of automated updates across TensorFlow's MLIR compiler, enhancing everything from variable initialization to layout optimization. It's like a turbo boost for model compilation and execution!
New feature: XLA:CPU Collectives API
Say hello to the new collectives API for XLA:CPU! This fresh addition supports collective operations, paving the way for optimized machine learning performance on CPUs.
Improvement: HloInstruction & BufferAssignment in XLA:CPU
We've supercharged the XLA CPU backend by refining the EmitKernelPrototype process, leading to more efficient memory handling and kernel execution. It's all about making things faster and cleaner!
New feature: XLA GPU Documentation
We've added a comprehensive guide to the XLA GPU architecture, complete with visual aids and examples. This documentation is your new best friend for navigating the GPU compiler pipeline.
Improvement: Transposed Convolutions in XLA:CPU
Our transposed convolution algorithm now supports multiple input and output channels, with performance improvements that will make your jaw drop—over 99% faster in some cases!
New feature: TFLite Quantization Option
TFLite users, rejoice! You can now disable per-channel quantization for dense layers, giving you more control over your model's quantization strategy.
Chore: Temporary Wheel Size Increase
We've temporarily increased the wheel size limit to keep those nightly builds rolling smoothly. It's a quick fix while we sort out the underlying issues.
Bugfix: ShapeError Crashes in XLA
We've tackled a pesky bug that caused crashes when element_type was out of bounds. Now, we print the integer value instead, making error reporting clearer and more robust.

That's all for now, folks! Keep those updates coming, and we'll keep making things better, faster, and more awesome. 🎉

6 months ago

Welcome to the latest change log! We've got some exciting updates and improvements to share with you. From new features that enhance performance to bug fixes that ensure smoother operations, here's a rundown of what's new and improved. 🎉

New feature: Introduced F4E2M1FN and F8E8M0FNU types to the XLA framework, enabling microscaling formats like MXFP4. This addition expands the framework's data type capabilities, providing support for unique floating-point formats. 💾
New feature: Added RecordBatchTaskSizeSum in TensorFlow's batching utility to track the cumulative size of tasks within a batch. This function enhances task size analysis during batch processing, offering better insights into task handling. 📊
New feature: Moved ProfileTimeBreakdown to open-source, allowing for detailed execution time analysis of HLO instructions within TensorFlow. This change enhances profiling capabilities for performance monitoring. 🔍
New feature: Added free-threading support to WeakrefLRUCache, improving its functionality in multithreaded environments. The update ensures thread safety with proper locking mechanisms, validated by a new multithreaded test. 🔒
New feature: Introduced a generic XnnFusionThunk for the XLA CPU backend and ported XnnDotThunk to it, optimizing fusion operations for improved performance. 🚀
Improvement: Enhanced the XLA GPU framework by using NCCL thunk for RaggedAllToAll operations, even in scenarios without inter-replica communication. This update improves handling of ragged data structures. 🤝
Improvement: Enabled sorted scatters in the XLA GPU backend, optimizing scatter operations with sorted indices for better performance. 📈
Improvement: Added locking around lazily-initialized fields in PyDeviceList to ensure thread safety in the XLA Python interface, enhancing robustness in multi-threaded environments. 🛡️
Bugfix: Fixed a crash due to out-of-memory errors in XLA's custom convolution algorithm by introducing a threshold for convolution matrix size, ensuring memory constraints are respected. 🛠️
Bugfix: Corrected kernel launch dimensions for ROCm to comply with platform-specific checks, enhancing compatibility and performance for ROCm applications. 🎯
Bugfix: Resolved a Bazel code check error by updating the BUILD file to use the correct namespace for platform compatibility, ensuring smoother build processes. 🔧
Chore: Integrated Triton library up to a specific commit, including patch files to address issues and improve compatibility. This ongoing effort refines the Triton integration for enhanced functionality. ⚙️

We hope these updates make your experience even better! Stay tuned for more improvements and features. 🚀

Showing 1 to 20 of 42 Entries