tensorflow changelog


Here's a delightful update on the latest changes and enhancements that have been made:

๐Ÿš€ New Features

  • XLA:CPU Thunk Serialization: We've jazzed up the XLA CPU backend with initial thunk serialization. This means thunks, those nifty units of computation, can now be serialized and deserialized, making computation saving and restoring a breeze. This is particularly handy for distributed computing scenarios. ๐ŸŽ‰

  • NCCL ncclCommInitRankScalable API Support: The XLA GPU framework now supports the NCCL ncclCommInitRankScalable API. This allows NCCL communicators to be initialized using multiple root ranks, boosting performance in large-scale environments. You can tweak the ranks per root with a snazzy flag too! ๐ŸŒŸ

  • Dispatch Op Custom Options: Introducing functions for managing custom options in TensorFlow Lite's LiteRT core using the flexbuffer API. This adds a structured, efficient way to handle dispatch operation options. Flexibility, meet efficiency! ๐Ÿ’ช

  • Data Lineage Logging: TensorFlow now sports a data lineage logging mechanism, helping you track and manage data like a pro. Perfect for those who love to keep things organized! ๐Ÿ“š

  • IFRT Atom Programs Utility Pass: New utility pass for writing atom programs and the main IFRT function to files. This enhances management and output of atom programs in XLA. ๐Ÿ“œ

๐Ÿ”ง Improvements

  • Coordination Service Task Reconnection: Restartable tasks can now reconnect to a cluster, provided they maintain the same local topology. This boosts stability and reliability. ๐Ÿ”„

  • Gather/Scatter Operand Overlap Handling: We've added functionality to create copies of operands in gather and scatter instructions when they overlap, ensuring smooth operations without memory conflicts. ๐Ÿงฉ

  • StreamExecutor Memory Allocation Unification: A step towards unifying memory allocation methods with new classes for streamlined management. Future-proofing memory handling like a boss! ๐Ÿ› ๏ธ

๐Ÿ› Bug Fixes

  • XLA:Python GIL Scoping: Fixed the scoping of GIL release in the XLA Python extension during nb::bytes object construction. No more threading hiccups! ๐Ÿ

  • PjitFunction Locking: Ensured the lock on cache_ is held when destroying executables_ in PjitFunction, maintaining thread safety in a free-threading mode. ๐Ÿ”’

  • TransposePlan Overflow: Resolved an overflow issue by changing data types to handle larger dimensions without a hitch. No more overflow woes! ๐Ÿ“ˆ

๐Ÿงน Chores

  • Refcounting Hashmap Cleanup: Removed an unused refcounting hashmap from the XLA codebase, making things cleaner and simpler. Out with the old! ๐Ÿงน

These updates bring a mix of new features, improvements, bug fixes, and cleanup that enhance the overall performance and functionality of the framework. Keep exploring and enjoy the new capabilities! ๐ŸŽŠ


Here's the latest scoop on what's new and improved in our codebase! We've been busy bees, adding some cool new features and squashing pesky bugs to make things run smoother than ever. Check out the highlights below! ๐Ÿš€

  • New Feature: Infeed and Outfeed Support for HloRunnerPjRt
    We've just rolled out infeed and outfeed support for HloRunnerPjRt in the XLA library. This means you can now transfer data into and out of computations in real-time, making your workflows more dynamic and interactive. Plus, we've added some nifty functions for buffer conversions and threading to keep things running smoothly. ๐Ÿƒโ€โ™‚๏ธ๐Ÿ’จ

  • Improvement: All-to-All Operation Enhancements
    Our latest update optimizes the handling of multiple source-target pairs during all-to-all operations. By merging and splitting sharding axes more efficiently, we've reduced the number of operations needed, boosting performance for distributed computations. Let's get those tensors reshaped and transposed like pros! ๐Ÿ”„

  • New Feature: CreateFromAhwb Method in TensorBuffer
    Say hello to the CreateFromAhwb method in TensorFlow Lite's TensorBuffer class! This new addition allows you to create a TensorBuffer from an Android Hardware Buffer, making it easier to work with hardware-backed tensors. We've got tests in place to ensure everything works like a charm. ๐Ÿ“ฑ๐Ÿ”ง

  • New Feature: Pinning Tensors to Device Memory in XLA
    You can now pin tensors to device memory in XLA, keeping them from being pre-fetched to alternate memory. This feature enhances memory management and performance, especially for applications that need quick access to critical tensors. ๐Ÿ“Œ๐Ÿ’พ

  • Improvement: Dynamic Slice Operation Optimization
    We've optimized the partitioning process for dynamic-slice operations, making them more efficient by replicating input data along slice dimensions. This change eliminates unnecessary input replication, leading to faster execution in distributed environments. ๐ŸŽฏ

  • New Feature: Lower Fake Quant Annotation
    Introducing the LowerQuantAnnotationsPass! This new pass transforms quant.fake_quant operations into tfl.Quantize and tfl.Dequantize ops, paving the way for better quantization handling in TensorFlow MLIR. ๐Ÿง™โ€โ™‚๏ธโœจ

  • New Feature: cuDNN Flash Attention Sequence Packing
    Our cuDNN flash attention now supports sequence packing, allowing multiple segments to be packed into one batch. This enhancement saves memory and speeds up both training and inference, making your workflows more efficient. ๐Ÿงฉโšก

  • Bugfix: Dispatch API Build Error
    We've fixed a build error in the TensorFlow Lite dispatch API by refining memory management and handling unknown C++ types. This ensures a smoother and error-free build process. ๐Ÿ› ๏ธ๐Ÿž

  • Bugfix: 3D Input Quantization in Fully Connected Layers
    We've addressed an issue with per-channel quantization for 3D input tensors, ensuring that fully connected operations handle output shapes correctly. Now, your models can process 3D inputs without a hitch! ๐Ÿ“๐Ÿ”

  • Bugfix: Operation Profile Improvements
    Weโ€™ve improved the TensorFlow profiler's operation profile by refining the deduplication process and enhancing the user interface. This makes it easier to manage and analyze operation profiles. ๐Ÿ“Š๐Ÿ”ง

  • Chore: Remove Unused Refcounting Hashmap
    We've cleaned up the codebase by removing an unused refcounting hashmap, streamlining the XLA project for better maintainability. ๐Ÿงน๐Ÿ—‘๏ธ

Stay tuned for more updates as we continue to enhance our codebase with awesome features and improvements! ๐ŸŒŸ


Welcome to the latest and greatest update roundup! ๐Ÿš€ We've been busy bees, buzzing around and making some awesome improvements to our beloved frameworks. Here's the lowdown on what's new, what's improved, and what's been squashed:

  • New feature: Nested Calls in XLA:CPU
    Our ElementalKernelEmitter has leveled up! It can now handle nested calls, enhancing the CPU backend's kernel generation capabilities. This means more efficient and flexible computations are on the horizon!

  • New feature: Pinning Tensors on TPU
    Introducing tensor pinning to device SRAM on TPUs via custom calls. This update optimizes memory management, ensuring your computations run smoother and faster.

  • Improvement: Automated Code Changes in TensorFlow MLIR
    We've unleashed a flurry of automated updates across TensorFlow's MLIR compiler, enhancing everything from variable initialization to layout optimization. It's like a turbo boost for model compilation and execution!

  • New feature: XLA:CPU Collectives API
    Say hello to the new collectives API for XLA:CPU! This fresh addition supports collective operations, paving the way for optimized machine learning performance on CPUs.

  • Improvement: HloInstruction & BufferAssignment in XLA:CPU
    We've supercharged the XLA CPU backend by refining the EmitKernelPrototype process, leading to more efficient memory handling and kernel execution. It's all about making things faster and cleaner!

  • New feature: XLA GPU Documentation
    We've added a comprehensive guide to the XLA GPU architecture, complete with visual aids and examples. This documentation is your new best friend for navigating the GPU compiler pipeline.

  • Improvement: Transposed Convolutions in XLA:CPU
    Our transposed convolution algorithm now supports multiple input and output channels, with performance improvements that will make your jaw dropโ€”over 99% faster in some cases!

  • New feature: TFLite Quantization Option
    TFLite users, rejoice! You can now disable per-channel quantization for dense layers, giving you more control over your model's quantization strategy.

  • Chore: Temporary Wheel Size Increase
    We've temporarily increased the wheel size limit to keep those nightly builds rolling smoothly. It's a quick fix while we sort out the underlying issues.

  • Bugfix: ShapeError Crashes in XLA
    We've tackled a pesky bug that caused crashes when element_type was out of bounds. Now, we print the integer value instead, making error reporting clearer and more robust.

That's all for now, folks! Keep those updates coming, and we'll keep making things better, faster, and more awesome. ๐ŸŽ‰


Welcome to the latest change log! We've got some exciting updates and improvements to share with you. From new features that enhance performance to bug fixes that ensure smoother operations, here's a rundown of what's new and improved. ๐ŸŽ‰

  • New feature: Introduced F4E2M1FN and F8E8M0FNU types to the XLA framework, enabling microscaling formats like MXFP4. This addition expands the framework's data type capabilities, providing support for unique floating-point formats. ๐Ÿ’พ

  • New feature: Added RecordBatchTaskSizeSum in TensorFlow's batching utility to track the cumulative size of tasks within a batch. This function enhances task size analysis during batch processing, offering better insights into task handling. ๐Ÿ“Š

  • New feature: Moved ProfileTimeBreakdown to open-source, allowing for detailed execution time analysis of HLO instructions within TensorFlow. This change enhances profiling capabilities for performance monitoring. ๐Ÿ”

  • New feature: Added free-threading support to WeakrefLRUCache, improving its functionality in multithreaded environments. The update ensures thread safety with proper locking mechanisms, validated by a new multithreaded test. ๐Ÿ”’

  • New feature: Introduced a generic XnnFusionThunk for the XLA CPU backend and ported XnnDotThunk to it, optimizing fusion operations for improved performance. ๐Ÿš€

  • Improvement: Enhanced the XLA GPU framework by using NCCL thunk for RaggedAllToAll operations, even in scenarios without inter-replica communication. This update improves handling of ragged data structures. ๐Ÿค

  • Improvement: Enabled sorted scatters in the XLA GPU backend, optimizing scatter operations with sorted indices for better performance. ๐Ÿ“ˆ

  • Improvement: Added locking around lazily-initialized fields in PyDeviceList to ensure thread safety in the XLA Python interface, enhancing robustness in multi-threaded environments. ๐Ÿ›ก๏ธ

  • Bugfix: Fixed a crash due to out-of-memory errors in XLA's custom convolution algorithm by introducing a threshold for convolution matrix size, ensuring memory constraints are respected. ๐Ÿ› ๏ธ

  • Bugfix: Corrected kernel launch dimensions for ROCm to comply with platform-specific checks, enhancing compatibility and performance for ROCm applications. ๐ŸŽฏ

  • Bugfix: Resolved a Bazel code check error by updating the BUILD file to use the correct namespace for platform compatibility, ensuring smoother build processes. ๐Ÿ”ง

  • Chore: Integrated Triton library up to a specific commit, including patch files to address issues and improve compatibility. This ongoing effort refines the Triton integration for enhanced functionality. โš™๏ธ

We hope these updates make your experience even better! Stay tuned for more improvements and features. ๐Ÿš€


Hey there, fabulous TensorFlow fans! ๐ŸŽ‰ Get ready to dive into the latest and greatest updates that are making TensorFlow Lite even more awesome. We've got some cool new features, essential improvements, and a few bug fixes that are smoothing out the ride. Let's see what's new!

  • Improvement: Enhanced Compiler Plugin API
    The compiler plugin API now partitions at the subgraph level instead of the model level. This fine-tunes the association of operations with subgraphs, making the compilation process more precise and efficient. ๐Ÿš€

  • Improvement: Improved Model Management
    Pre-allocated subgraphs can now be transferred into models, and metadata can be popped from the model's map. This boosts memory management and organization, ensuring smoother model operations. ๐Ÿง 

  • Improvement: Model FLOPs Calculations
    Model-specific FLOPs are now part of the device operation metrics, providing deeper insights into model performance and helping you optimize better. ๐Ÿ“ˆ

  • New Feature: Per-Channel Quantization in QC Compiler Plugin
    The Qualcomm compiler plugin now supports per-channel quantization parameters, boosting flexibility and efficiency for models that need it. ๐ŸŽ›

  • New Feature: std::any to LiteRtAny Conversion
    Introducing conversion between std::any and LiteRtAny, enhancing data handling flexibility in TensorFlow Lite's experimental library. ๐Ÿ”„

  • New Feature: Per-Tensor Quantization in QNN IR
    QNN Intermediate Representation now supports per-tensor quantization, expanding its capabilities for handling diverse models. ๐Ÿ“Š

  • New Feature: Open Source TPU Step Utils
    Say hello to tpu_step_breakdown_utils and tpu_step_details_utils! These libraries provide detailed breakdowns of TPU performance metrics, helping you optimize your TPU workloads. ๐Ÿ–ฅ

  • New Feature: HardwareType Combining
    Now, when merging RunEnvironment instances, the highest hardware type is selected, ensuring accurate profiling of hardware capabilities. ๐Ÿ–ง

  • Bugfix: Range Analysis Fix
    Fixed an issue in operand range multiplication with constants. Now, all components are correctly multiplied, ensuring accurate range analysis. ๐Ÿ”ง

  • Bugfix: Gather Operation Index Clamping
    Out-of-bound indices in gather operations are now clamped, preventing execution bugs in SPMD partitioners. ๐Ÿ› 

  • Bugfix: Build Breakage Fix
    Resolved a build issue by aligning data types in flatbuffer tools for Android, ensuring smooth compilation and operation. ๐Ÿ—

These updates are designed to make your TensorFlow experience smoother, faster, and more powerful. Keep innovating and stay tuned for more exciting updates! ๐Ÿš€


Here's the scoop on the latest updates to our favorite machine learning libraries. Get ready for some cool new features, bug fixes, and a sprinkle of optimizations. Let's dive in! ๐Ÿš€

  • New feature: TensorBoard now has an inference_latency_chart! ๐ŸŽ‰ This new feature lets you visualize how long your model's inference takes, helping you make smarter optimization decisions.

  • New feature: Say hello to per-channel quantization in LiteRT! This enhancement allows for more precise model optimization by applying different quantization scales for each tensor channel, improving accuracy in resource-constrained environments.

  • New feature: The Qualcomm compiler plugin for TensorFlow Lite now supports per-channel quantization parameters. This update brings greater flexibility and efficiency, especially for models that benefit from per-channel quantization techniques.

  • New feature: The WhileLoopAllReduceCodeMotion pass is now part of the XLA optimization toolkit. This addition could boost the performance of while loops by enabling more efficient code motion techniques.

  • Bugfix: The XLA latency hiding scheduler got a tune-up to better handle annotated no-op instructions. The fix ensures these instructions wait for the whole annotation set to be ready before scheduling, improving performance.

  • Bugfix: We squashed a bug causing crashes in the XLA Latency Hiding Scheduler with non-standard async ops. The scheduler now handles complex dependencies more effectively, ensuring smooth operation.

  • Bugfix: Fixed a range analysis bug in XLA where operand ranges weren't multiplied correctly with constants. The updated logic ensures accurate range calculations, strengthening the reliability of the XLA service.

  • Improvement: TensorFlow's profiler just got a boost! It now supports sampling for inference profiles, making it easier to analyze inference performance with more detailed statistics.

  • Improvement: Essential StepEvents have been added for GPU inference profiles, enhancing the profiling capabilities of TensorFlow applications running on GPUs.

  • Chore: Clean-up time! The --xla_gpu_experimental_enable_triton_softmax_priority_fusion flag has been removed from the XLA GPU compiler's API, simplifying the codebase by eliminating unnecessary features.

That's all for now, folks! Keep those models running smoothly and efficiently. ๐ŸŒŸ


Here's the scoop on our latest updates, where we've been busy adding new features, squashing bugs, and refining our systems to make everything run smoother than ever. Check out the highlights below and see how we're making things better for you! ๐Ÿš€


New Features:

  • xla::Collectives API: We've rolled out the new xla::Collectives API, setting the stage for NVIDIA Collective Communications Library (NCCL) integration. This makes XLA more robust for parallel processing on GPUs, with support for both host and device-initiated collective operations. ๐ŸŒŸ

  • Greater OP Legalization: TensorFlow Lite's LiteRT framework now supports the "greater" operation, complete with new test data and build configurations. This addition enhances tensor comparison capabilities. ๐Ÿ“ˆ

  • Dynamic Shapes in Convolutions: StableHLO now supports dynamic shapes in 1D convolutions, offering more flexibility and aligning with modern machine learning needs. ๐ŸŒ€

  • Ragged All-to-All in XLA: We've added asynchronous start and done phases for the "ragged all-to-all" operation, boosting XLA's efficiency in handling complex collective operations. ๐Ÿš€

  • Custom Options in IFRT: Users can now specify custom_options for runtime-specific execution, allowing more tailored execution parameters. ๐Ÿ› ๏ธ

  • Multi XSpace to InferenceStats Conversion: A new function transforms multiple XSpace instances into InferenceStats, enhancing TensorFlow's profiling framework for better inference performance insights. ๐Ÿ”

  • HLO Stats Tool: Introducing the HLO Stats Tool in TensorFlow's profiler for deeper performance analysis of high-level operations. ๐Ÿ“Š

Improvements:

  • C++ Tree with Path API: We've transitioned the tree_util.tree_flatten_with_path and tree_map_with_path APIs to C++, speeding up the pytree flattening process. โšก

Bug Fixes:

  • Triton Dot Product Bug: Fixed a bug in Triton's dot product algorithm for dot(inf, 1.0), ensuring correct results by addressing non-finite value summation. ๐Ÿ”ง

  • Wheel Creation Logic: Resolved issues in TensorFlow's wheel creation logic when using pywrap rules, improving the packaging process. ๐Ÿ“ฆ

  • Graph Output Tensor Recognition: Corrected logic in TensorFlow Lite to ensure graph output tensors are recognized even when used by other Ops. ๐Ÿ› ๏ธ

Chores:

  • Obsolete TODO Removal: Cleaned up outdated TODO comments in the TensorFlow XLA compiler codebase, streamlining and clarifying the code. ๐Ÿงน

These updates are all about making your experience smoother, faster, and more efficient. Stay tuned for more exciting improvements and keep those feedbacks coming! ๐Ÿ˜Š


Welcome to the latest updates! We've been busy adding some shiny new features and fixing pesky bugs to make your experience smoother and more efficient. Here's a rundown of what's new and improved:

  • New Feature ๐Ÿš€: Parallel compilation is now live for the XLA CPU backend, thanks to our new ORC TaskDispatcher. This means faster and more efficient JIT compilation, leveraging multi-threading to get things done in a snap!

  • New Feature ๐ŸŽ‰: TensorV1Attr support has been added to the flatbuffer_export and flatbuffer_operator, allowing for a more structured and efficient data representation in TensorFlow's MLIR framework. Now you can handle tensor attributes like a pro!

  • New Feature ๐ŸŒŸ: Introducing the VIFRT pass for converting between VIFRT versions. This nifty addition ensures compatibility and flexibility across different versions, making your development process smoother than ever.

  • New Feature ๐Ÿ: Python bindings for VIFRT serialization are here! Now you can serialize and deserialize IFRT IR programs with ease, ensuring compatibility across versions and making advanced serialization features more accessible.

  • New Feature ๐Ÿ”ง: Say hello to the experimental C++ graph builder for TensorFlow Lite! This tool empowers developers to construct and manipulate machine learning models programmatically, enhancing TFLite's flexibility and usability.

  • Improvement ๐Ÿ› ๏ธ: We've migrated the CpuCompiler from SimpleOrcJit to JitCompiler in the XLA backend for CPU. This upgrade promises better optimization and execution speeds, keeping things running like a well-oiled machine.

  • Improvement โš™๏ธ: To prep for JIT compilation, we've enhanced the CpuCompiler by constructing the JitCompiler within it, setting the stage for more efficient compilation processes.

  • New Feature ๐Ÿ’ก: A sharding config has been added to XLA's HloModuleConfig, as part of the AutoFDO integration. This gives you better control over operation distribution, optimizing performance like never before.

  • Bugfix ๐Ÿ›: We've squashed a bug in the MoveUserInstructionsIn function that was causing compilation errors with conditional operations. Now it handles multiple users like a champ!

  • Bugfix ๐Ÿž: Fixed an async execution bug in transposed convolution operations for XLA CPU. The intermediate buffer now stays in scope, preventing any memory mishaps.

  • Bugfix ๐Ÿ”ง: The tune_ctas logic in GemmFusionAutotunerImpl has been restored, ensuring proper CTA tuning for GPU computations, especially on Hopper architectures.

  • Chore ๐Ÿ”: Updated internal visibility settings for the registry library, ensuring access is managed effectively for Google-specific clients.

These updates are all about making your experience smoother, faster, and more powerful. Enjoy the new features and improvements, and keep an eye out for more exciting updates coming your way! ๐ŸŽˆ


Welcome to the latest round of updates! We've been busy bees ๐Ÿ, adding some slick new features, squashing pesky bugs, and tidying up the codebase. Here's a rundown of whatโ€™s new and improved:

  • New feature: ๐ŸŽ‰ We've added support for overriding cross-program prefetch behavior and filtering buffer intervals based on their usage in the XLA:TPU:MSA. These enhancements make memory management more flexible and efficient. Plus, we've included tests to make sure everything runs smoothly.

  • New feature: ๐Ÿš€ The HLO evaluator now supports explicit batch dimensions for gather and scatter operations. This change reserves necessary dimensions for tensors, making these operations more flexible and robust.

  • Improvement: ๐Ÿ› ๏ธ Introducing the AssertEq wrapper! This nifty tool helps ensure function outputs match expected results, enhancing our assertion framework. We've also improved error checking in the TensorFlow Lite runtime by validating tensor types more reliably.

  • New feature: ๐Ÿงฉ Say hello to HloModuleInterface and HloInstructionInterface! These new interfaces provide a more organized way to manage HLO data, improving efficiency and performance metrics retrieval.

  • New feature: โš™๏ธ Weโ€™ve added a RuntimeConfig when loading SavedModels, allowing you to disable the tf2xla MLIR bridge. This update optimizes graph execution for better performance.

  • Bugfix: ๐Ÿž Fixed a critical issue in CalculatePostOrderScheduleHelper(), ensuring kAsyncStart instructions are correctly initialized. This fix prevents instructions from being processed out of order.

  • New feature: ๐Ÿ” The HloUnaryInstruction class is here to boost result accuracy for specific unary functions, enhancing precision in computations.

  • Improvement: ๐Ÿ”ง Enhanced GPU GEMM fusions by allowing effective parameters and their broadcasts to be fused in the epilogues, optimizing performance.

  • New feature: ๐ŸŽ›๏ธ A new ToolParam for the XNNPACK TFLite delegate lets you easily toggle the Slinky optimizer via command-line flags, giving you more control over performance tuning.

  • Bugfix: ๐Ÿ›ก๏ธ Addressed a crucial issue in the GPU dot algorithm rewriter to handle infinity and NaN values correctly, ensuring accurate results in BF16 operations.

  • Bugfix: ๐Ÿ”ง Fixed the AlgebraicSimplifier to ensure it doesn't eliminate host offloading copies, maintaining the integrity of host memory operations.

  • Chore: ๐Ÿงน We've cleaned up by removing an unnecessary gpu_types.h inclusion in topk_kernel_test.cc, streamlining the code and reducing compilation time.

We hope these updates make your experience even better! Keep exploring and enjoy the improvements. ๐ŸŒŸ


Welcome to the latest and greatest updates! We've been busy making some awesome improvements and squashing pesky bugs. Here's a rundown of the cool new features, improvements, and fixes we've rolled out:


New Features ๐ŸŒŸ

  • PJRT Buffer Magic: Say hello to PJRT_Buffer_CopyRawToHost in the PJRT C API! This nifty feature lets you copy raw data from device to host memory, making your GPU app data handling smoother than ever. Itโ€™s a game-changer for high-performance computing and machine learning aficionados.

  • HLO Interfaces: We've introduced HloModuleInterface and HloInstructionInterface to spice up your HLO module and instruction management. These interfaces bring organization and efficiency to your TensorFlow profiling utilities with enhanced data handling.

  • Dot Product Testing: The XLA GPU framework now includes a test for dot products with batch and contracting dimensions. This ensures robust backend support for your matrix operations, making sure everything runs like a well-oiled machine.

Improvements ๐Ÿš€

  • LLVM Update: We've synced up with the latest LLVM updates, ensuring our project stays sharp and up-to-date with the latest features and improvements.

  • GEMM Fusion Flexibility: Our GPU GEMM fusion now supports broadcasts of trivially-sized dimensions, like [1,n] to [1,m,n], thanks to PR #19112. This means more flexibility and efficiency in your matrix operations.

  • TFL Pass Migration: The PushTransposeThroughEwisePass has migrated to the new TFL pass mechanism, streamlining the code and making it easier to maintain. Plus, we've updated the command-line argument for consistency.

Bugfixes ๐Ÿ›

  • No Signature, No Problem: Fixed an issue in TensorFlow Lite where models without signatures were causing hiccups. Now, we pass a nullptr for models lacking function signatures, keeping everything running smoothly.

  • Algebraic Simplifier Tweaks: We've ensured the AlgebraicSimplifier in XLA respects host offloading copies, preventing any unwanted eliminations and maintaining computation integrity.

  • Developer Guide Tweak: Fixed a formatting blip in developer_guide.md where <USER> was misbehaving. It's now {USER}, and the guide looks fab!

Chore ๐Ÿงน

  • Code Cleanup: Tidied up gpu_types.h by removing unused type aliases. This decluttering enhances clarity and makes room for future awesomeness.

That's all for now, folks! Keep your eyes peeled for more exciting updates and improvements coming your way. ๐ŸŽ‰


Here's the latest scoop on our codebase updates! We've been busy bees, buzzing around to bring you some fantastic new features, improvements, and bug fixes. Let's dive right in! ๐Ÿ


New feature: We've jazzed up the XLA framework by using the CUDA runtime API to accurately determine if two ranks are on the same host. This ensures more reliable local communication during collective operations, especially in multi-GPU setups. ๐Ÿš€

New feature: A new transformation pass is here! We've added a pass to outline an IFRT IR atom program into a module, enhancing the XLA framework's capabilities in handling IR atom programs. ๐ŸŽ‰

Improvement: TensorFlow Lite compiler now checks for infinity when folding max and min ops. This ensures that operations handle extreme floating-point values correctly, boosting robustness. ๐Ÿ’ช

New feature: You can now save output data from TFLite models as TensorFlow Example protocol buffers and output them to a file. This makes model evaluation and debugging a breeze! ๐Ÿ“Š

Improvement: Weโ€™ve added profiling to the ifrt-proxy client, enabling request-response trace tracking. This makes monitoring and analyzing RPC calls a piece of cake. ๐Ÿฐ

New feature: Direct legalization for min and max operations is now available in TensorFlow Lite, streamlining the conversion process and enhancing performance. โšก๏ธ

New feature: We introduced a pattern to reorder gather and cast ops in TensorFlow Lite for more efficient execution. Less work, more play! ๐ŸŽฎ

New feature: A new optimization pattern simplifies broadcasting and reshaping operations in TensorFlow MLIR, enhancing efficiency. Who doesn't love a good optimization? ๐Ÿ› ๏ธ

Bugfix: We fixed a critical issue in JAX where input arrays weren't reshaped correctly, preventing crashes on TPU and ensuring correct outputs on GPU. Phew! ๐Ÿ˜…

Bugfix: Memory leaks in cuda_executor.cc error paths are now a thing of the past. We've improved memory management to keep things running smoothly. ๐Ÿงน

Bugfix: Compatibility issues with Numpy 2.x in TensorFlow's numpy-like operations have been resolved. We're all set for the future! ๐Ÿ”ฎ

Chore: We tidied up by deleting status_test_util.h after migrating all its users. A cleaner codebase is a happier codebase! ๐Ÿงผ


That's all for now, folks! Stay tuned for more exciting updates and improvements. Keep coding and keep smiling! ๐Ÿ˜„


Here's a delightful rundown of the latest and greatest changes, improvements, and fixes in our codebase. We've been busy integrating, optimizing, and squashing pesky bugs to make your experience smoother and more efficient. Let's dive into the details! ๐Ÿš€

  • New feature: We've integrated the StableHLO framework into TensorFlow's MLIR infrastructure. This major update focuses on transforming and legalizing quantization and HLO operations, enhancing compatibility and performance. ๐ŸŽ‰

  • New feature: Added support for unary element-wise operations in the MHLO to TFL conversion process. Now, operations like absolute value and trigonometric functions are seamlessly transformed, bolstering TensorFlow Lite's capabilities. ๐ŸŒŸ

  • Improvement: Exporting MLIR modules just got clearer! The name of the HLO module now matches the MLIR module name, ditching the default "main" to avoid confusion and conflicts. ๐Ÿ“›

  • New feature: Memory management in XLA is stepping up! We've laid the groundwork for adding memory spaces to the CompileOnlyClient, paving the way for more sophisticated memory handling. ๐Ÿง 

  • Improvement: FP8 windowed einsums with multiple all-gather dots are now supported. This enhancement optimizes FP8 operations within the XLA framework, thanks to a nifty shift in dequantization. ๐ŸŽฏ

  • Improvement: Casting operations between floats and integers in MLIR are now more efficient, thanks to new folding optimizations. Say hello to faster compilation! ๐Ÿ”„

  • New feature: Introducing GetSparseCoreId to the TensorFlow profiler! This function extracts Sparse Core IDs from plane names, boosting TPU profiling capabilities. ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  • New feature: We've added a pass to open the sharding of while op free variables. This helps optimize sharding strategies during HLO conversion, enhancing operation efficiency. ๐Ÿงฉ

  • Bugfix: Resolved an issue where "MakeExactCopy" didn't copy "known_graph_outputs_", ensuring all necessary output values are retained in copied graphs. ๐Ÿ›

  • Bugfix: Fixed integer overflow issues post-NumPy 2.0 update by refining type casting and array creation operations, maintaining compatibility with NumPy 1.x behavior. ๐Ÿ”ง

  • Chore: Cleaned up pywrap_parallel_device.cc by removing unnecessary TensorFlow C API headers, streamlining the codebase. ๐Ÿงน

  • Bugfix: Addressed test failures under NumPy 2.x by directly calling __array__() for objects requiring a copy when converting to TF tensors. Compatibility restored! ๐Ÿ› ๏ธ

These updates are all about making things run smoother, faster, and with fewer hiccups. Keep those updates coming, and happy coding! ๐Ÿ˜Š


Welcome to the latest update! We've been busy bees ๐Ÿ making some exciting changes, adding new features, squashing bugs, and improving performance. Here's a rundown of what's new:

New Features

  • Original Value Tracking: Introduced a pass that adds the original_value field to each operation in the HLO graph. This is a game-changer for value tracking within the graph, making it easier to manage and analyze computations.
  • cuDNN Custom Call Conversions: Added a pass to convert specific cuDNN custom calls into custom fusion operations. This allows JAX users to run selected computations as cuDNN kernels, optimizing performance on GPUs.
  • Batch Dimension in Gather/Scatter: Now supporting batch dimensions in Gather and Scatter HLO syntax, enhancing data manipulation operations in XLA.
  • BatchFunction Operation: Updated protocol buffer text files to include a new "BatchFunction" operation, allowing for more flexible batching of input tensors.
  • AsyncWrapper: Introduced AsyncWrapper to wrap instructions into async blocks, enabling concurrent execution and potentially improving performance.

Improvements

  • Additional Batch Padding Policies: Exposed new batch padding policies like "BATCH_DOWN" and "MINIMIZE_TPU_COST_PER_REQUEST" for more efficient batch processing.
  • Async Dispatch for JAX CPU Backend: Enabled asynchronous dispatch for expensive computations on the JAX CPU backend, with an opt-out option for those who prefer the old synchronous behavior.

Bugfixes

  • Pipelining with Sequential Extracts: Fixed a bug related to pipelining sequential extracts, ensuring only the induction variable of a loop can be replaced.
  • Revert Changes in TensorFlow Lite GPU Delegate: Reverted a previous change to simplify the handling of the kClFastRelaxedMath compiler option, standardizing behavior across different GPU architectures.
  • Revert Changes in CUDA FFT Library: Reverted modifications to rename and update dependencies for the CUDA FFT library, ensuring proper initialization and integration.

Chores

  • Automated Code Cleanup: Removed unnecessary TensorFlow C API headers from pywrap_parallel_device.cc, streamlining the codebase.

We hope these updates make your development experience smoother and more efficient. Happy coding! ๐Ÿš€


Hey there, code wranglers! We've got some exciting updates for you. Check out the latest and greatest changes that are making our codebase even more awesome. ๐Ÿš€


Improvements

  • Streamlined Kernel Management: Combined StreamExecutor::GetKernel and StreamExecutor::CreateKernel into a single method StreamExecutor::LoadKernel. This simplifies the interface and enhances memory management. ๐ŸŒŸ
  • Efficient Operand Resharding: Optimized the partitioning of dot operations by directly resharding the rhs operand to match lhs and result tensor shardings, eliminating redundant rematerialization. ๐ŸŽฏ
  • Enhanced GPU Operations: Introduced IndexingMapAttr to ApplyIndexingOp, improving the efficiency and correctness of GPU fusions in XLA. ๐Ÿ–ฅ๏ธ

New Features

  • String Shape Kernel: Added registration for a Shape kernel that handles string tensors, enhancing TensorFlow's capabilities for string data processing on GPUs. ๐Ÿงต
  • ASCII Art Memory Map: Introduced a function to print a compact 2D map of occupied heap memory over time as ASCII art, making debugging easier and more fun! ๐ŸŽจ
  • Long Polling for Error Propagation: Added long polling as a new way to propagate errors in the coordination service, improving robustness and responsiveness. ๐Ÿ•ต๏ธโ€โ™‚๏ธ
  • Gloo Support on macOS: Enabled Gloo to function on macOS using the libuv transport mechanism, expanding its compatibility. ๐Ÿ
  • Experimental Command Buffers: Added a flag to enable command buffers during profiling sessions in the XLA GPU backend, providing more flexibility. ๐Ÿงช

Bugfixes

  • HLO Evaluator Stability: Fixed an issue where the HLO evaluator would dereference a disengaged optional, preventing potential runtime errors. ๐Ÿ› ๏ธ
  • Coordination Service Test: Addressed a data race in coordination_service_test.cc by implementing notifications for proper thread synchronization. ๐Ÿƒโ€โ™‚๏ธ
  • oneDNN Crashes: Fixed crashes in oneDNN matmul, convolution, and layer norm tests by ensuring proper initialization of operands_stack_alloca arrays. ๐Ÿš‘

Chores

  • Model Builder Relocation: Moved the model_builder from TensorFlow Lite core to the TensorFlow compiler/converter module, streamlining the directory structure. ๐Ÿ“ฆ

That's all for now, folks! Keep coding and stay awesome! ๐Ÿ’ปโœจ


Welcome to the latest updates! We've packed in some awesome new features, crucial bug fixes, and a few handy improvements. Let's dive into what's new!

New Features ๐Ÿš€

  • Integrate StableHLO at openxla/stablehlo@531816f0: We've integrated the StableHLO project from the OpenXLA repository. This update enhances the functionality and compatibility of the XLA framework with the StableHLO standard, improving the transformation of StableHLO to HLO operations and validating the conversion from CHLO to MHLO.

  • Graph Dumping in .pb Format: You can now dump TensorFlow graphs in both text and binary formats using the TF_DUMP_GRAPH_FMT environment variable. This feature adds flexibility and better integration options for users.

  • Command-Line Flags for MLIR Lite Tools: Introduced a new command-line flags library for TensorFlow MLIR Lite tools. This simplified and dependency-free module is perfect for benchmarks and easier command-line argument handling.

  • Shardy Partitioner in ExecutableOptions: Added a new boolean field use_shardy_partitioner in ExecutableOptions. This allows developers to opt for the Shardy partitioning strategy, enhancing flexibility in the XLA library.

  • UnfoldSplatConstantPass: Added the UnfoldSplatConstantPass to the MLIR framework before the HLO to TFLite legalization process. This pass prevents folding splat constants with broadcasts, which can cause bloated model sizes.

Bug Fixes ๐Ÿž

  • Reverted UniqueChannelIdEnforcer: Reverted a previous change that introduced the UniqueChannelIdEnforcer. This reflects a shift in strategy for managing unique channel IDs within the XLA framework.

  • Fix acos Decomposition: Corrected the decomposition of the acos function for non-complex arguments. The previous implementation incorrectly handled the case where x == -1, which should return ฯ€ (pi).

  • AllReduceBlueConnect Crash Fix: Addressed a crash issue in AllReduceBlueConnect when multiple partitions are used. Now, the pass runs only with specific values for CollectiveOpGroupMode, improving robustness.

Improvements ๐ŸŒŸ

  • Runtime Pointer Sizes for Sorting: Enhanced the XLA CPU backend to support runtime pointer sizes for sorting elements. This update improves flexibility and efficiency in sorting operations.

  • LLVM Integration: Updated the TensorFlow MLIR framework to align with the latest LLVM changes. This integration enhances performance and reliability in quantization and type conversion functionalities.

  • Automated Code Changes: Made extensive modifications to the TensorFlow DTensor MLIR framework, improving distributed processing capabilities and optimizing performance.

Chores ๐Ÿงน

  • Remove Unused cuda_stream.h: Cleaned up the codebase by removing the unused cuda_stream.h header file and associated functions. This helps streamline the framework and improve maintainability.

That's all for now! Stay tuned for more updates and happy coding! ๐ŸŽ‰


Hey there, awesome devs! Here's the latest and greatest from our codebase. Check out these exciting updates, bug fixes, and improvements. ๐Ÿš€

New Features

  • Support i4 EmbeddingLookup in TFLite reference: Now you can use the EmbeddingLookup operation with TensorType_INT4 in TensorFlow Lite (TFLite). This means more flexibility and efficiency for your models. ๐ŸŽ‰
  • Add external KV cache op for GenAI: Introducing an external key-value (KV) cache operation for TensorFlow Lite's experimental GenAI module. This enhances the management of external KV caches, crucial for AI applications. ๐Ÿง 
  • [XLA:UNSTACKER] Detect effectively static dynamic-slice instructions: A new function to optimize loop unrolling by identifying static dynamic slices, boosting performance. ๐Ÿ”„
  • Add a method for looking up the memory space of a pointer: StreamExecutor now has a method to determine the memory space of a pointer, enhancing memory management. ๐Ÿ’พ
  • [XLA:FFI] Add instantiation handler to XLA_FFI_Handler_Bundle: Expanding the XLA FFI API with an instantiate handler, giving you more control over the instantiation process. ๐Ÿ› ๏ธ

Bugfixes

  • Fix race condition in sparse optimizers: Ensures exclusive locks when modifying var->tensor() in EnsureSparseVariableAccess to prevent segfaults and improve stability. ๐Ÿ”’
  • [XLA:GPU] Fix Triton codegen for BroadcastOps of scalars: Ensures broadcasting rules are correctly enforced in the Triton verifier, preventing potential errors. ๐Ÿ›ก๏ธ
  • Remove affine fuzz test: Temporarily removed due to build issues with the current version of fuzztest. This keeps our build process smooth and error-free. ๐Ÿงฉ

Improvements

  • Add physical device ordinal to buffers: Enhances resource management and tracking across different physical devices in the XLA framework. ๐Ÿ“ˆ
  • Add support for non-trivial strides for conv in MHLO->TFL: Convolution operations in MHLO->TFL now support non-trivial strides, increasing flexibility and performance. ๐Ÿƒโ€โ™‚๏ธ
  • Automated Code Change: Streamlined dependencies and updated headers in the grappler module, enhancing optimization and performance. โš™๏ธ

Chore

  • Remove deprecated TfLiteOperatorCreateWithData function: Cleaned up the codebase by removing this deprecated function, simplifying the implementation. ๐Ÿงน

Keep up the fantastic work, and let's keep pushing the boundaries of what's possible! ๐Ÿš€


Hey there, fabulous developers! ๐ŸŒŸ We've got some exciting updates and tweaks to share with you. Let's dive right into the latest changes:


New feature: ๐Ÿš€ Add support for atomic_rmw fadd for bf16 on HOPPER

  • Summary: This update brings in the magic of atomic_rmw fadd for bf16 data type on HOPPER CUDA compute capability within XLA:GPU and MLIR-based emitters. Now, you can perform atomic operations on bf16 data types with ease. A test case has been added to ensure everything runs smoothly on the HOPPER architecture.

Improvement: ๐Ÿ›  Avoid building hlo_runner_main.cc twice

  • Summary: We've streamlined the build process by moving the actual build into a shared library target and creating two binary targets that depend on it. This makes maintaining dependencies easier and more explicit. Say goodbye to redundant builds!

Improvement: ๐ŸŽ๏ธ Run fusion-wrapper pass before scheduling in XLA:GPU

  • Summary: The fusion-wrapper pass now runs before scheduling in the GPU compiler. This change enhances the fusion and scheduling process, making it more efficient. Plus, there's a new test to ensure non-fused instructions are wrapped correctly.

New feature: ๐ŸŒŸ Open source XLA passes for Shardy

  • Summary: Shardy just got a major upgrade with new XLA passes! We've added new files, headers, and functions for exporting and importing operations and shardings. Test files are also included to ensure everything works perfectly.

Improvement: โšก๏ธ Port concatenate instruction to Thunks in XLA:CPU

  • Summary: Concatenate instructions are now ported to Thunks, with a fast concatenate option for better performance. Benchmarks show a 4% improvement in parallel concatenate performance and an 11% boost in CPU time. Fast concatenate without parallel processing shows a slight performance dip.

New feature: ๐ŸŽ‰ Add a basic test case for circular pipeline collective permute

  • Summary: A new test case for circular pipeline collective permute has been added. It involves a simple computation using collective permute with source-target pairs and verifies the results. A more complex test case is outlined for future implementation.

New feature: ๐Ÿงธ Add a toy example for using Shardy

  • Summary: A toy example for using Shardy in the XLA pipeline is now available. This includes changes to workspace files, BUILD files, a main file for Shardy optimization, and a test file with a simple MLIR test case. Perfect for getting started with Shardy!

New feature: ๐Ÿ”ง Add Thunk::ExecuteSession to control concurrent workers

  • Summary: Control the number of concurrent workers processing XLA execute requests with Thunk::ExecuteSession. This helps manage task scheduling overheads for XLA programs with many tiny thunks. Unit tests ensure the locking mechanism works as expected.

Bugfix: ๐Ÿ› Remove support for CUDA versions below 12.3 in XLA

  • Summary: Weโ€™ve streamlined XLA by removing support for CUDA versions below 12.3. This update affects multiple files related to GPU functionality, profiling, and testing, aligning XLA with the latest CUDA technology for improved performance.

Bugfix: ๐Ÿ›  Revert fix for 3 DeadCode findings

  • Summary: Reverted a previous fix that addressed 3 DeadCode findings related to DelayKernelIsSupported, LaunchDelayKernel, and UnsupportedGpuFeature. The revert undoes changes made to gpu_timer_kernel_rocm.cc and gpu_types.h.

Bugfix: โš™๏ธ Only use the kernel threadpool if it is enabled

  • Summary: Added a conditional check to use the kernel threadpool only if it is enabled. This ensures optimal performance and resource utilization when working with TensorFlow Lite delegates.

Chore: ๐Ÿงน Make stablehlo tests private

  • Summary: The visibility of stablehlo tests has been changed from public to private. This keeps these tests restricted to their intended scope, maintaining the integrity and organization of the codebase.

That's all for now, folks! Keep coding and stay awesome! โœจ


Here's a rundown of the latest changes and improvements:

New Features

  • [xla:ffi] API to Update CallFrame with Runtime Values: ๐Ÿš€ Added an API to update CallFrame with new runtime values (buffer pointers), enhancing the flexibility of XLA's foreign function interface.
  • [XLA:GPU] Deterministic Flash Attention Backward Implementation: ๐Ÿงฉ Introduced deterministic flash attention backward implementation in XLA:GPU, providing more control and consistency.
  • [XLA:CPU][oneDNN] F16 Convolutions on Supported CPUs: ๐ŸŽ‰ Enabled F16 convolutions on supported Intel CPUs, boosting performance and efficiency.
  • [XLA:CPU][oneDNN] Matmul-Bias-Add Fusion: ๐Ÿ”ฅ Enabled fusion of matmul followed by bias-add and binary-add operations in XLA:CPU, optimizing performance.
  • Testing Utility for v2 API Test Data Path: ๐Ÿงช Added a utility for managing test data paths for the v2 API in TensorFlow, laying the groundwork for future testing needs.
  • Support for uint8_t Dot Operation Tests: ๐Ÿค– Added support for uint8_t dot operation tests and corresponding HLO evaluator support, expanding the library's capabilities.

Improvements

  • HLO Deduplication and Execution Threads Test: ๐Ÿ› ๏ธ Added a comprehensive test for HLO deduplication and execution threads in XLA, ensuring robust functionality.
  • Recursive Work Splitting for Thunk Executor Tasks: ๐ŸŽ๏ธ Introduced recursive work splitting to launch thunk executor tasks, improving performance and avoiding bottlenecks.

Bugfixes

  • [XLA:FFI] Catch Exceptions in User FFI Calls: ๐Ÿ› Added a defensive try/catch mechanism to handle exceptions in user FFI calls, enhancing reliability.
  • Fix for Execution Stream Assignment Test: ๐Ÿ”ง Fixed the constructor initialization error in the execution_stream_assignment_test, ensuring the test runs successfully.
  • Removal of mlir2exec Test: ๐Ÿงน Removed the mlir-tflite-runner binary and related test utilities, indicating a cleanup or restructuring of the MLIR Lite module.

Chores

  • Split Definitions from reduced_precision_support.h: ๐Ÿ“‚ Split definitions into a new file, reduced_precision_metadata.h, for better organization and maintainability.

These updates bring a mix of new features, improvements, bug fixes, and organizational changes, aimed at enhancing the performance, reliability, and maintainability of the XLA and TensorFlow projects. ๐Ÿš€


Hey there, awesome developers! We've got some exciting updates and improvements to share with you. Check out the latest changes below:

New Features ๐Ÿš€

  • Integrate StableHLO at openxla/stablehlo@dd48ec58: We've integrated StableHLO, introducing new operations like UniformDequantizeOp and UniformQuantizeOp along with their inference and verification functions. This brings enhancements to uniform quantization and all-to-all operations. ๐ŸŽ‰

  • Add num_warps to BlockLevelFusionConfig: A new field, "num_warps," has been added to the BlockLevelFusionConfig message in the GPU backend, along with a method to convert the struct to proto. This improves GPU backend settings configuration. ๐Ÿ› ๏ธ

  • Support for CollectivePermute thunk: We've added support for the CollectivePermute thunk in XLA for CPU, enabling all collective operations to be executed using thunks. ๐Ÿ™Œ

  • Shardings for CaseOp and IfOp: This update adds shardings for implicit operands and return values of CaseOp and IfOp, ensuring correct sharding settings based on input parameters. ๐Ÿ”„

  • Layout method for BasicStringArray: Implemented the layout method for the BasicStringArray class, adding functionality to handle the layout of BasicStringArray objects. ๐Ÿ“

Improvements โœจ

  • Split DotThunk for parallel compilation: The DotThunk implementation in XLA CPU service now supports parallel compilation, optimizing matrix multiplication operations. ๐Ÿ’ช

  • Profiling enhancements with NVTX: Named threads, CUDA devices, and CUDA streams in the Nsight Systems UI for a better profiling experience. ๐Ÿ–ฅ๏ธ

  • Memcpy function restructuring: Moved the StreamExecutor::Memcpy function to the Stream and its derived classes, streamlining the code and improving efficiency. ๐Ÿ”„

Bugfixes ๐Ÿ›

  • Prevent XLA crash if PATH variable not set: Addressed an issue where XLA would crash if the PATH environment variable was not set, now providing an error message instead. ๐Ÿšซ

  • Hashable Interval & IndexingMap: Made the Interval and IndexingMap classes properly hashable, ensuring they can be used in containers and other data structures. ๐Ÿ”

  • Stop using xla/statusor.h: Updated various files to directly include tsl/platform/statusor.h instead of xla/statusor.h, which now only contains an alias for absl::Status. ๐Ÿ”„

Chores ๐Ÿงน

  • Clean-up before removing tiling: Cleaned up code related to XLA:GPU and MLIR-based indexing in preparation for removing tiling functionality. ๐Ÿงฝ

Stay awesome and keep coding! ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป


Welcome to our latest update! We've been busy adding some awesome new features, squashing pesky bugs, and making improvements to keep everything running smoothly. Here's the lowdown on what's new and improved:

### New Features
- **Asynchronous Launch for HostKernel** ๐Ÿš€: We've introduced async launch to HostKernel and employed Eigen device to parallelize kernel execution. This means better resource utilization and faster computations on the CPU platform.
- **StableHLO Integration**: Integrated StableHLO at openxla/stablehlo@dd48ec58, adding new operations for uniform quantization and all-to-all operations. This boosts the functionality and efficiency of our operations.
- **Int4 Support in Dequantize Op**: Added support for int4 in the dequantize operation, including per-channel dequantization. This enhances the flexibility and functionality of TensorFlow Lite.
- **'decompose_optionals' Pass**: Introduced a new pass to decompose optional operations into simpler identity operations, improving code readability and maintainability.
- **Aliasing Semantics for Nested Fusions**: Added aliasing semantics for nested fusions, enhancing the accuracy and functionality of fusion analysis in the XLA service.

### Improvements
- **Recursive Work Splitting for Host Tasks**: Implemented recursive work splitting to submit host tasks, significantly improving wall time for task submission into a thread pool.
- **JAX Builds Centralization**: Moved JAX builds to build.py, streamlining the build process and improving test environments for JAX_CPU and JAX_GPU.
- **Stream Dependency Management**: Eliminated StreamExecutor::CreateStreamDependency by consolidating its code into Stream and its derived classes, optimizing stream dependency management.

### Bugfixes
- **Revert Changelist 641306427**: Reverted a previous change, updating tensor types in the CastOperationParser test to ensure correct operation.
- **Float Conversion Fixes**: Addressed issues with float conversions for fp8 and u64, fixing missing lowerings and incorrect upper bounds to resolve unary_ops_test_gpu.
- **Revert c2e7e9f6c3f4d4937d8145f988ea74818e000ecc**: Reverted changes that removed references to Google's Abseil library, restoring functionality related to remote tensor handles.

### Chores
- **LLVM Integration**: Updated LLVM usage to match the latest commit [7476c20c481c](https://github.com/llvm/llvm-project/commit/7476c20c481c), ensuring we are using the most up-to-date version for development.

Stay tuned for more updates and happy coding! ๐ŸŽ‰
Showing 1 to 20 of 26 Entries