tensorflow changelog


Here's the latest scoop on our codebase updates! We've been busy bees, buzzing around to bring you some fantastic new features, improvements, and bug fixes. Let's dive right in! ๐Ÿ


New feature: We've jazzed up the XLA framework by using the CUDA runtime API to accurately determine if two ranks are on the same host. This ensures more reliable local communication during collective operations, especially in multi-GPU setups. ๐Ÿš€

New feature: A new transformation pass is here! We've added a pass to outline an IFRT IR atom program into a module, enhancing the XLA framework's capabilities in handling IR atom programs. ๐ŸŽ‰

Improvement: TensorFlow Lite compiler now checks for infinity when folding max and min ops. This ensures that operations handle extreme floating-point values correctly, boosting robustness. ๐Ÿ’ช

New feature: You can now save output data from TFLite models as TensorFlow Example protocol buffers and output them to a file. This makes model evaluation and debugging a breeze! ๐Ÿ“Š

Improvement: Weโ€™ve added profiling to the ifrt-proxy client, enabling request-response trace tracking. This makes monitoring and analyzing RPC calls a piece of cake. ๐Ÿฐ

New feature: Direct legalization for min and max operations is now available in TensorFlow Lite, streamlining the conversion process and enhancing performance. โšก๏ธ

New feature: We introduced a pattern to reorder gather and cast ops in TensorFlow Lite for more efficient execution. Less work, more play! ๐ŸŽฎ

New feature: A new optimization pattern simplifies broadcasting and reshaping operations in TensorFlow MLIR, enhancing efficiency. Who doesn't love a good optimization? ๐Ÿ› ๏ธ

Bugfix: We fixed a critical issue in JAX where input arrays weren't reshaped correctly, preventing crashes on TPU and ensuring correct outputs on GPU. Phew! ๐Ÿ˜…

Bugfix: Memory leaks in cuda_executor.cc error paths are now a thing of the past. We've improved memory management to keep things running smoothly. ๐Ÿงน

Bugfix: Compatibility issues with Numpy 2.x in TensorFlow's numpy-like operations have been resolved. We're all set for the future! ๐Ÿ”ฎ

Chore: We tidied up by deleting status_test_util.h after migrating all its users. A cleaner codebase is a happier codebase! ๐Ÿงผ


That's all for now, folks! Stay tuned for more exciting updates and improvements. Keep coding and keep smiling! ๐Ÿ˜„


Here's a delightful rundown of the latest and greatest changes, improvements, and fixes in our codebase. We've been busy integrating, optimizing, and squashing pesky bugs to make your experience smoother and more efficient. Let's dive into the details! ๐Ÿš€

  • New feature: We've integrated the StableHLO framework into TensorFlow's MLIR infrastructure. This major update focuses on transforming and legalizing quantization and HLO operations, enhancing compatibility and performance. ๐ŸŽ‰

  • New feature: Added support for unary element-wise operations in the MHLO to TFL conversion process. Now, operations like absolute value and trigonometric functions are seamlessly transformed, bolstering TensorFlow Lite's capabilities. ๐ŸŒŸ

  • Improvement: Exporting MLIR modules just got clearer! The name of the HLO module now matches the MLIR module name, ditching the default "main" to avoid confusion and conflicts. ๐Ÿ“›

  • New feature: Memory management in XLA is stepping up! We've laid the groundwork for adding memory spaces to the CompileOnlyClient, paving the way for more sophisticated memory handling. ๐Ÿง 

  • Improvement: FP8 windowed einsums with multiple all-gather dots are now supported. This enhancement optimizes FP8 operations within the XLA framework, thanks to a nifty shift in dequantization. ๐ŸŽฏ

  • Improvement: Casting operations between floats and integers in MLIR are now more efficient, thanks to new folding optimizations. Say hello to faster compilation! ๐Ÿ”„

  • New feature: Introducing GetSparseCoreId to the TensorFlow profiler! This function extracts Sparse Core IDs from plane names, boosting TPU profiling capabilities. ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  • New feature: We've added a pass to open the sharding of while op free variables. This helps optimize sharding strategies during HLO conversion, enhancing operation efficiency. ๐Ÿงฉ

  • Bugfix: Resolved an issue where "MakeExactCopy" didn't copy "known_graph_outputs_", ensuring all necessary output values are retained in copied graphs. ๐Ÿ›

  • Bugfix: Fixed integer overflow issues post-NumPy 2.0 update by refining type casting and array creation operations, maintaining compatibility with NumPy 1.x behavior. ๐Ÿ”ง

  • Chore: Cleaned up pywrap_parallel_device.cc by removing unnecessary TensorFlow C API headers, streamlining the codebase. ๐Ÿงน

  • Bugfix: Addressed test failures under NumPy 2.x by directly calling __array__() for objects requiring a copy when converting to TF tensors. Compatibility restored! ๐Ÿ› ๏ธ

These updates are all about making things run smoother, faster, and with fewer hiccups. Keep those updates coming, and happy coding! ๐Ÿ˜Š


Welcome to the latest update! We've been busy bees ๐Ÿ making some exciting changes, adding new features, squashing bugs, and improving performance. Here's a rundown of what's new:

New Features

  • Original Value Tracking: Introduced a pass that adds the original_value field to each operation in the HLO graph. This is a game-changer for value tracking within the graph, making it easier to manage and analyze computations.
  • cuDNN Custom Call Conversions: Added a pass to convert specific cuDNN custom calls into custom fusion operations. This allows JAX users to run selected computations as cuDNN kernels, optimizing performance on GPUs.
  • Batch Dimension in Gather/Scatter: Now supporting batch dimensions in Gather and Scatter HLO syntax, enhancing data manipulation operations in XLA.
  • BatchFunction Operation: Updated protocol buffer text files to include a new "BatchFunction" operation, allowing for more flexible batching of input tensors.
  • AsyncWrapper: Introduced AsyncWrapper to wrap instructions into async blocks, enabling concurrent execution and potentially improving performance.

Improvements

  • Additional Batch Padding Policies: Exposed new batch padding policies like "BATCH_DOWN" and "MINIMIZE_TPU_COST_PER_REQUEST" for more efficient batch processing.
  • Async Dispatch for JAX CPU Backend: Enabled asynchronous dispatch for expensive computations on the JAX CPU backend, with an opt-out option for those who prefer the old synchronous behavior.

Bugfixes

  • Pipelining with Sequential Extracts: Fixed a bug related to pipelining sequential extracts, ensuring only the induction variable of a loop can be replaced.
  • Revert Changes in TensorFlow Lite GPU Delegate: Reverted a previous change to simplify the handling of the kClFastRelaxedMath compiler option, standardizing behavior across different GPU architectures.
  • Revert Changes in CUDA FFT Library: Reverted modifications to rename and update dependencies for the CUDA FFT library, ensuring proper initialization and integration.

Chores

  • Automated Code Cleanup: Removed unnecessary TensorFlow C API headers from pywrap_parallel_device.cc, streamlining the codebase.

We hope these updates make your development experience smoother and more efficient. Happy coding! ๐Ÿš€


Hey there, code wranglers! We've got some exciting updates for you. Check out the latest and greatest changes that are making our codebase even more awesome. ๐Ÿš€


Improvements

  • Streamlined Kernel Management: Combined StreamExecutor::GetKernel and StreamExecutor::CreateKernel into a single method StreamExecutor::LoadKernel. This simplifies the interface and enhances memory management. ๐ŸŒŸ
  • Efficient Operand Resharding: Optimized the partitioning of dot operations by directly resharding the rhs operand to match lhs and result tensor shardings, eliminating redundant rematerialization. ๐ŸŽฏ
  • Enhanced GPU Operations: Introduced IndexingMapAttr to ApplyIndexingOp, improving the efficiency and correctness of GPU fusions in XLA. ๐Ÿ–ฅ๏ธ

New Features

  • String Shape Kernel: Added registration for a Shape kernel that handles string tensors, enhancing TensorFlow's capabilities for string data processing on GPUs. ๐Ÿงต
  • ASCII Art Memory Map: Introduced a function to print a compact 2D map of occupied heap memory over time as ASCII art, making debugging easier and more fun! ๐ŸŽจ
  • Long Polling for Error Propagation: Added long polling as a new way to propagate errors in the coordination service, improving robustness and responsiveness. ๐Ÿ•ต๏ธโ€โ™‚๏ธ
  • Gloo Support on macOS: Enabled Gloo to function on macOS using the libuv transport mechanism, expanding its compatibility. ๐Ÿ
  • Experimental Command Buffers: Added a flag to enable command buffers during profiling sessions in the XLA GPU backend, providing more flexibility. ๐Ÿงช

Bugfixes

  • HLO Evaluator Stability: Fixed an issue where the HLO evaluator would dereference a disengaged optional, preventing potential runtime errors. ๐Ÿ› ๏ธ
  • Coordination Service Test: Addressed a data race in coordination_service_test.cc by implementing notifications for proper thread synchronization. ๐Ÿƒโ€โ™‚๏ธ
  • oneDNN Crashes: Fixed crashes in oneDNN matmul, convolution, and layer norm tests by ensuring proper initialization of operands_stack_alloca arrays. ๐Ÿš‘

Chores

  • Model Builder Relocation: Moved the model_builder from TensorFlow Lite core to the TensorFlow compiler/converter module, streamlining the directory structure. ๐Ÿ“ฆ

That's all for now, folks! Keep coding and stay awesome! ๐Ÿ’ปโœจ


Welcome to the latest updates! We've packed in some awesome new features, crucial bug fixes, and a few handy improvements. Let's dive into what's new!

New Features ๐Ÿš€

  • Integrate StableHLO at openxla/stablehlo@531816f0: We've integrated the StableHLO project from the OpenXLA repository. This update enhances the functionality and compatibility of the XLA framework with the StableHLO standard, improving the transformation of StableHLO to HLO operations and validating the conversion from CHLO to MHLO.

  • Graph Dumping in .pb Format: You can now dump TensorFlow graphs in both text and binary formats using the TF_DUMP_GRAPH_FMT environment variable. This feature adds flexibility and better integration options for users.

  • Command-Line Flags for MLIR Lite Tools: Introduced a new command-line flags library for TensorFlow MLIR Lite tools. This simplified and dependency-free module is perfect for benchmarks and easier command-line argument handling.

  • Shardy Partitioner in ExecutableOptions: Added a new boolean field use_shardy_partitioner in ExecutableOptions. This allows developers to opt for the Shardy partitioning strategy, enhancing flexibility in the XLA library.

  • UnfoldSplatConstantPass: Added the UnfoldSplatConstantPass to the MLIR framework before the HLO to TFLite legalization process. This pass prevents folding splat constants with broadcasts, which can cause bloated model sizes.

Bug Fixes ๐Ÿž

  • Reverted UniqueChannelIdEnforcer: Reverted a previous change that introduced the UniqueChannelIdEnforcer. This reflects a shift in strategy for managing unique channel IDs within the XLA framework.

  • Fix acos Decomposition: Corrected the decomposition of the acos function for non-complex arguments. The previous implementation incorrectly handled the case where x == -1, which should return ฯ€ (pi).

  • AllReduceBlueConnect Crash Fix: Addressed a crash issue in AllReduceBlueConnect when multiple partitions are used. Now, the pass runs only with specific values for CollectiveOpGroupMode, improving robustness.

Improvements ๐ŸŒŸ

  • Runtime Pointer Sizes for Sorting: Enhanced the XLA CPU backend to support runtime pointer sizes for sorting elements. This update improves flexibility and efficiency in sorting operations.

  • LLVM Integration: Updated the TensorFlow MLIR framework to align with the latest LLVM changes. This integration enhances performance and reliability in quantization and type conversion functionalities.

  • Automated Code Changes: Made extensive modifications to the TensorFlow DTensor MLIR framework, improving distributed processing capabilities and optimizing performance.

Chores ๐Ÿงน

  • Remove Unused cuda_stream.h: Cleaned up the codebase by removing the unused cuda_stream.h header file and associated functions. This helps streamline the framework and improve maintainability.

That's all for now! Stay tuned for more updates and happy coding! ๐ŸŽ‰


Hey there, awesome devs! Here's the latest and greatest from our codebase. Check out these exciting updates, bug fixes, and improvements. ๐Ÿš€

New Features

  • Support i4 EmbeddingLookup in TFLite reference: Now you can use the EmbeddingLookup operation with TensorType_INT4 in TensorFlow Lite (TFLite). This means more flexibility and efficiency for your models. ๐ŸŽ‰
  • Add external KV cache op for GenAI: Introducing an external key-value (KV) cache operation for TensorFlow Lite's experimental GenAI module. This enhances the management of external KV caches, crucial for AI applications. ๐Ÿง 
  • [XLA:UNSTACKER] Detect effectively static dynamic-slice instructions: A new function to optimize loop unrolling by identifying static dynamic slices, boosting performance. ๐Ÿ”„
  • Add a method for looking up the memory space of a pointer: StreamExecutor now has a method to determine the memory space of a pointer, enhancing memory management. ๐Ÿ’พ
  • [XLA:FFI] Add instantiation handler to XLA_FFI_Handler_Bundle: Expanding the XLA FFI API with an instantiate handler, giving you more control over the instantiation process. ๐Ÿ› ๏ธ

Bugfixes

  • Fix race condition in sparse optimizers: Ensures exclusive locks when modifying var->tensor() in EnsureSparseVariableAccess to prevent segfaults and improve stability. ๐Ÿ”’
  • [XLA:GPU] Fix Triton codegen for BroadcastOps of scalars: Ensures broadcasting rules are correctly enforced in the Triton verifier, preventing potential errors. ๐Ÿ›ก๏ธ
  • Remove affine fuzz test: Temporarily removed due to build issues with the current version of fuzztest. This keeps our build process smooth and error-free. ๐Ÿงฉ

Improvements

  • Add physical device ordinal to buffers: Enhances resource management and tracking across different physical devices in the XLA framework. ๐Ÿ“ˆ
  • Add support for non-trivial strides for conv in MHLO->TFL: Convolution operations in MHLO->TFL now support non-trivial strides, increasing flexibility and performance. ๐Ÿƒโ€โ™‚๏ธ
  • Automated Code Change: Streamlined dependencies and updated headers in the grappler module, enhancing optimization and performance. โš™๏ธ

Chore

  • Remove deprecated TfLiteOperatorCreateWithData function: Cleaned up the codebase by removing this deprecated function, simplifying the implementation. ๐Ÿงน

Keep up the fantastic work, and let's keep pushing the boundaries of what's possible! ๐Ÿš€


Hey there, fabulous developers! ๐ŸŒŸ We've got some exciting updates and tweaks to share with you. Let's dive right into the latest changes:


New feature: ๐Ÿš€ Add support for atomic_rmw fadd for bf16 on HOPPER

  • Summary: This update brings in the magic of atomic_rmw fadd for bf16 data type on HOPPER CUDA compute capability within XLA:GPU and MLIR-based emitters. Now, you can perform atomic operations on bf16 data types with ease. A test case has been added to ensure everything runs smoothly on the HOPPER architecture.

Improvement: ๐Ÿ›  Avoid building hlo_runner_main.cc twice

  • Summary: We've streamlined the build process by moving the actual build into a shared library target and creating two binary targets that depend on it. This makes maintaining dependencies easier and more explicit. Say goodbye to redundant builds!

Improvement: ๐ŸŽ๏ธ Run fusion-wrapper pass before scheduling in XLA:GPU

  • Summary: The fusion-wrapper pass now runs before scheduling in the GPU compiler. This change enhances the fusion and scheduling process, making it more efficient. Plus, there's a new test to ensure non-fused instructions are wrapped correctly.

New feature: ๐ŸŒŸ Open source XLA passes for Shardy

  • Summary: Shardy just got a major upgrade with new XLA passes! We've added new files, headers, and functions for exporting and importing operations and shardings. Test files are also included to ensure everything works perfectly.

Improvement: โšก๏ธ Port concatenate instruction to Thunks in XLA:CPU

  • Summary: Concatenate instructions are now ported to Thunks, with a fast concatenate option for better performance. Benchmarks show a 4% improvement in parallel concatenate performance and an 11% boost in CPU time. Fast concatenate without parallel processing shows a slight performance dip.

New feature: ๐ŸŽ‰ Add a basic test case for circular pipeline collective permute

  • Summary: A new test case for circular pipeline collective permute has been added. It involves a simple computation using collective permute with source-target pairs and verifies the results. A more complex test case is outlined for future implementation.

New feature: ๐Ÿงธ Add a toy example for using Shardy

  • Summary: A toy example for using Shardy in the XLA pipeline is now available. This includes changes to workspace files, BUILD files, a main file for Shardy optimization, and a test file with a simple MLIR test case. Perfect for getting started with Shardy!

New feature: ๐Ÿ”ง Add Thunk::ExecuteSession to control concurrent workers

  • Summary: Control the number of concurrent workers processing XLA execute requests with Thunk::ExecuteSession. This helps manage task scheduling overheads for XLA programs with many tiny thunks. Unit tests ensure the locking mechanism works as expected.

Bugfix: ๐Ÿ› Remove support for CUDA versions below 12.3 in XLA

  • Summary: Weโ€™ve streamlined XLA by removing support for CUDA versions below 12.3. This update affects multiple files related to GPU functionality, profiling, and testing, aligning XLA with the latest CUDA technology for improved performance.

Bugfix: ๐Ÿ›  Revert fix for 3 DeadCode findings

  • Summary: Reverted a previous fix that addressed 3 DeadCode findings related to DelayKernelIsSupported, LaunchDelayKernel, and UnsupportedGpuFeature. The revert undoes changes made to gpu_timer_kernel_rocm.cc and gpu_types.h.

Bugfix: โš™๏ธ Only use the kernel threadpool if it is enabled

  • Summary: Added a conditional check to use the kernel threadpool only if it is enabled. This ensures optimal performance and resource utilization when working with TensorFlow Lite delegates.

Chore: ๐Ÿงน Make stablehlo tests private

  • Summary: The visibility of stablehlo tests has been changed from public to private. This keeps these tests restricted to their intended scope, maintaining the integrity and organization of the codebase.

That's all for now, folks! Keep coding and stay awesome! โœจ


Here's a rundown of the latest changes and improvements:

New Features

  • [xla:ffi] API to Update CallFrame with Runtime Values: ๐Ÿš€ Added an API to update CallFrame with new runtime values (buffer pointers), enhancing the flexibility of XLA's foreign function interface.
  • [XLA:GPU] Deterministic Flash Attention Backward Implementation: ๐Ÿงฉ Introduced deterministic flash attention backward implementation in XLA:GPU, providing more control and consistency.
  • [XLA:CPU][oneDNN] F16 Convolutions on Supported CPUs: ๐ŸŽ‰ Enabled F16 convolutions on supported Intel CPUs, boosting performance and efficiency.
  • [XLA:CPU][oneDNN] Matmul-Bias-Add Fusion: ๐Ÿ”ฅ Enabled fusion of matmul followed by bias-add and binary-add operations in XLA:CPU, optimizing performance.
  • Testing Utility for v2 API Test Data Path: ๐Ÿงช Added a utility for managing test data paths for the v2 API in TensorFlow, laying the groundwork for future testing needs.
  • Support for uint8_t Dot Operation Tests: ๐Ÿค– Added support for uint8_t dot operation tests and corresponding HLO evaluator support, expanding the library's capabilities.

Improvements

  • HLO Deduplication and Execution Threads Test: ๐Ÿ› ๏ธ Added a comprehensive test for HLO deduplication and execution threads in XLA, ensuring robust functionality.
  • Recursive Work Splitting for Thunk Executor Tasks: ๐ŸŽ๏ธ Introduced recursive work splitting to launch thunk executor tasks, improving performance and avoiding bottlenecks.

Bugfixes

  • [XLA:FFI] Catch Exceptions in User FFI Calls: ๐Ÿ› Added a defensive try/catch mechanism to handle exceptions in user FFI calls, enhancing reliability.
  • Fix for Execution Stream Assignment Test: ๐Ÿ”ง Fixed the constructor initialization error in the execution_stream_assignment_test, ensuring the test runs successfully.
  • Removal of mlir2exec Test: ๐Ÿงน Removed the mlir-tflite-runner binary and related test utilities, indicating a cleanup or restructuring of the MLIR Lite module.

Chores

  • Split Definitions from reduced_precision_support.h: ๐Ÿ“‚ Split definitions into a new file, reduced_precision_metadata.h, for better organization and maintainability.

These updates bring a mix of new features, improvements, bug fixes, and organizational changes, aimed at enhancing the performance, reliability, and maintainability of the XLA and TensorFlow projects. ๐Ÿš€


Hey there, awesome developers! We've got some exciting updates and improvements to share with you. Check out the latest changes below:

New Features ๐Ÿš€

  • Integrate StableHLO at openxla/stablehlo@dd48ec58: We've integrated StableHLO, introducing new operations like UniformDequantizeOp and UniformQuantizeOp along with their inference and verification functions. This brings enhancements to uniform quantization and all-to-all operations. ๐ŸŽ‰

  • Add num_warps to BlockLevelFusionConfig: A new field, "num_warps," has been added to the BlockLevelFusionConfig message in the GPU backend, along with a method to convert the struct to proto. This improves GPU backend settings configuration. ๐Ÿ› ๏ธ

  • Support for CollectivePermute thunk: We've added support for the CollectivePermute thunk in XLA for CPU, enabling all collective operations to be executed using thunks. ๐Ÿ™Œ

  • Shardings for CaseOp and IfOp: This update adds shardings for implicit operands and return values of CaseOp and IfOp, ensuring correct sharding settings based on input parameters. ๐Ÿ”„

  • Layout method for BasicStringArray: Implemented the layout method for the BasicStringArray class, adding functionality to handle the layout of BasicStringArray objects. ๐Ÿ“

Improvements โœจ

  • Split DotThunk for parallel compilation: The DotThunk implementation in XLA CPU service now supports parallel compilation, optimizing matrix multiplication operations. ๐Ÿ’ช

  • Profiling enhancements with NVTX: Named threads, CUDA devices, and CUDA streams in the Nsight Systems UI for a better profiling experience. ๐Ÿ–ฅ๏ธ

  • Memcpy function restructuring: Moved the StreamExecutor::Memcpy function to the Stream and its derived classes, streamlining the code and improving efficiency. ๐Ÿ”„

Bugfixes ๐Ÿ›

  • Prevent XLA crash if PATH variable not set: Addressed an issue where XLA would crash if the PATH environment variable was not set, now providing an error message instead. ๐Ÿšซ

  • Hashable Interval & IndexingMap: Made the Interval and IndexingMap classes properly hashable, ensuring they can be used in containers and other data structures. ๐Ÿ”

  • Stop using xla/statusor.h: Updated various files to directly include tsl/platform/statusor.h instead of xla/statusor.h, which now only contains an alias for absl::Status. ๐Ÿ”„

Chores ๐Ÿงน

  • Clean-up before removing tiling: Cleaned up code related to XLA:GPU and MLIR-based indexing in preparation for removing tiling functionality. ๐Ÿงฝ

Stay awesome and keep coding! ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป


Welcome to our latest update! We've been busy adding some awesome new features, squashing pesky bugs, and making improvements to keep everything running smoothly. Here's the lowdown on what's new and improved:

### New Features
- **Asynchronous Launch for HostKernel** ๐Ÿš€: We've introduced async launch to HostKernel and employed Eigen device to parallelize kernel execution. This means better resource utilization and faster computations on the CPU platform.
- **StableHLO Integration**: Integrated StableHLO at openxla/stablehlo@dd48ec58, adding new operations for uniform quantization and all-to-all operations. This boosts the functionality and efficiency of our operations.
- **Int4 Support in Dequantize Op**: Added support for int4 in the dequantize operation, including per-channel dequantization. This enhances the flexibility and functionality of TensorFlow Lite.
- **'decompose_optionals' Pass**: Introduced a new pass to decompose optional operations into simpler identity operations, improving code readability and maintainability.
- **Aliasing Semantics for Nested Fusions**: Added aliasing semantics for nested fusions, enhancing the accuracy and functionality of fusion analysis in the XLA service.

### Improvements
- **Recursive Work Splitting for Host Tasks**: Implemented recursive work splitting to submit host tasks, significantly improving wall time for task submission into a thread pool.
- **JAX Builds Centralization**: Moved JAX builds to build.py, streamlining the build process and improving test environments for JAX_CPU and JAX_GPU.
- **Stream Dependency Management**: Eliminated StreamExecutor::CreateStreamDependency by consolidating its code into Stream and its derived classes, optimizing stream dependency management.

### Bugfixes
- **Revert Changelist 641306427**: Reverted a previous change, updating tensor types in the CastOperationParser test to ensure correct operation.
- **Float Conversion Fixes**: Addressed issues with float conversions for fp8 and u64, fixing missing lowerings and incorrect upper bounds to resolve unary_ops_test_gpu.
- **Revert c2e7e9f6c3f4d4937d8145f988ea74818e000ecc**: Reverted changes that removed references to Google's Abseil library, restoring functionality related to remote tensor handles.

### Chores
- **LLVM Integration**: Updated LLVM usage to match the latest commit [7476c20c481c](https://github.com/llvm/llvm-project/commit/7476c20c481c), ensuring we are using the most up-to-date version for development.

Stay tuned for more updates and happy coding! ๐ŸŽ‰

Hey team! Check out the latest and greatest updates to our codebase. We've got some cool new features, important improvements, and essential bug fixes. Dive in and see what's new! ๐Ÿš€

New Features

  • Support for conditional() with manual subgroups in spmd_partitioner: Now you can handle conditional operations with manual subgroups, maintaining manual sharding where needed. This update includes changes to SpmdPartitioningVisitor and new test cases to validate this functionality. ๐ŸŽ‰

  • Basic DAG Executor Implementation for XLA CPU: Introducing a basic Directed Acyclic Graph (DAG) executor for the XLA CPU service. This helps in executing thunks concurrently in a thread pool, ensuring correct ordering and execution. ๐Ÿงฉ

  • Initial Implementation of ThunkExecutor: A new ThunkExecutor class is here! It builds a DAG defining execution order based on buffer uses, complete with methods and tests to ensure everything runs smoothly. ๐Ÿ› ๏ธ

  • Runtime Simulator for HLO Module Execution Time: A new simulator predicts execution time for HLO modules, taking into account nested loop trip counts. This helps in optimizing execution time estimates. โฑ๏ธ

  • ScratchAllocator in External FFI API: Introducing ScratchAllocator for efficient device memory allocation and deallocation in XLA's external FFI API. This improves overall usability and performance. ๐Ÿ’พ

Improvements

  • Simplified Code in dynamic_update_slice: Weโ€™ve streamlined the code by removing unnecessary template usage and converting indices into int64 before processing. This reduces the target binary size and optimizes performance. ๐Ÿ“‰

  • Export XLA:FFI Handlers as C Function Symbols: A new macro allows exporting XLA:FFI handlers as C function symbols, making it easier to work with FFI implementations in shared libraries. ๐Ÿ”ง

  • Using Eigen Thread Pool for ThunkExecutor Tasks: ThunkExecutor tasks now utilize the Eigen thread pool, addressing mutex contention points and improving performance nearly linearly with the number of threads. ๐ŸŽ๏ธ

Bug Fixes

  • Correct Propagation of Deserialization Errors: Weโ€™ve fixed the deserialization process to correctly propagate errors from HloProgramSerDes, ensuring better error handling and message communication. ๐Ÿ› ๏ธ

  • Vectorization with Modulo Operations: Fixed an issue where vectorization didnโ€™t work properly with modulo operations. Now, both (a mod b) * x and (a * x) mod b are handled correctly. ๐Ÿงฎ

  • Hash Function Compatibility with Numpy 2.0: Addressed a failure in the hash function with Numpy 2.0. The hash calculations now use Numpy's uint64 data type for better compatibility. ๐Ÿ”

Chores

  • Removed Dead Code in XLA:GPU: Cleaned up the codebase by removing unused code related to MockNcclTopoModel from GpuExecutableRunOptions. This makes the code cleaner and easier to maintain. ๐Ÿงน

That's all for now! Keep coding and stay awesome! ๐Ÿ’ปโœจ


Welcome to the latest change log! We've been busy making some exciting updates and improvements. Here's a rundown of what's new, fixed, and improved:


New Features

  • Freeze API for Device Tensors ๐ŸงŠ: Introducing a Freeze() API to release host memory for device tensors in TensorFlow. It decides whether to release a tensor based on its usage by CPU/Host operations. This helps in managing memory more efficiently by freeing up resources used solely by the device.

  • Shard-as Propagation Support ๐Ÿš€: Added support for shard-as propagation with unspecified dimensions in the XLA:SPMD framework. This update ensures better handling of sharding instructions and enhances the propagation process.

  • GemmDegenerateDimRemover Pass: A new pass called GemmDegenerateDimRemover has been added to the XLA service for GPU. This pass removes degenerate dimensions introduced by GemvRewriter, optimizing matrix-vector multiplications.

  • Remove Unused Dimensions in IndexingMap: A method to remove unused dimensions from the IndexingMap class in the XLA:GPU service has been introduced. This helps in cleaning up and optimizing representations by removing unused dimensions.

  • HloAnyOf Function ๐ŸŒŸ: Added a new traversal function called HloAnyOf to the XLA:GPU codebase. This function provides a flexible way to traverse HLO nodes without needing additional adaptors, making the codebase more user-friendly.

Improvements

  • Multi-threading in tf-data Module ๐Ÿงต: We've introduced multi-threading to run the flat map function in TensorFlow's tf-data module. This change boosts the efficiency and performance of processing input datasets by using multiple threads.

  • Memory Term Reduction Algorithm: A simpler and more effective algorithm for reducing memory terms has been implemented. This update uses ActivePrim pairs instead of LiveAndPrim pairs, making the merging of overlapping intervals more efficient.

  • Remove Unused Dims and Symbols in XLA:GPU: A method to remove both unused dimensions and symbols has been added to the XLA:GPU IndexAnalysis module. This optimization reduces redundancy and improves performance.

Bug Fixes

  • Early Error for Coordination Service Shutdown: Fixed an issue where a barrier request after the coordination service shutdown would proceed. Now, it returns an error early, ensuring proper handling of such requests.

  • Close Host Callback Queues: Explicitly closing host callback queues inside IfrtBackend destruction to avoid potential deadlocks caused by blocked executions.

  • Unpropagatable Dots in Space-to-Batch Conversion: Marked dots as unpropagatable during space-to-batch conversion to prevent issues related to dot propagation post layout assignment.

Chores

  • Remove Deprecated MLIR Codegen: Removed deprecated XLA:CPU MLIR-based codegen parts to clean up the codebase and streamline the compilation pipeline.

That's all for now! Stay tuned for more updates and improvements. ๐ŸŒŸ


Welcome to the latest change log! We've been busy adding some fantastic new features, improving existing functionalities, and squashing pesky bugs. Here's the scoop:

New Features ๐ŸŽ‰

  • Max IDs and Unique IDs Operation: Added a new operation called TF_GetStatsFromListOfSparseCoreCooTensorsOp to compute the max_ids and max_unique_ids from a list of SparseCoreCooTensors. This includes unit tests to ensure accuracy and functionality.
  • Convert to Sparse Core CSR Wrapped COO Format: Introduced the ConvertToSparseCoreCsrWrappedCooTensorOp operation. This converts a sorted COO tensor into a sparse core CSR wrapped COO format, optimizing the handling of sparse tensors.
  • PartialReduce Custom Call in Auto-Sharding: Added support for the PartialReduce custom call op in auto-sharding, enhancing the generation of strategies for PartialReduce operations.
  • Nested Tuples in BorrowingLiteral: Added support for nested tuples in BorrowingLiteral, allowing more flexibility when working with complex data structures in XLA.
  • Composite Ops in TFLite Flatbuffer Schema: Added support for Composite ops in the TFLite flatbuffer schema, introducing the necessary infrastructure for the StableHLOComposite operation.

Improvements ๐Ÿš€

  • Multiple Epilogues in Fusion Process: Now each reduction group can have its own epilogue in the fusion process, enhancing flexibility and customization.
  • Python Bindings for TensorFlow to StableHLO Tooling: Added Python bindings to enable the conversion of TensorFlow SavedModel to StableHLO, providing more flexibility in specifying input parameters and output paths.
  • Cache Dataset Random Access Iterators: Enhanced support for saving and loading cache dataset random access iterators, ensuring that cached elements can be accessed and restored efficiently.

Bug Fixes ๐Ÿ›

  • TPU Device Check in MlirBridgePass: Reintroduced the TPU device check in MlirBridgePass::GetPassState(), unblocking graphs that target TPU without replication.
  • GpuAlgebraicSimplifier: Fixed a bug in the GpuAlgebraicSimplifier related to determining if operands of a dot operation are vectors.
  • Replace absl::make_unique_for_overwrite: Updated the code to use std::make_unique instead of absl::make_unique_for_overwrite, aligning with standard C++ practices.

We hope you enjoy these updates and improvements! Keep coding and stay awesome! ๐Ÿš€โœจ


Hey there, code wranglers! We've got a bunch of updates to share with you. From new features to bug fixes, here's the latest scoop on what's been happening under the hood. ๐Ÿš€


New feature

  • Containers with CUDA 12.3 and CUDNN 8.9: Added new containers with CUDA 12.3 and CUDNN 8.9. This update makes sure you can build manylinux 2014 compliant cross-compilers targeting compatible glibc and system libstdc++. ๐Ÿš€
  • Weight-only quantization: Introduced weight-only quantization for convolution and dot_general operations. This adds support for the weight_only_ptq method, making your deep learning models leaner and meaner. ๐Ÿ‹๏ธโ€โ™‚๏ธ
  • CalibrationStatisticsSaver op: Added a new op definition to replace the CalibrationSingleton, aggregating and saving statistics to files. This op is stateful and designed to run on the CPU, making it easy to lift to outer functions. ๐Ÿ“Š
  • Async dynamic slicing: Implemented async dynamic slicing for host memory offloading on GPU. Dynamic slicing instructions are wrapped in a fusion node, allowing for asynchronous execution. ๐ŸŒ€
  • StableHLO integration: Integrated StableHLO at openxla/stablehlo@714d9aca, updating various functions and constants. ๐Ÿ› ๏ธ

Improvement

  • Variable dtype and shape storage: Enhanced IfrtRestoreTensorRegistry to store variable dtype and shape, improving tensor restoration and lookup during execution. ๐Ÿง 
  • Global shuffling for memory cache dataset: Added support for global shuffling in the memory cache dataset, improving data processing capabilities. ๐Ÿ”„
  • Memory Term Reducer: Augmented the Memory Term Reducer to merge both primitives and groups, enhancing memory management and optimization. ๐Ÿงฉ

Bugfix

  • Convert-memory-placement-to-internal-annotations: Removed a check for single user of an operand, allowing the program to process operands with multiple users. ๐Ÿ”ง
  • LLVM integration: Updated LLVM usage to match the latest commit version, ensuring compatibility and stability. ๐Ÿ›ก๏ธ
  • Duplicate dependency in TSL: Removed a duplicate 'clog' dependency, streamlining the code and optimizing dependency management. ๐Ÿ—‘๏ธ

Chore

  • Remove unused workflow: Cleaned up the codebase by removing an outdated "A/B Diff Performance Benchmarking" workflow. โœ‚๏ธ

That's all for now! Keep on coding and stay tuned for more updates. Happy coding! ๐Ÿ˜„


Here's the latest and greatest from our development team! Check out the awesome new features, improvements, and bug fixes we've rolled out:


New Features

  • IndexFlatMapDataset ๐ŸŽ‰

    • Introducing IndexFlatMapDataset, a new dataset operation in TensorFlow. It's like flat_map but with global shuffling! Users need to provide an index_map_fn function, which returns a tuple of (element index, offset) for the unflattened dataset. Enhances dataset manipulation with global shuffling support.
  • Unbounded Dynamism Tests ๐Ÿงช

    • Added tests for unbounded dynamism in ReducePrecisionOp, ShiftLeftOp, and ComplexOp. These tests ensure that these operations handle precision reduction, shifting, and complex number operations correctly, even with varying shapes and broadcast dimensions.
  • IfrtServingExecutable Host Callback Execution ๐Ÿš€

    • Added support for executing host callbacks in IfrtServingExecutable. This includes building, grouping, and executing host callbacks synchronously, along with necessary tests to ensure functionality.

Improvements

  • Unpack Quantized MHLO Ops ๐Ÿ”ง

    • Unpacked per-channel hybrid quantized MHLO ops to float ops. This includes extensive modifications and tests to ensure correct handling of scales and zero points in symmetric and asymmetric quantization cases.
  • Composite Lowering for aten.avg_pool2d ๐ŸŒŠ

    • Added a composite lowering pass for aten.avg_pool2d in the TensorFlow compiler MLIR Lite stablehlo module. This includes utility functions and updates to various files to handle average pooling operations.
  • Global Shuffling for IndexFlatMapDataset ๐ŸŒ

    • Enhanced IndexFlatMapDataset with global shuffling support. This includes updates to ensure compatibility with random access for all upstream transformations and new test cases to validate the functionality.

Bug Fixes

  • PjRtBuffer Dependency Handling ๐Ÿ› ๏ธ

    • Updated DonateWithControlDependency in PjRtBuffer to use PjRtFuture<> for passing dependencies. This includes temporary adaptor functions and changes across multiple files to ensure compatibility.
  • HloComputation Struct Optimization ๐Ÿ‹๏ธโ€โ™‚๏ธ

    • Removed the redundant instruction_indices_ from HloComputation, reducing the struct size and reorganizing it for better efficiency.
  • Attribute Fix for MSVC ๐Ÿ”ฉ

    • Replaced __attribute__((unused)) with [[maybe_unused]] in PluginProgramSerDes and PluginCompileOptionsSerDes to fix an MSVC error.

Chores

  • Internal Package Group Update ๐Ÿ“ฆ
    • Modified the internal package group in the tensorflow/BUILD file, adding a new package group for "//waymo/accelerator/...". This helps in better organizing and managing the codebase.

Stay tuned for more updates and keep coding! ๐Ÿš€


### Changelog

Hey there, awesome developers! We've got some exciting updates and fixes for you. Check out what's new and improved:

#### New feature ๐Ÿš€
- **PluginProgram in IFRT**: Introducing the 'PluginProgram' in IFRT, now accessible via `xla_client.compile_ifrt_program()`. This nifty feature wraps arbitrary byte-strings, giving IFRT backends the freedom to interpret them as they see fit. Plus, new functions to create XLA and plugin programs and compile options are now available.
- **Distributed Save and Load with Wait**: Say hello to `data.experimental.distributed_save` and the `wait` parameter in `load`! Save your distributed dataset snapshots non-blockingly and read them while they're being written. Backward compatibility? Check!
- **Executable Wrapper for Host Callback**: Added a new C++ class `TfHostCallback` to run host callbacks in TensorFlow. Create, pass input tensors, execute, and retrieve output tensors with ease.
- **Force Early Scheduling**: Introducing `kForceEarly` to schedule nodes as early as possible, especially useful for GPU schedulers. Optimize your pipelined Recv nodes for better performance.
- **Get Default Layout in PyClient**: Added a method to retrieve the default layout for specific devices in the PyClient class. More control over your layouts now!

#### Improvement ๐ŸŒŸ
- **Same Shape Bias for Convolution**: Lift the same shape bias for `stablehlo.convolution`. Explicitly give bias with the desired shape, and find operands of specific types with ease.
- **SourceLocation in xla::Internal Errors**: Enhanced error reporting and debugging by adding SourceLocation information to xla::Internal errors.
- **Rename WeightOnlyPreset**: Updated the naming convention from WeightOnlyPreset to WeightOnlyPtqPreset for clarity and uniformity across the codebase.

#### Bugfix ๐Ÿ›
- **Rollforward with Fix**: Resolved issues in "hlo_proto_to_memory_visualization_utils.cc" by rolling forward with necessary fixes. Shape indexes and descriptions are now accurately resolved.
- **Fake Quant Gradient Ops**: Registered fake quant gradient operations as not differentiable to maintain consistency and accuracy in gradient computations.
- **Async Copies Scheduling**: Corrected the scheduling of async copy operations with `start_after=-1` to hide latency effectively.

#### Chore ๐Ÿงน
- **Remove Stray Constant Folding Mutex**: Cleaned up and optimized the constant folding logic by removing an unnecessary mutex, resulting in more efficient code execution.

Enjoy these updates and keep on coding! ๐Ÿš€โœจ
Showing 1 of 16 Entries