tensorflow changelog


Hey team! Check out the latest and greatest updates to our codebase. We've got some cool new features, important improvements, and essential bug fixes. Dive in and see what's new! ๐Ÿš€

New Features

  • Support for conditional() with manual subgroups in spmd_partitioner: Now you can handle conditional operations with manual subgroups, maintaining manual sharding where needed. This update includes changes to SpmdPartitioningVisitor and new test cases to validate this functionality. ๐ŸŽ‰

  • Basic DAG Executor Implementation for XLA CPU: Introducing a basic Directed Acyclic Graph (DAG) executor for the XLA CPU service. This helps in executing thunks concurrently in a thread pool, ensuring correct ordering and execution. ๐Ÿงฉ

  • Initial Implementation of ThunkExecutor: A new ThunkExecutor class is here! It builds a DAG defining execution order based on buffer uses, complete with methods and tests to ensure everything runs smoothly. ๐Ÿ› ๏ธ

  • Runtime Simulator for HLO Module Execution Time: A new simulator predicts execution time for HLO modules, taking into account nested loop trip counts. This helps in optimizing execution time estimates. โฑ๏ธ

  • ScratchAllocator in External FFI API: Introducing ScratchAllocator for efficient device memory allocation and deallocation in XLA's external FFI API. This improves overall usability and performance. ๐Ÿ’พ

Improvements

  • Simplified Code in dynamic_update_slice: Weโ€™ve streamlined the code by removing unnecessary template usage and converting indices into int64 before processing. This reduces the target binary size and optimizes performance. ๐Ÿ“‰

  • Export XLA:FFI Handlers as C Function Symbols: A new macro allows exporting XLA:FFI handlers as C function symbols, making it easier to work with FFI implementations in shared libraries. ๐Ÿ”ง

  • Using Eigen Thread Pool for ThunkExecutor Tasks: ThunkExecutor tasks now utilize the Eigen thread pool, addressing mutex contention points and improving performance nearly linearly with the number of threads. ๐ŸŽ๏ธ

Bug Fixes

  • Correct Propagation of Deserialization Errors: Weโ€™ve fixed the deserialization process to correctly propagate errors from HloProgramSerDes, ensuring better error handling and message communication. ๐Ÿ› ๏ธ

  • Vectorization with Modulo Operations: Fixed an issue where vectorization didnโ€™t work properly with modulo operations. Now, both (a mod b) * x and (a * x) mod b are handled correctly. ๐Ÿงฎ

  • Hash Function Compatibility with Numpy 2.0: Addressed a failure in the hash function with Numpy 2.0. The hash calculations now use Numpy's uint64 data type for better compatibility. ๐Ÿ”

Chores

  • Removed Dead Code in XLA:GPU: Cleaned up the codebase by removing unused code related to MockNcclTopoModel from GpuExecutableRunOptions. This makes the code cleaner and easier to maintain. ๐Ÿงน

That's all for now! Keep coding and stay awesome! ๐Ÿ’ปโœจ


Welcome to the latest change log! We've been busy making some exciting updates and improvements. Here's a rundown of what's new, fixed, and improved:


New Features

  • Freeze API for Device Tensors ๐ŸงŠ: Introducing a Freeze() API to release host memory for device tensors in TensorFlow. It decides whether to release a tensor based on its usage by CPU/Host operations. This helps in managing memory more efficiently by freeing up resources used solely by the device.

  • Shard-as Propagation Support ๐Ÿš€: Added support for shard-as propagation with unspecified dimensions in the XLA:SPMD framework. This update ensures better handling of sharding instructions and enhances the propagation process.

  • GemmDegenerateDimRemover Pass: A new pass called GemmDegenerateDimRemover has been added to the XLA service for GPU. This pass removes degenerate dimensions introduced by GemvRewriter, optimizing matrix-vector multiplications.

  • Remove Unused Dimensions in IndexingMap: A method to remove unused dimensions from the IndexingMap class in the XLA:GPU service has been introduced. This helps in cleaning up and optimizing representations by removing unused dimensions.

  • HloAnyOf Function ๐ŸŒŸ: Added a new traversal function called HloAnyOf to the XLA:GPU codebase. This function provides a flexible way to traverse HLO nodes without needing additional adaptors, making the codebase more user-friendly.

Improvements

  • Multi-threading in tf-data Module ๐Ÿงต: We've introduced multi-threading to run the flat map function in TensorFlow's tf-data module. This change boosts the efficiency and performance of processing input datasets by using multiple threads.

  • Memory Term Reduction Algorithm: A simpler and more effective algorithm for reducing memory terms has been implemented. This update uses ActivePrim pairs instead of LiveAndPrim pairs, making the merging of overlapping intervals more efficient.

  • Remove Unused Dims and Symbols in XLA:GPU: A method to remove both unused dimensions and symbols has been added to the XLA:GPU IndexAnalysis module. This optimization reduces redundancy and improves performance.

Bug Fixes

  • Early Error for Coordination Service Shutdown: Fixed an issue where a barrier request after the coordination service shutdown would proceed. Now, it returns an error early, ensuring proper handling of such requests.

  • Close Host Callback Queues: Explicitly closing host callback queues inside IfrtBackend destruction to avoid potential deadlocks caused by blocked executions.

  • Unpropagatable Dots in Space-to-Batch Conversion: Marked dots as unpropagatable during space-to-batch conversion to prevent issues related to dot propagation post layout assignment.

Chores

  • Remove Deprecated MLIR Codegen: Removed deprecated XLA:CPU MLIR-based codegen parts to clean up the codebase and streamline the compilation pipeline.

That's all for now! Stay tuned for more updates and improvements. ๐ŸŒŸ


Welcome to the latest change log! We've been busy adding some fantastic new features, improving existing functionalities, and squashing pesky bugs. Here's the scoop:

New Features ๐ŸŽ‰

  • Max IDs and Unique IDs Operation: Added a new operation called TF_GetStatsFromListOfSparseCoreCooTensorsOp to compute the max_ids and max_unique_ids from a list of SparseCoreCooTensors. This includes unit tests to ensure accuracy and functionality.
  • Convert to Sparse Core CSR Wrapped COO Format: Introduced the ConvertToSparseCoreCsrWrappedCooTensorOp operation. This converts a sorted COO tensor into a sparse core CSR wrapped COO format, optimizing the handling of sparse tensors.
  • PartialReduce Custom Call in Auto-Sharding: Added support for the PartialReduce custom call op in auto-sharding, enhancing the generation of strategies for PartialReduce operations.
  • Nested Tuples in BorrowingLiteral: Added support for nested tuples in BorrowingLiteral, allowing more flexibility when working with complex data structures in XLA.
  • Composite Ops in TFLite Flatbuffer Schema: Added support for Composite ops in the TFLite flatbuffer schema, introducing the necessary infrastructure for the StableHLOComposite operation.

Improvements ๐Ÿš€

  • Multiple Epilogues in Fusion Process: Now each reduction group can have its own epilogue in the fusion process, enhancing flexibility and customization.
  • Python Bindings for TensorFlow to StableHLO Tooling: Added Python bindings to enable the conversion of TensorFlow SavedModel to StableHLO, providing more flexibility in specifying input parameters and output paths.
  • Cache Dataset Random Access Iterators: Enhanced support for saving and loading cache dataset random access iterators, ensuring that cached elements can be accessed and restored efficiently.

Bug Fixes ๐Ÿ›

  • TPU Device Check in MlirBridgePass: Reintroduced the TPU device check in MlirBridgePass::GetPassState(), unblocking graphs that target TPU without replication.
  • GpuAlgebraicSimplifier: Fixed a bug in the GpuAlgebraicSimplifier related to determining if operands of a dot operation are vectors.
  • Replace absl::make_unique_for_overwrite: Updated the code to use std::make_unique instead of absl::make_unique_for_overwrite, aligning with standard C++ practices.

We hope you enjoy these updates and improvements! Keep coding and stay awesome! ๐Ÿš€โœจ


Hey there, code wranglers! We've got a bunch of updates to share with you. From new features to bug fixes, here's the latest scoop on what's been happening under the hood. ๐Ÿš€


New feature

  • Containers with CUDA 12.3 and CUDNN 8.9: Added new containers with CUDA 12.3 and CUDNN 8.9. This update makes sure you can build manylinux 2014 compliant cross-compilers targeting compatible glibc and system libstdc++. ๐Ÿš€
  • Weight-only quantization: Introduced weight-only quantization for convolution and dot_general operations. This adds support for the weight_only_ptq method, making your deep learning models leaner and meaner. ๐Ÿ‹๏ธโ€โ™‚๏ธ
  • CalibrationStatisticsSaver op: Added a new op definition to replace the CalibrationSingleton, aggregating and saving statistics to files. This op is stateful and designed to run on the CPU, making it easy to lift to outer functions. ๐Ÿ“Š
  • Async dynamic slicing: Implemented async dynamic slicing for host memory offloading on GPU. Dynamic slicing instructions are wrapped in a fusion node, allowing for asynchronous execution. ๐ŸŒ€
  • StableHLO integration: Integrated StableHLO at openxla/stablehlo@714d9aca, updating various functions and constants. ๐Ÿ› ๏ธ

Improvement

  • Variable dtype and shape storage: Enhanced IfrtRestoreTensorRegistry to store variable dtype and shape, improving tensor restoration and lookup during execution. ๐Ÿง 
  • Global shuffling for memory cache dataset: Added support for global shuffling in the memory cache dataset, improving data processing capabilities. ๐Ÿ”„
  • Memory Term Reducer: Augmented the Memory Term Reducer to merge both primitives and groups, enhancing memory management and optimization. ๐Ÿงฉ

Bugfix

  • Convert-memory-placement-to-internal-annotations: Removed a check for single user of an operand, allowing the program to process operands with multiple users. ๐Ÿ”ง
  • LLVM integration: Updated LLVM usage to match the latest commit version, ensuring compatibility and stability. ๐Ÿ›ก๏ธ
  • Duplicate dependency in TSL: Removed a duplicate 'clog' dependency, streamlining the code and optimizing dependency management. ๐Ÿ—‘๏ธ

Chore

  • Remove unused workflow: Cleaned up the codebase by removing an outdated "A/B Diff Performance Benchmarking" workflow. โœ‚๏ธ

That's all for now! Keep on coding and stay tuned for more updates. Happy coding! ๐Ÿ˜„


Here's the latest and greatest from our development team! Check out the awesome new features, improvements, and bug fixes we've rolled out:


New Features

  • IndexFlatMapDataset ๐ŸŽ‰

    • Introducing IndexFlatMapDataset, a new dataset operation in TensorFlow. It's like flat_map but with global shuffling! Users need to provide an index_map_fn function, which returns a tuple of (element index, offset) for the unflattened dataset. Enhances dataset manipulation with global shuffling support.
  • Unbounded Dynamism Tests ๐Ÿงช

    • Added tests for unbounded dynamism in ReducePrecisionOp, ShiftLeftOp, and ComplexOp. These tests ensure that these operations handle precision reduction, shifting, and complex number operations correctly, even with varying shapes and broadcast dimensions.
  • IfrtServingExecutable Host Callback Execution ๐Ÿš€

    • Added support for executing host callbacks in IfrtServingExecutable. This includes building, grouping, and executing host callbacks synchronously, along with necessary tests to ensure functionality.

Improvements

  • Unpack Quantized MHLO Ops ๐Ÿ”ง

    • Unpacked per-channel hybrid quantized MHLO ops to float ops. This includes extensive modifications and tests to ensure correct handling of scales and zero points in symmetric and asymmetric quantization cases.
  • Composite Lowering for aten.avg_pool2d ๐ŸŒŠ

    • Added a composite lowering pass for aten.avg_pool2d in the TensorFlow compiler MLIR Lite stablehlo module. This includes utility functions and updates to various files to handle average pooling operations.
  • Global Shuffling for IndexFlatMapDataset ๐ŸŒ

    • Enhanced IndexFlatMapDataset with global shuffling support. This includes updates to ensure compatibility with random access for all upstream transformations and new test cases to validate the functionality.

Bug Fixes

  • PjRtBuffer Dependency Handling ๐Ÿ› ๏ธ

    • Updated DonateWithControlDependency in PjRtBuffer to use PjRtFuture<> for passing dependencies. This includes temporary adaptor functions and changes across multiple files to ensure compatibility.
  • HloComputation Struct Optimization ๐Ÿ‹๏ธโ€โ™‚๏ธ

    • Removed the redundant instruction_indices_ from HloComputation, reducing the struct size and reorganizing it for better efficiency.
  • Attribute Fix for MSVC ๐Ÿ”ฉ

    • Replaced __attribute__((unused)) with [[maybe_unused]] in PluginProgramSerDes and PluginCompileOptionsSerDes to fix an MSVC error.

Chores

  • Internal Package Group Update ๐Ÿ“ฆ
    • Modified the internal package group in the tensorflow/BUILD file, adding a new package group for "//waymo/accelerator/...". This helps in better organizing and managing the codebase.

Stay tuned for more updates and keep coding! ๐Ÿš€


### Changelog

Hey there, awesome developers! We've got some exciting updates and fixes for you. Check out what's new and improved:

#### New feature ๐Ÿš€
- **PluginProgram in IFRT**: Introducing the 'PluginProgram' in IFRT, now accessible via `xla_client.compile_ifrt_program()`. This nifty feature wraps arbitrary byte-strings, giving IFRT backends the freedom to interpret them as they see fit. Plus, new functions to create XLA and plugin programs and compile options are now available.
- **Distributed Save and Load with Wait**: Say hello to `data.experimental.distributed_save` and the `wait` parameter in `load`! Save your distributed dataset snapshots non-blockingly and read them while they're being written. Backward compatibility? Check!
- **Executable Wrapper for Host Callback**: Added a new C++ class `TfHostCallback` to run host callbacks in TensorFlow. Create, pass input tensors, execute, and retrieve output tensors with ease.
- **Force Early Scheduling**: Introducing `kForceEarly` to schedule nodes as early as possible, especially useful for GPU schedulers. Optimize your pipelined Recv nodes for better performance.
- **Get Default Layout in PyClient**: Added a method to retrieve the default layout for specific devices in the PyClient class. More control over your layouts now!

#### Improvement ๐ŸŒŸ
- **Same Shape Bias for Convolution**: Lift the same shape bias for `stablehlo.convolution`. Explicitly give bias with the desired shape, and find operands of specific types with ease.
- **SourceLocation in xla::Internal Errors**: Enhanced error reporting and debugging by adding SourceLocation information to xla::Internal errors.
- **Rename WeightOnlyPreset**: Updated the naming convention from WeightOnlyPreset to WeightOnlyPtqPreset for clarity and uniformity across the codebase.

#### Bugfix ๐Ÿ›
- **Rollforward with Fix**: Resolved issues in "hlo_proto_to_memory_visualization_utils.cc" by rolling forward with necessary fixes. Shape indexes and descriptions are now accurately resolved.
- **Fake Quant Gradient Ops**: Registered fake quant gradient operations as not differentiable to maintain consistency and accuracy in gradient computations.
- **Async Copies Scheduling**: Corrected the scheduling of async copy operations with `start_after=-1` to hide latency effectively.

#### Chore ๐Ÿงน
- **Remove Stray Constant Folding Mutex**: Cleaned up and optimized the constant folding logic by removing an unnecessary mutex, resulting in more efficient code execution.

Enjoy these updates and keep on coding! ๐Ÿš€โœจ
Showing 21 to 26 of 26 Entries