We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
tensorflow changelog
Hey there, awesome devs! Here's the latest and greatest from our codebase. Check out these exciting updates, bug fixes, and improvements. ๐
New Features
- Support i4 EmbeddingLookup in TFLite reference: Now you can use the
EmbeddingLookup
operation withTensorType_INT4
in TensorFlow Lite (TFLite). This means more flexibility and efficiency for your models. ๐ - Add external KV cache op for GenAI: Introducing an external key-value (KV) cache operation for TensorFlow Lite's experimental GenAI module. This enhances the management of external KV caches, crucial for AI applications. ๐ง
- [XLA:UNSTACKER] Detect effectively static dynamic-slice instructions: A new function to optimize loop unrolling by identifying static dynamic slices, boosting performance. ๐
- Add a method for looking up the memory space of a pointer: StreamExecutor now has a method to determine the memory space of a pointer, enhancing memory management. ๐พ
- [XLA:FFI] Add instantiation handler to XLA_FFI_Handler_Bundle: Expanding the XLA FFI API with an instantiate handler, giving you more control over the instantiation process. ๐ ๏ธ
Bugfixes
- Fix race condition in sparse optimizers: Ensures exclusive locks when modifying
var->tensor()
inEnsureSparseVariableAccess
to prevent segfaults and improve stability. ๐ - [XLA:GPU] Fix Triton codegen for
BroadcastOp
s of scalars: Ensures broadcasting rules are correctly enforced in the Triton verifier, preventing potential errors. ๐ก๏ธ - Remove affine fuzz test: Temporarily removed due to build issues with the current version of fuzztest. This keeps our build process smooth and error-free. ๐งฉ
Improvements
- Add physical device ordinal to buffers: Enhances resource management and tracking across different physical devices in the XLA framework. ๐
- Add support for non-trivial strides for conv in MHLO->TFL: Convolution operations in MHLO->TFL now support non-trivial strides, increasing flexibility and performance. ๐โโ๏ธ
- Automated Code Change: Streamlined dependencies and updated headers in the
grappler
module, enhancing optimization and performance. โ๏ธ
Chore
- Remove deprecated TfLiteOperatorCreateWithData function: Cleaned up the codebase by removing this deprecated function, simplifying the implementation. ๐งน
Keep up the fantastic work, and let's keep pushing the boundaries of what's possible! ๐
Included Commits
This commit includes automated code changes in the TensorFlow project, specifically in the grappler
module. The BUILD
file has been modified to update dependencies and remove unnecessary ones while adding new ones. Additionally, changes have been made to the grappler.h
and grappler_test.cc
files to include new headers and dependencies related to buffer handling, status, optimization, and performance data. The modifications aim to improve functionality and performance within the grappler
module.
Overall, this commit streamlines dependencies, updates headers, and includes new functionalities related to buffer handling, optimization, and performance data in the grappler
module of TensorFlow. The changes are designed to enhance the efficiency and capabilities of the module, ensuring better performance and functionality for users working with TensorFlow's grappler
features.
Files changed
- tensorflow/c/experimental/grappler/BUILD
- tensorflow/c/experimental/grappler/grappler.h
- tensorflow/c/experimental/grappler/grappler_test.cc
This commit adds an instantiation handler to the XLA_FFI_Handler_Bundle in the XLA FFI API. The changes include modifications to the api.h, c_api.h, ffi_api.cc, xla_compiler.cc, and custom_call_test.cc files. The XLA_FFI_Handler_Bundle now includes an optional instantiate handler in addition to the existing prepare, initialize, and execute handlers. The modifications involve updating the handler registry, handler stages, and handler registration process to accommodate the new instantiate handler. This change enhances the flexibility and functionality of the XLA FFI API by allowing the registration of custom handlers for different execution stages.
Overall, this commit expands the capabilities of the XLA FFI API by introducing an instantiate handler and updating relevant files to support this new feature. The changes aim to improve the customization options available for handling different stages of execution within the XLA framework. The addition of the instantiate handler provides developers with more control over the instantiation process, complementing the existing prepare, initialize, and execute handlers in the XLA FFI API.
Files changed
- third_party/xla/xla/ffi/api/api.h
- third_party/xla/xla/ffi/api/c_api.h
- third_party/xla/xla/ffi/ffi_api.cc
- third_party/xla/xla/python/xla_compiler.cc
- third_party/xla/xla/service/gpu/custom_call_test.cc
This commit adds a method for looking up the memory space of a pointer to stream_executor::StreamExecutor. It includes changes in multiple files, such as adding the method GetPointerMemorySpace to the GpuExecutor class, updating the MemorySpace enum in stream_executor.h, and modifying the cuda_driver.cc, cuda_driver.h, gpu_driver.h, gpu_executor.h, gpu_executor_test.cc, rocm_driver.cc, rocm_driver.h, and stream_executor.h files to implement and utilize this new functionality. The commit also includes unit tests in gpu_executor_test.cc to verify the correct behavior of the GetPointerMemorySpace method for both host and device memory spaces.
In summary, this commit enhances the stream_executor library by introducing a method to determine the memory space of a pointer, updating relevant files to incorporate this method, and adding tests to ensure its proper functionality for different memory spaces.
Files changed
- third_party/xla/xla/stream_executor/cuda/cuda_driver.cc
- third_party/xla/xla/stream_executor/cuda/cuda_driver.h
- third_party/xla/xla/stream_executor/gpu/BUILD
- third_party/xla/xla/stream_executor/gpu/gpu_driver.h
- third_party/xla/xla/stream_executor/gpu/gpu_executor.h
- third_party/xla/xla/stream_executor/gpu/gpu_executor_test.cc
- third_party/xla/xla/stream_executor/rocm/rocm_driver.cc
- third_party/xla/xla/stream_executor/rocm/rocm_driver.h
- third_party/xla/xla/stream_executor/stream_executor.h
This commit adds support for non-trivial strides for convolution in MHLO->TFL. The changes include modifying the test case for convolution with strides and updating the ConvData functions to check for supported shapes and padding. The commit also removes the check for supported strides in the IsConvLegal function, indicating that non-trivial strides are now supported for convolution in MHLO->TFL conversions.
Overall, the commit enhances the capabilities of convolution operations in MHLO->TFL by adding support for non-trivial strides, improving the test cases, and updating the ConvData functions to ensure compatibility with the changes.
Files changed
- tensorflow/compiler/mlir/lite/stablehlo/tests/tfl_legalize_hlo.mlir
- tensorflow/compiler/mlir/lite/stablehlo/transforms/legalize_hlo_conversions/conv.cc
This commit addresses an issue in the Triton code generation for BroadcastOp
s involving scalars in XLA's GPU backend. Previously, the Triton verifier did not enforce the broadcasting rules correctly, which could lead to potential errors in the generated code. The commit introduces a function to add leading dimensions to input tensors and modifies the verification process by moving the call to mlir::verify
to occur immediately after the TTIR
is emitted, rather than after the MLIR canonicalization passes. This change ensures that any broadcasting discrepancies are caught earlier in the process, preventing future issues.
Additionally, the commit includes updates to the codebase, such as the addition of tests to verify the correct handling of 0D operands in broadcast operations. These tests ensure that the broadcasting logic behaves as expected, particularly for edge cases involving scalar inputs. Overall, the changes enhance the robustness of the Triton code generation and verification processes, contributing to improved reliability in the handling of broadcast operations in XLA's GPU implementation.
Files changed
- third_party/xla/xla/service/gpu/fusions/triton/triton_fusion_emitter.cc
- third_party/xla/xla/service/gpu/fusions/triton/triton_fusion_emitter_device_test.cc
- third_party/xla/xla/service/gpu/fusions/triton/triton_support_legacy_test.cc
This commit introduces an external key-value (KV) cache operation for TensorFlow Lite's experimental GenAI module. The new functionality is implemented in the file external_kvcache.cc
, which includes the preparation and evaluation functions for the external KV cache operation, ensuring that it correctly handles input and output tensors. The operation requires five input tensors: two for the key and value caches, one for the position, and two for the slices of keys and values. The output consists of updated key and value caches. The commit also adds a corresponding test file, external_kvcache_test.cc
, which contains unit tests to validate the new operation.
In addition to the new files, modifications were made to existing files to integrate the new operation into the TensorFlow Lite framework. Specifically, the BUILD
file was updated to include the new source files, and the registration functions were added to genai_ops.cc
and genai_ops.h
to allow the external KV cache operation to be registered and used within the GenAI operations. Overall, this commit enhances the capabilities of TensorFlow Lite's GenAI by enabling efficient management of external key-value caches, which is crucial for various AI applications.
Files changed
- tensorflow/lite/experimental/genai/BUILD
- tensorflow/lite/experimental/genai/external_kvcache.cc
- tensorflow/lite/experimental/genai/external_kvcache_test.cc
- tensorflow/lite/experimental/genai/genai_ops.cc
- tensorflow/lite/experimental/genai/genai_ops.h
This commit introduces the addition of physical device ordinals to the TrackedDeviceBuffer
and ShapedBuffer
classes within the XLA (Accelerated Linear Algebra) compiler framework. The update allows for the passing of the PJRT (Portable JAX Runtime) device pointer to the TrackedDeviceBuffer
, enabling the retrieval of device ordinals, which can enhance the management and tracking of resources across different physical devices.
Several files across the XLA codebase have been modified to accommodate this feature, including updates to the implementation and header files related to tracked_device_buffer
, shaped_buffer
, and various service components. These changes aim to improve the functionality and interoperability of device buffers within the XLA ecosystem, potentially leading to better performance and resource utilization in machine learning tasks.
Files changed
- tensorflow/compiler/jit/xla_launch_util.cc
- third_party/xla/xla/pjrt/BUILD
- third_party/xla/xla/pjrt/pjrt_stream_executor_client.cc
- third_party/xla/xla/pjrt/tracked_device_buffer.cc
- third_party/xla/xla/pjrt/tracked_device_buffer.h
- third_party/xla/xla/pjrt/tracked_device_buffer_test.cc
- third_party/xla/xla/service/executable.h
- third_party/xla/xla/service/generic_transfer_manager.cc
- third_party/xla/xla/service/service.cc
- third_party/xla/xla/service/shaped_buffer.cc
- third_party/xla/xla/service/shaped_buffer.h
- third_party/xla/xla/service/transfer_manager.cc
- third_party/xla/xla/service/transfer_manager.h
This commit addresses a race condition in the sparse optimizers of TensorFlow by ensuring that an exclusive lock is held when modifying the var->tensor()
in the EnsureSparseVariableAccess
function. The previous implementation used a shared lock for this modification, which could lead to reading de-allocated memory during concurrent operations such as ResourceGather
, potentially resulting in segmentation faults. The update involves a structural change where EnsureSparseVariableAccess
is invoked only once per operation kernel, thereby preventing a deadlock scenario. This is accomplished by setting the copy_on_read_mode
to true during the initial call, which mitigates the risk of concurrent access issues that could affect model quality.
Additionally, the commit modifies several files to implement these changes, including updates to the locking mechanisms and the management of variable access. The changes also include a shift to using tsl::mutex
types instead of the previous mutex
types, enhancing the consistency and safety of concurrent operations. Overall, these modifications aim to improve the stability and reliability of TensorFlow's handling of sparse variables in a multi-threaded environment.
Files changed
- tensorflow/core/kernels/BUILD
- tensorflow/core/kernels/training_op_helpers.h
This commit removes the deprecated function TfLiteOperatorCreateWithData
, which served as an alias for TfLiteOperatorCreate
. The removal is reflected in the modifications made to both operator.cc
and operator.h
, where the function's definitions and declarations have been deleted. The change simplifies the codebase by eliminating redundancy, as the functionality provided by TfLiteOperatorCreateWithData
is already encompassed by the existing TfLiteOperatorCreate
.
Overall, the commit streamlines the TensorFlow Lite code by cleaning up deprecated elements, thus enhancing maintainability and clarity. The removal of 12 lines of code across the two files contributes to a more concise implementation, aligning with best practices in software development by phasing out outdated functions.
Files changed
- tensorflow/lite/core/c/operator.cc
- tensorflow/lite/core/c/operator.h
This commit introduces support for the EmbeddingLookup
operation with TensorType_INT4
in TensorFlow Lite (TFLite). It modifies several files to accommodate this new feature, including updates to the operation's definition and evaluation logic in the TFLite kernel. The changes ensure that the EmbeddingLookup
can now handle 4-bit integer inputs, enhancing the flexibility and efficiency of the operation for various model types.
Additionally, the commit updates the versioning for the EmbeddingLookup
operation to include this new capability, marking it as version 4. The changes are reflected in the operation's registration, evaluation functions, and corresponding tests to validate the implementation, ensuring that the new functionality is properly integrated and tested within the TFLite framework. Overall, this enhancement aims to improve the performance and resource utilization of embedding lookups in machine learning models deployed on TFLite.
Files changed
- RELEASE.md
- tensorflow/compiler/mlir/lite/ir/tfl_ops.td
- tensorflow/lite/core/kernels/register.cc
- tensorflow/lite/kernels/embedding_lookup.cc
- tensorflow/lite/kernels/embedding_lookup_test.cc
- tensorflow/lite/tools/versioning/op_version.cc
- tensorflow/lite/tools/versioning/runtime_version.cc
This commit introduces a new function to the XLA (Accelerated Linear Algebra) compiler that detects effectively static dynamic-slice instructions within unrollable loops. The function, MatchEffectivelyStaticDynamicSliceInsideLoop
, checks if a dynamic-slice instruction has a single dynamic index that is effectively static, meaning it is derived only from the loop's induction variable and constants. This enhancement aims to optimize loop unrolling by identifying cases where dynamic slices can be treated as static, thus potentially improving performance.
The changes include modifications to the while_loop_unroller.cc
and while_loop_unroller.h
files, where the new function is implemented alongside a helper function, IsEffectivelyStatic
, which recursively checks the static nature of given instructions. Additionally, the commit adds tests to ensure the functionality works correctly, verifying that static dynamic slices are recognized while dynamic ones are not. Overall, this commit enhances the XLA compiler's ability to optimize loop unrolling by improving its analysis of dynamic-slice instructions.
Files changed
- third_party/xla/xla/service/while_loop_unroller.cc
- third_party/xla/xla/service/while_loop_unroller.h
- third_party/xla/xla/service/while_loop_unroller_test.cc
This commit temporarily removes the affine fuzz test from the codebase due to issues with the grammar bzl in the current version of fuzztest being used. The changes include the deletion of several files related to the affine fuzz test, including the grammar definition and associated test files. Specifically, the affine fuzz grammar and the corresponding simplifier fuzz test code have been removed, as well as related build configurations.
The commit reflects a necessary adjustment to maintain build integrity, as the fuzz test could not be compiled with the existing setup. By removing these components, the developers aim to streamline the build process until the underlying issues with the fuzztest grammar are resolved.
Files changed
- third_party/xla/build_tools/build.py
- third_party/xla/xla/service/gpu/model/fuzztest/BUILD
- third_party/xla/xla/service/gpu/model/fuzztest/affine_fuzz.g4
- third_party/xla/xla/service/gpu/model/fuzztest/affine_simplifier_fuzz_test.cc