Skip to content

Conversation

@AlexsanderHamir
Copy link
Collaborator

@AlexsanderHamir AlexsanderHamir commented Nov 20, 2025

This PR is not to be merged, will cherry pick from here and merge into main slowly.

Title

Reduce memory cost of importing the completion function

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🧹 Refactoring

Context

Our current import strategy pulls in large portions of the codebase—even when only a single function is needed. Many modules perform heavy work at import time or bring in sizable dependencies, so importing the completion function triggers unnecessary initialization and memory allocation.

While this PR reduces the overhead for the completion function, it doesn’t fully resolve the underlying issue. A broader cleanup of our import structure is required for a complete fix.

Changes

  • Lazy-loaded the heaviest libraries identified in the memory profile during completion import.

Memory Differences

Before

Screenshot 2025-11-19 at 5 37 23 PM Screenshot 2025-11-19 at 5 38 03 PM

After

Screenshot 2025-11-19 at 5 37 41 PM Screenshot 2025-11-19 at 5 37 52 PM

This change removes 67MB of memory consumption on import time.
This reduced memory usage when importing the LiteLLM completion function from 200 MB to 140 MB.
This brings us down to 20MB, but something is getting triggered that is causing memory to spike.
Lazy-load most functions and response types from utils.py to avoid loading
tiktoken and other heavy dependencies at import time. This significantly
reduces memory usage when importing completion from litellm.

Changes:
- Made utils functions (exception_type, get_litellm_params, ModelResponse, etc.)
  lazy-loaded via __getattr__
- Made ALL_LITELLM_RESPONSE_TYPES lazy-loaded
- Fixed circular imports by updating files to import directly from litellm.utils
  or litellm.types.utils instead of from litellm
- Kept client decorator as immediate import since it's used at function
  definition time

Only client is now imported immediately from utils.py; all other utils
functions and response types are loaded on-demand when accessed.
Lazy-load tiktoken and default_encoding from litellm_core_utils to avoid
loading these heavy dependencies at import time. This further reduces memory
usage when importing completion from litellm.

Changes:
- Made tiktoken imports lazy-loaded in utils.py, main.py, and token_counter.py
- Made default_encoding lazy-loaded in token_counter.py and utils.py
- Made get_modified_max_tokens lazy-loaded in utils.py (only used internally)
- Made encoding attribute lazy-loaded via __getattr__ in __init__.py
- Removed top-level tiktoken and Encoding imports that were loading at module level

tiktoken and default_encoding are now only loaded when token counting or
encoding functions are actually called, not when importing completion.
Refactor repetitive lazy import and caching code into reusable helper
functions to improve code maintainability and readability.

Changes:
- Added _lazy_import_and_cache() generic helper for lazy importing with caching
- Added _lazy_import_from() convenience wrapper for common import pattern
- Replaced 4 repetitive code blocks with simple function calls
- Maintains same performance: imports cached after first access, zero
  overhead on subsequent calls

The helper functions eliminate code duplication while preserving the
performance benefits of cached lazy loading.
- Remove eager import of AsyncHTTPHandler and HTTPHandler from __init__.py
- Make module_level_aclient and module_level_client lazy-loaded via __getattr__
- HTTP handler clients are now instantiated on first access, not at import time
- Reduces memory footprint when importing completion from litellm
Lazy-load Cache, DualCache, RedisCache, and InMemoryCache from caching.caching
to avoid loading these dependencies at import time. This further reduces memory
usage when importing completion from litellm.

Changes:
- Made Cache, DualCache, RedisCache, and InMemoryCache lazy-loaded via __getattr__ in __init__.py
- Removed top-level caching class imports that were loading at module level
- Updated cache type annotation to use forward reference string to avoid runtime import
- Caching classes are now only loaded when actually accessed, not when importing completion

Performance:
- First access: 0.001-0.008ms (negligible latency)
- Cached access: 0.000ms (no latency penalty)
- Classes are cached in globals() after first access to avoid repeated import overhead

This follows the same pattern as HTTP handlers lazy loading and avoids latency
issues by caching imported classes after first access.
@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Error Error Nov 25, 2025 0:33am

1. Grouped lazy imports into the same functions.
2. Removed importing more then one lib when its name wasn't called.
…e_index_from_tool_calls to reduce import-time memory cost
- Convert most types.utils imports to lazy loading via __getattr__
- Add _lazy_import_types_utils function for on-demand imports
- Keep LlmProviders and PriorityReservationSettings as direct imports (needed for module-level initialization)
- Add TYPE_CHECKING imports for type annotations (CredentialItem, BudgetConfig, etc.)
- Significantly reduces import cascade and memory usage at import time
- Make provider_list and priority_reservation_settings lazy-loaded via __getattr__
- Lazy load types.proxy.management_endpoints.ui_sso imports (DefaultTeamSSOParams, LiteLLM_UpperboundKeyGenerateParams)
- Keep LlmProviders and PriorityReservationSettings as direct imports (needed by other modules)
- Remove non-essential comments
- Significantly reduces import-time memory usage
- Make KeyManagementSystem fully lazy-loaded via __getattr__
- Make KeyManagementSettings lazy-loadable via __getattr__
- Keep KeyManagementSettings as direct import (needed for _key_management_settings initialization during import)
- Add TYPE_CHECKING imports for type annotations
- Significantly reduces import-time memory usage
- Move client import from line 1053 to right before main.py import (line 1328)
- This delays loading utils.py (which imports tiktoken) until after most other imports
- client cannot be fully lazy-loaded because main.py needs it at import time for @client decorator
- Reduces memory footprint during early import phase
@AlexsanderHamir AlexsanderHamir force-pushed the litellm_memory_import_issue branch from afc07ed to b03746b Compare November 22, 2025 20:15
- Remove direct import of BytezChatConfig from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading bytez transformation module until BytezChatConfig is accessed
- main.py still works (imports directly), utils.py works (accesses via litellm.BytezChatConfig)
- Remove direct import of CustomLLM from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading custom_llm module until CustomLLM is accessed
- images/main.py still works (imports directly from source)
- Proxy examples still work (access via litellm.CustomLLM)
- Remove direct import of AmazonConverseConfig from early in __init__.py
- Add lazy loading via __getattr__ pattern
- Delays loading converse_transformation module until AmazonConverseConfig is accessed
- common_utils.py still works (accesses via litellm.AmazonConverseConfig())
- invoke_handler.py still works (imports directly from source)
Add azure_chat_completions to the _lazy_vars dictionary in __getattr__
to fix ImportError when other modules (e.g., images/main.py) try to
import it from litellm.main. This ensures backward compatibility with
modules that import these handlers directly.
… memory

Make openai and its submodules (_parsing, _pydantic, ResponseFormat, OpenAIError)
lazy-loaded in the @client decorator to avoid expensive import when importing
the decorator. This defers the openai import until the decorator actually runs,
significantly reducing import-time memory cost.

Changes:
- Remove top-level 'import openai' from utils.py
- Add lazy import helpers for openai module and submodules
- Replace openai.* references in @client decorator with lazy-loaded versions
- Update exception handling to use lazy-loaded openai.APIError, Timeout, etc.
Remove the unused import of litellm._service_logger from utils.py to reduce
import-time memory cost. The module is not used in utils.py and can be
imported directly where needed.
Make litellm.litellm_core_utils.audio_utils.utils lazy-loaded using a cached
helper function to avoid expensive import when importing the @client decorator.
The module is only loaded when actually needed (during transcription calls)
and cached for subsequent use to maintain performance.
…mory

Remove unused top-level imports of litellm.llms and litellm.llms.gemini from
utils.py. These are not used directly and submodule imports (from litellm.llms.*)
will automatically import the parent package when needed, avoiding expensive
imports at module load time.
Make CachingHandlerResponse and LLMCachingHandler lazy-loaded using cached
helper functions to avoid expensive import when importing the @client decorator.
These classes are only needed when the decorator actually runs, not at import time.
Make CustomGuardrail lazy-loaded using a cached helper function to avoid
expensive import when importing the @client decorator. The class is only
needed when get_applied_guardrails is called, not at import time.
Make CustomLogger lazy-loaded using a cached helper function and TYPE_CHECKING
for type hints to avoid expensive import when importing the @client decorator.
All type hints use string literals to support forward references. The class is
only loaded when actually needed (isinstance checks), not at import time.
Fix NameError by replacing direct LLMCachingHandler usage with lazy loader
function call in the async wrapper. This ensures the class is properly loaded
when needed rather than at import time.
Remove the unused import of BaseVectorStore from utils.py to reduce
import-time memory cost. The class is not used in utils.py and can be
imported directly where needed.
…-time memory

Make get_litellm_metadata_from_kwargs lazy-loaded using a cached helper
function to avoid expensive import when importing the @client decorator.
The function is only needed when get_end_user_id_for_cost_tracking is
called, not at import time.
Make CredentialAccessor lazy-loaded using a cached helper function to avoid
expensive import when importing the @client decorator. The class is only
needed when load_credentials_from_list is called, not at import time.
…t-time memory

Make _get_response_headers, exception_type, and get_error_message
lazy-loaded using cached helper functions to avoid expensive import
when importing the @client decorator. These functions are only needed
when exception handling occurs, not at import time.
Update main.py to use lazy-loaded exception_type from utils.py instead
of direct import. This fixes the ImportError when importing completion
from litellm, since exception_type is now lazy-loaded in utils.py.
Make get_llm_provider lazy-loaded using a cached helper function to avoid
expensive import when importing the @client decorator. The function is only
needed when provider logic is accessed, not at import time.
Update main.py to use lazy-loaded get_llm_provider from utils.py instead
of direct import. This fixes the ImportError when importing completion
from litellm, since get_llm_provider is now lazy-loaded in utils.py.
… memory

Make get_supported_openai_params lazy-loaded using a cached helper function
to avoid expensive import when importing the @client decorator. The function
is only needed when optional params are processed, not at import time.
…rt-time memory

Make LiteLLMResponseObjectHandler, _handle_invalid_parallel_tool_calls,
convert_to_model_response_object, convert_to_streaming_response, and
convert_to_streaming_response_async lazy-loaded using cached helper functions
and __getattr__ to avoid expensive import when importing the @client decorator.
These functions are only needed when response conversion occurs, not at import time.
Make get_api_base lazy-loaded using a cached helper function and __getattr__
to avoid expensive import when importing the @client decorator. The function
is only needed when API base resolution occurs, not at import time.
…to reduce import-time memory

Make get_formatted_prompt, get_response_headers, ResponseMetadata,
_parse_content_for_reasoning, LiteLLMLoggingObject, and
redact_message_input_output_from_logging lazy-loaded using cached helper
functions and __getattr__ to avoid expensive imports when importing the
@client decorator. These are only needed when response processing occurs,
not at import time.
Move the TYPE_CHECKING block for LiteLLMLoggingObject to after the typing
imports to fix the NameError: name 'TYPE_CHECKING' is not defined error.
Make CustomStreamWrapper lazy-loaded using a cached helper function and
__getattr__ to avoid expensive import when importing the @client decorator.
The class is only needed when streaming responses are processed, not at
import time. This is required since it's imported by litellm/llms/openai_like/chat/handler.py.
…port-time memory

Move BaseGoogleGenAIGenerateContentConfig to TYPE_CHECKING block since it's
only used in type annotations. Update the type hint to use a string literal
to avoid runtime import when importing the @client decorator.
Move BaseOCRConfig to TYPE_CHECKING block since it's only used in type
annotations. The type hint already uses a string literal, so no runtime
import is needed when importing the @client decorator.
Move BaseSearchConfig to TYPE_CHECKING block since it's only used in type
annotations. The type hint already uses a string literal, so no runtime
import is needed when importing the @client decorator.
Move Base*Config classes and related imports to TYPE_CHECKING block or
lazy load them to reduce import-time memory cost. This follows the same
pattern used in __init__.py.

Changes:
- Move all Base*Config classes used only in type hints to TYPE_CHECKING block
- Create lazy loader functions for runtime-used Base*Config classes
- Lazy load BedrockModelInfo, CohereModelInfo, MistralOCRConfig
- Lazy load HTTPHandler, AsyncHTTPHandler
- Lazy load get_num_retries_from_retry_policy, reset_retry_policy, get_secret
- Lazy load ANTHROPIC_API_ONLY_HEADERS and AnthropicThinkingParam
- Update all type hints to use string literals for forward references
- Update all runtime usages to call lazy loader functions
- Expose lazy-loaded items via __getattr__ for backward compatibility

This significantly reduces import-time memory footprint while maintaining
full backward compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants