Implementing Caching with TGI
We’re building a caching layer for TGI to optimize performance—because waiting for API calls is so last season.
Prerequisites
- Python 3.11+
- pip install huggingface-hub
- pip install transformers
- pip install fastapi
- pip install uvicorn
- pip install requests
Step 1: Setting Up Your Environment
First, you need to set up your Python environment. You can use a virtual environment, which is always a good practice. This keeps your dependencies organized and avoids conflicts. Here’s how to create one:
python3 -m venv tgi-env
source tgi-env/bin/activate # On Windows use `tgi-env\Scripts\activate`
pip install huggingface-hub transformers fastapi uvicorn requests
For those of you who have accidentally worked without a virtual environment—trust me, it gets messy fast.
Step 2: Implementing the Caching Layer
Now, let’s write the caching layer. In this step, we’ll write a simple FastAPI app that utilizes the TGI for inference and caches the results. Here’s a basic implementation:
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import requests
import time
import hashlib
import os
app = FastAPI()
cache = {}
TGI_URL = "https://api.huggingface.co/tgi" # Replace with actual TGI endpoint
def generate_cache_key(input_text):
return hashlib.md5(input_text.encode()).hexdigest()
@app.post("/generate/")
async def generate(input_text: str):
cache_key = generate_cache_key(input_text)
if cache_key in cache:
return JSONResponse(content=cache[cache_key], status_code=200)
response = requests.post(TGI_URL, json={"input": input_text})
if response.status_code != 200:
return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
result = response.json()
cache[cache_key] = result
return result
This code defines an endpoint “/generate/” that checks the cache. If the result is there, it returns it. Otherwise, it makes a request to the TGI API, caches the response, and then returns it.
Step 3: Running Your FastAPI App
To run your FastAPI application, you’ll need Uvicorn. Here’s how to start it:
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Now your application is live, and you can hit it at http://localhost:8000/generate/. Keep an eye on the console; that’s where the magic happens. If you mistakenly wrote `main.py` as `main.py` and forgot it was `main.py`, Uvicorn will throw an error. Just double-check your filenames.
Step 4: Adding Expiration to the Cache
Now, let’s make our cache a bit smarter. Caching is great, but if you hold onto data forever, you’ll run into stale responses. Here’s how to implement a basic expiration for cached results:
cache_expiry = 300 # Cache expiry time in seconds
def generate_cache_key(input_text):
return hashlib.md5(input_text.encode()).hexdigest()
@app.post("/generate/")
async def generate(input_text: str):
cache_key = generate_cache_key(input_text)
current_time = time.time()
if cache_key in cache and (current_time - cache[cache_key]['timestamp']) < cache_expiry:
return JSONResponse(content=cache[cache_key]['data'], status_code=200)
response = requests.post(TGI_URL, json={"input": input_text})
if response.status_code != 200:
return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
result = response.json()
cache[cache_key] = {'data': result, 'timestamp': current_time}
return result
Now, your cache checks the timestamp before returning data. If the cache is older than the expiry time, it fetches new data from TGI. Seriously, no one wants stale data, right?
The Gotchas
Here are some common pitfalls you may run into:
- Memory Consumption: The cache can grow quickly, especially if you’re dealing with diverse inputs. You might want to implement cache size limits or a LRU (Least Recently Used) cache strategy.
- Error Handling: Make sure to handle different HTTP response statuses appropriately. If TGI is down, your application should respond gracefully instead of crashing.
- Concurrency Issues: If you’re using your app in a multi-threaded environment, consider using thread locks around your cache to avoid race conditions.
- Invalidation: Think about how you want to handle cache invalidation. If your data is updated frequently, you’ll need to clear the cache for specific keys.
- Rate Limiting: TGI may have rate limits. Implementing caching helps, but if you're not careful, you can still hit limits if many clients request the same data.
Full Code Example
Here’s everything we’ve built so far in one complete piece:
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import requests
import time
import hashlib
app = FastAPI()
cache = {}
cache_expiry = 300 # Cache expiry time in seconds
TGI_URL = "https://api.huggingface.co/tgi" # Replace with actual TGI endpoint
def generate_cache_key(input_text):
return hashlib.md5(input_text.encode()).hexdigest()
@app.post("/generate/")
async def generate(input_text: str):
cache_key = generate_cache_key(input_text)
current_time = time.time()
if cache_key in cache and (current_time - cache[cache_key]['timestamp']) < cache_expiry:
return JSONResponse(content=cache[cache_key]['data'], status_code=200)
response = requests.post(TGI_URL, json={"input": input_text})
if response.status_code != 200:
return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
result = response.json()
cache[cache_key] = {'data': result, 'timestamp': current_time}
return result
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
What's Next
Consider implementing a persistent cache using Redis or similar. This way, your cache remains intact even if the FastAPI app restarts. It'll improve performance and user experience significantly.
FAQ
Q: How do I clear the cache?
A: You can create a new endpoint in your FastAPI app that clears the cache dictionary. Ensure to protect this endpoint with some form of authentication.
Q: Can I cache results for longer than 5 minutes?
A: Sure! Just adjust the cache_expiry variable to your desired value.
Q: What if TGI goes down?
A: Your application should handle error responses from TGI gracefully. You can display a user-friendly message instead of crashing.
Data Sources
To learn more about the TGI and its capabilities, check out the official documentation here. For community insights, you might find the Hugging Face GitHub remarkably helpful.
| Repository | Stars | Forks | Open Issues | License | Last Updated |
|---|---|---|---|---|---|
| huggingface/text-generation-inference | 10,854 | 1,266 | 324 | Apache-2.0 | 2026-03-21 |
Last updated May 08, 2026. Data sourced from official docs and community benchmarks.
🕒 Published: