How to Implement Caching with TGI (Step by Step)

Q: Q: Can I cache results for longer than 5 minutes?

A: Sure! Just adjust the cache_expiry variable to your desired value.

📖 6 min read•1,038 words•Updated May 7, 2026

Implementing Caching with TGI

We’re building a caching layer for TGI to optimize performance—because waiting for API calls is so last season.

Prerequisites

Python 3.11+
pip install huggingface-hub
pip install transformers
pip install fastapi
pip install uvicorn
pip install requests

Step 1: Setting Up Your Environment

First, you need to set up your Python environment. You can use a virtual environment, which is always a good practice. This keeps your dependencies organized and avoids conflicts. Here’s how to create one:


python3 -m venv tgi-env
source tgi-env/bin/activate # On Windows use `tgi-env\Scripts\activate`
pip install huggingface-hub transformers fastapi uvicorn requests

For those of you who have accidentally worked without a virtual environment—trust me, it gets messy fast.

Step 2: Implementing the Caching Layer

Now, let’s write the caching layer. In this step, we’ll write a simple FastAPI app that utilizes the TGI for inference and caches the results. Here’s a basic implementation:


from fastapi import FastAPI
from fastapi.responses import JSONResponse
import requests
import time
import hashlib
import os

app = FastAPI()
cache = {}

TGI_URL = "https://api.huggingface.co/tgi" # Replace with actual TGI endpoint

def generate_cache_key(input_text):
 return hashlib.md5(input_text.encode()).hexdigest()

@app.post("/generate/")
async def generate(input_text: str):
 cache_key = generate_cache_key(input_text)
 if cache_key in cache:
 return JSONResponse(content=cache[cache_key], status_code=200)

 response = requests.post(TGI_URL, json={"input": input_text})
 
 if response.status_code != 200:
 return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
 
 result = response.json()
 cache[cache_key] = result
 return result

This code defines an endpoint “/generate/” that checks the cache. If the result is there, it returns it. Otherwise, it makes a request to the TGI API, caches the response, and then returns it.

Step 3: Running Your FastAPI App

To run your FastAPI application, you’ll need Uvicorn. Here’s how to start it:


uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Now your application is live, and you can hit it at http://localhost:8000/generate/. Keep an eye on the console; that’s where the magic happens. If you mistakenly wrote `main.py` as `main.py` and forgot it was `main.py`, Uvicorn will throw an error. Just double-check your filenames.

Step 4: Adding Expiration to the Cache

Now, let’s make our cache a bit smarter. Caching is great, but if you hold onto data forever, you’ll run into stale responses. Here’s how to implement a basic expiration for cached results:


cache_expiry = 300 # Cache expiry time in seconds

def generate_cache_key(input_text):
 return hashlib.md5(input_text.encode()).hexdigest()

@app.post("/generate/")
async def generate(input_text: str):
 cache_key = generate_cache_key(input_text)
 current_time = time.time()
 
 if cache_key in cache and (current_time - cache[cache_key]['timestamp']) < cache_expiry:
 return JSONResponse(content=cache[cache_key]['data'], status_code=200)

 response = requests.post(TGI_URL, json={"input": input_text})
 
 if response.status_code != 200:
 return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
 
 result = response.json()
 cache[cache_key] = {'data': result, 'timestamp': current_time}
 return result

Now, your cache checks the timestamp before returning data. If the cache is older than the expiry time, it fetches new data from TGI. Seriously, no one wants stale data, right?

The Gotchas

Here are some common pitfalls you may run into:

Memory Consumption: The cache can grow quickly, especially if you’re dealing with diverse inputs. You might want to implement cache size limits or a LRU (Least Recently Used) cache strategy.
Error Handling: Make sure to handle different HTTP response statuses appropriately. If TGI is down, your application should respond gracefully instead of crashing.
Concurrency Issues: If you’re using your app in a multi-threaded environment, consider using thread locks around your cache to avoid race conditions.
Invalidation: Think about how you want to handle cache invalidation. If your data is updated frequently, you’ll need to clear the cache for specific keys.
Rate Limiting: TGI may have rate limits. Implementing caching helps, but if you're not careful, you can still hit limits if many clients request the same data.

Full Code Example

Here’s everything we’ve built so far in one complete piece:


from fastapi import FastAPI
from fastapi.responses import JSONResponse
import requests
import time
import hashlib

app = FastAPI()
cache = {}
cache_expiry = 300 # Cache expiry time in seconds
TGI_URL = "https://api.huggingface.co/tgi" # Replace with actual TGI endpoint

def generate_cache_key(input_text):
 return hashlib.md5(input_text.encode()).hexdigest()

@app.post("/generate/")
async def generate(input_text: str):
 cache_key = generate_cache_key(input_text)
 current_time = time.time()
 
 if cache_key in cache and (current_time - cache[cache_key]['timestamp']) < cache_expiry:
 return JSONResponse(content=cache[cache_key]['data'], status_code=200)

 response = requests.post(TGI_URL, json={"input": input_text})
 
 if response.status_code != 200:
 return JSONResponse(content={"error": "API call failed"}, status_code=response.status_code)
 
 result = response.json()
 cache[cache_key] = {'data': result, 'timestamp': current_time}
 return result

if __name__ == "__main__":
 import uvicorn
 uvicorn.run(app, host="0.0.0.0", port=8000)

What's Next

Consider implementing a persistent cache using Redis or similar. This way, your cache remains intact even if the FastAPI app restarts. It'll improve performance and user experience significantly.

FAQ

Q: How do I clear the cache?

A: You can create a new endpoint in your FastAPI app that clears the cache dictionary. Ensure to protect this endpoint with some form of authentication.

Q: Can I cache results for longer than 5 minutes?

A: Sure! Just adjust the cache_expiry variable to your desired value.

Q: What if TGI goes down?

A: Your application should handle error responses from TGI gracefully. You can display a user-friendly message instead of crashing.

Data Sources

To learn more about the TGI and its capabilities, check out the official documentation here. For community insights, you might find the Hugging Face GitHub remarkably helpful.

Repository	Stars	Forks	Open Issues	License	Last Updated
huggingface/text-generation-inference	10,854	1,266	324	Apache-2.0	2026-03-21

Last updated May 08, 2026. Data sourced from official docs and community benchmarks.

🕒 Published: May 7, 2026

🎓

Written by Jake Chen

AI educator passionate about making complex agent technology accessible. Created online courses reaching 10,000+ students.

Learn more →