Flask Memory
Leaks

The ones you will hit in production.
Not theory — real incidents.

April 2026  ·  ~15 min read

Most Flask memory leak guides look in the wrong place.

Flask makes it easy to write a web app. It does not make it easy to understand what happens to that app when it runs inside gunicorn workers for days at a time, under real traffic, on a container platform that charges for RAM.

Most guides focus on object reference cycles and the garbage collector. The leaks that will actually cost you money come from a different set of patterns: persistent connections, module-level state, cache TTL mismatches, and trusting user-supplied URLs. This article covers all of them.

1

Never proxy streaming content through Flask

This is the single most expensive mistake you can make. It is also the most tempting, because the code looks completely reasonable:

Dangerous — this will destroy your server under load

def stream_proxy():
    upstream = requests.get(external_url, stream=True)

    def generate():
        for chunk in upstream.iter_content(chunk_size=4096):
            yield chunk

    return Response(stream_with_context(generate()))

The problem is not the code itself. The problem is what happens at the OS level. Each concurrent listener holds open two persistent TCP sockets: one from the external server to your Flask process, and one from your Flask process to the browser. Each socket has a kernel receive buffer that the OS allocates regardless of whether your Python code is reading it. On Linux, these buffers can reach 8 MB per socket under load.

The math is brutal

100 concurrent audio listeners: 100 × 16 MB = 1.6 GB of kernel buffer memory attributed to your container. 500 listeners: 8 GB. This is not Python heap memory — tracemalloc will not show it. But your container's RSS will keep climbing until the OOM killer arrives.

The rule:

Flask workers are not designed for thousands of persistent long-lived connections. If the content is HTTPS, redirect the browser directly to it. If it is HTTP, require the upstream to add HTTPS support — do not solve their infrastructure problem with your RAM.

Safe

# For HTTPS streams: redirect and let the browser connect directly
if parsed.scheme == 'https':
    return redirect(stream_url)

# For HTTP streams: refuse and tell the user to fix their upstream
return jsonify({'error': 'HTTP streams not supported. Use HTTPS.'}), 400

If you absolutely must proxy HTTP streams for a business reason you cannot work around, use a purpose-built reverse proxy (nginx, Caddy, HAProxy) in front of Flask, and route streaming paths there. These tools are built for persistent connections. Flask is not.

2

Shared sessions and the connection pool trap

A module-level requests.Session() is a standard pattern for reusing HTTP connections efficiently:

# Module level — shared across all requests in the worker process
_http = requests.Session()
_http.mount('https://', HTTPAdapter(max_retries=Retry(total=0)))

This is fine when you use it for short-lived API calls (JSON metadata, REST endpoints). It becomes a slow-motion disaster when any request in the session connects to a streaming server.

Here is what happens: you call _http.get(audio_stream_url, timeout=3) without stream=True. Requests buffers data for 3 seconds before the read timeout fires. The response is closed — but the underlying TCP connection to the audio server may be kept open by urllib3's connection pool, waiting to be reused. The stream server keeps pushing audio data. That data fills the kernel's TCP receive buffer indefinitely, attributed to your process as RSS.

The fix depends on what you are connecting to:

  • For short-lived API calls: the shared session is fine. Keep using it.
  • For anything that might be a streaming endpoint: use a dedicated, short-lived session or use requests.get() directly (which does not pool connections).
  • For ICY metadata fetches from audio streams: use a raw socket so you control exactly when the TCP connection closes.

The tell:

If your Railway/Fly/Render memory graph shows continuous growth that correlates perfectly with network ingress, but tracemalloc shows almost nothing, you are looking at kernel TCP buffer accumulation, not Python heap growth. The shared session is the first place to check.

3

The cache TTL / polling interval mismatch

This one is subtle. You add a server-side cache to an expensive endpoint. You set a TTL. You add a client-side polling interval. You ship. Memory climbs. The problem: your cache TTL is shorter than your polling interval.

Leaking — cache always cold when the client polls

# Server: cache expires after 10 seconds
_cache[tenant_id] = {'data': result, 'expires': time.time() + 10}

# Client: polls every 15 seconds
setInterval(fetchNowPlaying, 15000);

Every single browser poll hits an expired cache. The cache is never warm when the client asks. You might as well have no cache at all. Every poll triggers the full expensive operation — which in this case was an outbound HTTP connection to an external server, with all the socket overhead that entails.

Fixed — cache outlives the polling interval

# Server: cache outlives the polling interval
_cache[tenant_id] = {'data': result, 'expires': time.time() + 45}

# Client: polls less frequently than cache expires
setInterval(fetchNowPlaying, 60000);

With the corrected values: 100 concurrent listeners on the same station all share one cached response. One external fetch per 45 seconds instead of 100. The connection overhead drops by two orders of magnitude.

The rule:

Server cache TTL should be at least 2× the client polling interval. This ensures the cache is always warm when clients poll, and multiple concurrent clients share a single upstream fetch.

4

Module-level state that grows forever

In Flask, module-level variables are shared across all requests handled by a single worker process. This is intentional and useful for caches. It becomes a problem when those data structures have no upper bound on their size and no cleanup mechanism.

Unbounded — these grow by one entry per unique key, forever

_rate_windows = defaultdict(deque)   # one deque per IP address
_results_cache = {}                  # one entry per unique query string
_fail_counts = {}                    # one entry per tenant, never deleted

A defaultdict(deque) used for rate limiting is a particularly common offender. Every unique IP address or tenant ID that ever hits your app creates a new deque entry. The deque entries are small, but they accumulate. On a public-facing app with organic traffic, you can have tens of thousands of entries within hours.

Three patterns to fix this:

Pattern 1: Periodic sweep on a counter

_cleanup_counter = 0

def maybe_cleanup():
    global _cleanup_counter
    _cleanup_counter += 1
    if _cleanup_counter < 1000:
        return
    _cleanup_counter = 0
    now = time.time()
    expired = [k for k, v in _cache.items() if v['expires'] <= now]
    for k in expired:
        _cache.pop(k, None)

Pattern 2: Bounded size with eviction

# Cap at N entries, evict oldest when full
_CACHE_MAX = 500

if len(_cache) >= _CACHE_MAX:
    oldest = sorted(_cache, key=lambda k: _cache[k]['expires'])[:100]
    for k in oldest:
        _cache.pop(k, None)

Pattern 3: Use plain dict, not defaultdict

defaultdict creates entries on read — easy to accumulate unknowingly

_windows = defaultdict(deque)
_ = _windows[ip]  # Creates an entry even if you never use it

plain dict only creates entries when you explicitly set them

_windows = {}
if ip not in _windows:
    _windows[ip] = deque()
5

User-supplied URLs that point to streams

This is the one nobody writes about. If your app lets users configure a URL that your backend then fetches (a webhook URL, a metadata endpoint, a feed URL), some users will paste the wrong thing. For audio-adjacent apps, they will paste their stream URL instead of their metadata API URL. Your backend will then faithfully fetch a never-ending audio stream on every request.

Validate user-supplied URLs before fetching them. At minimum, reject anything that looks like it will return binary streaming content:

_AUDIO_EXTENSIONS = ('.mp3', '.ogg', '.aac', '.m4a', '.opus')
_STREAM_PATH_SEGMENTS = ('/stream', '/live/', '/listen/', '/radio.mp3')

def looks_like_audio_stream(url: str, stream_url: str) -> bool:
    lower = url.lower().rstrip('/')
    return (
        url == stream_url  # user pasted stream URL into metadata field
        or any(lower.endswith(ext) for ext in _AUDIO_EXTENSIONS)
        or any(seg in lower for seg in _STREAM_PATH_SEGMENTS)
    )

if looks_like_audio_stream(metadata_url, stream_url):
    logger.warning("Rejecting audio stream URL from user: %s", metadata_url)
    return None  # skip the fetch entirely

The same principle applies everywhere:

  • If users configure feed URLs, they might point to a 500 MB XML file.
  • If they configure webhook URLs, they might accidentally loop back to your own server.
  • Always validate structure and apply timeouts before fetching anything a user gave you.
6

Unbounded database queries

Flask has no ORM forcing you to think about query shape. It is trivially easy to write a query that returns every row in a table into a Python list:

Dangerous at scale

# Fetches every user agent string ever recorded — unbounded
rows = db.execute(
    "SELECT ua FROM page_views WHERE tenant_id = %s",
    (tenant_id,)
).fetchall()

# Then classifies in Python
mobile = sum(1 for r in rows if 'Mobile' in r['ua'])

This looks fine in development with 200 rows. In production with 2 million rows, it loads every user agent string into a Python list in the worker's heap. The list stays allocated until the garbage collector runs — which may not be for several more requests.

Correct — push classification into the database

# Returns 3-4 rows, always — regardless of table size
rows = db.execute("""
    SELECT
        CASE
            WHEN ua ~* '(Mobi|Android|iPhone)' THEN 'mobile'
            WHEN ua ~* '(iPad|Tablet)'          THEN 'tablet'
            ELSE 'desktop'
        END AS device_type,
        COUNT(*) AS cnt
    FROM page_views
    WHERE tenant_id = %s
    GROUP BY 1
""", (tenant_id,)).fetchall()

Any aggregation that can be expressed in SQL should be. COUNT, SUM, AVG, GROUP BY, CASE/WHEN — use them. The database is running on dedicated hardware optimised for this. Your Flask worker is not.

Two more query hygiene rules:

  • Always add LIMIT to queries that could return large result sets, even in development.
  • Use fetchone() instead of fetchall() whenever you only need the first result.
7

Flask-Limiter and the memory:// trap

Flask-Limiter is excellent. Its default storage backend in recent versions is memory://, which stores rate limit counters in process memory. This is fine for development. In production with gevent workers, it silently becomes a leak.

The memory:// backend uses a background cleanup timer to evict expired windows. Under gevent, this timer never fires correctly because gevent monkey-patches threading. Expired entries accumulate indefinitely. The dict grows without bound.

Leaks under gevent

limiter = Limiter(
    key_func=get_remote_address,
    storage_uri="memory://"  # silent leak with gevent workers
)

Safe — use Redis

redis_url = os.environ.get("REDIS_URL")

if not redis_url:
    logger.critical("REDIS_URL not set — rate limiting disabled to prevent memory leak")

limiter = Limiter(
    key_func=get_remote_address,
    storage_uri=redis_url or "memory://",
    enabled=bool(redis_url),  # disabled entirely without Redis
)

Add Redis to your stack early.

On Railway, Render, or Fly.io it is a one-click addition. The alternative is either a memory leak or no rate limiting — neither is acceptable in production.

8

Gunicorn as a safety net, not a fix

Gunicorn's --max-requests flag recycles workers after they have handled N requests. When a worker is recycled, all its heap memory is freed and a fresh process starts. This prevents any slow leak from accumulating indefinitely:

gunicorn main:app \
  --worker-class gevent \
  --workers 2 \
  --timeout 120 \
  --max-requests 1000 \
  --max-requests-jitter 100

The jitter parameter randomises when workers recycle (between 1000 and 1100 requests) so they do not all restart simultaneously and drop traffic.

This is a safety net, not a fix

If memory climbs from 200 MB to 30 GB between recycling cycles, --max-requests cannot help — the OOM kill happens first. Fix the actual leak, then use --max-requests to protect against slow residual growth you have not found yet.

A useful mental model:

Your baseline memory after a fresh worker start should be ~100–200 MB for a typical Flask app. By the time a worker hits 1000 requests, it should be maybe 300–400 MB. If it is 3 GB, you have a leak. If it is 250 MB, you have normal Python allocator behaviour.

9

How to diagnose a leak you can't explain

When memory climbs and you do not know why, work through this checklist in order:

1. Correlate with the network graph

Check your platform's memory and network traffic graphs on the same timeline. If memory growth tracks exactly with ingress traffic, you are looking at kernel TCP buffers from persistent connections — not Python heap. tracemalloc will show almost nothing. Look for streaming proxies or outbound connections to streaming servers.

2. Use tracemalloc

import tracemalloc, gc

# Add to a protected admin route
def memory_debug():
    gc.collect()
    tracemalloc.start()
    # ... handle some requests ...
    snapshot = tracemalloc.take_snapshot()
    top = snapshot.statistics('lineno')[:20]
    return '\n'.join(str(s) for s in top)

Hit this endpoint twice with a 10-minute gap. The lines that grew between snapshots are your leak. Cross-reference them with the patterns in this article.

3. Check RSS vs /proc/self/status

with open('/proc/self/status') as f:
    for line in f:
        if line.startswith('VmRSS'):
            print(line.strip())  # Resident Set Size — what the OS sees

If RSS is much higher than what tracemalloc reports as Python heap usage, the delta is kernel buffers or memory-mapped files. Go back to step 1 and look at the network graph.

4. Check module-level state size

import sys

# Check sizes of known caches
print(len(_rate_windows), sys.getsizeof(_rate_windows))
print(len(_results_cache), sys.getsizeof(_results_cache))

5. Audit every module-level dict and the HTTP session

Search your codebase for these patterns at module scope:

  • = {} — is it bounded? does it get cleaned up?
  • = defaultdict( — entries created on read, easy to accumulate
  • = requests.Session() — what does it connect to? any streaming endpoints?
  • stream=True — is the response always closed, including on early returns?

The Production Checklist

Before you ship a Flask app, verify each of these.

No streaming proxies in Flask workers. Redirect HTTPS streams. Reject HTTP streams. Use nginx/Caddy for anything else.

Server cache TTL > client polling interval × 2. A cache that expires before the next poll is not a cache.

Module-level dicts have a cleanup path. Either a periodic sweep, a max-size eviction, or a TTL check on every access.

All stream=True responses are closed. Use try/finally. Early returns inside a try block still pass through finally — use it.

User-supplied URLs are validated before fetching. Reject binary content types, streaming paths, and self-referential URLs.

Aggregation happens in SQL, not Python. GROUP BY + CASE/WHEN returns 4 rows. SELECT * returns 2 million.

Flask-Limiter uses Redis, not memory://, in any production environment with gevent workers.

--max-requests is set on gunicorn. 1000 with 100 jitter is a reasonable default.

The pattern behind all of these:

Flask workers are designed for short-lived request-response cycles. Anything that persists beyond the lifetime of a single request — open connections, growing dicts, buffered responses — will accumulate over time. When in doubt, ask: what happens to this thing after the request ends?

Keep Going

Memory leaks solved. Now make sure the rest of your stack is production-ready.