amohr: tracking memory leaks in python

Tracking leaks in python is difficult for a multitude of reasons:

It's GC'd language, which means things don't get freed immediately
It uses pool allocators
the re module has a cache of compiled expressions
tracemalloc may not give you good call stacks: https://bugs.python.org/issue33565
the ThreadPoolExecutor creates a thread per submit until you hit max_workers, default max workers is os.cpu_count() * 5
When using tracemalloc it will consume memory for using the traces
When using modules like request/aiohttp/aiobotocore/etc which use sockets they typically have a pool of connections whose size may fluctuate over time
memory fragmentation

Here are a set of work-arounds around these issues

gc.collect() from a place that isn't holding onto object references when you want a stable point)
from 3.6 forwards use PYTHONMALLOC=malloc
call re._cache.clear() from a similar place to #1
no known work-around (I'm trying tohelp ensure it does something better in the future)
when you start tracemalloc ensure you start after all the threads have been created, this means you've submitted at least max_worker jobs to the pools. Another hack is temporarily changing the ThreadPoolExecutor to create all threads on first submit
Don't rely on RSS when using tracemalloc
Try to make the pool sizes 1
Run your leak tests for longer periods, or if using large chunks of memory try to reduce the chunk sizes

The way I approach it is two-fold:

Try to use tracemalloc to figure out specifically where leaks are coming from, I use a helper like: https://gist.github.com/thehesiod/2f56f98370bea45f021d3704b21707a9
using memory_profiler module to binary search through the codebase to figure out what is causing a leak from a high-level. This basically means disabling parts of your application until you find the trigger.

Wednesday, May 23, 2018