Optimizing uwsgi for Many Many Threads and Processes

tl;dr : Consider optimizing uwsgi by setting `threads-stacksize = 64` or some small value in your uwsgi config. Python apps which do not use many C modules do not use the C stack very much. A smaller stack size mean threads use less memory and you can safely have more of them servicing requests.

Long story:

Years ago I was deploying a new flask web service using uwsgi. I needed it to scale to thousands of connections. I read a blog post (I searched and cannot find it now) which suggested 10 processes with 10 threads each to be able to serve 100 concurrent connections. After testing and tuning this particular app, we settled on 10 processes and 100 threads per process. It ran well.

Recently, a production app, which I helped deploy, fell on its face. It was performing very poorly, seemingly out of nowhere. This app was originally deployed with the same 10 processes, 100 threads per process configuration which I had used so successfully in the past. The ops team had already reduced the process count to 4 due to excessive memory use of the application. This means the application was only able to service 400 concurrent connections.

I still cannot entirely explain why the app ran for many months and then suddenly had problems. I’m guessing it is because of recent announcements driving more traffic to the site. The 400 threads were actually being used instead of sitting idle waiting for connections.

In the process of trying to restore service, our ops team wisely used a tool which I was not likely to have used (huge thanks to them). The tool is pmap and it shows mapped memory for a given process. I noticed something interesting in the output of pmap:

00007fcf75061000   8192K rw---   [ anon ]
00007fcf75861000      4K -----   [ anon ]
00007fcf75862000   8192K rw---   [ anon ]
00007fcf76062000      4K -----   [ anon ]

This was repeated with the same memory increment 50 times for a total of 100. It occurred to me that the default stack size of a thread in Linux is 8MB and that these memory maps were the stack of each thread. I was able to confirm this suspicion by running the app myself and adjusting the size by configuring uwsgi with –threads-stacksize.

I started by moving to 1MB which I know is the default Windows thread stack size, guessing it would still be plenty. Then I started to play limbo and see how low can I go. I started to get pretty happy when I broke the 256KB mark and our app was still functioning. Our app has the luxury of not having any deep calls. I might have been able to go lower, but once I got to 64KB, I didn’t see my point. Every order of magnitude decrease was smaller and smaller an improvement.

Moving from 8MB to 1MB took memory usage from 3.2GB to 400MB. Every halving of stack size halved overall memory usage of the thread stacks by this app. First 512KB/thread for 200MB, then 256KB/thread for 100MB, then 128KB/thread for 50MB, then 64KB/thread for 25MB. At this point, everything about the app was running exactly the same, the only difference being that I wasn’t wasting 3.2GB of memory in unused thread stacks.