Scaling Apache httpd as a ReverseProxy

We recently had the need to make sure our front end apache httpd reverse proxy and ssl termination server could handle the larger number of websocket connections we are going to use with it. Given websockets are longer lived connections, this is a different use of apache httpd and we want to get it right. The proxied service is capable of handling tens of thousands of concurrent connections, if not hundreds of thousands or more.

First, our testing tool is custom made, it makes all the websocket connections first and then proceeds to ping. This is important as it exercises the concurrent connections capabilities of httpd. When using it, the client system needs the ability to create enough sockets. The first limit I encountered was with my test client system. The shell environment defaults to 1024 open files limited. It is a soft limit, so use ulimit -S to adjust the limit. Even ab will show an error of “socket: Too many open files (24)” if you use -n 1050 and -c 1050 options.

$ ulimit -n
1024
$ ulimit -Hn
65536
$ ulimit -Sn 65536
$ ulimit -n
65536

Now, your testing tool can create more than 1024 connections. The next limit I ran into was that of connections on the httpd server. Even mpm_event uses thread per request (do not let the event name fool you). The default ubuntu apache2 mpm_event configuration allows for 150 concurrent connections:

 StartServers 2
 MinSpareThreads 25
 MaxSpareThreads 75
 ThreadLimit 64
 ThreadsPerChild 25
 MaxRequestWorkers 150
 MaxConnectionsPerChild 0

A tool like ab won’t halt at 150. A tool named slowhttptest is in xenial/universe. Run apt install slowhttptest to install it. It is a flexible tool and has a great man page and -h help output.

$ slowhttptest -c 1000 -H -g -o my_header_stats -i 10 -r 200 -t GET -u http://system.under.test.example.com/ -x 24 -p 3

slowhttptest version 1.6
– https://code.google.com/p/slowhttptest/ –
test type: SLOW HEADERS
number of connections: 1000
URL: http://system.under.test.example.com/
verb: GET
Content-Length header value: 4096
follow up data max size: 52
interval between follow up data: 10 seconds
connections per seconds: 200
probe connection timeout: 3 seconds
test duration: 240 seconds
using proxy: no proxy

Tue Sep 27 14:33:03 2016:
slow HTTP test status on 5th second:

initializing: 0
pending: 284
connected: 667
error: 0
closed: 0
service available: YES

This screen will update as connections are created until service available changes from YES to NO.

In my tests it closed: value was exactly 150. I can view the my_header_stats.csv file to see when max was reached.

Next, lets adjust Apache httpd to allow for more concurrent connections. My target is 15,000 connections, so I’ll increase numbers linearly 2 processes (StartServers) with 75 threads each (ThreadsPerChild) gave 150 connections. 20 processes with 750 threads each should give 15,000 connections.

Edit mpm_event.conf: ($ sudo vi /etc/apache2/mods-enabled/mpm_event.conf)

<IfModule mpm_event_module>
 StartServers 10
 MinSpareThreads 25
 MaxSpareThreads 750
 ThreadLimit 1000
 ThreadsPerChild 750
# MaxRequestWorkers aka MaxClients => ServerLimit *ThreadsPerChild
 MaxRequestWorkers 15000
 MaxConnectionsPerChild 0
 ServerLimit 20
 ThreadStackSize 524288
</IfModule>

Restart (full restart, not graceful – ThreadsPerChild change requires this) apache2 httpd and retry the slowhttptest. Notice service available is always YES.

Now turn up the slowhttptest numbers. Change the -c parameter to 15000 and the -r to 1500. It should take 10sec to ramp up the connections. In my use case I could not create that many connections so quickly. slowhttptest was maxing out a CPU core.

All of the above apache httpd config was done using the mpm_event processing module. The next issue I ran into was a case of mpm_worker not behaving as I expected. I have a doubly proxied system, because this is super real world where we route http things all over the place, sometimes in ways we shouldn’t but because we are lazy, or it is easier or… anyway…

In ubuntu/trusty with apache httpd 2.4.7 mpm_worker has a limit of 64 ThreadsPerChild even if you configure it with a larger number. There is no warning. You’d never know unless you take a look at the number of processes in a worker: $ ps -uwww-data -opid,ppid,nlwp  The fix is to switch from mpm_worker to mpm_event.

$ sudo a2dismod mpm_worker
$ sudo a2enmod mpm_event
$ sudo service apache2 restart

I thought that I’d need to do more, but this got me to where I needed to be.

Optimizing uwsgi for Many Many Threads and Processes

tl;dr : Consider optimizing uwsgi by setting `threads-stacksize = 64` or some small value in your uwsgi config. Python apps which do not use many C modules do not use the C stack very much. A smaller stack size mean threads use less memory and you can safely have more of them servicing requests.

Long story:

Years ago I was deploying a new flask web service using uwsgi. I needed it to scale to thousands of connections. I read a blog post (I searched and cannot find it now) which suggested 10 processes with 10 threads each to be able to serve 100 concurrent connections. After testing and tuning this particular app, we settled on 10 processes and 100 threads per process. It ran well.

Recently, a production app, which I helped deploy, fell on its face. It was performing very poorly, seemingly out of nowhere. This app was originally deployed with the same 10 processes, 100 threads per process configuration which I had used so successfully in the past. The ops team had already reduced the process count to 4 due to excessive memory use of the application. This means the application was only able to service 400 concurrent connections.

I still cannot entirely explain why the app ran for many months and then suddenly had problems. I’m guessing it is because of recent announcements driving more traffic to the site. The 400 threads were actually being used instead of sitting idle waiting for connections.

In the process of trying to restore service, our ops team wisely used a tool which I was not likely to have used (huge thanks to them). The tool is pmap and it shows mapped memory for a given process. I noticed something interesting in the output of pmap:

00007fcf75061000   8192K rw---   [ anon ]
00007fcf75861000      4K -----   [ anon ]
00007fcf75862000   8192K rw---   [ anon ]
00007fcf76062000      4K -----   [ anon ]

This was repeated with the same memory increment 50 times for a total of 100. It occurred to me that the default stack size of a thread in Linux is 8MB and that these memory maps were the stack of each thread. I was able to confirm this suspicion by running the app myself and adjusting the size by configuring uwsgi with –threads-stacksize.

I started by moving to 1MB which I know is the default Windows thread stack size, guessing it would still be plenty. Then I started to play limbo and see how low can I go. I started to get pretty happy when I broke the 256KB mark and our app was still functioning. Our app has the luxury of not having any deep calls. I might have been able to go lower, but once I got to 64KB, I didn’t see my point. Every order of magnitude decrease was smaller and smaller an improvement.

Moving from 8MB to 1MB took memory usage from 3.2GB to 400MB. Every halving of stack size halved overall memory usage of the thread stacks by this app. First 512KB/thread for 200MB, then 256KB/thread for 100MB, then 128KB/thread for 50MB, then 64KB/thread for 25MB. At this point, everything about the app was running exactly the same, the only difference being that I wasn’t wasting 3.2GB of memory in unused thread stacks.