Since then I implemented my own M:N userspace:kernel threading library (which is what lthread boils down to) based on Russ Cox's BSD-licensed libtask. You know what I found?
-
Many libc routines require a surprisingly large amount of stack. 64kiB was the smallest power of 2 I could find where I wouldn't overflow the stack somewhere in a libc call. (This isn't a problem with stack-copying, but it is a problem if you have dedicated per-lthread stacks).
-
It was slow! Just dropping the M:N library and using blocking network I/O with 200+ kernel pthreads was vastly faster.
-
Many pthread-functions get very, very confused if they are used from different pthreads (same logical "lthread"). (This happens when a userspace scheduler swap happens between matching operations.) For example, pthread mutexes don't like to be locked in one pthread and unlocked in another pthread (same "lthread"). I ran into other things that match this category, but can't remember them off of the top of my head.
I ended up scrapping it and just using a pool of kernel pthreads. It works, the code is pretty clear (blocking IO! whatever, man), and it's fast enough.
Edit: By "fast enough", I mean "can fill 2x 1Gbit pipes from disk" (without any perf-focused work thus far).