Thread local storage and starting threads up is largely a rather
inconsequential implementation detail. When it comes down to actual
parallel programming, of which I have done more than a little, the big thing
is thread synchronization. It's rather hardware dependent. You can
pretty much entirely wipe out any parallism gains with a synchronization
call that results in a context switch or even a serious cache impact. On
one side you have machines like the Denelcor HEP where every memory word had
a pair of semaphores on it and the instructions could stall the process
while waiting for them and the hardware would schedule the other threads.
On the other hand you have your x86, which you can do a few clever things
with some atomic operations and inlined assembler but a lot of the
"standard" (boost, pthread, etc...) synchs will kill you.