cs161 2004 Lecture 9: SMP Event-driven programs Why is this a problem? Event-driven model eschews threads, so harder to multiplex CPU Event model doesn't say what's safe to run simultaneously What will benefit? Crypto Compression CPU intensive web services like mapquest Dynamic content with slower languages Why doesn't our current scheme work? Single select() loop, serial call backs. How about using subprocs? heavyweight How many to use? (1 per CPU, probably) Then each subprocs must be able to do various things? Solicit ideas: Multiple select() loops? each loop is over all fds: Just doesn't work, end up doing more work divy up the fds: load balancing, sharing Run callbacks in separate threads? How to do it safely? Multiple copies of program? Again, shaing Colors Different colors can be run at the same time Color is a 32bit number Default is 0 Common usage A color per request (fd) A color per data structure (sort of like Java's lock on any object) How is this different from locking? Only one lock at a time... but aren't small fine-grained locks good? Speedup comes from the parts that DON'T need locks You identify safe spots, not unsafe spots. Easier to be conservative Many "errors" are safe API details wrap(), same as NMSTL of same name Color is in the callback, not an argument to registration (modularity) Color is not inherited (no "current color") modules would have to know their context cpucb() Runs a call-back when CPU is idle Used to change color. libasync details Uses loopback NFS server callback queue per thread two parts, cpucb's and fdcb's A given cpucb() color is run in the order it's registered. Why? x() { a(red); b(blue); } Suppose many x()s are called. run b()s in order Cpucb's are preferred, when scheduling,why? Locality of reference. Why different than I/O? I/O implied pause anyway cache locality by running a color on a given CPU also preserves the ordering guarantee problem? (statically dividing program could lead to unbalanced load) fix? work stealing. 1024 array says which thread a color goes to now Scheduler prefer running same color (single request or structure, improves cache) select doesn't block if there's anything else to do cpucb's are preferred (sequential code, improves cache) Limitations read/write locks not expressible how about "thread join"? Evaluation How much of the available parallelism was exposed? How much parallelism is there? N-copy is upper bound, separate caches What is lost? Serialized accept, cache checking, O/S locks on per proc basis Multiple Flashes do better. Means this is wrong approach? How much overhead did colors impose? Single CPU throughput goes from 35.4MB/s to 30.4MB/s 16% For SFS, only 4% loss SFS is sped up almost as much as Ahmdal's Law allows (2.85 vs 2.5) That is, only 65% of cycles were up for parallelization But maybe fine-grained locking would have exposed more?