cs161 2004 Lecture 9: SMP Event-driven programs

Why is this a problem?
 Event-driven model eschews threads, so harder to multiplex CPU
 Event model doesn't say what's safe to run simultaneously

What will benefit?
 Crypto
 Compression
 CPU intensive web services like mapquest
 Dynamic content with slower languages

Why doesn't our current scheme work?
 Single select() loop, serial call backs.
 How about using subprocs?
  heavyweight
  How many to use? (1 per CPU, probably)
  Then each subprocs must be able to do various things?

Solicit ideas:
 Multiple select() loops?
  each loop is over all fds: Just doesn't work, end up doing more work
  divy up the fds: load balancing, sharing
 Run callbacks in separate threads? How to do it safely?
 Multiple copies of program? Again, shaing

Colors
 Different colors can be run at the same time
 Color is a 32bit number
 Default is 0

Common usage
 A color per request (fd)
 A color per data structure (sort of like Java's lock on any object)

How is this different from locking?
 Only one lock at a time... but aren't small fine-grained locks good?
  Speedup comes from the parts that DON'T need locks
 You identify safe spots, not unsafe spots.
 Easier to be conservative
 Many "errors" are safe

API details
 wrap(), same as NMSTL of same name
 Color is in the callback, not an argument to registration (modularity)
 Color is not inherited (no "current color")
  modules would have to know their context
 cpucb()
  Runs a call-back when CPU is idle
  Used to change color.


libasync details
 Uses loopback NFS server
 callback queue per thread
  two parts, cpucb's and fdcb's
 A given cpucb() color is run in the order it's registered. Why?
   x() { a(red); b(blue); } Suppose many x()s are called.  run b()s in order
 Cpucb's are preferred, when scheduling,why?
  Locality of reference. Why different than I/O? I/O implied pause anyway
 cache locality by running a color on a given CPU
  also preserves the ordering guarantee
  problem? (statically dividing program could lead to unbalanced load)
  fix?  work stealing.  1024 array says which thread a color goes to now

Scheduler
 prefer running same color (single request or structure, improves cache)
 select doesn't block if there's anything else to do
 cpucb's are preferred (sequential code, improves cache)

Limitations
  read/write locks not expressible
  how about "thread join"?

Evaluation
 How much of the available parallelism was exposed?
   How much parallelism is there?  N-copy is upper bound, separate caches
   What is lost? Serialized accept, cache checking, O/S locks on per proc basis
   Multiple Flashes do better.  Means this is wrong approach?
 How much overhead did colors impose?
   Single CPU throughput goes from 35.4MB/s to 30.4MB/s  16%
   For SFS, only 4% loss
 SFS is sped up almost as much as Ahmdal's Law allows (2.85 vs 2.5)
   That is, only 65% of cycles were up for parallelization
   But maybe fine-grained locking would have exposed more?