cs161 2004 Lecture 15: Performance (Part 1: Livelock)

Performance of a single system under overload:
  Why this is interesting: it's deeper than just throwing CPU at the problem
  Shows up anytime regularly scheduled job A is pre-empted by load dep job B.

Specific problem: IP forwarding
  
What's going on here (diagram):
  Network hardware (rx) -> Q -> IP stack -> Q -> App
                 (xmit) <- Q <-

  Interrupt generated by networking hardware - handled immediately, place
    on queue, soft-interrupt for IP stack, place on queue for app, etc.

  Q: When can the xmit starve? What about the application?

  In general, if a Q is dropping, pusher has higher priority than puller
   Now we see what SEDA was trying to do by raising number of threads

Explain Figure 6-1: (explain screend)
  What determines how far it goes up?
  Abstractly, what determines slope of line going down? 
  Why does screend's line go down earlier than w/o screend?
  Ideally, should discard ASAP, curve would flatten out

  mogul's point: burning CPU cycles when most precious
  
Q: Would infinite length queues fix this?

Q: What about reducing interrupt priority level? 

Still have to deal with scheduling work, and not committing to
work when there's no cpu time for it.

Q: Any other ideas? Maybe process packet to completion? 

Different scenario: Can this livelock?

    while(1) {
      send NFS READ RPC;
      wait for response;
    }

  Q: What ends up happening?
   Self-limiting (closed loop)
  Q: What about with lots of clients?
   Many might be dropped, but we get progress
  Q: When is route like/not like this?
   TCP clients are (almost) closed, most UDP are not

The paper's solution:
  timeslice + RR scheduling
    when idle, use interrupts
    when busy, poll and handle work in kernel thread
  discard packet as early as possible (card!)
  when busy: no new work

  another way: change the push architecture to a pull arch

  Q: What happens if packets arrive too quickly? Too slowly?

  
Figure 6-3: 
  Q: What's wrong with polling no-quota?
   xmit is still starved
  Q: Why does it go to zero? That's a lot worse!?
   doing more work per packet before drop


Figure 6-4:
  Q: Why does "polling, no feedback" do poorly?
    Can still end up starving screend
  Q: What's the basic idea behind feedback?
    Thread will yield when queue is full


Comment: feedback seems more elegant than quotas (no consts)

Is this the final answer?
  What if running telnet and webserver on same machine?
  How about a bunch of processing events - will feedback still work?
  Works well for net (can drop pkts), but what about disk i/o?
   Maybe that's not open loop

Q: What's the general principle?
  Don't waste time on new work that you can't finish
   (or make new work lower priority)


cs161 2004 Lecture 15: Performance  (Part 2: Black Boxes)

We've talked about performance for single-machine/process
systems - what happens with multiple components?

Motivation: 
  People tend to build systems out of components (web, DB, etc.)
    often closed or very complex components
  Performance problems are difficult to diagnose: 
    don't want to open up every black box

Example: multi-tier webserver with front-end, auth, DB, dns, storage
  all we might know is requests are taking 400ms on avg.


Contribution:
  Figure out which black boxes to open, reducing
    the time/cost of debugging the system

  Instead of focusing on instrumentation, use the
    minimum amount of info to infer what's going on


Goal: find "high-impact" causal paths which occur 
  frequently and have high latency

think of hierarchial profiler, like gprof for distributed sys
  Compare to simple profiler: we want to know that B is slow when called from A

Basic idea is to passively monitor messages such as RPCs
using tcpdump to try to figure out what components are slow
and when.

Diagram: 3 components, A B C, call from a->b, b->c, c->b, b->a
  with slight delays at each component


One approach which uses call-return/RPC ids:

Nesting Algorithm
  pair call/return messages
  figure out nesting relationships
  reconstruct call paths


Complication: parallel calls

A->B, A->B, B->C, B->C -- which belongs to which? multiple options
  define heuristics: scoreboard potential parallel calls, make
    a guess based on histogram


Another approach, without call-return semantics: 
  create time series s1(t) = a->b messages
  		     s2(t) = b->c messages


  diagram with impulses at times 1,2,5,6,7 for s1, 3,4,7,8,9 for s2
s1: ||  |||
s2:   ||  |||  


  convolution: x(t) = s2(t) [x] s1(-t)  show peak at t=2
 x: 1353112210  (just count the overlaps as s2 slides left


  fft does it quickly, detects time shift 

  inherently robust to noise, but periodic stuff causes problems


Cool, but slow - see experimental results: 2hrs vs. 2 secs!

If you have call-return semantics, nested seems like way to go.


Practical concerns:
  clock synch: ok if less than minimum instrumentation delay
  message loss: ok up to 5%
  delay variance: robust up to 30% of min. period

  Q: other problems? 
    What about cache hits?

  Q: How useful is this as a programmer? 
    gprof tells you which functions, can this approach?

  Q: Is this the right level? 
    Instrument JavaVM? Invasiveness tradeoff.

Conclusion: passively collected traces have enough info to 
  infer what's going on at a high level. 

  Also, this was intended as a tool, not a system for automagic
   optimization. OK to have false positives/errors.