cs161 2004 Lecture 15: Performance (Part 1: Livelock) Performance of a single system under overload: Why this is interesting: it's deeper than just throwing CPU at the problem Shows up anytime regularly scheduled job A is pre-empted by load dep job B. Specific problem: IP forwarding What's going on here (diagram): Network hardware (rx) -> Q -> IP stack -> Q -> App (xmit) <- Q <- Interrupt generated by networking hardware - handled immediately, place on queue, soft-interrupt for IP stack, place on queue for app, etc. Q: When can the xmit starve? What about the application? In general, if a Q is dropping, pusher has higher priority than puller Now we see what SEDA was trying to do by raising number of threads Explain Figure 6-1: (explain screend) What determines how far it goes up? Abstractly, what determines slope of line going down? Why does screend's line go down earlier than w/o screend? Ideally, should discard ASAP, curve would flatten out mogul's point: burning CPU cycles when most precious Q: Would infinite length queues fix this? Q: What about reducing interrupt priority level? Still have to deal with scheduling work, and not committing to work when there's no cpu time for it. Q: Any other ideas? Maybe process packet to completion? Different scenario: Can this livelock? while(1) { send NFS READ RPC; wait for response; } Q: What ends up happening? Self-limiting (closed loop) Q: What about with lots of clients? Many might be dropped, but we get progress Q: When is route like/not like this? TCP clients are (almost) closed, most UDP are not The paper's solution: timeslice + RR scheduling when idle, use interrupts when busy, poll and handle work in kernel thread discard packet as early as possible (card!) when busy: no new work another way: change the push architecture to a pull arch Q: What happens if packets arrive too quickly? Too slowly? Figure 6-3: Q: What's wrong with polling no-quota? xmit is still starved Q: Why does it go to zero? That's a lot worse!? doing more work per packet before drop Figure 6-4: Q: Why does "polling, no feedback" do poorly? Can still end up starving screend Q: What's the basic idea behind feedback? Thread will yield when queue is full Comment: feedback seems more elegant than quotas (no consts) Is this the final answer? What if running telnet and webserver on same machine? How about a bunch of processing events - will feedback still work? Works well for net (can drop pkts), but what about disk i/o? Maybe that's not open loop Q: What's the general principle? Don't waste time on new work that you can't finish (or make new work lower priority) cs161 2004 Lecture 15: Performance (Part 2: Black Boxes) We've talked about performance for single-machine/process systems - what happens with multiple components? Motivation: People tend to build systems out of components (web, DB, etc.) often closed or very complex components Performance problems are difficult to diagnose: don't want to open up every black box Example: multi-tier webserver with front-end, auth, DB, dns, storage all we might know is requests are taking 400ms on avg. Contribution: Figure out which black boxes to open, reducing the time/cost of debugging the system Instead of focusing on instrumentation, use the minimum amount of info to infer what's going on Goal: find "high-impact" causal paths which occur frequently and have high latency think of hierarchial profiler, like gprof for distributed sys Compare to simple profiler: we want to know that B is slow when called from A Basic idea is to passively monitor messages such as RPCs using tcpdump to try to figure out what components are slow and when. Diagram: 3 components, A B C, call from a->b, b->c, c->b, b->a with slight delays at each component One approach which uses call-return/RPC ids: Nesting Algorithm pair call/return messages figure out nesting relationships reconstruct call paths Complication: parallel calls A->B, A->B, B->C, B->C -- which belongs to which? multiple options define heuristics: scoreboard potential parallel calls, make a guess based on histogram Another approach, without call-return semantics: create time series s1(t) = a->b messages s2(t) = b->c messages diagram with impulses at times 1,2,5,6,7 for s1, 3,4,7,8,9 for s2 s1: || ||| s2: || ||| convolution: x(t) = s2(t) [x] s1(-t) show peak at t=2 x: 1353112210 (just count the overlaps as s2 slides left fft does it quickly, detects time shift inherently robust to noise, but periodic stuff causes problems Cool, but slow - see experimental results: 2hrs vs. 2 secs! If you have call-return semantics, nested seems like way to go. Practical concerns: clock synch: ok if less than minimum instrumentation delay message loss: ok up to 5% delay variance: robust up to 30% of min. period Q: other problems? What about cache hits? Q: How useful is this as a programmer? gprof tells you which functions, can this approach? Q: Is this the right level? Instrument JavaVM? Invasiveness tradeoff. Conclusion: passively collected traces have enough info to infer what's going on at a high level. Also, this was intended as a tool, not a system for automagic optimization. OK to have false positives/errors.