cs161 2004 Lecture 7: Logging

What's the overall topic?
  Atomic updates of complex data w.r.t. failures.
  Today just a single system, we'll be seeing distributed versions later.

Terminology
  data buffer
  data storage
  log buffer
  log storage
  dirty vs clean
  sync write vs async

Last lecture:
  Ordered synchronous writes, one is a commit, others completed
    or un-done by recovery.
  Example: FFS file creation:
    (1. update i-free list)
    2. initialize i-node
    3. update directory block
  The good news:
    This approach is correct when it can be made to work.
    File system will work after crash even w/o recovery.
    Operations are "durable" when sys call returns.
  The bad news:
    Synchronous writes are very slow.
    Crash recovery may be slow, e.g. to rebuild i-free list.
    Not all operations have an obvious single committing write.

FFS rename
  editor could use re-name from temp file for careful update
  echo a > d1/f1
  echo b > d2/f2
  mv d2/f2 d1/f1
  need to update two directories, stored in two blocks on disk.
  remove then add? add then remove?
    probably want add then remove
  what if a crash?
  what does fsck do?
    it knows something is wrong, since link count is 1, but two links.
    can't roll back -- which one to delete?
    has to just increase the link count.
    this is *not* a legal result of rename!
    but at least we haven't lost the file.
  so FFS is slow *and* it doesn't get semantics right.

You can push tree update one step farther.
  Prepare a new copy of the entire affected sub-tree.
  Replace old subtree in one final write.
  Very expensive if done in the obvious way.
  But you can share structure between old and new tree.
  Only need new storage between change points and sub-tree root.
  (NetApp WAFL does this and more.)
  This approach still uses >= 1 sync writes,
    and only works for tree data structures.

What are the reasons to use logging?
  atomic commit of compound operations. w.r.t. crashes.
  fast recovery (unlike fsck).
  well-defined post-recovery state: serial prefix of operations.
    as if synchronous and crash had occured a bit earlier
  can be applied to almost any existing data structure
    e.g. database tables, free lists
  very useful to coordinate updates to distributed data structures
    need to be very systematic about order of updates,
    which parts of which updates have completed.
    in case of partial failure.
  representation is compact on a disk, so very fast to append

Transactions
  The main point of a log is to make complex operations atomic.
    I.e. operations that involve many individual writes.
    You want all writes or none, even if a crash in the middle.
  A "transaction" is a multi-write operation that should be atomic.
  The logging system needs to know which sets of writes form a transaction.
  re-organize code to mark start/end of group of atomic operations
  create()
    begin_transaction
      update free list
      update i-node
      update directory entry
    end_transaction
  writes go to buffer cache via logging system
  there may be multiple concurrent transactions
    e.g. if two processes are making system calls

naive redo log
  keep a "log" of updates
    Begin TID
    Data TID B# new-data
    Commit TID
  Example:
    Begin T1
    Data T1 B1 25
    End T1
    Begin T2
    Data T2 B1 30
    Begin T3
    Data T3 B2 99
    Data T3 B3 50
    End T3
  for now, log lives on its own infinite disk
  note we include records from uncommitted xactions in the log
  records from concurrent xactions may be inter-mingled
  we can flush dirty cache blocks to disk any time we want
  recovery
    1. discard on-disk DB
    2. scan whole log and remember all Committed TIDs
    3. scan whole log, ignore non-committed TIDs, replay the writes
  why can't we use any of DB's contents during recovery?
    don't know if a block is from an uncommitted xaction
    i.e. was flushed to disk before commit
  so what have we achieved?
    atomic update of complex data structures
  problems:
    we have to store the whole log forever
    recovery has to replay from the beginning of time

redo with checkpoint
  most logs work like this, e.g. FSD
  allows much faster recovery: can use on-disk DB
  write-ahead rule
    delay flushing dirty blocks from disk cache
    until corresponding commit record is on disk
  so keep updates of uncommitted xactions in buffers (not disk)
  so no uncommitted data on disk.
    but disk may be missing some committed data
    recovery needs to replay committed data from the log
  how can we avoid re-playing the whole log on recovery?
    recovery needs to know a point in log at which it can start
    a "checkpoint", pointer into log, stored on disk
    how to ensure recovery can ignore everything before the checkpoint?
  checkpoint rule:
    all writes before check point must be stable on disk
    checkpoint may not advance beyond first uncommitted Begin
  in background, flush a bunch of early writes, update checkpoint ptr
  on recovery, re-play commited updates from checkpoint onward
  it's ok if we flush but crash before updating checkpoint pointer
    we will re-write exactly the same data during recovery
  can free log space before checkpoint!
  problem:
    uncommitted transactions sit on main-memory buffers.
    a problem for long-running DB transactions
    (not a problem for file systems)

undo/redo with checkpoint
  suppose we want to flush uncommitted writes to disk?
    need to be able to un-do them in recovery
    so include old value in each log record      
    Data TID B# old-data new-data
  now we can write a buffer after log entry is on disk
    no need to wait for the Commit to be on disk
  two pointers stored on disk: checkpoint and tail
  checkpoint:
    all buffers flushed up to this point
    no need to redo before this point
    but may need to undo before this point
  tail:
    start of first uncommitted transaction
    no need to undo before this point
    so can free before this point
  it's ok if we crash just before updating the tail pointer itself
    we would have advanced it over committed transaction(s)
    so we will redo them, no problem
  what if there's an undo record for block never written?
    it's ok: undo will re-write same value that's already there
  what if
    Begin T1
    Write T1 B1 old=10 new=20
    Begin T2
    Write T2 B1 old=20 new=30
    crash
    The right answer is B1 = 10, since neither committed
    But it looks like we'll un-do to 20
    What went wrong? How to fix it?
  no main-memory buffers for data between tail and checkpoint.
    so uncommitted xactions don't consume memory

careful disk writing
  log usually stored in a dedicated known area of the disk
    so it's easy to find after a reboot
  where's the start?
    checkpoint (or tail pointer for undo/redo)
  where's the end?
    hard if crash interrupted log append
    append records in order
    store checksum for each record to make sure it is complete
    recovery must search forward for last matching checksum

why is logging fast?
  group commit -- batched log writes.
    could delay flushing log -- may lose committed transactions
    but at least you have a prefix.
  single write (or less) to implement a transaction.
    no seek if dedicated log
  write-behind of data allows batched / scheduled.
    one buffer may reflect many transactions.
    i.e. create many files in a directory.
    don't have to be so careful since the log is the real information

remaining problems?
  selective syncing seems hard