Lab 4: Caching

Due March 15, 2022, at 8:00pm EDT

Introduction

Computers often use a technique called caching to make a memory system comprised of a hierarchy of storage devices to appear as large and fast. In particular, when we build a cache, we use a small amount of relatively fast storage at one level of the memory hierarchy to speed up access to a large and relatively slow storage at the next lower level of the memory hierarchy. By “lower”, we mean further from the processor.

Here’s an example picture of what the memory hierarchy in your computer might look like:

At the top of the memory hierarchy, the picture shows the processor cache divided into multiple levels, with the L1 cache (sometimes pronounced “level-one cache”) closer to the processor than the L4 cache. This reflects how processor caches actually work in practice (there really are 3-4 different caches in your processor!), but we often think of a processor cache as a single unit.

Different computers have different sizes and access costs for these hierarchy levels; the ones listed above are typical. For example, a common desktop computer with four cores (i.e., four independent processors) might have ~200 bytes of registers; ~9 MB of processor cache; 8 GB primary memory; 512 GB SSD; and 2 TB hard disk. The processor cache divides into three levels: e.g., there might be 128KB of L1 cache, divided into four 32KB components; 512KB of L2 cache, divided into two 256KB components (one per “socket”, or pair of cores); and 8MB of L3 cache shared by all cores.

And distributed systems have yet lower levels. For instance, Google and Facebook have petabytes or exabytes of data, distributed among many computers. You can think of this as another storage layer, networked storage, that is typically slower to access than any local storage.

Each layer in the storage hierarchy acts as a cache for the following layer.

How is caching used?

Caches are so critical to computer systems that it sometimes seems like caching is the only performance-improving idea in systems. Processors have caches for primary memory. The operating system uses most of primary memory as a cache for disks and other stable storage devices. Running programs reserve some of their private memory to cache the operating system’s cache of the disk.

People have made entire careers out of proposing different variants of caches for different use cases. When a program processes a file, the actual computation is done using registers, which are caching values from the processor cache, which is caching data from the running program’s memory, which is caching data from the operating system, which is caching data from the disk. And modern disks contain caches inside their hardware too!

Assignment

Assignment installation

First, ensure that your repository has a handout remote. Type:

$ git remote show handout

If this reports an error, run:

$ git remote add handout https://github.com/csci0300/cs300-s22-labs.git

Then run:

$ git pull
$ git pull handout master

This will merge our Lab 4 stencil code with your previous work. If you have any “conflicts” from Lab 3, resolve them before continuing further. Run git push to save your work back to your personal repository.

Note: This lab contains both an autograded and a manually-graded portion, so the grading server checkoff button will only ever give you partial credit for this lab.

Autograded: The programs you create in Exercise 1 will be autograded when you run the check-off script. You’ll receive partial credit on the lab if you pass the check-off script.
Manually graded: Please write answers to the written questions in answers.md, and we’ll manually check submissions off every couple days. You’ll receive full credit for the lab if you pass the autograding tests and your answers demonstrate reasonable effort.

Exercise 1: Why Caching Matters

In this exercise, we’ll use simple I/O benchmark programs to evaluate the real-world impact of caches. Specifically, we’ll run several programs that write data to disk (to your computer’s hard drive). Each of these programs accomplish the same task of writing 5,120,000 bytes (5 MB) to a file called data, but they do so in several different ways.

Note: Your course container does not have a real disk attached; instead, its disk is emulated by your computer. Therefore, it’s important that you run Exercise 1 on the department machines. The remaining exercises can run in your course container.

You can clone your repo onto the department machines by ssh’ing or logging into a machine and running:

$ git clone https://github.com/csci0300/cs300-s22-labs-<GITHUB_USERNAME>.git

Synchronous writes one byte at a time

Take a look at w01-byte.c. Currently, w01-byte.c writes to a file called data:

one byte at a time
using direct system calls (write)
synchronously to the disk

“Synchronously” means that each write operation completes only when the data has made its way through all layers of caching mentioned above out to the true disk hardware. We request synchronous writes by opening the file with the O_SYNC option to open (line 5).

You can compile and run w01-byte.c by running:

make w01-byte to compile
./w01-byte to run

Task: Run w01-byte and record how many bytes per second it can write. Write your answer into answers.md.

(This may take a few minutes. If this takes longer than 5 minutes, feel free to stop the program with Ctrl-C. You can still record the program’s bytes/second.)

You’ll notice that this is quite slow.

Asynchronous writes one byte at a time

We can optimize the previous program by removing the O_SYNC flag, so that the program requests “asynchronous” writes, and relies on the operating system to write the data to the disk when it’s convenient (but after your write call has returned).

Task:

Copy and paste the contents of w01-byte.c into a new file called w02-byte.c, and add w02-byte.c to the EXEC := ... line in the Makefile.
Remove the O_SYNC flag from the open call in w02-byte.c.
Once again, compile (using make w02-byte), run the program, and record its write speed in bytes per second to your answers.md.

You should notice that asynchronous program is ~1,500x faster than the synchronous one. (This may vary based on the machine you’re using.)

What's happening at a high level?

When this program writes bytes to the file:

The application makes a write system call asking the operating system to write a byte.

The operating system performs this write to a buffer cache in main memory.

The operating system immediately returns to the application!

That is, the application gets control back before the data is written to disk. The data is written to disk eventually, either after some time goes by or because the buffer cache runs out of room, but such later writes can be overlapped with other processing. This gives us a performance speed-up, but it comes at the cost of data persistence. If, for example, our computer loses power before the data gets written from the cache to disk, it may appear as if we finished writing our data, because our program finished writing data to the cache, but it won’t actually be on our disk after the computer restarts.

Adding more caching layers

While this new asynchronous program is much faster, it’s still being slowed down by expensive operations – namely system calls. These operating system invocations are slow, because they require the processor to stop what it’s doing, save the user program’s state, give control to the operating system to perform a write or read call, and then switch back. (We’ll learn a lot more about system calls in the OS part of the course shortly!)

To get around this, we can use an additional layer of caching through library calls – function calls that are implemented in the standard library and keep their own cache. These functions avoid the cost of switching states to the operating system, because you’re writing to their in-memory cache rather than to the OS’s. By using the standard library call fwrite rather than the write system call, the program performs writes to a cache in program memory without involving the operating system.

Task:

Create a copy of w01-byte.c and rename the file w03-byte.c.
Modify the new file to use the fopen, fwrite, and fclose library calls (instead of open, write, and close) and then record the results. (Make sure to add w03-byte to the Makefile.)
- Note that this may involve changing the return values and arguments of theopen, write, and close calls.

Even though all three programs write one byte at a time, caching makes the fastest one write ~40,000x faster than the program we started with.

What's happening at a high level?

When this program writes bytes to the file:

The application makes an fwrite library function call asking the stdio library to write a byte.

The library performs this write to its cache and immediately returns to the application.

That is, the application gets control back even before a system call is made. The data is written out of the stdio cache using a system call eventually, either when the stdio cache runs out of room, it is explicitly flushed, or the program exits.

Exercise 2: How the Processor Cache Works

Now that you have a better idea why caching is so important, let’s get into how one particular, important kind of cache in your computer works. We will be looking at the processor cache, which is a fast piece of memory that the processor uses to be able to access data in memory more quickly. We already encountered the processor cache briefly when we talked about alignment in lectures!

Note that the idea behind caching is general and applies both to the I/O caching you explored in Exercise 1 and to the caching we will explore in this exercise. But the processor cache is actually a specific piece of hardware: it is part of the processor chip, and you could see if it you pried the chip open and put it under a microscope!

Your course container behaves like a virtual machine, which makes it difficult to get reliable cache statistics from it, as it shares the processor cache with other programs on your computer. For this exercise, we therefore use a cache simulator tool, called Venus, to visualize changes to the cache.

Caches have several hit, miss, and eviction policies, each with their own trade-offs. In the Venus simulator, you will use a write-through hit policy, a write-allocate miss policy, and an LRU eviction policy. We explain what these terms mean below!

Write-through means that on a write “hit”, data is written to both the cache and main memory. Writing to the cache is fast, but writing to main memory is slow; this makes write latency in write-through caches slower than in a write-back cache (where the data is only written to main memory later, when the cache block is evicted). However, write-through caches mean simpler hardware, since we can assume in write-through caches that memory always has the most up-to-date data. (Sound familiar? It should remind you of Exercise 1, where we wrote bytes synchronously to disk!)
Write-allocate means that on a write miss (where data you’re writing to isn’t already in the cache), you pull the block you missed on into the cache and perform the write on it.
LRU (Least recently used) means that when a cache block must be evicted to make space for a new one, we select the block that has been used furthest back in time (“least recently”) and throw it out of the cache.

Getting Set Up

This lab uses the cache.s file. This file contains assembly code in RISC-V assembly, an assembly language for a different architecture than the one you’ve worked with so far. You won’t need to understand how to read this assembly code, but if you’re curious about the differences between the x86-64 and RISC-V architectures, check out this post. Here is some C-like pseudocode for the assembly instructions in cache.s:












int array[];  // Assume sizeof(int) == 4
for (k = 0; k < repcount; k++) { // repeat the loop repcount times
    // Step through the selected array segment with the given step size.
    for (index = 0; index < arraysize; index += stepsize) {
        if(option==0)
            // Option 0: One cache access - write
            array[index] = 0;			
        else
            // Option 1: Two cache accesses - read AND write
            array[index] = array[index] + 1;	
    }
}

Lines 24-28 of cache.s allow us to set the arraysize, stepsize, repcount, and option variables. li is a RISC-V instruction to load a constant into a register. Currently:

the arraysize, represented by a0 in the assembly code stores 256 bytes, which is 64 integers (because sizeof(int) = 4 in the Venus simulator).
the stepsize represented by a1 stores 2
the repcount represented by a2 stores 1
the option represented by a3 stores 1

Make sure you understand what the pseudocode does and how you can edit the variables before you proceed to analyze cache configurations on it.

The most important thing to understand is the pseudocode. When you run cache.s, instructions that perform this pseudocode will be executed.
Which elements you access is determined by the step size (a1) and how many times you do so is determined by the repcount (a2). These two parameters will most directly affect how many cache hits or misses will occur. The option (a3) will affect whether you zero out some elements of some array (option 0) or increment them (option 1).

Task: To get set up in Venus:

Open the Venus Cache Simulator.
Copy and Paste the code from cache.s into the Editor tab.
In the Simulator tab, click Assemble and Simulate from Editor to assemble the code.
- Once you’ve assembled the code, you can click Run to execute the code. You can also click on assembly instructions to set breakpoints in the code. Once you’ve run the program, you need to click Reset before you can run it again.

Simulate the following scenarios and record the final cache hit rates with the program in cache.s. You can find the cache hit rates in the “Cache” sidebar on the right hand side of the Simulator tab (you may have to scroll down a bit to see the “Hit Rate” row). Try to reason out what the hit rate will be before running the code. After running each simulation, make sure you understand why you see what you see!

Scenario 1:

Program Parameters:

Set these by initializing the registers in the code in the Editor Tab under lines 22-29

Array Size (a0): 128 (bytes)
Step Size (a1): 1
Rep Count (a2): 2
Option (a3): 0

Cache Parameters:

Set these in the Cache section of the Simulator Tab.

Cache Levels: 1
Block Size: 8
Number of Blocks: 1
Enable?: Click on the button to make it green
Placement Policy: Direct Mapped
Associativity: 1 (Ignore this for this lab or look it up if you’re curious.)
Block Replacement Policy: LRU

After you assemble the program, you can visualize when blocks are getting pulled into the cache on each memory access by setting a breakpoint at the instructions 0x30 and 0x40 in the Simulator by clicking on them. Instruction 0x40 corresponds to the instruction that performs array[index] = 0; in option 0, and 0x30 corresponds to the instruction that performs array[index] = array[index] + 1; under option 1.

Here’s how this should look if you set everything up properly:

Task: Now run the program, and answer these questions in answers.md.

How many array elements can fit into a cache block?
What combination of parameters is producing (i.e., explains) the hit rate you observe? Think about the sizes of each of the parameters.
What is our hit rate if we increase Rep Count arbitrarily? Why?

How can I interpret the hit rate in Venus?

Given the parameters of the program, for two repetitions, our program will iterate through each element of the 32 integers in the array and set its value to 0. Because our block size is 8, we can fit 2 integers in the cache. Originally, your cache is empty. If you assemble the program, set a breakpoint at instruction 0x40, and then run the program, you can see your cache slowly get misses and hits:

Hit Run for the first time: your program will stop at instruction 0x44. It has just set array[0] = 0. Because this array element isn’t in the cache, you should see a cache miss on the right hand side, but now that you’ve performed an access, your cache now pulled in a block of memory (containing 8 bytes – the first two integers in the array). Your program has performed 1 access and your hit count and rate are 0 because your cache hasn’t had any cache hits.
Hit Run a second time: your program will stop at instruction 0x44. It has just set array[1] = 0. Because this array element is in the cache, you should see a cache hit. Your program has now performed 2 accesses, your hit count is 1, and your hit rate is 0.5 because your cache just performed a cache hit.
If you keep hitting Run, you’ll see the number of accesses, hit count, and hit rate change according to the accesses being made. If you remove the breakpoint on 0x40 by clicking on it, and then hit Run, you’ll see that the final accesses, hit count, and hit rate looks something like this image below. Think about why this makes sense.

Scenario 2:

Program Parameters:

Set these by initializing the registers in the code in the Editor tab.

Array Size (a0): 128 (bytes)
Step Size (a1): 27 (step by 27 elements of the array)
Rep Count (a2): 2
Option (a3): 0

Cache Parameters:

Set these in the Cache section of the Simulator tab.

Cache Levels: 1
Block Size: 8
Number of Blocks: 1
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU

Task:

What combination of parameters is producing (i.e., explains) the hit rate you observe? Think about the sizes of each of the parameters.
What happens to our hit rate if we increase the number of blocks and why?

Scenario 3:

Program Parameters:

Set these by initializing the registers in the code in the Editor tab.

Array Size (a0): 256 (bytes)
Step Size (a1): 2
Rep Count (a2): 2
Option (a3): 1

Cache Parameters:

Set these in the Cache section of the Simulator tab.

Cache Levels: 1
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU

Task:

Choose a number of blocks greater than 1 and determine the smallest block size that uses every block and maximizes the hit rate given the parameters above. Explain why.

(N.B.: Venus will throw an error if your number of blocks and block size is not a power of 2)

Exercise 3: Writing Cache Friendly Code

Hopefully, you realized in the last exercise that we can maximize our hit-rate by making the cache large enough to hold all of the contents of the array we’re accessing. In practice, it would be nice if we could store all of the contents of the memory that we’re accessing in the fastest possible storage, but like many things in life, we’re often limited by financial resources (remember the memory hierarchy from lecture?). You can check how much space in bytes your machine is allocating for caching by typing getconf -a | grep CACHE or lscpu in your course container.

By understanding how caches work, you can optimize a program’s memory access patterns to obtain good performance from the memory hierarchy. As a concrete example of such optimization, let’s analyze the performance of various matrix multiplication functions. Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences and deep learning fields (for example, much of what TensorFlow gets up to during model training and inference under the hood turns into matrix multiplications).

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays (see below for an explanation of what these terms mean):

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            // equivalent to C[j][i] += A[k][i] * B[j][k]
            // if these were 2D arrays
            C[i+j*n] += A[i+k*n] * B[k+j*n];

Note: In this exercise, the 2D matrix is flattened to a one-dimensional array. For example, if you had a 5x5 matrix, it would be a 25 element 1D array, where each row is consecutively laid out in memory. If you generally access the 2nd row and 3rd column of the 5x5 matrix called A using A[1][2] (since it’s 0 indexed), you’d access the equivalent entry of the 1D array called B with B[1*5 + 2].

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…

moves through B in row-order (each iteration of the innermost loop iterates through a single row of matrix B)
moves through A in column-order (each iteration of the innermost loop iterates through a single column of matrix A)
accesses one entry of C

Checkout matrixMultiply.c. You’ll notice that the file contains multiple implementations of matrix multiply with three nested loops. You can compile and run matrixMultiply.c by running:

$ make matrixMultiply
$ ./matrixMultiply

While each of these implementations will correctly calcuate the matrix multiplication, the order in which we choose to access the elements of the matrices can have a large impact on performance. From the previous exercises, you should realize that caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of temporal locality (recently accessed blocks are accessed again) and spatial locality (nearby blocks are accessed soon), utilizing blocks already contained within our cache.

Task: Run the program a couple of times, order the functions from fastest to slowest, and explain why each function’s ranking makes sense using your understanding of how the cache works. Some functions might have similar runtimes. If this is the case, explain why.

Note: The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” The bigger the number, the faster the program is running.

Handin instructions

Turn in your code by pushing your git repository to github.com/csci0300/cs300-s22-labs-YOURNAME.git.

Then, head to the grading server. On the “Labs” page, use the “Lab 4 checkoff” button to check off your lab.

Note: For this lab, you will only receive partial credit from the checkoff button, so don’t panic if it says that you received a partial checkoff (everybody does).

The partial credit corresponds to exercise 1. We will grade the remainder of the lab manually, and your submission will be graded by TAs who will update your grade accordingly. Make sure your answers to the conceptual questions are in answers.md.

Acknowledgements: This lab is based on exercises in Harvard’s CS61 and UC Berkeley’s CS61c courses, which we have reused with permission.