« Back to the main CS 300 website
Lab 4: Caching
Due March 15, 2022, at 8:00pm EDT
Introduction
Computers often use a technique called caching to make a memory system comprised of a hierarchy of storage devices to appear as large and fast. In particular, when we build a cache, we use a small amount of relatively fast storage at one level of the memory hierarchy to speed up access to a large and relatively slow storage at the next lower level of the memory hierarchy. By “lower”, we mean further from the processor.
Here’s an example picture of what the memory hierarchy in your computer might look like:
At the top of the memory hierarchy, the picture shows the processor cache divided into multiple levels, with the L1 cache (sometimes pronounced “level-one cache”) closer to the processor than the L4 cache. This reflects how processor caches actually work in practice (there really are 3-4 different caches in your processor!), but we often think of a processor cache as a single unit.
Different computers have different sizes and access costs for these hierarchy levels; the ones listed above are typical. For example, a common desktop computer with four cores (i.e., four independent processors) might have ~200 bytes of registers; ~9 MB of processor cache; 8 GB primary memory; 512 GB SSD; and 2 TB hard disk. The processor cache divides into three levels: e.g., there might be 128KB of L1 cache, divided into four 32KB components; 512KB of L2 cache, divided into two 256KB components (one per “socket”, or pair of cores); and 8MB of L3 cache shared by all cores.
And distributed systems have yet lower levels. For instance, Google and Facebook have petabytes or exabytes of data, distributed among many computers. You can think of this as another storage layer, networked storage, that is typically slower to access than any local storage.
Each layer in the storage hierarchy acts as a cache for the following layer.
How is caching used?
Caches are so critical to computer systems that it sometimes seems like caching is the only performance-improving idea in systems. Processors have caches for primary memory. The operating system uses most of primary memory as a cache for disks and other stable storage devices. Running programs reserve some of their private memory to cache the operating system’s cache of the disk.
People have made entire careers out of proposing different variants of caches for different use cases. When a program processes a file, the actual computation is done using registers, which are caching values from the processor cache, which is caching data from the running program’s memory, which is caching data from the operating system, which is caching data from the disk. And modern disks contain caches inside their hardware too!
Assignment
Assignment installation
First, ensure that your repository has a handout
remote. Type:
$ git remote show handout
If this reports an error, run:
$ git remote add handout https://github.com/csci0300/cs300-s22-labs.git
Then run:
$ git pull
$ git pull handout master
This will merge our Lab 4 stencil code with your previous work. If you have any “conflicts” from Lab 3, resolve them before continuing further. Run git push
to save your work back to your personal repository.
Note: This lab contains both an autograded and a manually-graded portion, so the grading server checkoff button will only ever give you partial credit for this lab.
- Autograded: The programs you create in Exercise 1 will be autograded when you run the check-off script. You’ll receive partial credit on the lab if you pass the check-off script.
- Manually graded: Please write answers to the written questions in
answers.md
, and we’ll manually check submissions off every couple days. You’ll receive full credit for the lab if you pass the autograding tests and your answers demonstrate reasonable effort.
Exercise 1: Why Caching Matters
In this exercise, we’ll use simple I/O benchmark programs to evaluate the real-world impact of caches. Specifically, we’ll run several programs that write data to disk (to your computer’s hard drive). Each of these programs accomplish the same task of writing 5,120,000 bytes (5 MB) to a file called data
, but they do so in several different ways.
Note: Your course container does not have a real disk attached; instead, its disk is emulated by your computer. Therefore, it’s important that you run Exercise 1 on the department machines. The remaining exercises can run in your course container.
You can clone your repo onto the department machines by ssh’ing or logging into a machine and running:
$ git clone https://github.com/csci0300/cs300-s22-labs-<GITHUB_USERNAME>.git
Synchronous writes one byte at a time
Take a look at w01-byte.c
. Currently, w01-byte.c
writes to a file called data
:
- one byte at a time
- using direct system calls (
write
)
- synchronously to the disk
“Synchronously” means that each write operation completes only when the data has made its way through all layers of caching mentioned above out to the true disk hardware. We request synchronous writes by opening the file with the O_SYNC
option to open
(line 5).
You can compile and run w01-byte.c
by running:
make w01-byte
to compile
./w01-byte
to run
Task: Run w01-byte
and record how many bytes per second it can write. Write your answer into answers.md
.
(This may take a few minutes. If this takes longer than 5 minutes, feel free to stop the program with Ctrl-C. You can still record the program’s bytes/second.)
You’ll notice that this is quite slow.
Asynchronous writes one byte at a time
We can optimize the previous program by removing the O_SYNC
flag, so that the program requests “asynchronous” writes, and relies on the operating system to write the data to the disk when it’s convenient (but after your write
call has returned).
Task:
- Copy and paste the contents of
w01-byte.c
into a new file called w02-byte.c
, and add w02-byte.c
to the EXEC := ...
line in the Makefile.
- Remove the
O_SYNC
flag from the open
call in w02-byte.c
.
- Once again, compile (using
make w02-byte
), run the program, and record its write speed in bytes per second to your answers.md
.
You should notice that asynchronous program is ~1,500x faster than the synchronous one. (This may vary based on the machine you’re using.)
What's happening at a high level?
When this program writes bytes to the file:
- The application makes a write system call asking the operating system to write a byte.
- The operating system performs this write to a buffer cache in main memory.
- The operating system immediately returns to the application!
That is, the application gets control back before the data is written to disk. The data is written to disk eventually, either after some time goes by or because the buffer cache runs out of room, but such later writes can be overlapped with other processing. This gives us a performance speed-up, but it comes at the cost of data persistence. If, for example, our computer loses power before the data gets written from the cache to disk, it may appear as if we finished writing our data, because our program finished writing data to the cache, but it won’t actually be on our disk after the computer restarts.
Adding more caching layers
While this new asynchronous program is much faster, it’s still being slowed down by expensive operations – namely system calls. These operating system invocations are slow, because they require the processor to stop what it’s doing, save the user program’s state, give control to the operating system to perform a write
or read
call, and then switch back. (We’ll learn a lot more about system calls in the OS part of the course shortly!)
To get around this, we can use an additional layer of caching through library calls – function calls that are implemented in the standard library and keep their own cache. These functions avoid the cost of switching states to the operating system, because you’re writing to their in-memory cache rather than to the OS’s. By using the standard library call fwrite
rather than the write
system call, the program performs writes to a cache in program memory without involving the operating system.
Task:
- Create a copy of
w01-byte.c
and rename the file w03-byte.c
.
- Modify the new file to use the fopen, fwrite, and fclose library calls (instead of
open
, write
, and close
) and then record the results. (Make sure to add w03-byte
to the Makefile.)
- Note that this may involve changing the return values and arguments of the
open
, write
, and close
calls.
Even though all three programs write one byte at a time, caching makes the fastest one write ~40,000x faster than the program we started with.
What's happening at a high level?
When this program writes bytes to the file:
- The application makes an
fwrite
library function call asking the stdio library to write a byte.
- The library performs this write to its cache and immediately returns to the application.
That is, the application gets control back even before a system call is made. The data is written out of the stdio cache using a system call eventually, either when the stdio cache runs out of room, it is explicitly flushed, or the program exits.
Exercise 2: How the Processor Cache Works
Now that you have a better idea why caching is so important, let’s get into how one particular, important kind of cache in your computer works. We will be looking at the processor cache, which is a fast piece of memory that the processor uses to be able to access data in memory more quickly. We already encountered the processor cache briefly when we talked about alignment in lectures!
Note that the idea behind caching is general and applies both to the I/O caching you explored in Exercise 1 and to the caching we will explore in this exercise. But the processor cache is actually a specific piece of hardware: it is part of the processor chip, and you could see if it you pried the chip open and put it under a microscope!
Your course container behaves like a virtual machine, which makes it difficult to get reliable cache statistics from it, as it shares the processor cache with other programs on your computer. For this exercise, we therefore use a cache simulator tool, called Venus, to visualize changes to the cache.
Caches have several hit, miss, and eviction policies, each with their own trade-offs. In the Venus simulator, you will use a write-through
hit policy, a write-allocate
miss policy, and an LRU
eviction policy. We explain what these terms mean below!
Write-through
means that on a write “hit”, data is written to both the cache and main memory. Writing to the cache is fast, but writing to main memory is slow; this makes write latency in write-through caches slower than in a write-back cache (where the data is only written to main memory later, when the cache block is evicted). However, write-through caches mean simpler hardware, since we can assume in write-through caches that memory always has the most up-to-date data. (Sound familiar? It should remind you of Exercise 1, where we wrote bytes synchronously to disk!)
Write-allocate
means that on a write miss (where data you’re writing to isn’t already in the cache), you pull the block you missed on into the cache and perform the write on it.
LRU (Least recently used)
means that when a cache block must be evicted to make space for a new one, we select the block that has been used furthest back in time (“least recently”) and throw it out of the cache.
Getting Set Up
This lab uses the cache.s
file. This file contains assembly code in RISC-V assembly, an assembly language for a different architecture than the one you’ve worked with so far. You won’t need to understand how to read this assembly code, but if you’re curious about the differences between the x86-64 and RISC-V architectures, check out this post. Here is some C-like pseudocode for the assembly instructions in cache.s
:
int array[];
for (k = 0; k < repcount; k++) {
for (index = 0; index < arraysize; index += stepsize) {
if(option==0)
array[index] = 0;
else
array[index] = array[index] + 1;
}
}
Lines 24-28 of cache.s
allow us to set the arraysize
, stepsize
, repcount
, and option
variables. li
is a RISC-V instruction to load a constant into a register. Currently:
- the
arraysize
, represented by a0
in the assembly code stores 256
bytes, which is 64
integers (because sizeof(int) = 4
in the Venus simulator).
- the
stepsize
represented by a1
stores 2
- the
repcount
represented by a2
stores 1
- the
option
represented by a3
stores 1
Make sure you understand what the pseudocode does and how you can edit the variables before you proceed to analyze cache configurations on it.
- The most important thing to understand is the pseudocode. When you run
cache.s
, instructions that perform this pseudocode will be executed.
- Which elements you access is determined by the step size (
a1
) and how many times you do so is determined by the repcount (a2
). These two parameters will most directly affect how many cache hits or misses will occur. The option (a3
) will affect whether you zero out some elements of some array (option 0
) or increment them (option 1
).
Task: To get set up in Venus:
- Open the Venus Cache Simulator.
- Copy and Paste the code from
cache.s
into the Editor
tab.
- In the
Simulator
tab, click Assemble and Simulate from Editor
to assemble the code.
- Once you’ve assembled the code, you can click
Run
to execute the code. You can also click on assembly instructions to set breakpoints in the code. Once you’ve run the program, you need to click Reset
before you can run it again.
Simulate the following scenarios and record the final cache hit rates with the program in cache.s
. You can find the cache hit rates in the “Cache” sidebar on the right hand side of the Simulator
tab (you may have to scroll down a bit to see the “Hit Rate” row). Try to reason out what the hit rate will be before running the code. After running each simulation, make sure you understand why you see what you see!
Scenario 1:
Program Parameters:
Set these by initializing the registers in the code in the Editor Tab
under lines 22-29
- Array Size (
a0
): 128 (bytes)
- Step Size (
a1
): 1
- Rep Count (
a2
): 2
- Option (
a3
): 0
Cache Parameters:
Set these in the Cache section of the Simulator Tab
.
- Cache Levels: 1
- Block Size: 8
- Number of Blocks: 1
- Enable?: Click on the button to make it green
- Placement Policy: Direct Mapped
- Associativity: 1 (Ignore this for this lab or look it up if you’re curious.)
- Block Replacement Policy: LRU
After you assemble the program, you can visualize when blocks are getting pulled into the cache on each memory access by setting a breakpoint at the instructions 0x30
and 0x40
in the Simulator
by clicking on them. Instruction 0x40
corresponds to the instruction that performs array[index] = 0;
in option 0
, and 0x30
corresponds to the instruction that performs array[index] = array[index] + 1;
under option 1
.
Here’s how this should look if you set everything up properly:
Task: Now run the program, and answer these questions in answers.md
.
- How many array elements can fit into a cache block?
- What combination of parameters is producing (i.e., explains) the hit rate you observe? Think about the sizes of each of the parameters.
- What is our hit rate if we increase Rep Count arbitrarily? Why?
How can I interpret the hit rate in Venus?
Given the parameters of the program, for two repetitions, our program will iterate through each element of the 32 integers in the array and set its value to 0. Because our block size is 8, we can fit 2 integers in the cache. Originally, your cache is empty. If you assemble the program, set a breakpoint at instruction 0x40
, and then run the program, you can see your cache slowly get misses and hits:
- Hit
Run
for the first time: your program will stop at instruction 0x44
. It has just set array[0] = 0
. Because this array element isn’t in the cache, you should see a cache miss on the right hand side, but now that you’ve performed an access, your cache now pulled in a block of memory (containing 8 bytes – the first two integers in the array). Your program has performed 1
access and your hit count and rate are 0
because your cache hasn’t had any cache hits.
- Hit
Run
a second time: your program will stop at instruction 0x44
. It has just set array[1] = 0
. Because this array element is in the cache, you should see a cache hit. Your program has now performed 2
accesses, your hit count is 1
, and your hit rate is 0.5
because your cache just performed a cache hit.
- If you keep hitting Run, you’ll see the number of accesses, hit count, and hit rate change according to the accesses being made. If you remove the breakpoint on
0x40
by clicking on it, and then hit Run
, you’ll see that the final accesses, hit count, and hit rate looks something like this image below. Think about why this makes sense.
Scenario 2:
Program Parameters:
Set these by initializing the registers in the code in the Editor
tab.
- Array Size (
a0
): 128 (bytes)
- Step Size (
a1
): 27 (step by 27 elements of the array)
- Rep Count (
a2
): 2
- Option (
a3
): 0
Cache Parameters:
Set these in the Cache section of the Simulator
tab.
- Cache Levels: 1
- Block Size: 8
- Number of Blocks: 1
- Enable?: Should be green
- Placement Policy: Direct Mapped
- Associativity: 1
- Block Replacement Policy: LRU
Task:
- What combination of parameters is producing (i.e., explains) the hit rate you observe? Think about the sizes of each of the parameters.
- What happens to our hit rate if we increase the number of blocks and why?
Scenario 3:
Program Parameters:
Set these by initializing the registers in the code in the Editor
tab.
- Array Size (
a0
): 256 (bytes)
- Step Size (
a1
): 2
- Rep Count (
a2
): 2
- Option (
a3
): 1
Cache Parameters:
Set these in the Cache section of the Simulator
tab.
- Cache Levels: 1
- Enable?: Should be green
- Placement Policy: Direct Mapped
- Associativity: 1
- Block Replacement Policy: LRU
Task:
- Choose a
number of blocks
greater than 1
and determine the smallest block size
that uses every block and maximizes the hit rate given the parameters above. Explain why.
(N.B.: Venus will throw an error if your number of blocks and block size is not a power of 2)
Exercise 3: Writing Cache Friendly Code
Hopefully, you realized in the last exercise that we can maximize our hit-rate by making the cache large enough to hold all of the contents of the array we’re accessing. In practice, it would be nice if we could store all of the contents of the memory that we’re accessing in the fastest possible storage, but like many things in life, we’re often limited by financial resources (remember the memory hierarchy from lecture?). You can check how much space in bytes your machine is allocating for caching by typing getconf -a | grep CACHE
or lscpu
in your course container.
By understanding how caches work, you can optimize a program’s memory access patterns to obtain good performance from the memory hierarchy. As a concrete example of such optimization, let’s analyze the performance of various matrix multiplication functions. Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences and deep learning fields (for example, much of what TensorFlow gets up to during model training and inference under the hood turns into matrix multiplications).
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A
, B
, and C
are all n-by-n and stored in one-dimensional column-major arrays (see below for an explanation of what these terms mean):
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
Note: In this exercise, the 2D matrix is flattened to a one-dimensional array. For example, if you had a 5x5
matrix, it would be a 25
element 1D array, where each row is consecutively laid out in memory. If you generally access the 2nd row and 3rd column of the 5x5
matrix called A
using A[1][2]
(since it’s 0 indexed), you’d access the equivalent entry of the 1D array called B
with B[1*5 + 2]
.
In the above code, note that the loops are ordered i
, j
, k
. If we examine the innermost loop (the one that increments k
), we see that it…
- moves through
B
in row-order (each iteration of the innermost loop iterates through a single row of matrix B
)
- moves through
A
in column-order (each iteration of the innermost loop iterates through a single column of matrix A
)
- accesses one entry of
C
Checkout matrixMultiply.c
. You’ll notice that the file contains multiple implementations of matrix multiply with three nested loops. You can compile and run matrixMultiply.c
by running:
$ make matrixMultiply
$ ./matrixMultiply
While each of these implementations will correctly calcuate the matrix multiplication, the order in which we choose to access the elements of the matrices can have a large impact on performance. From the previous exercises, you should realize that caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of temporal locality (recently accessed blocks are accessed again) and spatial locality (nearby blocks are accessed soon), utilizing blocks already contained within our cache.
Task: Run the program a couple of times, order the functions from fastest to slowest, and explain why each function’s ranking makes sense using your understanding of how the cache works. Some functions might have similar runtimes. If this is the case, explain why.
Note: The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” The bigger the number, the faster the program is running.
Handin instructions
Turn in your code by pushing your git repository to github.com/csci0300/cs300-s22-labs-YOURNAME.git
.
Then, head to the grading server. On the “Labs” page, use the “Lab 4 checkoff” button to check off your lab.
Note: For this lab, you will only receive partial credit from the checkoff button, so don’t panic if it says that you received a partial checkoff (everybody does).
The partial credit corresponds to exercise 1. We will grade the remainder of the lab manually, and your submission will be graded by TAs who will update your grade accordingly. Make sure your answers to the conceptual questions are in answers.md
.
Acknowledgements: This lab is based on exercises in Harvard’s CS61 and UC Berkeley’s CS61c courses, which we have reused with permission.
« Back to the main CS 300 website
Lab 4: Caching
Due March 15, 2022, at 8:00pm EDT
Introduction
Computers often use a technique called caching to make a memory system comprised of a hierarchy of storage devices to appear as large and fast. In particular, when we build a cache, we use a small amount of relatively fast storage at one level of the memory hierarchy to speed up access to a large and relatively slow storage at the next lower level of the memory hierarchy. By “lower”, we mean further from the processor.
Here’s an example picture of what the memory hierarchy in your computer might look like:
At the top of the memory hierarchy, the picture shows the processor cache divided into multiple levels, with the L1 cache (sometimes pronounced “level-one cache”) closer to the processor than the L4 cache. This reflects how processor caches actually work in practice (there really are 3-4 different caches in your processor!), but we often think of a processor cache as a single unit.
Different computers have different sizes and access costs for these hierarchy levels; the ones listed above are typical. For example, a common desktop computer with four cores (i.e., four independent processors) might have ~200 bytes of registers; ~9 MB of processor cache; 8 GB primary memory; 512 GB SSD; and 2 TB hard disk. The processor cache divides into three levels: e.g., there might be 128KB of L1 cache, divided into four 32KB components; 512KB of L2 cache, divided into two 256KB components (one per “socket”, or pair of cores); and 8MB of L3 cache shared by all cores.
And distributed systems have yet lower levels. For instance, Google and Facebook have petabytes or exabytes of data, distributed among many computers. You can think of this as another storage layer, networked storage, that is typically slower to access than any local storage.
Each layer in the storage hierarchy acts as a cache for the following layer.
How is caching used?
Caches are so critical to computer systems that it sometimes seems like caching is the only performance-improving idea in systems. Processors have caches for primary memory. The operating system uses most of primary memory as a cache for disks and other stable storage devices. Running programs reserve some of their private memory to cache the operating system’s cache of the disk.
People have made entire careers out of proposing different variants of caches for different use cases. When a program processes a file, the actual computation is done using registers, which are caching values from the processor cache, which is caching data from the running program’s memory, which is caching data from the operating system, which is caching data from the disk. And modern disks contain caches inside their hardware too!
Assignment
Assignment installation
First, ensure that your repository has a
handout
remote. Type:If this reports an error, run:
Then run:
This will merge our Lab 4 stencil code with your previous work. If you have any “conflicts” from Lab 3, resolve them before continuing further. Run
git push
to save your work back to your personal repository.Note: This lab contains both an autograded and a manually-graded portion, so the grading server checkoff button will only ever give you partial credit for this lab.
answers.md
, and we’ll manually check submissions off every couple days. You’ll receive full credit for the lab if you pass the autograding tests and your answers demonstrate reasonable effort.Exercise 1: Why Caching Matters
In this exercise, we’ll use simple I/O benchmark programs to evaluate the real-world impact of caches. Specifically, we’ll run several programs that write data to disk (to your computer’s hard drive). Each of these programs accomplish the same task of writing 5,120,000 bytes (5 MB) to a file called
data
, but they do so in several different ways.Note: Your course container does not have a real disk attached; instead, its disk is emulated by your computer. Therefore, it’s important that you run Exercise 1 on the department machines. The remaining exercises can run in your course container.
You can clone your repo onto the department machines by ssh’ing or logging into a machine and running:
$ git clone https://github.com/csci0300/cs300-s22-labs-<GITHUB_USERNAME>.git
Synchronous writes one byte at a time
Take a look at
w01-byte.c
. Currently,w01-byte.c
writes to a file calleddata
:write
)“Synchronously” means that each write operation completes only when the data has made its way through all layers of caching mentioned above out to the true disk hardware. We request synchronous writes by opening the file with the
O_SYNC
option toopen
(line 5).You can compile and run
w01-byte.c
by running:make w01-byte
to compile./w01-byte
to runTask: Run
w01-byte
and record how many bytes per second it can write. Write your answer intoanswers.md
.(This may take a few minutes. If this takes longer than 5 minutes, feel free to stop the program with Ctrl-C. You can still record the program’s bytes/second.)
You’ll notice that this is quite slow.
Asynchronous writes one byte at a time
We can optimize the previous program by removing the
O_SYNC
flag, so that the program requests “asynchronous” writes, and relies on the operating system to write the data to the disk when it’s convenient (but after yourwrite
call has returned).Task:
w01-byte.c
into a new file calledw02-byte.c
, and addw02-byte.c
to theEXEC := ...
line in the Makefile.O_SYNC
flag from theopen
call inw02-byte.c
.make w02-byte
), run the program, and record its write speed in bytes per second to youranswers.md
.You should notice that asynchronous program is ~1,500x faster than the synchronous one. (This may vary based on the machine you’re using.)
What's happening at a high level?
Adding more caching layers
While this new asynchronous program is much faster, it’s still being slowed down by expensive operations – namely system calls. These operating system invocations are slow, because they require the processor to stop what it’s doing, save the user program’s state, give control to the operating system to perform a
write
orread
call, and then switch back. (We’ll learn a lot more about system calls in the OS part of the course shortly!)To get around this, we can use an additional layer of caching through library calls – function calls that are implemented in the standard library and keep their own cache. These functions avoid the cost of switching states to the operating system, because you’re writing to their in-memory cache rather than to the OS’s. By using the standard library call
fwrite
rather than thewrite
system call, the program performs writes to a cache in program memory without involving the operating system.Task:
w01-byte.c
and rename the filew03-byte.c
.open
,write
, andclose
) and then record the results. (Make sure to addw03-byte
to the Makefile.)open
,write
, andclose
calls.Even though all three programs write one byte at a time, caching makes the fastest one write ~40,000x faster than the program we started with.
What's happening at a high level?
Exercise 2: How the Processor Cache Works
Now that you have a better idea why caching is so important, let’s get into how one particular, important kind of cache in your computer works. We will be looking at the processor cache, which is a fast piece of memory that the processor uses to be able to access data in memory more quickly. We already encountered the processor cache briefly when we talked about alignment in lectures!
Note that the idea behind caching is general and applies both to the I/O caching you explored in Exercise 1 and to the caching we will explore in this exercise. But the processor cache is actually a specific piece of hardware: it is part of the processor chip, and you could see if it you pried the chip open and put it under a microscope!
Your course container behaves like a virtual machine, which makes it difficult to get reliable cache statistics from it, as it shares the processor cache with other programs on your computer. For this exercise, we therefore use a cache simulator tool, called Venus, to visualize changes to the cache.
Caches have several hit, miss, and eviction policies, each with their own trade-offs. In the Venus simulator, you will use a
write-through
hit policy, awrite-allocate
miss policy, and anLRU
eviction policy. We explain what these terms mean below!Write-through
means that on a write “hit”, data is written to both the cache and main memory. Writing to the cache is fast, but writing to main memory is slow; this makes write latency in write-through caches slower than in a write-back cache (where the data is only written to main memory later, when the cache block is evicted). However, write-through caches mean simpler hardware, since we can assume in write-through caches that memory always has the most up-to-date data. (Sound familiar? It should remind you of Exercise 1, where we wrote bytes synchronously to disk!)Write-allocate
means that on a write miss (where data you’re writing to isn’t already in the cache), you pull the block you missed on into the cache and perform the write on it.LRU (Least recently used)
means that when a cache block must be evicted to make space for a new one, we select the block that has been used furthest back in time (“least recently”) and throw it out of the cache.Getting Set Up
This lab uses the
cache.s
file. This file contains assembly code in RISC-V assembly, an assembly language for a different architecture than the one you’ve worked with so far. You won’t need to understand how to read this assembly code, but if you’re curious about the differences between the x86-64 and RISC-V architectures, check out this post. Here is some C-like pseudocode for the assembly instructions incache.s
:Lines 24-28 of
cache.s
allow us to set thearraysize
,stepsize
,repcount
, andoption
variables.li
is a RISC-V instruction to load a constant into a register. Currently:arraysize
, represented bya0
in the assembly code stores256
bytes, which is64
integers (becausesizeof(int) = 4
in the Venus simulator).stepsize
represented bya1
stores2
repcount
represented bya2
stores1
option
represented bya3
stores1
Make sure you understand what the pseudocode does and how you can edit the variables before you proceed to analyze cache configurations on it.
cache.s
, instructions that perform this pseudocode will be executed.a1
) and how many times you do so is determined by the repcount (a2
). These two parameters will most directly affect how many cache hits or misses will occur. The option (a3
) will affect whether you zero out some elements of some array (option 0
) or increment them (option 1
).Task: To get set up in Venus:
cache.s
into theEditor
tab.Simulator
tab, clickAssemble and Simulate from Editor
to assemble the code.Run
to execute the code. You can also click on assembly instructions to set breakpoints in the code. Once you’ve run the program, you need to clickReset
before you can run it again.Simulate the following scenarios and record the final cache hit rates with the program in
cache.s
. You can find the cache hit rates in the “Cache” sidebar on the right hand side of theSimulator
tab (you may have to scroll down a bit to see the “Hit Rate” row). Try to reason out what the hit rate will be before running the code. After running each simulation, make sure you understand why you see what you see!Scenario 1:
Program Parameters:
a0
): 128 (bytes)a1
): 1a2
): 2a3
): 0Cache Parameters:
After you assemble the program, you can visualize when blocks are getting pulled into the cache on each memory access by setting a breakpoint at the instructions
0x30
and0x40
in theSimulator
by clicking on them. Instruction0x40
corresponds to the instruction that performsarray[index] = 0;
in option0
, and0x30
corresponds to the instruction that performsarray[index] = array[index] + 1;
under option1
.Here’s how this should look if you set everything up properly:
Task: Now run the program, and answer these questions in
answers.md
.How can I interpret the hit rate in Venus?
Given the parameters of the program, for two repetitions, our program will iterate through each element of the 32 integers in the array and set its value to 0. Because our block size is 8, we can fit 2 integers in the cache. Originally, your cache is empty. If you assemble the program, set a breakpoint at instruction
0x40
, and then run the program, you can see your cache slowly get misses and hits:Run
for the first time: your program will stop at instruction0x44
. It has just setarray[0] = 0
. Because this array element isn’t in the cache, you should see a cache miss on the right hand side, but now that you’ve performed an access, your cache now pulled in a block of memory (containing 8 bytes – the first two integers in the array). Your program has performed1
access and your hit count and rate are0
because your cache hasn’t had any cache hits.Run
a second time: your program will stop at instruction0x44
. It has just setarray[1] = 0
. Because this array element is in the cache, you should see a cache hit. Your program has now performed2
accesses, your hit count is1
, and your hit rate is0.5
because your cache just performed a cache hit.0x40
by clicking on it, and then hitRun
, you’ll see that the final accesses, hit count, and hit rate looks something like this image below. Think about why this makes sense.Scenario 2:
Program Parameters:
a0
): 128 (bytes)a1
): 27 (step by 27 elements of the array)a2
): 2a3
): 0Cache Parameters:
Task:
Scenario 3:
Program Parameters:
a0
): 256 (bytes)a1
): 2a2
): 2a3
): 1Cache Parameters:
Task:
number of blocks
greater than1
and determine the smallestblock size
that uses every block and maximizes the hit rate given the parameters above. Explain why.(N.B.: Venus will throw an error if your number of blocks and block size is not a power of 2)
Exercise 3: Writing Cache Friendly Code
Hopefully, you realized in the last exercise that we can maximize our hit-rate by making the cache large enough to hold all of the contents of the array we’re accessing. In practice, it would be nice if we could store all of the contents of the memory that we’re accessing in the fastest possible storage, but like many things in life, we’re often limited by financial resources (remember the memory hierarchy from lecture?). You can check how much space in bytes your machine is allocating for caching by typing
getconf -a | grep CACHE
orlscpu
in your course container.By understanding how caches work, you can optimize a program’s memory access patterns to obtain good performance from the memory hierarchy. As a concrete example of such optimization, let’s analyze the performance of various matrix multiplication functions. Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences and deep learning fields (for example, much of what TensorFlow gets up to during model training and inference under the hood turns into matrix multiplications).
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices
A
,B
, andC
are all n-by-n and stored in one-dimensional column-major arrays (see below for an explanation of what these terms mean):Note: In this exercise, the 2D matrix is flattened to a one-dimensional array. For example, if you had a
5x5
matrix, it would be a25
element 1D array, where each row is consecutively laid out in memory. If you generally access the 2nd row and 3rd column of the5x5
matrix calledA
usingA[1][2]
(since it’s 0 indexed), you’d access the equivalent entry of the 1D array calledB
withB[1*5 + 2]
.In the above code, note that the loops are ordered
i
,j
,k
. If we examine the innermost loop (the one that incrementsk
), we see that it…B
in row-order (each iteration of the innermost loop iterates through a single row of matrixB
)A
in column-order (each iteration of the innermost loop iterates through a single column of matrixA
)C
Checkout
matrixMultiply.c
. You’ll notice that the file contains multiple implementations of matrix multiply with three nested loops. You can compile and runmatrixMultiply.c
by running:$ make matrixMultiply
$ ./matrixMultiply
While each of these implementations will correctly calcuate the matrix multiplication, the order in which we choose to access the elements of the matrices can have a large impact on performance. From the previous exercises, you should realize that caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of temporal locality (recently accessed blocks are accessed again) and spatial locality (nearby blocks are accessed soon), utilizing blocks already contained within our cache.
Task: Run the program a couple of times, order the functions from fastest to slowest, and explain why each function’s ranking makes sense using your understanding of how the cache works. Some functions might have similar runtimes. If this is the case, explain why.
Note: The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” The bigger the number, the faster the program is running.
Handin instructions
Turn in your code by pushing your git repository to
github.com/csci0300/cs300-s22-labs-YOURNAME.git
.Then, head to the grading server. On the “Labs” page, use the “Lab 4 checkoff” button to check off your lab.
Note: For this lab, you will only receive partial credit from the checkoff button, so don’t panic if it says that you received a partial checkoff (everybody does).
The partial credit corresponds to exercise 1. We will grade the remainder of the lab manually, and your submission will be graded by TAs who will update your grade accordingly. Make sure your answers to the conceptual questions are in
answers.md
.Acknowledgements: This lab is based on exercises in Harvard’s CS61 and UC Berkeley’s CS61c courses, which we have reused with permission.