One can use the PPRO HW counters to determine things like cache hit rates, percentage of time in each level of the caches, MBytes/sec achieved from the bus, and the amount of time the cpu is idle. There are many events that we could talk about; we focus on several of the more significant ones such as: 0x43 PP_DATA_MEM_REFS 0x24 PP_L2_LINES_IN 0x45 PP_DCU_LINES_IN The claim is that one can obtain a general feal for the cache-levels and memory transactions of a given code simply by monitoring these three counters. Unfortunately, one cannot get an exact picture: and we describe the caveats and assumptions we use below. Before demonstrating the power of these three events, we briefly describe each one of them. PP_DATA_MEM_REFS (or 0x43) reports the number of loads/stores from/to a memory location. This includes every memory reference, both cacheable and noncacheable. Because the caches are write-allocate write-back, writing to a single data location requires reading the entire cache line from memory. Because PP_DATA_MEM_REFS counts data elements, not cache lines, it may represent a gross underestimate of the data moved. The L1 cache can theoretically accommodate one double load (64 bits), one double store, and a quad instruction data load per cycle; however, in practice, most applications issue only one double load/store per cycle. Many more loads than stores are performed, and in those cases where stores are made, we are typically storing elements of an array, which means that all elements of a destination cache line are used. PP_L2_LINES_IN (or 0x24) says how many 32-byte L2 cache lines were allocated. This does not indicate how many references were made per cache line, but the assumption that many formulas make is that the number of references made in any cache line are equal for all the elements in that cache line. PP_DCU_LINES_IN (or 0x45) is the total cache lines allocated in the L1_data cache. Our first example of a useful application of these events is the calculation of aggregrate memory bandwidth: MillionBytes/sec = 32*PP_BUS_TRAN_MEM / 1.E6 / WALL_CLOCK_TIME We would now like to look at the fractions of hits from L1, L2 and Memory. Unfortunately, we don't know how many references per cacheline. All we know is the number of references and various counts describing the number of cache lines. One could imagine a situation where every data reference was to an element already in L1_data, or conversely, a situation where every data reference was to an element not in L2. In either of these situations, we can monitor the events involving cache lines, but these do not indicate the big picture. We must therefore make a simplifying assumption that each element of every cache line referenced is referenced the same number of times. Now one can measure the fraction of data from Memory as: MIN(PP_L2_LINES_IN * N, PP_DATA_MEM_REFS) FracM = ----------------------------------------- PP_DATA_MEM_REFS Here, N is 4 for doubles and 8 for singles. The fraction of data from L2 and L1 is taken as (1 - FracM). We now estimate the number of L2 and L1 hits, using the same simplifying assumption we made to estimate the fraction of data from memory: NumberL2L1 = PP_DATA_MEM_REFS-PP_L2_LINES_IN*N But what fraction of this number of exclusive L1 and L2 hits is exclusive to L2 (exclusive) or L1? If we have this number, we could then calculate the fractions from L2 and L1 respectively. One way of getting a handle on the L2 hit rate is to look at PP_DCU_LINES_IN and PP_L2_LINES_IN. These tell us the number of L2/Memory hits in cache lines and the number of Memory hits in cache lines. By subtracting them, we can get a feel for the total number of L2 hits in terms of cache lines, which can then be converted to the number of L2 hits in terms of references. NumberL2hits= MIN( ABS(PP_DCU_LINES_IN - PP_L2_LINES_IN)*N , PP_DATA_MEM_REFS ) Although we suspect PP_DCU_LINES_IN >= PP_L2_LINES_IN, they may be counted at different iterations so it is possible that PP_L2_LINES_IN could be marginally larger at a different iteration. We throw in the ABS(*) to check this situation. The corresponding L2hit fraction (out of the cache accesses) is: NumberL2hits L2hit = ------------ NumberL2L1 After one computes L2hit one can compute the fraction from L2 as: FractionL2 = L2hit * (1.0 - FracM) One can presumably compute the fraction from L1 as: FractionL1 = 1.0 - FracM - FractionL2 We can also give a crude estimate for instruction cache overheads. These require two counters: PP_CPU_CLK_UNHALTED and PP_L2_IFETCH. The former can be monitored with event 0 and the latter with event 1, so that one can get a somewhat accurate picture without sampling in a single run. PP_CPU_CLK_UNHALTED tells the total number of cpu cyles and PP_L2_IFETCH tells the number of instruction fetches. Under most circumstances, each instruction fetch forces a lost of one cycle, so the ratio gives an instruction cache overhead. We summarize our assumptions as follows: (1) There are more loads of the same precision of bytes (*4, *8) than stores or loads of another type. (2) Each element of almost every cache line referenced, is referenced an equal number of times. (3) Since we can only monitor two events at once, then we assume that different iterations of calling these counters that the cache is in nearly an identical state and the run is nearly identical. Given these assumptions, we note that our libraries only provide crude estimates.