.. index:: single: Papi .. _PAPI: .. _mdoc_papi: ======== PAPI ======== PAPI is a multi-platform library for portably accessing hardware counters for event profiling of software including flop counts, cache efficiency, and branch prediction rates. See the `PAPI website `_ for more information. Usage ===== The user must add PAPI function calls to their code and link to the PAPI library in order to access the hardware counters. Often PAPI calls can be added to previously instrumented code through timing calls, otherwise the user will need to identify the subset of the code to be profiled. An example code that uses PAPI to identify poor cache performance is located below. Available Counters ------------------ The command :command:`papi_avail` will determine which PAPI counters are accessible on the current system. Some counters are supported natively, and others can be derived from other counters that are natively supported. PAPI also supports multiplexing, where a larger number of events can be profiled simultaneously using a sampling technique. See the PAPI documentation for more details. .. note:: The number of active counters is much less than the number of counters available on the system. Sandy Bridge nodes have 11 registers that can be used for hardware counters, but PAPI requires a few of those registers for internal functions. In practice, ~4 independent PAPI events can be instrumented at one time, and valid combinations of events must be found using trial-and-error. .. code:: bash $ module load papi/5.1 $ papi_avail -a Available events and hardware information. -------------------------------------------------------------------------------- PAPI Version : 5.1.0.2 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45) CPU Revision : 7.000000 CPUID Info : Family: 6 Model: 45 Stepping: 7 CPU Max Megahertz : 2599 CPU Min Megahertz : 2599 Hdw Threads per core : 2 Cores per Socket : 8 NUMA Nodes : 2 CPUs per Node : 16 Total CPUs : 32 Running in a VM : no Number Hardware Counters : 11 Max Multiplex Counters : 64 -------------------------------------------------------------------------------- Name Code Deriv Description (Note) PAPI_L1_DCM 0x80000000 No Level 1 data cache misses PAPI_L1_ICM 0x80000001 No Level 1 instruction cache misses PAPI_L2_DCM 0x80000002 Yes Level 2 data cache misses PAPI_L2_ICM 0x80000003 No Level 2 instruction cache misses PAPI_L1_TCM 0x80000006 Yes Level 1 cache misses PAPI_L2_TCM 0x80000007 No Level 2 cache misses PAPI_L3_TCM 0x80000008 No Level 3 cache misses PAPI_TLB_DM 0x80000014 Yes Data translation lookaside buffer misses PAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer misses PAPI_L1_LDM 0x80000017 No Level 1 load misses PAPI_L1_STM 0x80000018 No Level 1 store misses PAPI_L2_STM 0x8000001a No Level 2 store misses PAPI_STL_ICY 0x80000025 No Cycles with no instruction issue PAPI_BR_UCN 0x8000002a Yes Unconditional branch instructions PAPI_BR_CN 0x8000002b No Conditional branch instructions PAPI_BR_TKN 0x8000002c Yes Conditional branch instructions taken PAPI_BR_NTK 0x8000002d No Conditional branch instructions not taken PAPI_BR_MSP 0x8000002e No Conditional branch instructions mispredicted PAPI_BR_PRC 0x8000002f Yes Conditional branch instructions correctly predicted PAPI_TOT_INS 0x80000032 No Instructions completed PAPI_FP_INS 0x80000034 Yes Floating point instructions PAPI_LD_INS 0x80000035 No Load instructions PAPI_SR_INS 0x80000036 No Store instructions PAPI_BR_INS 0x80000037 No Branch instructions PAPI_TOT_CYC 0x8000003b No Total cycles PAPI_L2_DCH 0x8000003f Yes Level 2 data cache hits PAPI_L2_DCA 0x80000041 No Level 2 data cache accesses PAPI_L3_DCA 0x80000042 Yes Level 3 data cache accesses PAPI_L2_DCR 0x80000044 No Level 2 data cache reads PAPI_L3_DCR 0x80000045 No Level 3 data cache reads PAPI_L2_DCW 0x80000047 No Level 2 data cache writes PAPI_L3_DCW 0x80000048 No Level 3 data cache writes PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses PAPI_L3_ICA 0x8000004e No Level 3 instruction cache accesses PAPI_L2_ICR 0x80000050 No Level 2 instruction cache reads PAPI_L3_ICR 0x80000051 No Level 3 instruction cache reads PAPI_L2_TCA 0x80000059 Yes Level 2 total cache accesses PAPI_L3_TCA 0x8000005a No Level 3 total cache accesses PAPI_L2_TCR 0x8000005c Yes Level 2 total cache reads PAPI_L3_TCR 0x8000005d Yes Level 3 total cache reads PAPI_L2_TCW 0x8000005f No Level 2 total cache writes PAPI_L3_TCW 0x80000060 No Level 3 total cache writes PAPI_FDV_INS 0x80000063 No Floating point divide instructions PAPI_FP_OPS 0x80000066 Yes Floating point operations PAPI_SP_OPS 0x80000067 Yes Floating point operations; optimized to count scaled single precision vector operations PAPI_DP_OPS 0x80000068 Yes Floating point operations; optimized to count scaled double precision vector operations PAPI_VEC_SP 0x80000069 Yes Single precision vector/SIMD instructions PAPI_VEC_DP 0x8000006a Yes Double precision vector/SIMD instructions PAPI_REF_CYC 0x8000006b No Reference clock cycles ------------------------------------------------------------------------- Of 50 available events, 17 are derived. avail.c PASSED Example ======= Download :download:`matrixmult_papi.c` for an example using PAPI to measure the L2 cache miss rate for a poorly-written matrix multiplication program:: $ module load papi/5.1 $ gcc -O2 matrixmult_papi.c -lpapi $ ./a.out 322761027 L2 cache misses (0.744% misses) in 5740137120 cycles The precise output will vary with the system load. Reversing the order of the inner two loops should produce a significant improvement in cache efficiency and a corresponding speedup.