Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 1 | perf-c2c(1) |
| 2 | =========== |
| 3 | |
| 4 | NAME |
| 5 | ---- |
| 6 | perf-c2c - Shared Data C2C/HITM Analyzer. |
| 7 | |
| 8 | SYNOPSIS |
| 9 | -------- |
| 10 | [verse] |
| 11 | 'perf c2c record' [<options>] <command> |
| 12 | 'perf c2c record' [<options>] -- [<record command options>] <command> |
| 13 | 'perf c2c report' [<options>] |
| 14 | |
| 15 | DESCRIPTION |
| 16 | ----------- |
| 17 | C2C stands for Cache To Cache. |
| 18 | |
| 19 | The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows |
| 20 | you to track down the cacheline contentions. |
| 21 | |
| 22 | The tool is based on x86's load latency and precise store facility events |
| 23 | provided by Intel CPUs. These events provide: |
| 24 | - memory address of the access |
| 25 | - type of the access (load and store details) |
| 26 | - latency (in cycles) of the load access |
| 27 | |
| 28 | The c2c tool provide means to record this data and report back access details |
| 29 | for cachelines with highest contention - highest number of HITM accesses. |
| 30 | |
| 31 | The basic workflow with this tool follows the standard record/report phase. |
| 32 | User uses the record command to record events data and report command to |
| 33 | display it. |
| 34 | |
| 35 | |
| 36 | RECORD OPTIONS |
| 37 | -------------- |
| 38 | -e:: |
| 39 | --event=:: |
| 40 | Select the PMU event. Use 'perf mem record -e list' |
| 41 | to list available events. |
| 42 | |
| 43 | -v:: |
| 44 | --verbose:: |
| 45 | Be more verbose (show counter open errors, etc). |
| 46 | |
| 47 | -l:: |
| 48 | --ldlat:: |
| 49 | Configure mem-loads latency. |
| 50 | |
| 51 | -k:: |
| 52 | --all-kernel:: |
| 53 | Configure all used events to run in kernel space. |
| 54 | |
| 55 | -u:: |
| 56 | --all-user:: |
| 57 | Configure all used events to run in user space. |
| 58 | |
| 59 | REPORT OPTIONS |
| 60 | -------------- |
| 61 | -k:: |
| 62 | --vmlinux=<file>:: |
| 63 | vmlinux pathname |
| 64 | |
| 65 | -v:: |
| 66 | --verbose:: |
| 67 | Be more verbose (show counter open errors, etc). |
| 68 | |
| 69 | -i:: |
| 70 | --input:: |
| 71 | Specify the input file to process. |
| 72 | |
| 73 | -N:: |
| 74 | --node-info:: |
| 75 | Show extra node info in report (see NODE INFO section) |
| 76 | |
| 77 | -c:: |
| 78 | --coalesce:: |
Kim Phillips | 1291927 | 2017-05-03 13:13:50 +0100 | [diff] [blame] | 79 | Specify sorting fields for single cacheline display. |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 80 | Following fields are available: tid,pid,iaddr,dso |
| 81 | (see COALESCE) |
| 82 | |
| 83 | -g:: |
| 84 | --call-graph:: |
| 85 | Setup callchains parameters. |
| 86 | Please refer to perf-report man page for details. |
| 87 | |
| 88 | --stdio:: |
| 89 | Force the stdio output (see STDIO OUTPUT) |
| 90 | |
| 91 | --stats:: |
| 92 | Display only statistic tables and force stdio mode. |
| 93 | |
| 94 | --full-symbols:: |
| 95 | Display full length of symbols. |
| 96 | |
Jiri Olsa | 18f278d | 2016-10-11 13:39:47 +0200 | [diff] [blame] | 97 | --no-source:: |
| 98 | Do not display Source:Line column. |
| 99 | |
Jiri Olsa | af09b2d | 2016-10-11 13:52:05 +0200 | [diff] [blame] | 100 | --show-all:: |
| 101 | Show all captured HITM lines, with no regard to HITM % 0.0005 limit. |
| 102 | |
Jiri Olsa | b7ac4f9 | 2016-11-21 22:33:28 +0100 | [diff] [blame] | 103 | -f:: |
| 104 | --force:: |
| 105 | Don't do ownership validation. |
| 106 | |
Jiri Olsa | d940bac | 2016-11-21 22:33:30 +0100 | [diff] [blame] | 107 | -d:: |
| 108 | --display:: |
Kim Phillips | 1291927 | 2017-05-03 13:13:50 +0100 | [diff] [blame] | 109 | Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. |
Jiri Olsa | d940bac | 2016-11-21 22:33:30 +0100 | [diff] [blame] | 110 | |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 111 | C2C RECORD |
| 112 | ---------- |
| 113 | The perf c2c record command setup options related to HITM cacheline analysis |
| 114 | and calls standard perf record command. |
| 115 | |
| 116 | Following perf record options are configured by default: |
| 117 | (check perf record man page for details) |
| 118 | |
| 119 | -W,-d,--sample-cpu |
| 120 | |
| 121 | Unless specified otherwise with '-e' option, following events are monitored by |
| 122 | default: |
| 123 | |
| 124 | cpu/mem-loads,ldlat=30/P |
| 125 | cpu/mem-stores/P |
| 126 | |
| 127 | User can pass any 'perf record' option behind '--' mark, like (to enable |
| 128 | callchains and system wide monitoring): |
| 129 | |
| 130 | $ perf c2c record -- -g -a |
| 131 | |
| 132 | Please check RECORD OPTIONS section for specific c2c record options. |
| 133 | |
| 134 | C2C REPORT |
| 135 | ---------- |
| 136 | The perf c2c report command displays shared data analysis. It comes in two |
| 137 | display modes: stdio and tui (default). |
| 138 | |
| 139 | The report command workflow is following: |
| 140 | - sort all the data based on the cacheline address |
| 141 | - store access details for each cacheline |
| 142 | - sort all cachelines based on user settings |
| 143 | - display data |
| 144 | |
| 145 | In general perf report output consist of 2 basic views: |
| 146 | 1) most expensive cachelines list |
| 147 | 2) offsets details for each cacheline |
| 148 | |
| 149 | For each cacheline in the 1) list we display following data: |
| 150 | (Both stdio and TUI modes follow the same fields output) |
| 151 | |
| 152 | Index |
| 153 | - zero based index to identify the cacheline |
| 154 | |
| 155 | Cacheline |
| 156 | - cacheline address (hex number) |
| 157 | |
| 158 | Total records |
| 159 | - sum of all cachelines accesses |
| 160 | |
| 161 | Rmt/Lcl Hitm |
| 162 | - cacheline percentage of all Remote/Local HITM accesses |
| 163 | |
| 164 | LLC Load Hitm - Total, Lcl, Rmt |
| 165 | - count of Total/Local/Remote load HITMs |
| 166 | |
| 167 | Store Reference - Total, L1Hit, L1Miss |
| 168 | Total - all store accesses |
| 169 | L1Hit - store accesses that hit L1 |
| 170 | L1Hit - store accesses that missed L1 |
| 171 | |
| 172 | Load Dram |
| 173 | - count of local and remote DRAM accesses |
| 174 | |
| 175 | LLC Ld Miss |
| 176 | - count of all accesses that missed LLC |
| 177 | |
| 178 | Total Loads |
| 179 | - sum of all load accesses |
| 180 | |
| 181 | Core Load Hit - FB, L1, L2 |
| 182 | - count of load hits in FB (Fill Buffer), L1 and L2 cache |
| 183 | |
| 184 | LLC Load Hit - Llc, Rmt |
| 185 | - count of LLC and Remote load hits |
| 186 | |
| 187 | For each offset in the 2) list we display following data: |
| 188 | |
| 189 | HITM - Rmt, Lcl |
| 190 | - % of Remote/Local HITM accesses for given offset within cacheline |
| 191 | |
| 192 | Store Refs - L1 Hit, L1 Miss |
| 193 | - % of store accesses that hit/missed L1 for given offset within cacheline |
| 194 | |
| 195 | Data address - Offset |
| 196 | - offset address |
| 197 | |
| 198 | Pid |
| 199 | - pid of the process responsible for the accesses |
| 200 | |
| 201 | Tid |
| 202 | - tid of the process responsible for the accesses |
| 203 | |
| 204 | Code address |
| 205 | - code address responsible for the accesses |
| 206 | |
| 207 | cycles - rmt hitm, lcl hitm, load |
| 208 | - sum of cycles for given accesses - Remote/Local HITM and generic load |
| 209 | |
| 210 | cpu cnt |
| 211 | - number of cpus that participated on the access |
| 212 | |
| 213 | Symbol |
| 214 | - code symbol related to the 'Code address' value |
| 215 | |
| 216 | Shared Object |
| 217 | - shared object name related to the 'Code address' value |
| 218 | |
| 219 | Source:Line |
| 220 | - source information related to the 'Code address' value |
| 221 | |
| 222 | Node |
| 223 | - nodes participating on the access (see NODE INFO section) |
| 224 | |
| 225 | NODE INFO |
| 226 | --------- |
| 227 | The 'Node' field displays nodes that accesses given cacheline |
| 228 | offset. Its output comes in 3 flavors: |
| 229 | - node IDs separated by ',' |
| 230 | - node IDs with stats for each ID, in following format: |
| 231 | Node{cpus %hitms %stores} |
| 232 | - node IDs with list of affected CPUs in following format: |
| 233 | Node{cpu list} |
| 234 | |
| 235 | User can switch between above flavors with -N option or |
| 236 | use 'n' key to interactively switch in TUI mode. |
| 237 | |
| 238 | COALESCE |
| 239 | -------- |
| 240 | User can specify how to sort offsets for cacheline. |
| 241 | |
| 242 | Following fields are available and governs the final |
| 243 | output fields set for caheline offsets output: |
| 244 | |
| 245 | tid - coalesced by process TIDs |
| 246 | pid - coalesced by process PIDs |
| 247 | iaddr - coalesced by code address, following fields are displayed: |
| 248 | Code address, Code symbol, Shared Object, Source line |
| 249 | dso - coalesced by shared object |
| 250 | |
Jiri Olsa | 190bacc | 2017-01-20 10:20:32 +0100 | [diff] [blame] | 251 | By default the coalescing is setup with 'pid,iaddr'. |
Jiri Olsa | 465f27a | 2016-08-26 10:36:12 +0200 | [diff] [blame] | 252 | |
| 253 | STDIO OUTPUT |
| 254 | ------------ |
| 255 | The stdio output displays data on standard output. |
| 256 | |
| 257 | Following tables are displayed: |
| 258 | Trace Event Information |
| 259 | - overall statistics of memory accesses |
| 260 | |
| 261 | Global Shared Cache Line Event Information |
| 262 | - overall statistics on shared cachelines |
| 263 | |
| 264 | Shared Data Cache Line Table |
| 265 | - list of most expensive cachelines |
| 266 | |
| 267 | Shared Cache Line Distribution Pareto |
| 268 | - list of all accessed offsets for each cacheline |
| 269 | |
| 270 | TUI OUTPUT |
| 271 | ---------- |
| 272 | The TUI output provides interactive interface to navigate |
| 273 | through cachelines list and to display offset details. |
| 274 | |
| 275 | For details please refer to the help window by pressing '?' key. |
| 276 | |
| 277 | CREDITS |
| 278 | ------- |
| 279 | Although Don Zickus, Dick Fowles and Joe Mario worked together |
| 280 | to get this implemented, we got lots of early help from Arnaldo |
| 281 | Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. |
| 282 | |
| 283 | C2C BLOG |
| 284 | -------- |
| 285 | Check Joe's blog on c2c tool for detailed use case explanation: |
| 286 | https://joemario.github.io/blog/2016/09/01/c2c-blog/ |
| 287 | |
| 288 | SEE ALSO |
| 289 | -------- |
| 290 | linkperf:perf-record[1], linkperf:perf-mem[1] |