summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/perf
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/perf')
-rw-r--r--Documentation/admin-guide/perf/index.rst3
-rw-r--r--Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst (renamed from Documentation/admin-guide/perf/nvidia-pmu.rst)8
-rw-r--r--Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst522
3 files changed, 528 insertions, 5 deletions
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 47d9a3df6329..aa12708ddb96 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -24,7 +24,8 @@ Performance monitor support
thunderx2-pmu
alibaba_pmu
dwc_pcie_pmu
- nvidia-pmu
+ nvidia-tegra241-pmu
+ nvidia-tegra410-pmu
meson-ddr-pmu
cxl
ampere_cspmu
diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst
index f538ef67e0e8..fad5bc4cee6c 100644
--- a/Documentation/admin-guide/perf/nvidia-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst
@@ -1,8 +1,8 @@
-=========================================================
-NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
-=========================================================
+============================================================
+NVIDIA Tegra241 SoC Uncore Performance Monitoring Unit (PMU)
+============================================================
-The NVIDIA Tegra SoC includes various system PMUs to measure key performance
+The NVIDIA Tegra241 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Scalable Coherency Fabric (SCF)
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
new file mode 100644
index 000000000000..0656223b61d4
--- /dev/null
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -0,0 +1,522 @@
+=====================================================================
+NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
+=====================================================================
+
+The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
+metrics like memory bandwidth, latency, and utilization:
+
+* Unified Coherence Fabric (UCF)
+* PCIE
+* PCIE-TGT
+* CPU Memory (CMEM) Latency
+* NVLink-C2C
+* NV-CLink
+* NV-DLink
+
+PMU Driver
+----------
+
+The PMU driver describes the available events and configuration of each PMU in
+sysfs. Please see the sections below to get the sysfs path of each PMU. Like
+other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
+the CPU id used to handle the PMU event. There is also "associated_cpus"
+sysfs attribute, which contains a list of CPUs associated with the PMU instance.
+
+UCF PMU
+-------
+
+The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
+distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
+interconnect that supports hardware coherence across multiple coherently caching
+agents, including:
+
+ * CPU clusters
+ * GPU
+ * PCIe Ordering Controller Unit (OCU)
+ * Other IO-coherent requesters
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
+
+Some of the events available in this PMU can be used to measure bandwidth and
+utilization:
+
+ * slc_access_rd: count the number of read requests to SLC.
+ * slc_access_wr: count the number of write requests to SLC.
+ * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
+ * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
+ * mem_access_rd: count the number of read requests to local or remote memory.
+ * mem_access_wr: count the number of write requests to local or remote memory.
+ * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
+ * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
+ * cycles: counts the UCF cycles.
+
+The average bandwidth is calculated as::
+
+ AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
+ AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
+ AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
+ AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
+ AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
+ AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
+ AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
+
+More details about what other events are available can be found in Tegra410 SoC
+technical reference manual.
+
+The events can be filtered based on source or destination. The source filter
+indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
+remote socket. The destination filter specifies the destination memory type,
+e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
+local/remote classification of the destination filter is based on the home
+socket of the address, not where the data actually resides. The available
+filters are described in
+/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
+
+The list of UCF PMU event filters:
+
+* Source filter:
+
+ * src_loc_cpu: if set, count events from local CPU
+ * src_loc_noncpu: if set, count events from local non-CPU device
+ * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
+
+* Destination filter:
+
+ * dst_loc_cmem: if set, count events to local system memory (CMEM) address
+ * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
+ * dst_loc_other: if set, count events to local CXL memory address
+ * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
+
+If the source is not specified, the PMU will count events from all sources. If
+the destination is not specified, the PMU will count events to all destinations.
+
+Example usage:
+
+* Count event id 0x0 in socket 0 from all sources and to all destinations::
+
+ perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
+
+* Count event id 0x0 in socket 0 with source filter = local CPU and destination
+ filter = local system memory (CMEM)::
+
+ perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
+
+* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
+ destination filter = remote memory::
+
+ perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
+
+PCIE PMU
+--------
+
+This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
+the memory subsystem. It monitors all read/write traffic from the root port(s)
+or a particular BDF in a PCIE RC to local or remote memory. There is one PMU per
+PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
+up to 8 root ports. The traffic from each root port can be filtered using RP or
+BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
+capture traffic from all RPs. Please see below for more details.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
+
+The events in this PMU can be used to measure bandwidth, utilization, and
+latency:
+
+ * rd_req: count the number of read requests by PCIE device.
+ * wr_req: count the number of write requests by PCIE device.
+ * rd_bytes: count the number of bytes transferred by rd_req.
+ * wr_bytes: count the number of bytes transferred by wr_req.
+ * rd_cum_outs: count outstanding rd_req each cycle.
+ * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
+
+The average bandwidth is calculated as::
+
+ AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
+ AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
+ AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
+
+
+The average latency is calculated as::
+
+ FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+ AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
+ AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
+
+The PMU events can be filtered based on the traffic source and destination.
+The source filter indicates the PCIE devices that will be monitored. The
+destination filter specifies the destination memory type, e.g. local system
+memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
+classification of the destination filter is based on the home socket of the
+address, not where the data actually resides. These filters can be found in
+/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
+
+The list of event filters:
+
+* Source filter:
+
+ * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
+ bitmask represents the RP index in the RC. If the bit is set, all devices under
+ the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
+ devices in root port 0 to 3.
+ * src_bdf: the BDF that will be monitored. This is a 16-bit value that
+ follows formula: (bus << 8) + (device << 3) + (function). For example, the
+ value of BDF 27:01.1 is 0x2781.
+ * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
+ "src_bdf" is used to filter the traffic.
+
+ Note that Root-Port and BDF filters are mutually exclusive and the PMU in
+ each RC can only have one BDF filter for the whole counters. If BDF filter
+ is enabled, the BDF filter value will be applied to all events.
+
+* Destination filter:
+
+ * dst_loc_cmem: if set, count events to local system memory (CMEM) address
+ * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
+ * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
+ * dst_loc_pcie_cxl: if set, count events to local CXL memory address
+ * dst_rem: if set, count events to remote memory address
+
+If the source filter is not specified, the PMU will count events from all root
+ports. If the destination filter is not specified, the PMU will count events
+to all destinations.
+
+Example usage:
+
+* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
+
+* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
+ targeting just local CMEM of socket 0::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
+
+* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
+
+* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
+ targeting just local CMEM of socket 1::
+
+ perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
+
+* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
+
+.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
+
+Mapping the RC# to lspci segment number
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
+Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
+for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
+contains the following information to map PCIE devices under the RP back to its RC# :
+
+ - Bus# (byte 0xc) : bus number as reported by the lspci output
+ - Segment# (byte 0xd) : segment number as reported by the lspci output
+ - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
+ - RC# (byte 0xf): root complex number associated with the RP
+ - Socket# (byte 0x10): socket number associated with the RP
+
+Example script for mapping lspci BDF to RC# and socket#::
+
+ #!/bin/bash
+ while read bdf rest; do
+ dvsec4_reg=$(lspci -vv -s $bdf | awk '
+ /Designated Vendor-Specific: Vendor=10de ID=0004/ {
+ match($0, /\[([0-9a-fA-F]+)/, arr);
+ print "0x" arr[1];
+ exit
+ }
+ ')
+ if [ -n "$dvsec4_reg" ]; then
+ bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
+ segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
+ rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
+ rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
+ socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
+ echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
+ fi
+ done < <(lspci -d 10de:)
+
+Example output::
+
+ 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
+ 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
+ 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
+ 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
+ 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
+ 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
+ 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
+ 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
+ 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
+ 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
+ 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
+ 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
+ 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
+ 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
+ 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
+ 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
+ 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
+ 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
+ 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
+ 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
+ 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
+
+PCIE-TGT PMU
+------------
+
+This PMU is located in the SOC fabric connecting the PCIE root complex (RC) and
+the memory subsystem. It monitors traffic targeting PCIE BAR and CXL HDM ranges.
+There is one PCIE-TGT PMU per PCIE RC in the SoC. Each RC in Tegra410 SoC can
+have up to 16 lanes that can be bifurcated into up to 8 root ports (RP). The PMU
+provides RP filter to count PCIE BAR traffic to each RP and address filter to
+count access to PCIE BAR or CXL HDM ranges. The details of the filters are
+described in the following sections.
+
+Mapping the RC# to lspci segment number is similar to the PCIE PMU. Please see
+:ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
+
+The events in this PMU can be used to measure bandwidth and utilization:
+
+ * rd_req: count the number of read requests to PCIE.
+ * wr_req: count the number of write requests to PCIE.
+ * rd_bytes: count the number of bytes transferred by rd_req.
+ * wr_bytes: count the number of bytes transferred by wr_req.
+ * cycles: count the clock cycles of SOC fabric connected to the PCIE interface.
+
+The average bandwidth is calculated as::
+
+ AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
+ AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
+ AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
+
+The PMU events can be filtered based on the destination root port or target
+address range. Filtering based on RP is only available for PCIE BAR traffic.
+Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
+found in sysfs, see
+/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
+
+Destination filter settings:
+
+* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
+ corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
+ only available for PCIE BAR traffic.
+* dst_addr_base: BAR or CXL HDM filter base address.
+* dst_addr_mask: BAR or CXL HDM filter address mask.
+* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
+ address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
+ the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
+ to determine if the traffic destination address falls within the filter range::
+
+ (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
+
+ If the comparison succeeds, then the event will be counted.
+
+If the destination filter is not specified, the RP filter will be configured by default
+to count PCIE BAR traffic to all root ports.
+
+Example usage:
+
+* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
+
+ perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
+
+* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
+ 0x10000 to 0x100FF on socket 0's PCIE RC-1::
+
+ perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
+
+CPU Memory (CMEM) Latency PMU
+-----------------------------
+
+This PMU monitors latency events of memory read requests from the edge of the
+Unified Coherence Fabric (UCF) to local CPU DRAM:
+
+ * RD_REQ counters: count read requests (32B per request).
+ * RD_CUM_OUTS counters: accumulated outstanding request counter, which track
+ how many cycles the read requests are in flight.
+ * CYCLES counter: counts the number of elapsed cycles.
+
+The average latency is calculated as::
+
+ FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+ AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
+ AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
+
+Example usage::
+
+ perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
+
+NVLink-C2C PMU
+--------------
+
+This PMU monitors latency events of memory read/write requests that pass through
+the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
+in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
+ * IN_RD_REQ: the number of incoming read requests.
+ * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
+ * IN_WR_REQ: the number of incoming write requests.
+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
+ * OUT_RD_REQ: the number of outgoing read requests.
+ * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
+ * OUT_WR_REQ: the number of outgoing write requests.
+ * CYCLES: NVLink-C2C interface cycle counts.
+
+The incoming events count the reads/writes from remote device to the SoC.
+The outgoing events count the reads/writes from the SoC to remote device.
+
+The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
+contains the information about the connected device.
+
+When the C2C interface is connected to GPU(s), the user can use the
+"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
+index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
+The PMU will monitor all GPUs by default if not specified.
+
+When connected to another SoC, only the read events are available.
+
+The events can be used to calculate the average latency of the read/write requests::
+
+ C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
+ IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
+ OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count incoming traffic from all GPUs connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
+
+ * Count incoming traffic from GPU 0 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
+
+ * Count incoming traffic from GPU 1 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
+
+ * Count outgoing traffic to all GPUs connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
+
+ * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
+
+ * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
+
+NV-CLink PMU
+------------
+
+This PMU monitors latency events of memory read requests that pass through
+the NV-CLINK interface. Bandwidth events are not available in this PMU.
+In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
+SoC and this PMU only counts read traffic.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
+ * IN_RD_REQ: the number of incoming read requests.
+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
+ * OUT_RD_REQ: the number of outgoing read requests.
+ * CYCLES: NV-CLINK interface cycle counts.
+
+The incoming events count the reads from remote device to the SoC.
+The outgoing events count the reads from the SoC to remote device.
+
+The events can be used to calculate the average latency of the read requests::
+
+ CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
+
+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count incoming read traffic from remote SoC connected via NV-CLINK::
+
+ perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
+
+ * Count outgoing read traffic to remote SoC connected via NV-CLINK::
+
+ perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
+
+NV-DLink PMU
+------------
+
+This PMU monitors latency events of memory read requests that pass through
+the NV-DLINK interface. Bandwidth events are not available in this PMU.
+In Tegra410 SoC, this PMU only counts CXL memory read traffic.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
+ * IN_RD_REQ: the number of read requests to CXL memory.
+ * CYCLES: NV-DLINK interface cycle counts.
+
+The events can be used to calculate the average latency of the read requests::
+
+ DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count read events to CXL memory::
+
+ perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'