nvprof --query-events

简介: nvprof --query-events

Available Events:

Name Description

Device 0 (GeForce GTX 970M):

 

Domain domain_a:

tex0_cache_sector_queries: Number of texture cache 0 requests. This increments by 1 for each 32-byte access.
tex1_cache_sector_queries: Number of texture cache 1 requests. This increments by 1 for each 32-byte access.
tex0_cache_sector_misses: Number of texture cache 0 misses. This increments by 1 for each 32-byte access.
tex1_cache_sector_misses: Number of texture cache 1 misses. This increments by 1 for each 32-byte access.
elapsed_cycles_sm: Elapsed clocks

Domain domain_b:

fb_subp0_read_sectors: Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.
fb_subp1_read_sectors: Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.
fb_subp0_write_sectors: Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.
fb_subp1_write_sectors: Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.

Domain domain_c:

gld_inst_8bit: Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_16bit: Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_32bit: Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_64bit: Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_128bit: Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
gst_inst_8bit: Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_16bit: Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_32bit: Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_64bit: Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_128bit: Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

Domain domain_d:

prof_trigger_00: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_01: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_02: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_03: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_04: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_05: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_06: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_07: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
warps_launched: Number of warps launched.
inst_issued0: Number of cycles that did not issue any instruction, increments per warp.
inst_issued1: Number of cycles that issued single instruction, increments per warp.
inst_issued2: Number of cycles that issued dual instructions, increments per warp.
inst_executed: Number of instructions executed per warp.
thread_inst_executed: Number of instructions executed by the active threads. For each instruction it increments by number of threads, including predicated-off threads, that execute the instruction. It does not include replays.
not_predicated_off_thread_inst_executed: Number of instructions executed by active and not predicated off threads, does not include replays. For each instruction it increments by the number of not predicated off threads that execute this instruction.
local_store: Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
local_load: Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
shared_load: Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
shared_store: Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
shared_atom_cas: Number of ATOMS.CAS instructions executed per warp.
shared_atom: Number of ATOMS instructions executed per warp.
global_atom_cas: Number of ATOM.CAS instructions executed per warp.
atom_count: Number of ATOM instructions executed per warp.
global_load: Number of executed load instructions where state space is specified as global, increments per warp on a multiprocessor.
global_store: Number of executed store instructions where state space is specified as global, increments per warp on a multiprocessor.
gred_count: Number of reduction operations performed per warp.
divergent_branch: Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.
branch: Number of branch instructions executed per warp on a multiprocessor.
active_cycles: Number of cycles a multiprocessor has at least one active warp.
active_warps: Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.
active_ctas: Accumulated number of active blocks per cycle. For every cycle it increments by the number of active blocks in the cycle which can be in the range 0 to 32.
sm_cta_launched: Number of blocks launched
shared_ld_bank_conflict: Number of shared load bank conflict generated when the addresses for two or more shared memory load requests fall in the same memory bank.
shared_st_bank_conflict: Number of shared store bank conflict generated when the addresses for two or more shared memory store requests fall in the same memory bank.
shared_ld_transactions: Number of transactions for shared load accesses. Maximum transaction size in maxwell is 128 bytes, any warp accessing more that 128 bytes will cause multiple transactions for a shared load instruction. This also includes extra transactions caused by shared bank conflicts.
shared_st_transactions: Number of transactions for shared store accesses. Maximum transaction size in maxwell is 128 bytes, any warp accessing more that 128 bytes will cause multiple transactions for a shared store instruction. This also includes extra transactions caused by shared bank conflicts.

Domain domain_e:

l2_subp0_write_sector_misses: Number of write requests sent to DRAM from slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sector_misses: Number of read requests sent to DRAM from slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_sector_misses: Number of write requests sent to DRAM from slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_sector_misses: Number of read requests sent to DRAM from slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_tex_sector_queries: Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_tex_sector_queries: Number of write requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_tex_hit_sectors: Number of write requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sysmem_sector_queries: Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_sysmem_sector_queries: Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_total_read_sector_queries: Total read requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp0_total_write_sector_queries: Total write requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_read_tex_sector_queries: Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_tex_sector_queries: Number of write requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_tex_hit_sectors: Number of write requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_sysmem_sector_queries: Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_sysmem_sector_queries: Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_total_read_sector_queries: Total read requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_total_write_sector_queries: Total write requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.


目录
相关文章
|
机器学习/深度学习 自然语言处理 算法
DL应用:query生成和query推荐
引言  在机器翻译、图片描述、语义蕴涵、语音识别和文本摘要中,序列到序列的问题已经有太多大牛研究了,也取得了很多突破。谷歌的Attention is all you need[1],舍弃并超越了主流的rnn与cnn序列建模框架,刷出了新的state of the art,这种大胆创新的精神值得我们学习。
15711 1
|
7月前
【开发专题_02】Executing an update/delete query
【开发专题_02】Executing an update/delete query
75 0
|
关系型数据库 MySQL
深入理解SELECT ... LOCK IN SHARE MODE和SELECT ... FOR UPDATE
深入理解SELECT ... LOCK IN SHARE MODE和SELECT ... FOR UPDATE
254 0
|
7月前
|
存储
Build desc failed:Fetch table group shards failed on meta proxy:Loading cached shard 1ocation value for table group[dwhg_scm.dwhg_prd_tg_default] failed
Build desc failed:Fetch table group shards failed on meta proxy:Loading cached shard 1ocation value for table group[dwhg_scm.dwhg_prd_tg_default] failed
206 2
有趣的 events_statements_current 表问题
有趣的 events_statements_current 表问题
160 0
|
SQL Java 数据库连接
JPA异常:Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
JPA异常:Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
2111 0
|
SQL 关系型数据库 MySQL
Mysql数据库Slow_log中的lock_Time和Query_time
主要用简单的例子来说明,slow log里的lock_time和query_time的关系
Mysql数据库Slow_log中的lock_Time和Query_time
|
Java
错误Batch update returned unexpected row count from update [0]; actual row count: 0;
错误Batch update returned unexpected row count from update [0]; actual row count: 0;   把开发过程中碰到的BUG累积下来也是一笔财富。
3737 0
错误Batch update returned unexpected row count from update [0]; actual row count: 0;
|
开发者
Dynamic function tracing events
本文翻译自LWN文章“Dynamic function tracing events”,原文[链接](https://lwn.net/Articles/747256/) 从内核引入tracepoint开始,开发者们就对这些tracepoint是否为内核ABI的一部分而发生过争执。过去由于破坏了已有存在依赖的用户态程序,tracepoint相关变更被回退过。同时,对内部代码无法改变的恐慌使得在
1149 0