1BP: 1-bit branch predictor
4 C's - compulsory Misses: the first time a block is accessed by the cache 4 C's - capacity Misses: blocks must be evicted due to the size of the cache.
4 C's - coherence Miss: processors are accessing the same block. Processor A writes to the block. Even though Processor B has the block in its cache, it is a miss, because the block is no longer up-to-date.
4 C's - conflict Misses: associated with set associative and direct mapped caches - another data address needs the cache block and must replace the data currently in the cache.
ALAT: advance load table - stores advance information about load operations
aliasing: in the BTB, when two addresses overlap with the same BTB entry, this is called aliasing. Aliasing should be kept to <1%.
ALU: arithmetic logic unit
AMAT: average memory access time
AMAT: Average Memory Access Time = hit time + miss rate * miss penalty
Amdahl's Law: an equation to determine the improvement of a system when only a portion of the system is improved.
architectural registers: registers (Floating point and General Purpose) that are visible to the programmer.
ARF: architectural register file or retirement register file
Asynchronous Message Passing: a processor requests data, then continues processing instructions while message is retrieved.
BHT: branch history table - records if branch was taken or not taken.
blocking cache: the cache services only one block at a time, blocking all other requests BTB: branch target buffer - keeps track of what address was taken last time the processor encountered this instruction.
cache coherence definition #1: Definition #1 - A read R from address X on processor P1 returns the value written by the most recent write W to X on P1 if no other processor has written to X between W and R.
cache coherence definition #2: Definition #2 - If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write.
cache coherence definition #3: Definition #3 - Writes to the same location are serialized:two writes to location X are seen in the same order by all processors.
cache hit: desired data is in the cache and is up-to-date cache miss: desired data is not in the cache or is dirty
cache thrashing: when two or more addresses are competing for the same cache block. The processor is requesting both addresses, which results in each access evicting the previous access. CDB: common data bus
check pointing: store the state of the CPU before a branch is taken. Then if the branch is a misprediction, restore the CPU to correct state. Don't store to memory until it is determined this is the correct branch.
CISC Processor: complex instruction set CMP: chip multiprocessor
coarse multi-threading: the thread being processed changes every few clock cycles consistency: order of access to different addresses
control hazard: branching and jumps cannot be executed until the destination address is known CPI: cycle per instruction
CPU: central processing unit
Dark Silicon: the gap between how many transistors are on a chip and how many you can use simultaneously. The simultaneous usage is determined by the power consumption of the chip. data hazard: the order of the program is changed which results in data commands being out of order, if the instructions are dependent - then there is a data hazard.
DDR SDRAM: double data rate synchronous dynamic RAM dependency chain: long series of dependent instructions in code
directory protocols: information about each block state in the caches is stored in a common directory.
DRAM: dynamic random access memory
DSM: distributed shared memory - all processors can access all memory locations Enterprise class: used for large scale systems that service enterprises
error: defect that results in failure
error forecasting: estimate presence, creation, and consequences of errors error removal: removing latent errors by verification
exclusion property: each cache level will not contain any data held by a lower level cache explicit ILP: compiler decides which instruction to execute in parallel
failure: the cause of an error
fault avoidance: prevent an occurrence of faults by construction
fault tolerance: prevent faults from becoming failures through redundancy faults: actual behavior deviates from specified behavior
FIFO: first in first out
fine multi-threading: the thread being processed changes every cycle FLOPS: floating point operations per second
Flynn's Taxonomy: classifications of parallel computer architecture, SISD, SIMD, MISD, MIMD
FPR: floating point register FSB: front side bus
Geometric Mean: the nth root of the product of the numbers global miss rate: (the # of L2 misses)/(# of all memory misses) GPR: general purpose register
hit latency: time it takes to get data from cache. Includes the time to find the address in the cache and load it on the data lines
ilp: instruction level programming
inclusion property: each level of cache will include all data from the lower level caches IPC: instructions per cycle
Iron Law: execution time is the number of executed instructions N (write N in in the ExeTime for Single-Cycle), times the CPI (write x1), times the clock cycle time (write 2ns) so we get N2ns (write =N2ns) for single-cycle.
Iron Law: instructions per program depends on source code, compiler technology, and ISA. CPI depends upon the ISA and the micro architecture. Time per cycle depends upon the micro architecture and the base technology.
iron law of computer performance: relates cycles per instruction, frequency and number of instructions to computer performance
ISA: instruction set architecture
Itanium architecture: an explicit ILP architecture, six instructions can be executed per clock cycle
Itanium Processor: Intel family of 64-bit processors that uses the Itanium architecture LFU: least frequently used
ll and sc: load link and store conditional, a method using two instructions ll and sc for ensuring synchronization.
local miss rate: # of L2 misses/ # of L1 misses
locality principle: things that will happen soon are likely to be similar to things that just happened.
loop interchange: used for nested loops. Interchange the order of the iterations of the loop, to make the accesses of the indexes closer to what is actually the layout in memory
LRU: least recently used LSQ: load store queue
MCB: memory conflict buffer - "Dynamic Memory Disambiguation Using the Memory Conflict Buffer", see also "Memory Disambiguation"
MEOSI Protocol: modified-exclusive-owner-shared-invalid protocol, the states of any cached block.
MESI Protocol: modified-exclusive-shared-invalid protocol, the states of any cached block. Message Passing: a processor can only access its local memory. To access other memory locations is must send request/receive messages for data at other memory locations. meta-predictor: a predictor that chooses the best branch predictor for each branch. MIMD: multiple instruction stream, multiple data streams
MISD: multiple instruction streams, single data stream
miss latency: time it takes to get data from main memory. This includes the time it takes to check that it is not in the cache and then to determine who owns the data, and then send it to the CPU.
mobo: mother board
Moore's Law: Gordon E. Moore observed the number of transistors on an integrated circuit board doubles every two years.
MP: multiprocessing
MPKI: Misses per Kilo Instruction
MSI Protocol: modified-shared-invalid protocol, the states of any cached block. MTPI: message transfer part interface
MTTF: mean time to failure MTTR: mean time to repair
multi-level caches: caches with two or more levels, each level larger and slower than the previous level
mutex variable: mutually exclusive (mutex), a low level synchronization mechanism. A thread acquires the variable, then releases it upon completion of the task. During this period no other thread can acquire the mutex.
NMRU: not most recently used
non-blocking caches: if there is a miss, the cache services the next request while waiting for memory
NUMA: non-uniform memory access, also called a distributed shared memory OOO: out of order
OS: operating system
PAPT: physically addressed, physically tagged cache - the cache stores the data based on its physcial address
PC: program counter
PCI: peripheral component interconnect
Pentium Processor: x86 super scalar processor from Intel
physical registers: registers, FP and GP that are not visible to the programmer pipeline burst cache:
pipelined cache: a pipelined burst cache uses 3 clock cycles to transfer the first data set from a cache block, then 1 clock cycle to transfer each of the rest. The pipeline and the 'burst'. (3-1-1-1) PIPT: physically indexed, physically tagged cache.
Power: Power = 1/2C V^2 * f Alpha
Power Architecture: performance optimization with enhanced RISC
Power vs Performance Equation:
pre-fetch buffer: when getting data from memory, get all the data in the row and store it in a buffer.
pre-fetching cache: instructions are fetched from memory before they are needed by the cpu " Prescott Processor: Based on the Netburst architecture. It has a 31 stage pipeline in the core. The high penatly paid for mispredictions is supposedly offset with a Rapid Execution Engine. It also has a trace execution cache, this stores decoded instructions and then reuses them instead of fetching and decoding again.
PRF: physical register file
pseudo associative cache: an address is first searched in 1/2 of the cache. If it is not there, then it is searched in the other half of the cache.
RAID: redundant array of independent disks
RAID 0: strips of data are stored on disks - alternating between disks. Each disk supplies a portion of the data, which usually improves performance.
RAID 1: the data is replicated on another disk. Each disk contains the data. Which ever disk is free responds to the read request. The write request is written to one disk and then mirrored to the other disk(s).
RAID 2 and RAID 3: the data is striped on disks and Hamming codes or parity bits are used for error detection. RAID 2 and RAID 3 are not used in any current application
RAID 4: Data is striped in large blocks onto disks with a dedicated parity disk. It is used by the NetApp company.
RAID 5: Data is striped in large blocks onto disks, but there is no dedicated parity disk. The parity for each block is stored on one of the data blocks.
RAR: read after read RAS: return address stack RAT: register alias table
RAT: *(another RAT in multiprocessing) register allocation table RAW: read after write
RDRAM: direct random access memory
relaxed consistency: some instructions can be performed ooo and still maintain consistency reliability: measure of continuous service accomplishment
reservation stations: function unit buffers
RETO: return from interrupt
RF: register file
RISC Processor: reduced instruction set - simple instructions of the same size. Instructions are executed in one clock cycle
ROB: re-order buffer RS: reservation station
RWX: read - write- execute permissions on files
SHARC processor: floating point processors designed for DSP applications SIMD: singe instruction stream, multiple data streams
simultaneous multi-threading: instructions from different threads are processed, even in the same cycle
SISD: single instruction stream , single data stream SMP: symmetric multiprocessing
SMT: simultaneous multi threading
snooping protocols: A broadcast network - caches for each processor watch the bus for addresses in their cache.
SPARC processor: Scalable Processor Architecture -a RISC instruction set processor
spatial locality: if we access a memory location, nearby memory locations have a tendency to be accessed soon.
Speedup: how much faster a modified system is compared to the unmodified system. SPR: special purpose registers - such as program counter, or status register
SRAM: static random access memory
structural hazard: the pipeline contains two instructions attempting to access the same resource.
super scalar architecture: the processor manages instruction dependencies at run-time. Executes more than one instruction per clock cycle using pipelines.
synchronization: "a system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations for each individual processor appear in the order specified by the program." Quote by Leslie Lamport
Synchronous Message Passing: a processor requests data then waits until the data is received before continuing.
tag: the part of the data address that is used to find the data in the cache. This portion of the address is unique so that it can be distinguished from other lines in the cache.
temporal locality: if a program accesses a memory location, it tends to access the same location again very soon.
TLB: translation look aside buffer - a cache of translated virtual memory to physical memory addresses. TLB misses are very time consuming
Tomasulo's Algorithm: achieve high performance without using special compilers by using dynamic scheduling
tournament predictor: a meta-predictor
trace caches: sets of instructions are stored in a separate cache. These are instructions that have been decoded and executed. If there is a branch in the set, only the taken branch instructions are kept. If there is a misprediction the trace stops.
trace scheduling: rearranging instructions for faster execution, the common cases are scheduled tree, tournament, dissemination barriers: types of structures for barriers
UMA: uniform memory access - all memory locations have similar latencie.
皮格马利翁效应心理学指出,赞美、赞同能够产生奇迹,越具体,效果越好~
“收藏夹吃灰”是学“器”练“术”非常聪明的方法,帮助我们避免日常低效的勤奋~