原帖由 hopetoknow2 于 2006-6-13 21:12 发表
_ _ 最新优化手册:另外Yonah似乎已经开始在书里面出现了,有没有专门讲它,记不清了, 你可以去下下最新的pdf看。core duo
http://download.intel.com/design/Pentium4/manuals/24896613.pdf
http://download.intel.com/design/Pentium4/manuals/25366519.pdf
反正也不会太多太细
Intel Core Solo and Intel Core Duo processors incorporate an
microarchitecture that is similar to the Pentium M processor
microarchitecture, but provides additional enhancements for
performance and power efficiency. Enhancements include:
This second level cache is shared between two cores in an Intel Core
Duo processor to minimize bus traffic between two cores accessing
a single-copy of cached data. It allows an Intel Core Solo processor
(or when one of the two cores in an Intel Core Duo processor is idle)
• Stream SIMD Extensions 3
These extensions are supported in Intel Core Solo and Intel Core
Improvement in decoder and micro-op fusion allows the front end to
see most instructions as single μop instructions. This increases the
throughput of the three decoders in the front end.
Throughput of SIMD instructions is improved and the out-of-order
engine is more robust in handling sequences of frequently-used
instructions. Enhanced internal buffering and prefetch mechanisms
also improve data bandwidth for execution.
Execution of SIMD instructions on Intel Core Solo and Intel Core Duo
processors are improved over Pentium M processors by the following
enhancements:
• Micro-op fusion
Scalar SIMD operations on register and memory have single
micro-op flows comparable to X87 flows. Many packed instructions
are fused to reduce its micro-op flow from four to two micro-ops.
• Eliminating decoder restrictions
Intel Core Solo and Intel Core Duo processors improve decoder
throughput with micro-fusion and macro-fusion, so that many more
SSE and SSE2 instructions can be decoded without restriction. On
Pentium M processors, many single micro-op SSE and SSE2
instructions must be decoded by the main decoder.
• Improved packed SIMD instruction decoding
On Intel Core Solo and Intel Core Duo processors, decoding of most
packed SSE instructions is done by all three decoders. As a result
the front end can process up to three packed SSE instructions every
cycle. There are some exceptions to the above; some
shuffle/unpack/shift operations are not fused and require the main
decoder.
Data Prefetching
Intel Core Solo and Intel Core Duo processors provide hardware
mechanisms to prefetch data from memory to the second-level cache.
There are two techniques: one mechanism activates after the data access
pattern experiences two cache-reference misses within a trigger-distance
threshold (see Table 1-2). This mechanism is similar to that of the
Pentium M processor, but can track 16 forward data streams and 4
backward streams. The second mechanism fetches an adjacent cache
line of data after experiencing a cache miss. This effectively simulates
the prefetching capabilities of 128-byte sectors (similar to the sectoring
of two adjacent 64-byte cache lines available in Pentium 4 processors).
Hardware prefetch requests are queued up in the bus system at lower
priority than normal cache-miss requests. If bus queue is in high
demand, hardware prefetch requests may be ignored or cancelled to
service bus traffic required by demand cache-misses and other bus
transactions.
Hardware prefetch mechanisms are enhanced over that of Pentium M
processor by:
• Data stores that are not in the second-level cache generate read for
ownership requests. These requests are treated as loads and can
trigger a prefetch stream.
• Software prefetch instructions are treated as loads, they can also
trigger a prefetch stream.
...... |