热点科技

标题: P大，INTEL的IA架构手册已经到手了，还有些问题啊—— [打印本页]

作者: zhoujianpo 时间: 2006-6-13 19:54
标题: P大，INTEL的IA架构手册已经到手了，还有些问题啊——
P大，INTEL的IA架构手册已经到手了，还有些问题啊——
不知道4B什么时候有？4B是不是会讲Yonah和Conroe？

作者: hctgy 时间: 2006-6-13 20:11
哪有什么4B？4A是什么？

Conroe的优化手册还是Confidential

作者: ljlwxb 时间: 2006-6-13 21:00
牛奶，你好。我想说， Intel的手册内容很多很多，会看死人的。而优化手册，它对写汇编的人很有用。 Core的优化手册很值得阅读和分析。

我根据以往的阅读经历，觉得这些手册一般并不介绍一些“特别”令人感兴趣的底层信息。不足以了解比较细一些的CPU内部结构。 AMD的手册也差不太多的味道。

作者: shoujiyingjian 时间: 2006-6-13 21:09
另外Yonah似乎已经开始在书里面出现了，有没有专门讲它，记不清了，你可以去下下最新的pdf看。
core duo

作者: tq064yw 时间: 2006-6-13 21:12
_ _

作者: haohao766124 时间: 2006-6-13 21:21

原帖由 hopetoknow2 于 2006-6-13 21:12 发表
_ _

最新优化手册：另外Yonah似乎已经开始在书里面出现了，有没有专门讲它，记不清了，你可以去下下最新的pdf看。core duo
http://download.intel.com/design/Pentium4/manuals/24896613.pdf

http://download.intel.com/design/Pentium4/manuals/25366519.pdf

反正也不会太多太细

Intel Core Solo and Intel Core Duo processors incorporate an
microarchitecture that is similar to the Pentium M processor
microarchitecture, but provides additional enhancements for
performance and power efficiency. Enhancements include:
This second level cache is shared between two cores in an Intel Core
Duo processor to minimize bus traffic between two cores accessing
a single-copy of cached data. It allows an Intel Core Solo processor
(or when one of the two cores in an Intel Core Duo processor is idle)
• Stream SIMD Extensions 3
These extensions are supported in Intel Core Solo and Intel Core
Improvement in decoder and micro-op fusion allows the front end to
see most instructions as single μop instructions. This increases the
throughput of the three decoders in the front end.
Throughput of SIMD instructions is improved and the out-of-order
engine is more robust in handling sequences of frequently-used
instructions. Enhanced internal buffering and prefetch mechanisms
also improve data bandwidth for execution.

Execution of SIMD instructions on Intel Core Solo and Intel Core Duo
processors are improved over Pentium M processors by the following
enhancements:
• Micro-op fusion
Scalar SIMD operations on register and memory have single
micro-op flows comparable to X87 flows. Many packed instructions
are fused to reduce its micro-op flow from four to two micro-ops.
• Eliminating decoder restrictions
Intel Core Solo and Intel Core Duo processors improve decoder
throughput with micro-fusion and macro-fusion, so that many more
SSE and SSE2 instructions can be decoded without restriction. On
Pentium M processors, many single micro-op SSE and SSE2
instructions must be decoded by the main decoder.
• Improved packed SIMD instruction decoding
On Intel Core Solo and Intel Core Duo processors, decoding of most
packed SSE instructions is done by all three decoders. As a result
the front end can process up to three packed SSE instructions every
cycle. There are some exceptions to the above; some
shuffle/unpack/shift operations are not fused and require the main
decoder.
Data Prefetching
Intel Core Solo and Intel Core Duo processors provide hardware
mechanisms to prefetch data from memory to the second-level cache.
There are two techniques: one mechanism activates after the data access
pattern experiences two cache-reference misses within a trigger-distance
threshold (see Table 1-2). This mechanism is similar to that of the
Pentium M processor, but can track 16 forward data streams and 4
backward streams. The second mechanism fetches an adjacent cache
line of data after experiencing a cache miss. This effectively simulates
the prefetching capabilities of 128-byte sectors (similar to the sectoring
of two adjacent 64-byte cache lines available in Pentium 4 processors).
Hardware prefetch requests are queued up in the bus system at lower
priority than normal cache-miss requests. If bus queue is in high
demand, hardware prefetch requests may be ignored or cancelled to
service bus traffic required by demand cache-misses and other bus
transactions.
Hardware prefetch mechanisms are enhanced over that of Pentium M
processor by:
• Data stores that are not in the second-level cache generate read for
ownership requests. These requests are treated as loads and can
trigger a prefetch stream.
• Software prefetch instructions are treated as loads, they can also
trigger a prefetch stream.
......

作者: shaneshane 时间: 2006-6-13 21:27
都是旧资料了亚。

作者: fikiaqn 时间: 2006-6-13 21:33
INTEL的IA32架构手册我有，不过那个是给程序员看的啊，我用来查过一些指令集的用法，不过没看到里面有什么“CPU FANS"们关心的东西啊？

作者: zcz1234 时间: 2006-6-13 21:42

原帖由 Edison 于 2006-6-13 21:27 发表
都是旧资料了亚。

是呀，意思不太大，不过如果你仔细看和分析，Yonah的一些特性还是很有意思的。

例如Yonah的L1互通，这可以推算Core的情况
L2延迟是14cycles

Yonah双核在load数据的顺序是：write buffer、自己的L1；若没找到，然后就去L2、另一核的L1中找，最后是去内存找数据。
从另一核的L1中load到数据的典型延迟是14cycle+5.5*总线周期，手册是说是这个L1连接的总线周期是同主频的－－>我推算为20个cycles。

作者: wkp5883135 时间: 2006-6-13 21:45
ia32手册资料少？这取决于你怎么看。

不过如果想快速了解体系架构，应该看优化手册，只是pentium-m后intel提供的资料就非常模糊了，反正有些细节对程序员来说这都是无所谓的，例如ROB、RS。

作者: zf1666 时间: 2006-6-13 21:45

原帖由 hopetoknow2 于 2006-6-13 21:42 发表

是呀，意思不太大，不过如果你仔细看和分析，Yonah的一些特性还是很有意思的。

例如Yonah的L1互通，这可以推算Core的情况
L2延迟是14cycles

Yonah双核在load数据的顺序是：write buffer、自己的L1；若 ...

这个上面，Core和Yonah不一样。

5.5个bus cycle是个很长的时间。

作者: wuhuataocn 时间: 2006-6-13 21:51

原帖由 Prescott 于 2006-6-13 21:45 发表

这个上面，Core和Yonah不一样。

5.5个bus cycle是个很长的时间。

我弄错了，那是Yonah访问内存的计算方法。大约才86cycles，真低啊。

作者: sdlfll2 时间: 2006-6-13 21:53
4B就是排号啊？现在只有1，2，3A，3B和4A——

作者: mujingling3 时间: 2006-6-13 21:53
在我的aopen 975x测试中，yonah @ 2.600ghz的cache交换时间是13x ns per pin-pong。
而在我的conroe 2.67ghz测试中，cache交换时间77ns per pin-pong。

作者: liang19821127 时间: 2006-6-13 21:57

原帖由 Edison 于 2006-6-13 21:53 发表
在我的aopen 975x测试中，yonah @ 2.600ghz的cache交换时间是13x ns。
而在我的conroe 2.67ghz测试中，cache交换时间77ns。

77ns，远高于实际值哦。你怎么测试的？

作者: 80881 时间: 2006-6-13 22:01
也许是测试程序的问题，修改后可以缩小到1/4，不过conroe现在归还了。

作者: lishaowei 时间: 2006-6-13 22:47

原帖由 Edison 于 2006-6-13 22:01 发表
也许是测试程序的问题，修改后可以缩小到1/4，不过conroe现在归还了。

Yonah也可以缩小1/4吗?

作者: honets 时间: 2006-6-13 23:07

原帖由 Edison 于 2006-6-13 22:01 发表
也许是测试程序的问题，修改后可以缩小到1/4，不过conroe现在归还了。

我觉得Cho你的测试yonah架构图画错了。DP FMUL和DP FADD不应该画到同一个单元中，不该都在Port0

On Intel Core Solo and Intel Core Duo processors, the combination of
improved decoding and micro-op fusion allows instructions which were
formerly two, three, and four micro-ops to go through all decoders. As a
result, scalar SSE/SSE2 code can match the performance of x87 code
executing through two floating-point units. On Pentium M processors,
scalar SSE/SSE2 code can experience approximately 30% performance
degradation relative to x87 code executing through two floating-point
units.
In code sequences that have conversions from floating-point to integer,
divide single-precision instructions, or any precision change; x87 code
generation from a compiler typically writes data to memory in
single-precision and reads it again in order to reduce precision. Using
SSE/SSE2 scalar code instead of x87 code can generate a large
performance benefit using Intel NetBurst microarchitecture and a
modest benefit on Intel Core Solo and Intel Core Duo processors.

作者: av6421165 时间: 2006-6-13 23:19
同时我认为scalar SSE2乘法指令MULSD 和x87的DP fmul指令都是共享使用同一个DP浮点乘法器。
而scalar SSE2加法指令ADDSD 和x87的DP fadd指令都是共享使用同一个DP浮点加法器。

当然这也意味着并行SSE2乘法指令MULPD是需要2次使用这一个DP浮点乘法器，而并行SSE2加法指令ADDPD是需要2次使用DP浮点加法器

作者: cici0325 时间: 2006-6-14 00:40
FMAD/FADD是指x87的，不是DP，而是Long Double。

图中已经把SIMD FP ADD/SIMD DP MUL分别放在不同的port。因为是直接沿用PIII的架构图修改了一下，所以xxxPD没有写上去，这些指令的位置和对应的XXXPS单元位置一样的。

作者: xiayangfkd 时间: 2006-6-14 08:24

原帖由 Edison 于 2006-6-14 00:40 发表
FMAD/FADD是指x87的，不是DP，而是Long Double。

图中已经把SIMD FP ADD/SIMD DP MUL分别放在不同的port。因为是直接沿用PIII的架构图修改了一下，所以xxxPD没有写上去，这些指令的位置和对应的XXXPS单元位置一 ...

SIMD-FP ADDPS
SIMD-FP MULPS
is SSE

欢迎光临热点科技 (http://www.itheat.com/activity/)