Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian

Name: Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian
Item: Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian
Author: Johan De Gelas

by Johan De Gelas on July 21, 2016 8:45 AM EST

Posted in
CPUs
IBM
POWER
POWER8

124 Comments | Add A Comment

124 Comments

Comparing with Intel's Best

Comparing CPUs in tables is always a very risky game: those simple numbers hide a lot of nuances and trade-offs. But if we approach with caution, we can still extract quite a bit of information out of it.

Feature	IBM POWER8	Intel Broadwell (Xeon E5 v4)	Intel Skylake
L1-I cache Associativity	32 KB 8-way	32 KB 8-way	32 KB 8-way
L1-D cache Associativity	64 KB 8-way	32 KB 8-way	32 KB 8-way
Outstanding L1-cache misses	16	10	10
Fetch Width	8 instructions	16 bytes (+/- 4-5 x86)	16 bytes (+/- 4-5 x86)
Decode Width	8	4 µops	5-6* µops (*µop cache hit)
Issue Queue	64+15 branch+8 CR = 87	60 unified	97 unified
Issue Width/Cycle	10	8	8
Instructions in Flight	224 (GCT SMT-8 modus)	192 (ROB)	224 (ROB)
Archi regs Rename regs	32 (ST), 2x32 (SMT-2) 92 (ST), 2x92 (SMT-2)	16 168	16 180
Load Bandwidth (per unit) Load Queue Size	4 per cycle 16B/cycle 44 entries	2 per cycle 32B/cycle 72 entries	2 per cycle 32B/cycle 72 entries
Store Bandwidth Store Queue Size	2 per cycle 16B/cycle 40 entries	1 per cycle 32B/cycle 42 entries	1 per cycle 32B/cycle 56 entries
Int. Pipeline Length	18 stages	19 stages 14 stage from µop cache	19 stages 14 stage from µop cache
TLB	2048 4-way	128I + 64D L1 1024 8-way	128I + 64D L1 1536 8-way
Page Support	4 KB, 64 KB, 16 MB, 16 GB	4 KB, 2/4 MB, 1 GB	4 KB, 2/4 MB, 1 GB

Both CPUs are very wide brawny Out of Order (OoO) designs, especially compared to the ARM server SoCs.

Despite the lower decode and issue width, Intel has gone a little bit further to optimize single threaded performance than IBM. Notice that the IBM has no loop stream detector nor µop cache to reduce branch misprediction. Furthermore the load buffers of the Intel microarchitecture are deeper and the total number of instructions in flight for one thread is higher. The TLB architecture of the IBM POWER8 has more entries while Intel favors speedy address translations by offering a small level one TLB and a L2 TLB. Such a small TLB is less effective if many threads are working on huge amounts of data, but it favors a single thread that needs fast virtual to physical address translation.

On the flip side of the coin, IBM has done its homework to make sure that 2-4 threads can really boost the performance of the chip, while Intel's choices may still lead to relatively small SMT related performance gains in quite a few applications. For example, the instruction TLB, µop cache (Decode Stream Buffer) and instruction issue queues are divided in 2 when 2 threads are active. This will reduced the hit rate in the micro-op cache, and the 16 byte fetch looks a little bit on the small side. Let us see what IBM did to make sure a second thread can result in a more significant performance boost.

Inside the Beast(s) Heavy SMT: Multi Threading Prowess

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

124 Comments

View All Comments

DomOfSF - Thursday, July 21, 2016 - link
Johan de Gelas: blowing minds and educating "the rest of us" since...I dunno, a really long time ago (especially in internet years). Great job on the data, but the real good stuff is in your thoughts and analysis. Thank you!
close - Saturday, July 23, 2016 - link
Over a decade...
JohanAnandtech - Thursday, July 28, 2016 - link
13 years in the server business, 18 years now of reviewing hardware :-). Thx !!
jamyryals - Thursday, July 21, 2016 - link
It seems to me, Intel's focus on bringing their CPU architecture design all the way down to 5W is the reason IBM is able to stand out against them. Intel is focused on creating a scalable architecture while IBM can throw the whole kitchen sink at the server market.

Fascinating article, I really enjoyed it.
smilingcrow - Thursday, July 21, 2016 - link
Intel has plenty of unique features in their server platforms which aren't in the consumer platforms so I don't think that is the issue.
jospoortvliet - Tuesday, July 26, 2016 - link
The basic design of the core still is the same so there is probably at least some truth in the statement of Jamy.
Kevin G - Wednesday, July 27, 2016 - link
Up until this point. Consumer SkyLake and server SkyLake are going to be two different designs. They're certainly related but server SkyLake will have 512 KB of L2 cache per core and support AVX-512 instructions.

Server SkyLake is also going to support 3D Xpoint DIMMs, though that difference is more with the platform/chipset than the actual CPU core.
floobit - Thursday, July 21, 2016 - link
Very interesting. It seems odd to me that they chose to configure it in a 2U - except for big data clusters, most of the market space I see this playing is dominated by FC to a SAN. Is this a play in the big data cluster space, or the more traditional AIX/DB2/big iron that IBM has owned for so long?
Some questions I'd have:
what virtualization is possible with this architecture? presumably just the standard PowerVM? How well does that work?
What is the impact of IO latency? Could you throw a P3700 or two in here?
JohanAnandtech - Thursday, July 21, 2016 - link
2U: Besides big data storage needs, I suspect 2U is necessary for adequate cooling for the POWER8 chip.

Virtualization: Linux KVM works well as far as I know.

We actually tried out a P3700 in there (see: http://www.anandtech.com/show/9567/the-power-8-rev... ) and it worked very well. I asked IBM what a customer should expect when using third party storage (probably no support, but how about waranty?) but no answer yet.
mystic-pokemon - Friday, July 22, 2016 - link
Hi Johan
2U is not necessary for cooling a POWER 8 Chip. We do that better with our Barreleye (1.25 OU design). Even storage wise Barreleye has 15 Disk storage bay that can be seen in below links.

http://www.v3.co.uk/v3-uk/news/2453992/google-and-...

Let me know if you wanna ever benchmark a Barreleye. What specific POWER8 proc are you benchmarking with ? (Turismo?). I believe it does slightly better than S812LC on many benchmarks based on the variant of power8 proc S812LC runs.

Assessing IBM's POWER8, Part 1: A Low Level Look at Little Endian

Comparing with Intel's Best

Post Your Comment

124 Comments

View All Comments

DomOfSF - Thursday, July 21, 2016 - link

close - Saturday, July 23, 2016 - link

JohanAnandtech - Thursday, July 28, 2016 - link

jamyryals - Thursday, July 21, 2016 - link

smilingcrow - Thursday, July 21, 2016 - link

jospoortvliet - Tuesday, July 26, 2016 - link

Kevin G - Wednesday, July 27, 2016 - link

floobit - Thursday, July 21, 2016 - link

JohanAnandtech - Thursday, July 21, 2016 - link

mystic-pokemon - Friday, July 22, 2016 - link

Log in

Don't have an account? Sign up now