from Xbox 360 Slides (Presentation hasn't occurred yet, figure I'd get these up so I don't have to write during the session)
Hardware Specs
Triple-Core 3.2 Ghz custom CPU
- shared 1MB L2 Cache
- customized vector floating point unit per core
- 5.4 Gps FSB: 10.8 GB/s read and 10.8 GB/s write
** GPU can read from L2 (!!! I didn't know this !!!)
500 Mghz custom GPU
- 48 parallel unified shaders
- 10 MB embedded DRAM for fram buffer: 156GB/sec
512 Meg unified memory (700MGHZ GDDR3: 22.4 GB/s)
12x Dual layer DVD
20 GB Hard drive
High Def video out.
System Block Diagram (I'm just going to list the stuff hanging off the IO chip)
DVD (SATA)
HDD (SATA)
Front Controllers (2 USB)
Wireless Controllers
MU ports (2 USB)
Rear Panel USB
Ethernet
IR
Audio Out
FLASH
System Control
Video Out (hanging off of a separate Analog chip)
CPU: PPC Core Specs
* 3 3.2 Ghz PowerPC cores
* Shared 1 MB L2 cache, 8-way associative
* Per-Core features
- 2 issue per cycle, in-order, decoupled vector/scalar issue queue
- 2 symmetric fine grain hardware threads
- L1 Caches: 32K 2-way I$ / 32K 4-way D$
- Execution Pipelines
-- Branch Unit, Integer Unit, Load/Store Unit
-- VMX 128 Units: Floating Point Unit, Permute Unit, Simple Unit
-- Scalar FPU
* VMX128 enchanced for game and graphics workload
-- all execution units 4-way SIMD
-- 128 128-bit vector registers per thread
-- custom dot-product instruction
-- native D3D compressed data formats
CPU Data Streaming Specs
* High bandwidth data streaming support with minimal cache thrashing
- 128B cache line size (all cache)
- Flexible set locking in L2
- Write streaming:
* L1s are write through, writes do not allocate in L1
* 4 uncacheable write gathering buffers per core
* 8 cacheable, non-sequential write gathering buffers per core
- Read Streaming:
*

CBT data prefetch aroudn L2, directly into L1
* 8 outstanding load/prefetches per core
- Tight GPU data streaming integration (XPS)
* XPS -- "Xbox Procedural Synthesis"
* GPU 128B read from L2
* GPU low latency cacheable writebacks to CPU
* GPU shared D#D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data.
GPU Specs
* 500 MGhz graphics processor
- 48 parallel shader cores (ALUs)l dynamically schedulted 32bit IEEE FLP
- 24 billion shader instructions per second
* (super scalar design; scalar and texture ops per instruction)
- Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only
* AA: 16 billion samples/sec; 2x for depth / stencil only
- Geometry rate: 500 million triangles/sec
- Texture rate: 8 billion bilinear samples / sec
* 10 MB EDRAM -> 256 GB/s fill
* Direct3d 9.0 Compatible
- High level Shader Language (HLSL) 3.0+ support
* Custom features
- Memory export; Particle physics, subdivision surfaces
- Tiling acceleration: full resolution Hi-Z, Predicated Primitives
- XPS:
* CPU cores can be slaved to GPU processing
* GPU reads geometry data directly from L2
- Hardware scaling for display resolution matching
Architectural Choices
* FSAA, alpha and z place heavy load on memory BW
* Post-process effects require large depth complexity
* Enable flexible UMA solution
* Main Memory FB/ZB => unpredictable performance
* Solution: take FB/ZB fill-rate out of the equation
Software
* SMP/SMT
- Mainstream techniques
- Everything is simplified by being symmetric
* UMA
- No partitioning headaches
* OS
- All 3 cores available for game developers
* Standard APIs
- Win32, OpenMP
- Direct3d, HLSL
- Assembly (CPU & Shader) supported - direct hardware access
* Standard tools
- XNA; PIX, XACT
- Visual C++, works with multiple threads