# Intel® Xeon 5500 Platforms, Integrated Memory Controllers and NUMA David Levinthal, Julia Fedorova, Dmitry Ryabtsev SSG/DPD/PAT \* Intel, the Intel logo, Intel Core and Core Inside are trademarks of Intel Corporation in the U.S. and other countries. # **Agenda** **NUMA and Enabling: Overview** **Topology Overview** **BIOS Options** **OS dependent NUMA concerns** Identifying memory locality (and lack thereof) on Intel® Xeon 5500 processors **Summary** Intel and core are a trademark or registered trademark of Intel Corporation or its subsidiaries in the United 2 4/1/2010 # NUMA Modes on DP Systems Controlled in BIOS ### **Non Numa** - Even/Odd lines assigned to sockets 0/1 - Line interleaving ## **NUMA** mode - First Half of memory space on socket 0 - Second half on socket 1 - Default on Intel® Xeon™ 5500 Processors 4/1/201 # Non Uniform Memory Access and Parallel Execution Process parallel is intrinsically NUMA friendly - Affinity pinning maximizes local memory access - MP - Parallel submission to batch queues - Standard for HPC Shared memory threading is more problematic - Explicit threading, TBB, openMP\* - NUMA friendly data decomposition (page based) has not been required - OS scheduled thread migration can aggravate situation \* Other names and brands may be claimed as the property of others. 7 4/1/2010 **Software and Solutions Group**2008 Software Technology Open Forum HPC Applications will see Large Performance Gains due to Bandwidth Improvements A remaining performance bottleneck may be due to non uniform memory access latency Intel® PTU data access profiling feature was designed to address NUMA Intel® Xeon™ 5500 processors events were designed to provide the required data 4/1/2010 # Data Access Events on Intel® Xeon™ 5500 processors Reveal NUMA Access Pattern "miss" events are inclusive Sum over all data sources and their individual latencies Intel® Xeon™ 5500 processor Precise events are exclusive Per data source 4/1/2010 # Controlling NUMA Data Locality on Linux\* and Windows\* ### Linux\* assigns physical pages on "first touch" - ie buffer initialization not malloc - If each thread initializes its data, things are good - Can also use numactl or numalib # Windows assigns physical pages with "allocation" - VirtualAlloc works like malloc on Linux\* - · Physical pages assigned at first use - malloc & VirtualAllocExNuma allocation must be parallelized - Buffers are no longer contiguous linear address ranges - Much MUCH harder Other names and brands may be claimed as the property of others. Software and Solutions Group 2008 Software Technology Open Forum # Data Locality, Threaded Applications and Bandwidth ``` Consider a threaded triad int triad(int len, double *a, double *b, double *c, double *x); int i,bytes = 24; #pragma omp parallel { #pragma omp for private (i) #pragma vector nontemporal for(i=0;i<len;i++)a[i]=b[i]+x*c[i]; } return bytes ``` Parallelizes the work function called 1000 times, len=8192000 ~ 1B cachelines written NT, 2B read 4/1/201 Data Locality, Threaded Applications and Bandwidth Run an OpenMP\* triad under my usual mini\_app driver, the resulting BW is only ~ 5bytes/cycle for 8 threads Running in Non Numa Mode results in ~8.5 Bytes/cycle Why? Default Version Allocates Buffers on Thread 0 Using only one Memory Controller \* Other names and brands may be claimed as the property of others. 4/1/2010 **Software and Solutions Group** 2008 Software Technology Open Forum ### **Performance Events and NUMA Sources** - Offcore\_Response\_0 8 flavors of Request Type X 8 flavors of \$line Source - + all combinations..(~65K possible programmings) - One "gotcha"... NT stores to local Dram appear to go to another core's cache (data source = 2 instead of 0x40) 4/1/2010 ``` Parallel "Allocation" for Linux* Requires Parallel Initialization Parallel allocation buf1 = (char *) malloc(DIM*(sizeof (double))+1024); buf2 = (char *) malloc(DIM*(sizeof (double))+1024); buf3 = (char *) malloc(DIM*(sizeof (double))+1024); a = (double *) buf1; b = (double *) buf2; c = (double *) buf3; #pragma omp parallel #pragma omp for private(num) for(num=0;num<len;num++)</pre> a[num]=10.; b[num]=10.; c[num]=10.; } } * Other names and brands may be claimed as the property of others. (intel Software and Solutions Group 4/1/2010 2008 Software Technology Open Forum ``` | Event | Triad_omp | | |--------------------------------------------------------------|-----------|--| | CPU_CLK_UNHALTED.THREAD | 2.23E+11 | | | CPU_CLK_UNHALTED.THREAD;Socket 0 | 7.51E+10 | | | CPU_CLK_UNHALTED.THREAD;Socket 1 | 1.48E+11 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION | 3.13E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION;Socket 0 | 1.56E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION;Socket 1 | 1.56E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM | 1.56E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM;<br>Socket 0 | 1.55E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM; Socket 1 | 8000000 | | | OFFCORE RESPONSE 0.ANY REQUEST.REMOTE DRAM | 1.55E+09 | | | OFFCORE RESPONSE 0.ANY REQUEST.REMOTE DRAM;Socket 0 | 1.55E+09 | | | OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM;Socket 1 | 100000 | | Note socket 0/1 switch between PTU runs **Software and Solutions Group** 2008 Software Technology Open Forum (intel | Event | Triad_omp | Triad_NUMA | |--------------------------------------------------------------|-----------|------------| | CPU_CLK_UNHALTED.THREAD | 2.23E+11 | 1.17E+11 | | CPU_CLK_UNHALTED.THREAD;Socket 0 | 7.51E+10 | 5.84E+10 | | CPU_CLK_UNHALTED.THREAD;Socket 1 | 1.48E+11 | 5.83E+10 | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION | 3.13E+09 | 3.11E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION;Socket 0 | 1.56E+09 | 1.56E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.ANY_LOCATION;Socket 1 | 1.56E+09 | 1.55E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM | 1.56E+09 | 3.11E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM;<br>Socket 0 | 1.55E+09 | 1.55E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.LOCAL_CACHE_DRAM;<br>Socket 1 | 8000000 | 1.55E+09 | | OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM | 1.55E+09 | 400000 | | OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM;Socket 0 | 1.55E+09 | 300000 | | OFFCORE_RESPONSE_0.ANY_REQUEST.REMOTE_DRAM;Socket 1 | 100000 | 10000 | 5.1 B/cyc vs 8.5 B/cyc vs 12.5 B/cyc on a poorly tuned machine 2008 Software Technology Open Forum # **OpenMP and Core Affinity Pinning** Export KMP\_AFFINITY=compact,0,verbose will pin affinity of threads Just not reproducibly (per socket) on Red Hat 5.1 from run to run Causing problems in multi run PTU collections Problem is that an app does not use OMP runtime libs to pin affinity until there is a #pragma parallel {} You must add this around first instruction to pin affinity of Main thread 0 4/1/20 ## Multi-thread Scaling and NUMA When measuring scaling between 4 and 8 threads (assuming no SMT) the affinity of the 4 threads matters 4 threads all on one socket has the same LLC cache size/core as 8 threads ### **BUT** 2 threads/socket has closer to the same memory BW as the 8 thread run Thus 4->8 scaling will always have a non scaling contribution due to one of these 2 effects 21 4/1/201 # **Change Initialization to Follow Work Access Pattern** Thread initialization with same access sequence as work Expect ~33% improvement 1/2 of accesses get lower latency by 2 Simple OMP ran in 14.3 cycles/cell NUMA initialized version ran in 11.2 cycles/cell Every access has serious DTLB issues, which don't change with the improved NUMA layout 4/1/201 # NUMA will add complexity to software performance analysis and optimization We have the infrastructure to manage this Software and Solutions Group 2008 Software Technology Open Forum