

# Emerging Technology Conference



### Programming for Intel® Xeon Phi™

#### Stephen Blair-Chappell, Intel

### Essential Requirements for Programming on the Intel® Xeon Phi<sup>™</sup> Coprocessor.

In this session, the ingredients for successful Xeon Phi <sup>™</sup> Programming are discussed. We look at a the experience and results of porting an application to run on the Intel® Xeon Phi<sup>™</sup>. We look at: What went well in the project; How difficult the project porting was; Useful tips and tricks; A comparison of performance on Xeon Phi<sup>™</sup> and non-Xeon Phi<sup>™</sup> platforms.



#### Intel<sup>®</sup> Parallel Studio XE



Intel<sup>®</sup> Parallel Advisor Use to model parallelism in your existing applications

- Intel<sup>®</sup> Composer XE Use to generate fast, safe, parallel code (C/C++, Fortran)
- Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE Find hotspots and bottlenecks in you code.
- Intel<sup>®</sup> Inspector XE Use to find memory and threading errors

### **Four Components**

### <sup>®</sup> Cluster Studio XE



### Six Components

Intel<sup>®</sup> Parallel Advisor Use to model parallelism in your existing applications

- Intel<sup>®</sup> Composer XE Use to generate fast, safe, parallel code (C/C++, Fortran)
- Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE Find hotspots and bottlenecks in you code.
- Intel<sup>®</sup> Inspector XE Use to find memory and threading errors

#### • Intel<sup>®</sup> MPI

Industry standard message passing interface library for parallelism across clusters

Intel<sup>®</sup> Trace Analyzer and Collector (ITAC)

Examine runtime behaviour of programs running on clusters.

### Code must be

# highly Parallel

# effectively Vectorised

#### Application Performance: Intel® Xeon Phi<sup>™</sup> Coprocessor



http://www.intel.com/performance



Optimization Notice



Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### **Optimization Notice**



### Vectorisation







#### Optimization Notice

(intel)

#### 4<sup>th</sup> Generation Intel® Core<sup>™</sup> processor family

**Execution Units** 





intell

### **Overview of Writing Vector Code**

**Array Notation** 

A[:] = B[:] + C[:];

#### SIMD Directive

#pragma simd
for (int i = 0; i < N; ++i) {
 A[i] = B[i] + C[i];
}</pre>

#### **Elemental Function**

\_\_declspec(vector)
float ef(float a, float b) {
 return a + b;
}
A[:] = ef(B[:], C[:]);

Auto-Vectorization

for (int i = 0; i < N; ++i) { A[i] = B[i] + C[i]; }



# Explicit Vector Programming with SIMD Pragma/Directive

Programmer asserts:

\*p is loop invariant

A[] does not overlap with B[] or C[]

sum not aliased with B[] or C[]

sum should be treated as a reduction

Allow compiler to reorder for better vectorization

Vector code should be generated even if efficiency heuristic does not indicate a gain in performance

Explicit vector programming lets you express what you mean!

```
#pragma omp simd reduction(+:sum)
for(i = 0; i < *p; i++) {
    A[i] = B[i] * C[i];
    sum = sum + A[i];</pre>
```

#### How do I know if a loop is vectorised?

-vec-report

```
> icl /Qvec-report MultArray.c
MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.
```

Qvec-report1 (default) Qvec-report2 Qvec-report3 Qvec-report4 Qvec-report5 Qvec-report6



Bitvise xterm - Winnersh-KNC-lab-sbc.tlp - 172.28.35.61:22

~/dv/BRISTOL-UWE/lab-01 \$

### Vec-report7

[sbc@snbws3 lab-01]\$ vecanalysis.py --annotate vec7.txt Writing pi\_vr.c ... done Statistics for all files Message vector loop cost: 32.000000. type converts: 2. unroll factor set to 4. LOOP WAS VECTORIZED. loop inside vectorized loop at nesting level: 1. remainder loop was<u>not vectorized: 1</u>. vector loop cost: <u>59.000000</u>. medium-overhead vector operations: 2. heavy-overhead vector operations: 2. conversion from int to float will be emulated. scalar loop cost: 52. loop was vectorized (no peel/with remainder) estimated potential speedup: 6.010000. lightweight vector operations: 32. lightweight vector operations: 34. estimated potential speedup: 3.150000. divides: 1. REMAINDER LOOP WAS VECTORIZED. remainder loop was vectorized (masked) Total Source Locations:

Source Locations Count 66.7% NNNNNNNNNNNNNNNNNNNNNNNN 66.7% 66 7% 66 66 66 66 66 66 66 66 66 66 66. 7% 66.7% 66.7% 66.7% 66.7%

| 🕘 Analysis Target 🕺 Analysis Type 🕅 Sur                                                                                                                                                                                                                                                                                                                        | mmary 🐼 Botto   | m-up 🚺 To                                                                                                                             | P<br>Hardware Event Counts                                                                        |                                                                                               |                                                                                                                  |                                                                                          |                                                                                                                                                                                                                                 |                                                                                             |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| ouping: Function / Call Stack                                                                                                                                                                                                                                                                                                                                  |                 |                                                                                                                                       |                                                                                                   |                                                                                               |                                                                                                                  |                                                                                          |                                                                                                                                                                                                                                 |                                                                                             |
| Function / Call Stack Loop@0x40da56 at line 581 in grid_intersect] Loop@0x40d9a0 at line 559 in grid_intersect] Loop@0x40d9d0 at line 600 in grid_intersect] Loop@0x40d9d0 at line 562 in grid_intersect] Loop@0x4078c4 at line 111 in shader] Loop@0x40d0af at line 193 in cellbound] Loop@0x6ff5a048 in func@0x6ff5a02e] Loop@0x6ff59c7f in func@0x6ff59c67] |                 | IST_RETIRE F<br>1,461,822,066 (<br>221,977,048 1<br>49,820,136 1<br>45,107,993 1<br>22,348,421 (<br>2,802,192 1<br>2,068,814 2<br>0 ( | 0.000 msvcrt.dll [Loop@0x6ff5                                                                     | pe 562 in g<br>e 111 in s<br>oat at line 193 in ce<br>a048 in func@0x6f<br>9c7f in func@0x6ff | . grid.cpp<br>. grid.cpp<br>. grid.cpp<br>shade.cpp<br>grid.cpp<br>. [Unknown source.<br>. [Unknown source.      | Vector<br>Instruction Set<br>SSE2<br>SSE2<br>SSE2<br>SSE2<br>SSE2<br>SSE2<br>SSE2<br>SSE | Vector Instruction Class<br>MOVSD_XMM<br>ADDSD; COMISD; MOVSD_XMM<br>MOVSD_XMM<br>ADDSD; COMISD; DIVSD; MOVAPD; MOVQ<br>COMISD; COMISD; DIVSD; MOVAPD; MOVQ<br>COMISD; MOVAPD; MOVSD_XMM<br>MOVDQA<br>COMISD; MOVAPD; MOVSD_XMM | MOVSI                                                                                       |
| Loop@0x40ce99 at line 155 in globalbound]<br>Loop@0x4031c0 at line 165 in shadow_intersection]                                                                                                                                                                                                                                                                 | 0               | 100 C 10 C 10 C 10 C 10 C                                                                                                             | 0.000 find_hotspats.exe [Loop@0x40ce<br>0.000 find_hotspats.exe [Loop@0x4031<br>s 4s 4.5s 5s 5.5s |                                                                                               | and the second | SSE2<br>8.5s 9s                                                                          | COMISD<br>9.5s 10s 10.5s 11s                                                                                                                                                                                                    |                                                                                             |
| Loop@0x4031c0 at line 165 in shadow_intersection]           Q+Q+Q-Q+         0.5s         1s         1.5           Thread (0x2154)         Statement         Statement                                                                                                                                                                                         | 0<br>5s 2s 2.5s | 2,142,765 (<br>3s 3.5:                                                                                                                | 0.000 find_hotspats.exe [Loop@0x4031                                                              | 1c0 at line 165 in s<br>6s 6.5s 7s                                                            | 7.5s 8s                                                                                                          | 8.55 95                                                                                  | 9.55 105 10.55 115                                                                                                                                                                                                              | С<br>С                                                                                      |
| Loop@0x4031c0 at line 165 in shadow_intersection]           Q+Q+Q-Q+         0.5s         1s         1.5           Thread (0x2154)         Statement         Statement                                                                                                                                                                                         | 0<br>5s 2s 2.5s | 2,142,765 (<br>3s 3.5;                                                                                                                | 0.000 find_hotspots.exe [Loop@0x4031<br>s 4s 4.5s 5s 5.5s                                         | 1c0 at line 165 in s<br>6s 6.5s 7s                                                            | intersect.cpp<br>7.5s 8s                                                                                         | 8.55 95                                                                                  | 9.55 105 10.55 115                                                                                                                                                                                                              | C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C<br>C |

Optimization Notice

| Lightweight Hotspots - Hardware                                                                | Issues with Vectorization Info                                 | / 0                                                |                                              |                           | Intel VTune Amplifier >                         |
|------------------------------------------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------|----------------------------------------------|---------------------------|-------------------------------------------------|
| 🥘 Analysis Target 🔺 Analysis Type 🕅                                                            | Summary 🐼 Bottom-up 🔹 Top-                                     | Select viewpoint:                                  |                                              |                           |                                                 |
| ouping: Function / Call Stack                                                                  |                                                                | Hardware Event Counts Hardware Event Sample Counts |                                              |                           |                                                 |
| Function / Call Stack                                                                          | Hardware Ev Hardware Ev CP<br>CPU_CL + INST_RETIRE Rat         | Hardware Issues<br>Hotspots                        | uli) Source File                             | Vector<br>Instruction Set | Vector Instruction Class                        |
| [Loop@0x40da56 at line 581 in grid_intersect]<br>[Loop@0x40d9a0 at line 559 in grid_intersect] | 1,402,220,698 1,461,822,066 0.9<br>227,291,360 221,977,048 1.0 | Extended Sleep States                              | ne 581 in g grid.cpp<br>ne 559 in g grid.cpp | SSE2<br>SSE2              | MOVSD_XMM<br>ADDSD; COMISD; MOVSD_XMM           |
| Loop@0x40dsa0 at line 509 in grid_intersect]                                                   | 67 220 616 40 820 126 14                                       | Extended Sleep States                              | ne 539 in g grid.cpp                         | 5362                      | ADDSD; COMISD; MOVSD_AMM                        |
|                                                                                                | ERIMENTAL=vectinfo<br>lect lightweight-h                       | <b>is enabled in</b>                               |                                              |                           | - application.exe                               |
| Hardware Events                                                                                | de banne anter a liter tet a li k in tra stata a a le vil      | alana ali su salam kitu, ay su da kuta malayan     | eksessen alle miter so and a skill mite      | فالقصيمة عمرانه           |                                                 |
| Filter: 10.4% is shown 🛛 🍀 Process:                                                            | Any Process Thread:                                            | Any Thread 💽 Any V                                 | ector Instruction Class                      | ny Module                 | [10,4%] SSE2                                    |
| Timeline Hardware Event: 8R_INST_RETIRED.N                                                     | EAR_TAKEN 💽 Call Stack Mode: U                                 | ser/system functions 💽 Inline Mo                   | de: on Coop Mode Loops                       | only                      | Any Vector Instruction Set<br>[89.6%] [Unknown] |
|                                                                                                |                                                                | 0                                                  |                                              |                           | [0.0%] SSE                                      |
|                                                                                                |                                                                |                                                    |                                              |                           | (intel)                                         |

Optimization Notice

19

### **Scalar and Packed Instructions**



′inte

#### **Examples of Code Generation**

| <pre>static double A[1000], B[1000],<br/>C[1000];<br/>void add() {<br/>int i;<br/>for (i=0; i&lt;1000; i++)<br/>if (A[i]&gt;0)<br/>A[i] += B[i];<br/>else<br/>A[i] += C[i];<br/>}</pre>                                         | .B1.2::<br>movaps xmm2, A[rdx*8]<br>xorps xmm0, xmm0<br>cmpltpd xmm0, xmm2<br>movaps xmm1, B[rdx*8]<br>andps xmm1, xmm0<br>andnps xmm0, C[rdx*8]<br>orps xmm1, xmm0<br>addpd xmm2, xmm1<br>movaps A[rdx*8], xmm2<br>add rdx, 2<br>cmp rdx, 1000 SSE2<br>j1 .B1.2 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| .B1.2::<br>vmovaps ymm3, A[rdx*8]<br>vmovaps ymm1, C[rdx*8]<br>vcmpgtpd ymm2, ymm3, ymm0<br>vblendvpd ymm4, ymm1,B[rdx*8], ymm2<br>vaddpd ymm5, ymm3, ymm4<br>vmovaps A[rdx*8], ymm5<br>add rdx, 4<br>cmp rdx, 1000<br>jl .B1.2 | .B1.2::<br>movaps xmm2, A[rdx*8]<br>xorps xmm0, xmm0<br>cmpltpd xmm0, xmm2<br>movaps xmm1, C[rdx*8]<br>blendvpd xmm1, B[rdx*8], xmm0<br>addpd xmm2, xmm1<br>movaps A[rdx*8], xmm2<br>add rdx, 2<br>cmp rdx, 1000<br>jl .B1.2 SSE4.1                              |



#### **Vectorization Report**

#### "Loop was not vectorized" because:

- "Existence of vector dependence"
- "Non-unit stride used"
- "Mixed Data Types"
- "Condition too Complex"
- "Condition may protect exception"
- "Low trip count"

- "Subscript too complex"

- 'Unsupported Loop Structure"
- "Contains unvectorizable statement at line XX"
- "Not Inner Loop"
- "vectorization possible but seems inefficient"
- "Operator unsuited for vectorization"

(intel)<sup>8/2/201</sup>2

e.g. function calls



### Parallelism



Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

Optimization Notice

### Language to help parallelism

Intel<sup>®</sup> Cilk<sup>™</sup> Plus

OpenMP

```
#pragma omp parallel for
for(i=1;i<=4;i++) {
    printf("Iter: %d", i);
}
```

Parallel Code

### Intel<sup>®</sup> Threading Building Blocks

Intel<sup>®</sup> MPI

Fortran Coarrays

#### OpenCL

Native Threads

How many threads?

"An application must scale well past one hundred threads to qualify as highly parallel"



Jim Jeffers James Reinders. ISBN: 978-0124104143



### **Parallel Performance Potential**



If your performance needs are met by a an Intel Xeon® processor, they will be achieved with fewer threads than on a coprocessor

#### On a coprocessor:

**Optimization Notice** 

- Need more threads to achieve same performance
- Same thread count can yield less performance

#### Intel Xeon Phi excels on highly parallel applications





### Intel® Xeon Phi™

## What is it?

#### Co-processor

- PCI Express card
- Stripped down Linux operating system (busybox/dash)
- Dense, simplified processor
  - Simplifications for power savings In-order
  - Wider vector unit
  - Wider hardware thread count
- Lots of names
  - Many Integrated Core architecture, aka MIC
  - Knights Corner (code name)
  - Intel Xeon Phi Co-processor SE10P (product name)



THE UNIVERSITY OF TEXAS AT AUSTIN

### Intel® Xeon Phi<sup>™</sup> Architecture Overview





Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

Optimization Notice

### **Core Architecture Overview**



60+ in-order, low power IA cores in a ring interconnect

#### Two pipelines

- Scalar Unit based on Pentium® processors
- Dual issue with scalar instructions
- Pipelined one-per-clock scalar throughput
   SIMD Vector Processing Engine
- 4 hardware threads per core
- 4 clock latency, hidden by round-robin scheduling of threads
- Cannot issue back to back inst in same thread
   Coherent 512KB L2 Cache per core

### Key Differentiators Xeon Phi vs Workstation

More Cores Slower Clock Speed Wider SIMD registers Faster Bandwidth In-order pipeline

**Optimization Notice** 

#### A Tale of Two Architectures

|                  | Intel® Xeon® processor | Intel® Xeon Phi™ Coprocessor |  |
|------------------|------------------------|------------------------------|--|
| Sockets          | 2                      | 1                            |  |
| Clock Speed      | 2.6 GHz                | 1.1 GHz                      |  |
| Execution Style  | Out-of-order           | In-order                     |  |
| Cores/socket     | 8                      | Up to 61                     |  |
| HW Threads/Core  | 2                      | 4                            |  |
| Thread switching | HyperThreading         | Round Robin                  |  |
| SIMD widths      | 8SP, 4DP               | 16SP, 8DP                    |  |
| Peak Gflops      | 692SP, 346DP           | 2020SP, 1010DP               |  |
| Memory Bandwidth | 102GB/s                | 320GB/s                      |  |
| L1 DCache/Core   | 32kB                   | 32kB                         |  |
| L2 Cache/Core    | 256kB                  | 512kB                        |  |
| L3 Cache/Socket  | 30MB                   | none                         |  |



#### Theoretical Peak Flops Performance Frequency \* Num Sockets \* Num Fores \* Vector Width \* FP Ops Example

Two socket Intel® Xeon® E5-2670 Processor

| Freq | Sockets | Num<br>Cores | Vector<br>Width | FP Ops | GFlops |
|------|---------|--------------|-----------------|--------|--------|
| 2.6  | 2       | 8            | 4               | 2      | 666    |

#### Single card Xeon Phi Coprocessor (B0)

| Freq  | Sockets | Num<br>Cores | Vector<br>Width | FP Ops        | GFlops |
|-------|---------|--------------|-----------------|---------------|--------|
| 1.091 | 1       | 61           | 16              | 2 (using FMA) | 2,128  |
|       |         |              |                 |               | x3.20  |

(intel,

#### Synthetic Benchmark Summary (Intel® MKL)



Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>



#### Intel® Xeon Phi<sup>™</sup> Coprocessor: Increases Application Performance up to 10x

| Segment                           | Customer                                   | Application                                           | Performance Increase <sup>1</sup><br>vs. 25 Xeon* |
|-----------------------------------|--------------------------------------------|-------------------------------------------------------|---------------------------------------------------|
|                                   | Acceleware                                 | 8 <sup>th</sup> order isotropic<br>variable velocity  | Up to 2.23x                                       |
| Energy                            | Sinopec                                    | Seismic Imaging                                       | Up to 2.53x <sup>2</sup>                          |
|                                   | CNPC<br>(China Oil & Gas)                  | GeoEast Pre-Stack Time<br>Migration (Seismic)         | Up to 3.54x <sup>2</sup>                          |
| Financial Services                | Financial Services                         | BlackScholes SP<br>Monte Carlo SP                     | Up to 7.5x<br>Up to 10.75x                        |
| Physics                           | Jefferson Labs                             | Lattice QCD                                           | Up to 2.79x                                       |
| Finite Element                    | Sandia Labs                                | miniFE<br>(Finite Element Solver)                     | Up to 2x <sup>3</sup><br>Up to 1.3x <sup>5</sup>  |
| Solid State<br>Physics            | ZIB<br>(Zuse-Institut Berlin)              | Ising 3D<br>(Solid State Physics)                     | Up to 3.46x                                       |
| Digital Content<br>Creation/Video | Intel Labs                                 | Ray Tracing<br>(incoherent rays)                      | Up to 1.88x <sup>4</sup>                          |
|                                   | NEC                                        | Video Transcoding                                     | Up to 3.0x <sup>2</sup>                           |
| Astronomy                         | CSIRO/ASKAP<br>(Australia Astronomy)       | tHogbom Clean<br>(Astronomy image smear<br>removal)   | Up to 2.27x                                       |
| ,                                 | TUM (Technische<br>Universität München)    | SG++ (Astronomy Adaptive<br>Sparse Grids/Data Mining) | Up to 1.7x                                        |
| Fluid Dynamics                    | AWE (Atomic Weapons<br>Establishment - UK) | Cloverleaf<br>(2D Structured Hydrodynamics)           | 1.77x                                             |

#### Notes:

- 1. 25 Xeon\* vs. 1 Xeon Phi\* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)
- 2S Xeon\* vs. 2S Xeon\* + 2 Xeon Phi\* (offload)
- 3. 8 node cluster, each node with 2S Xeon\* (comparison is cluster performance with and without 1 Xeon Phi\* per node) (Hetero)
- 4. Intel Measured Oct, 2012
- 5. 8 node cluster, each node with 2S Xeon\* (comparison is cluster performance with Xeon only vs. Xeon Phi \*only (1 Xeon Phi\* per node) (Native) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  - Source: Customer Measured results as of October 22, 2012. Configuration Details: Please reference, slide speaker notes.





#### Updated

### **Programming Models and Mindsets**



#### Range of models to meet application needs

Copyright<sup>®</sup> 2014, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



#### Examples of Offloading



#### C/C++ Offload Pragma

```
#pragma offload target (mic)
```

```
#pragma omp parallel for reduction(+:pi)
```

```
for (i=0; i<count; i++) {</pre>
```

float t = (float)((i+0.5)/count);

```
pi += 4.0/(1.0+t*t);
```

```
}
```

```
pi /= count;
```

#### MKL Implicit Offload

//MKL implicit offload requires no source code changes, simply link with the offload MKL Library.

Fortran Offload Directive

```
!dir$ omp offload target(mic)
!$omp parallel do
    do i=1,10
    A(i) = B(i) * C(i)
    enddo
!$omp end parallel
```

#### C/C++ Language Extensions

•••

DE

```
class _Shared common {
    int data1;
    char *data2;
    class common *next;
    void process();
};
_Shared class common obj1, obj2;
...
_Cilk_spawn _Offload obj1.process();
Cilk spawn __obj2.process();
```

## Back of the Envelope · Measure percentage your code is Calculation

Measure how parallel your code is.

Detect any Memory intense parts of

Scale-up (or down!) the values to

take into account Xeon Phi

inte

Your code will benefit from running on Xeon Phi if ...

- It is highly parallel
- Is effectively vectorised
   *or* bandwidth
   constrained



### Three things to consider



ínte







## The Serial Factor

Serial Factor =

Clock Factor \* ILP Factor \* Issue Factor

Where

Clock Factor = 2.6 /1.09

For FMA type calculations ILP Factor\*\*\* = 2/2 =1

For non-FMA type calculations ILP Factor = 2/1

Issue factor = Num cycles to issue instruction on Phi / Num cycles to issue instruction on Xeon = 2/1

Note: in single threaded code Xeon Phi uses two cycles to issue an instruction (in threaded mode it takes just one cycle)

\*\* FMA: source code is capable of using Fused Multiple Add when built for Xeon Phi

x4.77

slower

x9.54

slower

### Factors (2.6 GHz Clock)

| Host                               | SIMD | Serial | Vector | Parallel | Clock |
|------------------------------------|------|--------|--------|----------|-------|
| Single socket<br>2.6 GHz.<br>FMA** | AVX  | 4.772  | 0.5    | 0.1333   | 2.386 |
|                                    | SSE2 |        | 0.25   |          |       |
| Single socket<br>2.6 GHz<br>No FMA | AVX  | 9.544  | 1      |          |       |
|                                    | SSE2 |        | 0.5    |          |       |
| Twin socket<br>2.6 GHz<br>FMA**    | AVX  | 4.772  | 0.5    | 0.2666   |       |
|                                    | SSE2 |        | 0.25   |          |       |
| Twin socket<br>2.6 GHz<br>No FMA   | AVX  | 9.544  | 1      |          |       |
|                                    | SSE2 |        | 0.5    |          |       |

Xeon: 8 cores per socket

Phi: Using 60 of 61 cores

\*\* FMA: source code is capable of using FMA when built for Xeon Phi NOTE: Serial Factor already includes the Clock factor

## 'Finger in the air' speedups (from 2 socket 2.6Ghz SSE2)

- An application that is highly parallel and effectively vectorised speed up x2.5
- An application that is highly parallel but not vectorised - speed up x1.3
- An application that is not parallel but is vectorised slow down by x1.5
- A Serial application slow down by x12.0
- A Bandwidth constrained application speed up by x2.4

What you experience in practice may be different from these figures. These are only 'back of the envelope' figures.



Xeon Phi optimisation work usually is of benefit to 'regular' Xeon CPU codes.

See configuration information at end of slide deck



## KNL Public Knowledge

- Knights Landing is the code name for the 2<sup>nd</sup> generation product in the Intel<sup>®</sup> Many Integrated Core Architecture
- Knights Landing targets Intel's **14 nanometer** manufacturing process
- Knights Landing will be productized as a **processor** (running the host OS) and a **coprocessor** (a PCIe endpoint device)
- Knights Landing will feature on-package, highbandwidth memory
- Flexible memory modes for the on package memory include: flat, cache, and hybrid modes
- Intel® Advanced Vector Extensions AVX-512

# Typical Hands-on Xeon Phi training agenda

- Day 1 Getting Ready
- 10.00 Welcome
- 10.30 Two Essential Requirements
- 11.00 Parallelism (L)
- 12.30 Lunch
- 1.30 Vectorisation (L)
- 4.00 Advance Profiling (Walkthrough)
- 5.00 End
- Day 2-Xeon Phi Programming
- 09.00 Start
- 09.15 Native & Offload Programming for Xeon Phi (L)
- 11.30 A Case Study
- 12.00 Lunch
- 1.00 Vectorisation on Xeon Phi (L)
- 1.50 Parallelism on Xeon Phi (L)
- 3.40 Wrap-up
- 4.00 End



25<sup>th</sup> & 26<sup>th</sup> June 2014 Manchester



#### **Legal Disclaimer & Optimization Notice**

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804





### Backup



