### The Path To Exascale – Challenges and Opportunities EMIT Workshop, Manchester, 30th July 2015 **Gaurav Kaul - Solutions Architect** John Swinburne – HPC Architect ### Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life-saving, life-sustaining, critical control or safety systems, or in nuclear facility applications. Intel products may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel may make changes to dates, specifications, product descriptions, and plans referenced in this document at any time, without notice. This document may contain information on products in the design phase of development. The information herein is subject to change without notice. Do not finalize a design with this information. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Intel Corporation or its subsidiaries in the United States and other countries may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. Wireless connectivity and some features may require you to purchase additional software, services or external hardware. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2014 Intel Corporation. All rights reserved. ### Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright© 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. #### **Optimization Notice** Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 ## A bit of History .... ## The Top 500 Waterfall % of sockets sold Source: Top500.org and Intel Estimate of Top500 sockets as % of sum of analysts reports of HPC and branded Workstations sockets. Performance waterfall timelines based on TOP500.org statistics (#1-#500) and Intel estimate (#500 to projected Intel Knights Landing) Other brands and names are the property of their respective owners. Performance Waterfall\* #1 Top500 System to Single Socket **6-8** years #1 to #500 ~9 years #500 to Single Socket \*plus.....similar waterfalls for other capabilities in areas like fabrics, storage, software, ... ## 50 years of Moore's Law ## **Moore and Dennard Scaling** #### **Current Processor Performance Trends** After ~2004 only the number of transistors continues to increase exponentially #### We have hit limits in - Power - Instruction level parallelism - Clock speed Single core scalar performance is now only growing slowly ## **Technology Scaling Outlook** ## The Power & Energy Challenge **TFLOP Machine today** 4550W 5KW TFLOP Machine then 100W With Exa Technology Disk 100W Com **5W** ~20W ~3W 150W ~5W **Memory** 2W 200W **Compute** 5W ## **Promising Technologies** ## Rethink System Level Architecture ### **Revise DRAM Architecture** Need exponentially increasing BW (GB/sec) Need exponentially decreasing energy (pJ/bit) #### Traditional DRAM Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins #### New DRAM architecture Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO's (3D) ### 3D-Integration of DRAM and Logic #### **Logic Buffer Chip** Technology optimized for: - High speed signaling - Energy efficient logic circuits - Implement intelligence #### **DRAM Stack** Technology optimized for: - Memory density - Lower cost 3D Integration provides best of both worlds ## DRAM Scaling Using 3D Memory ## Needs a Paradigm Shift #### Past and present priorities— | Single thread performance | Frequency | |---------------------------|--------------------------------------------------------------| | Programming productivity | Legacy, compatibility Architecture features for productivity | | Constraints | (1) Cost<br>(2) Reasonable Power/Energy | #### Future priorities— | Throughput performance | Parallelism | |------------------------|---------------------------------------------| | Power/Energy | Architecture features for energy Simplicity | | Constraints | (1) Programming productivity (2) Cost | Evaluate each (old) architecture feature with new priorities ### Intel: Investing to Remove 6 Bottlenecks ## Impact on Applications ## The Many Ways to Parallelism Instruction Thread Cluster / Instruction Thread Data Cluster / **Parallelism** Parallelism Process Parallelism Parallelism Parallelism Process Parallelism **Parallelism** Serial Code Node Level Serial Code Node Level Fast Scalar performance, Optimized C/C++,FORTRAN, Threading and Fast Scalar performance, Optimized C/C++, FORTRAN, Threading and Performance Libraries, Debug / Analysis Tools Performance Libraries Debug / Analysis Tools Parallel Node Level Parallel Node Level Multi-core, Multi-Socket, SSE and AVX instructions, OpenMP, Threading Multi-core, Multi-Socket, SSE and AVX instructions, OpenMP, Threading Building Blocks, Performance Libraries, Thread Checker, Ct, Cilk Building Blocks, Performance Libraries, Thread Checker, Ct. Cilk Multi-Node / Cluster Level Multi-Node / Cluster Level Cluster Tools, MPI Checker Cluster Tools, MPI Checker ## And New Workloads will Emerge ### Code Modernization — The 4D Approach ### Intel® Xeon Phi™ Product Family Based on Intel® Many Integrated Core (MIC) Architecture "Meet Knight's Landing: Intel's most powerful chip ever is overflowing with cutting-edge technologies" PCWorld - Jun 23, 2014 #### Future Knights Hill Next generation of the Intel® MIC Architecture Product Line In planning <sup>\*</sup>Per Intel's announced products or planning process for future products ### No Positioning Change Knights Landing Targeted for Highly-Vectorizable, Parallel Apps ## Most Commonly Used Parallel Processor\* Parallel, Fast Serial Multicore + Vector Leadership Today and Tomorrow #### Optimized for Highly-Vectorizable Parallel Apps Many Core Support for 512 bit vectors Higher memory bandwidth Common SW programming <sup>\*</sup>Based on highest volume CPU in the IDC HPC Qview Q1'13 #### Is Xeon Phi compelling vs Xeon? "Rifle shot" approach targeted with customers in FSI, Oil & Gas, and Life Sciences based on affinity/readiness Xeon Phi = Intel® Xeon Phi™ coprocessor ### Three Knights Landing Products #### **Knights Landing Processor** "Self-boot" Intel® Xeon Phi™ processor platform #### **KNL and KNL-F Processors:** - Knights Landing <u>IS</u> the host processor - Boots standard off-the-shelf OS's #### **Benefits:** - Higher performance density for highly parallel applications<sup>2</sup> - Reduced system power consumption<sup>2</sup> - Higher perf/Watt & perf/\$\$3 #### Knights Landing Coprocessor Requires Intel® Xeon® processor host #### **Knights Landing Coprocessor:** Solution for general purpose servers and workstations #### **Benefits:** - Targeted for applications with larger sections of serial work<sup>1</sup> - Upgrade path from Knights Corner as PCle card <sup>1</sup> Projections based on early product definition and as compared to prior generation Intel® Xeon Phi™ Coprocessors <sup>2</sup> Based on Intel internal analysis. Lower power based on power consumption estimates between (2) HCAs compared to 15W additional power for KNL-F. Higher density based on removal of PCle slots and associated HCAs populated in those slots. <sup>3</sup> Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2 Results based on internal Intel analysis using estimated theoretical Flops/s for KNL processors, along with estimated system power consumption and component pricing in the 2015 timeframe. See backup for complete system configurations. ### A Paradigm Shift for Highly-Parallel **Server Processor** and **Integration** are Keys to Future <sup>\*</sup>Comparison to 1st Generation Intel® Xeon Phi™ 7120P Coprocessor (formerly codenamed Knights Corner) <sup>&</sup>lt;sup>1</sup>Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. <sup>&</sup>lt;sup>2</sup>Comparison to a discrete Knights Landing processor and discrete fabric component. <sup>&</sup>lt;sup>3</sup>Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density. ### Knights Landing Architectural Diagram ### Today's Parallel Investment Carries Forward Sustained threading, vectorization, cache-blocking and more **MOST** optimizations carry forward with a recompile Incremental tuning gains Native or Symmetric or Offload #### Parallel is the Path Forward Intel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel | | | | | | | | inside XEON | | intel inside XEON PHI | |------------|--------------------------------------------|---------------------------------------------|---------------------------------------------|------------------------------------|---------------------------------------------------|-----------------------------------------------------|-------------------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------| | | Intel® Xeon®<br>processor<br><b>64-bit</b> | Intel® Xeon®<br>processor<br>5100<br>series | Intel® Xeon®<br>processor<br>5500<br>series | Intel® Xeon® processor 5600 series | Intel® Xeon® processor code-named Sandy Bridge EP | Intel® Xeon® processor code-named Ivy Bridge EP 4S1 | Future Intel® Xeon® processor | Intel* Xeon Phi™ coprocessor Knights Corner | Intel® Xeon Phi™<br>processor &<br>coprocessor<br><b>Knights</b><br><b>Landing</b> <sup>1</sup> | | Core(s) | 1 | 2 | 4 | 6 | 8 | 12 | <b>∼</b><br>tbd | 57-61 | Up to 72 | | Threads | 2 | 2 | 8 | 12 | 16 | 24 | tbd | 228-244 | Up to 288 | | SIMD Width | 128 | 128 | 128 | 128 | 256 | 256 | 512<br>∼ | 512 | 512 | #### More cores → More Threads → Wider vectors <sup>\*</sup>Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. ### Intel® Omni-Path Fabric - CPU Integration **Time** # Intel's Technical Computing Portfolio Technologies & Products **Compute Processing** Systems & Boards Network & Fabric I/O & Storage Software & Services ### **HW-SW Co-design** Applications and SW stack provide guidance for efficient system design ## What will matter in 10 years | | Now | 2025 | |-------------|--------------------------------------------------------|--------------------------------------------------------------------------------------------------| | Perf/\$ | Linpack, Real<br>Applications | Real Applications | | Perf/Watt | Limited by worst case application | All applications will be able to run at chosen power level. Dynamic, optimal energy management. | | Reliability | Use of file system checkpoint restart (spinning disks) | Transparent hardware and system software recovery. Checkpoints in non-mechanical media. | | Big Data | Parallel IO | New storage paradigm | ### **SW Challenges** - 1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV) - 2.Data locality—reduce data movement - 3.Intelligent scheduling—move thread to data if necessary - 4. Fine grain resource management (introspective) - 5. Applications and algorithms incorporate paradigm change ### Summary - Exascale will be there by 2022 or so - "Business as usual" (riding on Moore's Law and commodity technology) is becoming increasingly harder - Supercomputers are becoming more "special purpose" - Expect most/all supercomputers to use floating point accelerators in a few years; more specialized accelerators to follow - Can continue to push performance to zetascale - Will need to think of supercomputers as unique facilities, such as particle accelerators – not clusters of PCs - Supercomputing will become much more interesting