#### Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="https://www.intel.com/benchmarks">www.intel.com/benchmarks</a>. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. #### **Optimization Notice** Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 # Performance optimization Vtune & Advisor Paulius Velesko paulius.velesko@intel.com **Application Engineer** ## Sample Code git clone https://github.com/pvelesko/nbody-demo.git ## Intel® Software Development Tools for Tuning - Compiler Optimization Reports Key to identify issues preventing automated optimization - Intel® VTune™ Application Performance Snapshot Overall performance - Intel® Advisor Core and socket performance (vectorization and threading) - Intel® VTune™ Amplifier Node level performance (memory and more) - Intel® Trace Analyzer and Collector Cluster level performance (network) #### Get the tools Intel profiling tools are now FREE: https://software.intel.com/en-us/vtune/choose-download https://software.intel.com/en-us/advisor/choose-download ## Agenda - Optimize - Make it go fast - Vectorization - Memory - Make it scale - MPI - Profiling AI/ML - Get the example code: - git clone https://github.com/pvelesko/nbody-demo.git # Nbody demonstration The naïve code that could #### Nbody gravity simulation forked from https://github.com/fbaru-dev/nbody-demo (Dr. Fabio Baruffa) Let's consider a distribution of point masses located at r\_1,...,r\_n and have masses m\_1,...,m\_n. We want to calculate the position of the particles after a certain time interval using the Newton law of gravity. ``` struct Particle { public: Particle() { init();} void init() { pos[0] = 0.; pos[1] = 0.; pos[2] = 0.; vel[0] = 0.; vel[1] = 0.; vel[2] = 0.; acc[0] = 0.; acc[1] = 0.; acc[2] = 0.; mass = 0.; } real_type pos[3]; real_type vel[3]; real_type acc[3]; real_type mass; }; ``` Optimization Notice # Intel® Compiler Reports #### Generating the compiler report cd ./nbody-demo/ver0 vim ./GSimulation.cpp # find the compute loop vim ./Makefile; # add -qopt-report=5 flag make vim ./GSimulation.optrpt # search for the line number ### Looking at the compiler report ``` LOOP BEGIN at GSimulation.cpp(127,20) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at GSimulation.cpp(130,5) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at GSimulation.cpp(132,7) remark #25085: Preprocess Loopnests: Moving Out Load and Store [GSimulation.cpp(145,4)] remark #25085: Preprocess Loopnests: Moving Out Load and Store [GSimulation.cpp(146,4)] remark #25085: Preprocess Loopnests: Moving Out Load and Store [GSimulation.cpp(147,4)] remark #15415: vectorization support: non-unit strided load was generated for the variable <this->particles->pos[j][0]>, stride is 10 [GSimulation.cpp(138,9)] remark #15415: vectorization support: non-unit strided load was generated for the variable <this->particles->pos[j][1]>, stride is 10 [GSimulation.cpp(139,9)] remark #15415; vectorization support; non-unit strided load was generated for the variable <this->particles->pos[i][2]>, stride is 10 [GSimulation.cpp(140.9)] remark #15415: vectorization support: non-unit strided load was generated for the variable <this->particles->mass[j]>, stride is 10 [GSimulation.cpp(145,36)] remark #15415: vectorization support: non-unit strided load was generated for the variable <this->particles->mass[j]>, stride is 10 [GSimulation.cpp(146,36)] remark #15415: vectorization support: non-unit strided load was generated for the variable <this->particles->mass[j]>, stride is 10 [GSimulation.cpp(147,36)] remark #15305: vectorization support: vector length 16 remark #15309: vectorization support: normalized vectorization overhead 0.356 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 [GSimulation.cpp(143,4)] remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [GSimulation.cpp(143,4)] remark #15417: vectorization support: number of FP up converts: single precision to double precision 6 [GSimulation.cpp(145,4)] remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [GSimulation.cpp(145,4)] remark #15417: vectorization support: number of FP up converts: single precision to double precision 6 [GSimulation.cpp(146,4)] remark #15418; vectorization support; number of FP down converts; double precision to single precision 1 [GSimulation.cpp(146.4)] remark #15417: vectorization support: number of FP up converts: single precision to double precision 6 [GSimulation.cpp(147,4)] remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [GSimulation.cpp(147,4)] remark #15300: LOOP WAS VECTORIZED remark #15452: unmasked strided loads: 6 remark #15475: --- begin vector cost summary --- remark #15476; scalar cost: 137 remark #15477: vector cost: 20.000 remark #15478: estimated potential speedup: 6.300 remark #15487: type converts: 23 remark #15488: --- end vector cost summary --- ``` #### The Basic Tuning Cycle Infinite cycle only broken by external constraints (time, papers, releases ... ) Procedures for measuring performance and validating results are critical **Automation** and **environment** control are key for **consistency** Where do I start? /soft/perftools/intel/advisor/advixe.qsub /soft/perftools/intel/vtune/amplxe.qsub #### amplxe.qsub Script - Copy and customize the script from /soft/perftools/intel/vtune/amplxe.qsub - All-in-one script for profiling - Job size ranks, threads, hyperthreads, affinity - Attach to a single, multiple or all ranks - Binary as arg#1, input as arg#2 - qsub amplxe.qsub ./your exe ./inputs/inp - Binary and source search directory locations - Timestamp + binary name + input name as result directory - Save cobalt job files to result directory # Intel® Advisor ### Intel® Advisor – Vectorization Optimization #### Faster Vectorization Optimization: - Vectorize where it will pay off most - Quickly ID what is blocking vectorization - Tips for effective vectorization - Safely force compiler vectorization - Optimize memory stride #### Roofline model analysis: - Automatically generate roofline model - Evaluate current performance - Identify boundedness 0.033 http://intel.ly/advisor-xe Add Parallelism with Less Effort, Less Risk and More Impact ### Typical Vectorization Optimization Workflow There is no need to recompile or relink the application, but the use of **-g** is recommended. Note: if you're using Theta run out of /projects rather than /home - 1. Collect survey (overhead ~5%) advixe-cl -c survey - Basic info (static analysis) ISA, time spent, etc. - 2. Collect Tripcounts and Flops (overhead 1-10x) advixe-cl -c tripcounts -flop - Investigate application place within roofline model - Determine vectorization efficiency and opportunities for improvement - 3. Collect dependencies (overhead 5-1000x) advixe-cl -c dependencies - Differentiate between real and assumed issues blocking vectorization - 4. Collect Memory Access Patterns advixe-cl -c map #### Collect survey and tripcounts ``` cd /projects/intel/pvelesko/nody-demo/ver0 ``` make cp /soft/perftools/intel/advisor/advixe.qsub ./ qsub./advixe.qsub./nbody.x 2000 500 scp result back to your local machine Text report can also be useful: advixe-cl -R survey #### View Result X-forwarding is not recommended. Tar the result along with sources (if you want to be able to view them) or Generate a snapshot: \$ advixe-cl --snapshot --pack --cache-sources --cache-binaries then scp to your local machine ## Analyze Result - advixe\_ver0 Summary - ISA CPU Time - Total vs Self Loops and Functions/Loops Only/Functions only Top Down helpful when same function is called in multiple places Compute Perf - FLOPs Roofline #### **Summary Report** Summary provides overall performance characteristics Top time consuming loops are listed individually Vectorization efficiency is based on used ISA (in this case SSE2/SSE) Note the warning regarding a higher ISA (in this case -xMIC-AVX512) ## Survey Report (Code Analytics Tab) #### Analytics tab contains a wealth of information - Instruction set - Instruction mix - Traits (sqrt, type conversions, unpacks) - Vector efficiency - Floating point statistics And explanations on how they are measured or calculated - expand the box or hover over the question marks. ## Survey Report (Source Tab) #### Notice the following: - Higher ISA available - Type conversion - Use of square root All of these elements may affect performance Optimization Notice ## Cache-Aware Roofline Model (CARM) Analysis #### Follow recommendations and re-test In this new version (ver2 in github sample) we introduce the following changes: - Consistently use float types to avoid type conversions in GSimulation.cpp - Recompile to target Intel® Xeon Phi 7230 with -xMIC-AVX512 #### Note changes in survey report: - Reduced vectorization efficiency (harder with 512 bits) - Type conversions gone - Gathers/Blends point to memory issues and vector inefficiencies ## Analyze Result - advixe\_ver2 Roofline - Change in OI (due to FP converts) Jump in FLOPs **Memory Access** ### Vectorization: gather/scatter operation The compiler might generate gather/scatter instructions for loops automatically vectorized where memory locations are not contiguous ``` struct Particle { public: ... real_type pos[3]; real_type vel[3]; real_type acc[3]; real_type mass; }; ``` Optimization Notice #### Memory access pattern analysis How should I access data? #### Unit stride access are faster #### Constant stride are more complex ## Non predictable access are usually bad For B, 1 cache line load computes 4 DP For B, 2 cache line loads compute 4 DP with reconstructions For B, 4 cache line loads compute 4 DP with reconstructions, prefetching might not work #### Follow recommendations and re-test In this new version (ver3 in github sample) we introduce the following change: Change particle data structures from AOS to SOA #### Note changes in report: - Performance is lower - Main loop is no longer vectorized - Assumed vector dependence prevents automatic vectorization Next step is clear: perform a **Dependencies** analysis #### Suggested solutions Memory Access Patterns Report Dependencies Report Recommendations All Advisor-detectable issues: C++ | Fortran #### Recommendation: Resolve dependency The Dependencies analysis shows there is a real (proven) dependency in the loop. To fix: Do one of the following: If there is an anti-dependency, enable vectorization using the directive #pragma omp simd safelen(length), where length is smaller than the distance between dependent iterations in anti-dependency. For example: ``` #pragma omp simd safelen(4) for (i = 0; i < n - 4; i += 4) a[i + 4] = a[i] * c; ``` #### ISSUE: PROVEN (REAL) DEPENDENCY **PRESENT** The compiler assumed there is an anti-dependency (Write after read - WAR) or true dependency (Read after write - RAW) in the loop. Improve performance by investigating the assumption and handling accordingly. Resolve dependency • If there is a reduction pattern dependency in the loop, enable vectorization using the directive #pragma omp simd reduction(operator:list). For example: ``` #pragma omp simd reduction(+:sumx) for (k = 0; k < size2; k++) sumx += x[k]*b[k]; ``` ## Analyze Result - advixe\_ver4 Vectorization time back to normal Reduced execution time ### Advisor Roofline – How much further can we go? ``` _assume_aligned(particles->pos_x, alignment); __assume_aligned(particles->pos_y, alignment); __assume_aligned(particles->pos_z, alignment); __assume_aligned(particles->acc_x, alignment); __assume_aligned(particles->acc_y, alignment); assume aligned(particles->acc z, alignment); __assume_aligned(particles->mass, alignment); #endif real_type ax_i = particles->acc_x[i]; real_type ay_i = particles->acc_y[i]; real_type az_i = particles->acc_z[i]; for (j = 0; j < n; j++) real_type dx, dy, dz; real_type distanceSqr = 0.0f; real_type distanceInv = 0.0f; dx = particles->pos_x[j] - particles->pos_x[i]; dy = particles->pos_y[j] - particles->pos_y[i]; dz = particles->pos_z[j] - particles->pos_z[i]; k*dx + dy + + dz*dz + softeningSquared; 1.0f / sqrtf(distanceSqr); //ldi distanceInv particles->acc x[i] = ax i; particles->acc y[i] = ay i; particles->acc_z[i] = az_i; ``` $$FMA\ Ratio = \frac{3}{29} = 10\%$$ Peak = SP Vector ADD \* (1+ FMA Ratio) Peak = 40 \* (1 + 0.1) = 44 GFLOPS #### Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others ## **Vectorization Efficiency?** #### **Complex Operations?** Optimization Notice (intel) ### **Memory Performance** ``` for (i = 0: i < n: i++)// update acceleration __assume_aligned(particles->pos_x, alignment); __assume_aligned(particles->pos_y, alignment); __assume_aligned(particles->pos_z, alignment); __assume_aligned(particles->acc_x, alignment); __assume_aligned(particles->acc_y, alignment); __assume_aligned(particles->mass_alignment); #endif __assume_aligned(particles->acc_z, alignment); real_type ax_i = particles >acc_x[i]; real_type ay_i = particles >acc_y[i]; real_type az_i = particles >acc_z[i] #pragma omp simd simdlen(16) for (j = 0; j < n; j++) real_type dx, dy, dz; real_type distanceSqr = 0.0f; real_type distanceInv = 0.0f; dx = particles >pos_x[j] dy = particles >pos_y[j] particles-pos_x[i] particles >pos_y[i]; dz = particles >pos_z[j] particles- distanceSqr = dx*dx + dy*dy + dz*dz + softeningSquared; distanceInv = 1.0f / sqrtf(distanceSqr); ax_i+= dx * G * particles mass[j] distanceInv * distanceInv * distanceInv; //6flops ay_i += dy * G * particles mass[j] * distanceInv * distanceInv * distanceInv; //6flops ay_i += dy * G * particles->mass[j] * distanceInv * distanceInv * distanceInv; //6flops az_i += dz * G * particles->mass[j] * distanceInv * distanceInv * distanceInv; //6flops particles->acc_x[i] = ax_i; particles->acc_y[i] = ay_i; particles->acc_z[i] = az_i; ``` Maximum N before we lose caching? KNL L1-32kB L2-1MB (1 tile/2cores) 32k/(4\*4) = 2k (L1)1MB/(7\*4) = 35.7k(L2) #### **GFLOPs vs N** Optimization Notice # Intel® VTUNE™ Amplifier ## Intel® VTune™ Amplifier #### VTune Amplifier is a full system profiler - Accurate - Low overhead - Comprehensive (microarchitecture, memory, IO, treading, ... ) - Highly customizable interface - Direct access to source code and assembly - User-mode driverless sampling - Event-based sampling Analyzing code access to shared resources is critical to achieve good performance on multicore and manycore systems ### **Predefined Collections** ### Many available analysis types: uarch-exploration General microarchitecture exploration hpc-performance **HPC Performance Characterization** memory-access **Memory Access** disk-io **Disk Input and Output** concurrency Concurrency gpu-hotspots **GPU Hotspots** gpu-profiling **GPU In-kernel Profiling** hotspots **Basic Hotspots** **Locks and Waits** locksandwaits memory-consumption Memory Consumption system-overview **System Overview** **Python Support** # Collect uarch-exploration ``` cd /projects/intel/pvelesko/nody-demo/ver7 vim Makefile # edit to add -dynamic cp /soft/perftools/intel/advisor/amplxe.qsub ./ vim amplxe.qsub # edit collection to "uarch-exploration" qsub ./advixe.qsub ./nbody.x 2000 500 ``` scp result back to your local machine # Hotspots analysis for nbody demo (ver7: threaded) qsub amplxe.qsub ./your\_exe ./inputs/inp Lots of spin time indicate issues with load balance and synchronization Given the short OpenMP region duration it is likely we do not have sufficient work per thread Let's look a the timeline for each thread to understand things better... ## Bottom-up Hotspots view There is not enough work per thread in this particular example. Double click on line to access source and assembly. Notice the filtering options at the bottom, which allow customization of this view. Next steps would include additional analysis to continue the optimization process. # Viewing the result - Text file reports: - amplxe-cl -help report How do I create a text report? - amplxe-cl -help report hotspots What can I change - amplxe-cl -R hotspots -r ./res\_dir -column=? Which columns are available? - Ex: Report top 5% of loops, Total time and L2 Cache hit rates - amplxe-cl -R hotspots -loops-only - -limit=5 -column="L2\_CACHE\_HIT, Time Self (%)" - Vtune GUI - unset LD\_PRELOAD; amplxe-gui # Using Vtune to ch General Exploration Microarchitecture Analysis Configuration Collection Log Summa Grouping: Function / Call Stack Function / Call Stack GSimulation::start apic\_timer\_interrupt native\_write\_msr\_safe Grouping: Function / Call Stack Function / Call Stack ▶ GSimulation::start Isnic nevt deadline Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. mplxe: Using result path '/gpfs/jlse-fs0/users/pvelesko/nbody-dēmo/ver5/amplxe knl nodiv 60k' mplxe: Executing actions 75 % Generating a report Clockticks: 405,003,006,000 Instructions Retired: 342,199,000,000 Clockticks: 405,003,000,000 CPI Rate: 1.184 MUX Reliability: 0.992 Front-End Bound: 1.5% of Pipeline Slots III8 Overhead: 0.0% of Clockticks BACLEARS: 0.1% of Clockticks MS Entry: 0.0% of Clockticks ICache Line Fetch: 1.0% of Clockticks BAS Speculation: 0.2% of Pipeline Slots Branch Mispredict: 0.2% of Clockticks MS MS Engredict: 0.2% of Clockticks SMK Mshine Clear: 0.0% of Clockticks BACk-End Bound: 5.2% of Pipeline Slots Branch Mispredict: 0.2% of Clockticks MO Machine Clear Overhead: 0.0% of Clockticks BACk-End Bound: 5.2% of Pipeline Slots I A significant proportion of pipeline slots are remaining empty. When operations take too long in the back-end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work to be retired per cycle than the machine is capable of supporting. This opportunity cost results in slower execution. Long-latency operations like divides and memory operations can cause this, as can too many operations being directed to a single execution port (for example, more multiply operations arriving in the back-end per cycle than the execution unit can support). Memory Latency L1 Hit Rate: 60.2% L1 Hit Rate: 60.2% The L1 cache is the first, and shortest-latency, level in the memory hierarchy. This metric provides the ratio of demand load requests that hit the L1 cache to the total number of demand load. L<sup>2</sup> Hit Rate: 98.0% L2 Hit Bound: 100.0% of Clockticks Issue: A significant portion of cycles is being spent on data fetches that miss the L1 but hit the L2. This metric includes | coherence penalties for shared data. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider the performance tuning applicable to an L2-missing workload: reduce the data working set size, improve data access locality, consider blocking or partitioning your working set so that it fits into the L1, or better exploit hardware prefetchers. Consider using software prefetchers, but note that they can interfere with normal loads, potentially increasing latency, as well as increase pressure on the memory system. L2 Miss Bound: 36.2% of Clockticks | Issue: A high number of CPU cycles is being spent waiting for L2 | load misses to be serviced. Reduce the data working set size, improve data access locality, blocking and consuming data in chunks that fit into the L2, or better exploit hardware prefetchers. Consider using software prefetchers but note that they can increase latency by interfering with normal loads, as well as increase pressure on the memory system. UTLB Overhead: 4.0% of Clockticks SIMD Compute-to-L1 Access Ratio: 1.490 SIMD Compute-to-L2 Access Ratio: 4.003 This metric provides the ratio of SIMD compute instructions to the total number of memory loads that hit the L2 cache. On this platform, it is important that this ratio is large to ensure efficient usage of compute resources. Intra-Tile): 0.0% Page Walk: 4.9% of Clockticks Memory Reissues Split Loads: 0.0% Split Loads: 0.0% Split Stores: 0.0% Loads Blocked by Store Forwarding: 0.0% Retiring: 42.1% of Pipeline Slots VPU Utilization: 99.9% of Clockticks Divider: 0.0% of Clockticks MS Assists: 0.1% of Clockticks FP Assists: 0.0% of Clockticks Total Thread Count: 1 | d Speculation 🕑 | Back-End Bound 🛎 | Retiring 🛎 | | |-----------------|------------------|------------|--| | 0.1% | 41.3% | 58.6% | | | 0.0% | 46.7% | 0.0% | | | 0.0% | 60.0% | 0.0% | | | | | | | | | | | | | | | | | | | | | | | Me mory Late ncy | | | | | |------------------|---------------|---------------|--|--| | L2 Hit Bound | L2 Miss Bound | UTLB Overhead | | | | 0.9% | 0.0% | 0.0% | | | | 0.09/- | n n% | 0.09/_ | | | # Microarchitecture Exploration - Caches | S | 2k | 2.5k | 30k | 35k | 50k | 60k | |--------------------|------|-------|-------|-------|-------|-------| | L1 Hit % | 100% | 63.9% | 62.4% | 48.5% | 57.5% | 60.2% | | L2 Hit % | 0% | 100% | 100% | 100% | 99.2% | 98.8% | | L2 Hit<br>Bound % | 0% | 100% | 100% | 100% | 100% | 100% | | L2 Miss<br>Bound % | 0% | 0% | 0% | 0% | 28.6% | 36.2% | # Profiling PYThon & ML applications ## Python Profiling Python is straightforward in VTune™ Amplifier, as long as one does the following: - The "application" should be the full path to the python interpreter used - The python code should be passed as "arguments" to the "application" In Theta this would look like this: # Simple Python Example on Theta ``` aprun -n 1 -N 1 amplxe-cl -c hotspots -r vt_pytest \ -- /usr/bin/python ./cov.py naive 100 1000 ``` Naïve implementation of the calculation of a covariance matrix ### Summary shows: - Single thread execution - Top function is "naive" Click on top function to go to Bottom-up view # Bottom-up View and Source Code Note that for mixed Python/C code a Top-Down view can often be helpful to drill down into the C kernels # Intel® VtunE™ Application Performance Snapshot Performance overview at you fingertips # VTune™ Amplifier's Application Performance Snapshot ### High-level overview of application performance - Identify primary optimization areas - Recommend next steps in analysis - Extremely easy to use - Informative, actionable data in clean HTML report - Detailed reports available via command line - Low overhead, high scalability # Usage on Theta Launch all profiling jobs from /projects rather than /home No module available, so setup the environment manually: - \$ module load vtune - \$ export PMI\_NO\_FORK=1 Launch your job in interactive or batch mode: \$ aprun -N <ppn> -n <totRanks> [affinity opts] aps ./exe Produce text and html reports: \$ aprun -report ./aps\_result\_ .... ## **APS HTML Report** **Application Performance Snapshot** Application: heart\_demo Report creation date: 2017-08-01 12:08:48 Number of ranks: 144 Your application is MPI bound. Ranks per node: 18 This may be caused by high busy wait time inside the library (imbalance), non-OpenMP threads per rank: 2 optimal communication schema or MPI library settings. Use MPI profiling tools HW Platform: Intel(R) Xeon(R) Processor code named Broadwell-EP Logical Core Count per node: 72 like Intel® Trace Analyzer and Collector to explore performance bottlenecks. 121.39s Current run Target Delta MPI Time 53.74% < 10% Elapsed Time OpenMP Imbalance 0.43% <10% Memory Stalls 14.70% FPU Utilization 0.30% ▶ >50% 0.68 50.98 I/O Bound 0.00% <10% SP FLOPS (MAX 0.81, MIN 0.65) OpenMP Imbalance **Memory Stalls FPU Utilization** MPI Time 53.74% of Elapsed Time 0.43% of Elapsed Time 14.70% of pipeline slots 0.30% (0.52s)(65.23s) Cache Stalls SP FLOPs per Cycle 12.84% of cycles 0.08 Out of 32.00 MPI Imbalance 11.03% of Elapsed Time **Memory Footprint** DRAM Stalls Vector Capacity Usage (13.39s) Resident: 0.18% of cycles 25.84% TOP 5 MPI Functions % Per node: NUMA FP Instruction Mix Peak: 786.96 MB Waitall 37.35 31.79% of remote accesses Average: 687.49 MB % of Packed FP Instr.: 3.54% Isend % of 128-bit: 3.54% Per rank: 5.52 Barrier % of 256-bit: 0.00% Peak: 127.62 MB % of Scalar FP Instr.: 96.46% Irecv 3.70 Average: 38.19 MB Virtual: 0.00 Scatterv FP Arith/Mem Rd Instr. Ratio Per node: Peak: 9173.34 MB Average: 9064.92 MB FP Arith/Mem Wr Instr. Ratio I/O Bound (intel) Per rank: 0.00% Peak: 566.52 MB (AVG 0.00, PEAK 0.00) Optimization Notice # Common issues ### **Fixes** No call stack information - check that finalization Incompatible database scheme - make sure the same version of vtune Vtune sampling driver.. using perf - use latest vtune/ driver needs a rebuild # Tips and tricks # Speeding up finalization | Advisor | Vtune | |---------|-------| | | | add `--no-auto-finalize` to the aprun add `--finalization-mode=none` to aprun followed by `advixe-cl R survey ...` <u>without</u> <u>aprun</u> will cause to finalize on the momnode rather than KNL. followed by `amplxe-cl -R hotspots ...` <u>without</u> <u>aprun</u> will cause to finalize on momnode rather than KNL. You can also finalize on thetalogin: You can also finalize on thetalogin: cd your\_src\_dir; cd your\_src\_dir; export SRCDIR=`pwd | xargs realpath` export SRCDIR=`pwd | xargs realpath` advixe-cl -R survey --search-dir src:=\${SRCDIR} amplxe-cl -R hotspots --search-dir src:=\${SRCDIR} • • Optimization Notice # Managing overheads Advisor Dependencies and MAP analyses can have huge overheads If able, run on reduced problem size. Advisor just needs to figure out the execution flow. Only analyze loops/functions of interest: https://software.intel.com/en-us/advisor-user-guide-mark-up-loops # backup ### When do I use Vtune vs Advisor? #### Vtune - What's my cache hit ratio? - Which loop/function is consuming most time overall? (bottom-up) - Am I stalling often? IPC? - Am I keeping all the threads busy? - Am I hitting remote NUMA? - When do I maximize my BW? #### **Advisor** - Which vector ISA am I using? - Flow of execution (callstacks) - What is my vectorization efficiency? - Can I safely force vectorization? - Inlining? Data type conversions? - Roofline ### **VTune Cheat Sheet** ``` Compile with -q -dynamic amplxe-cl -c hpc-performance -flags -- ./executable ``` - --result-dir=./vtune output dir - --search-dir src:=../src --search-dir bin:=./ - -knob enable-stack-collection=true -knob collect-memorybandwidth=false - -knob analyze-openmp=true - -finalization-mode=deferred if finalization is taking too long on KNL - -data-limit=125 ← in mb - -trace-mpi for MPI metrics on Theta - amplxe-cl -help collect survey ### **Advisor Cheat Sheet** ``` Compile with -g -dynamic ``` advixe-cl -c roofline/depencies/map -flags -- ./executable - --project-dir=./advixe\_output\_dir - --search-dir src:=../src --search-dir bin:=./ - -no-auto-finalize if finalization is taking too long on KNL - --interval 1 (sample at 1ms interval, helps for profiling short runs) - -data-limit=125 ← in mb - advixe-cl -help Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. # How much further can we go? Introducing the Cache-Aware Roofline Model # Platform peak FLOPs How many floating point operations per second Gflop/s= $$min \begin{cases} Platform PEAK \\ Platform BW * AI \end{cases}$$ ### Theoretical value can be computed by specification ### More realistic value can be obtained by running Linpack =~ 930 Gflop/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2 (intel) ### Platform PEAK bandwidth How many bytes can be transferred per second Gflop/s= $$min \begin{cases} Platform PEAK \\ Platform BW \end{cases} AI$$ ### Theoretical value can be computed by specification Example with 2 sockets Intel® Xeon® Processor E5-2697 v2 PEAK BW = 2 x 1.866 x 8 x 4 = 119 GB/s Number of sockets Byte per channel Memory Frequency Number of mem channels ### More realistic value can be obtained by running **Stream** =~ 100 GB/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2 # Drawing the Roofline AI [Flop/Byte] 8.7 Optimization Notice ### Cache-Aware Roofline **Next Steps** # If under or near a memory roof... - Try a MAP analysis. Make any appropriate cache optimizations. - If cache optimization is impossible, try reworking the algorithm to have a higher AI. #### If Under the Vector Add Peak Check "Traits" in the Survey to see if FMAs are used. If not, try altering your code or compiler flags to **induce FMA usage.** # If just above the Scalar Add Peak Check **vectorization efficiency** in the Survey. Follow the recommendations to improve it if it's low. # If under the Scalar Add Peak... Check the Survey Report to see if the loop vectorized. If not, try to **get it to vectorize** if possible. This may involve running Dependencies to see if it's safe to force it. Optimization Notice