# AA 372: Numerical & Statistical Techniques

Prateek Sharma (<u>prateek@physics.iisc.ernet.in</u>) Office: D2-08 Office Hours: Fri. 2-3 pm (I expect to see you!)

### Two parts

• Numerical Analysis by me at IISc

• Statistical Methods by Desh at RRI

# Organization

I've created a wikipage for this course: <u>http://ps-teaching.wikispaces.com/</u>

I'll upload slides/problem-sets here grading: weekly homeworks, project

Syllabus: <u>http://ps-teaching.wikispaces.com/AA+372+Syllabus</u>

# some basics about how modern computers work

### Computer Architecture



both instructions & data sent by input devices to memory loaded from memory to CPU registers Instruction Set Architecture (ISA): machine language instruction set, word size, registers

#### ALU



bitwise logic ops. AND, OR, NOT, XOR

integer arithmetic ops. add,subtract,multiply,divide

bit shifting (\* or / by 2<sup>n</sup>)

### FPU: floating point unit

+,-,\* fast / slow and so are exp, cos, & other transcendental fns. commonly used function are coded in machine language

## Hierarchical Memory

#### **Computer Memory Hierarchy**



## Cache Utilization

### data stored in memory as a I-D array



sometimes compilers do these optimizations (-O3)

# Latency & Bandwidth

### minimum time<sup>\*</sup>to do an action (access time) rate of action once action is initialized





#### ScienceMark L1 Cache Latency: Time





Nehalem processor: LI~64 kB L2~2 MB L3~30 MB

LI cache ~ 5 times faster than L3 cache ~5 times faster than RAM!

# Clock Rate clock coordinates different actions

modern CPUs upto 4 FLOPs per cycle: 2.4 GHz => 4x2.4 10<sup>9</sup>~10<sup>10</sup> FLOPs/cycle/core (10 GF) if the cluster has 80 cores => 800 GF machine

this is not the only parameter! since data access is more time-consuming (40 ns) than FLOPs (0.1 ns); having larger RAM/ cache/interconnects more important than just clock speed

### Architecture level Parallelism

**bit level parallelism:** 4 bit ... 32 to 64 bit word-size (=register size); more bits processed/cycle



five-stage pipeline in a RISC (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back)

### Moore's Law



### improvements governed by technology

#### architecture, compiler, programs reflect this



power issues! chips becoming smaller and smaller

 $P=1/2 C V^2 f$ 

higher frequency => more power consumption & heating can't be air cooled! reduce operating voltage (transistor errors), frequency (speed reduction)

# software closely tied to hardware esp. with parallel systems

source code: high level language (fortran, c, c++)  $compiler \int (also optimizes the code, e.g., -O2, -O3 flags)$ object code & executable (lower level assembly/machine code)

interpreted languages (e.g., python, perl,MATLAB, Mathematica, IDL scripting languages) slower but handy/easier

important to remember architecture to attain maximum performance

# Parallel Computing

multicore: multiple processors on the same chip

shared memory (SMP): all processors have common main memory

Distributed memory: beowulf cluster, parallel clusters w. specialized interconnects

Grid computing: computers communicating over the internet; e.g., SETI@home

GPUs (graphics processing units): driven by games/graphics industry, fast FP operations

Software: MPI (message passing interface; distributed systems), openMP (shared memory)

# distributed memory programming model





Halo Update Communications Pattern



# shared memory programming model



### Trends in Supercomputing

http://top500.org/:

list of 500 fastest (based on LINPACK benchmark) computers in the world

shows trends in architecture, interconnects, vendors, etc.



### the list

| Rank | Site                                                                  | Computer                                                                                                 |
|------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| 1    | RIKEN Advanced Institute for<br>Computational Science (AICS)<br>Japan | K computer, SPARC64 VIIIfx 2.0GHz, Tofu<br>interconnect<br>Fujitsu                                       |
| 2    | National Supercomputing Center in<br>Tianjin<br>China                 | NUDT YH MPP, Xeon X5670 6C 2.93 GHz,<br>NVIDIA 2050<br>NUDT                                              |
| 3    | DOE/SC/Oak Ridge National<br>Laboratory<br>United States              | Cray XT5-HE Opteron 6-core 2.6 GHz<br>Cray Inc.                                                          |
| 4    | National Supercomputing Centre in<br>Shenzhen (NSCS)<br>China         | Dawning TC3600 Blade System, Xeon X5650 6C<br>2.66GHz, Infiniband QDR, NVIDIA 2050<br>Dawning            |
| 5    | GSIC Center, Tokyo Institute of<br>Technology<br>Japan                | HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia<br>GPU, Linux/Windows<br>NEC/HP                              |
| 6    | DOE/NNSA/LANL/SNL<br>United States                                    | Cray XE6, Opteron 6136 8C 2.40GHz, Custom<br>Cray Inc.                                                   |
| 7    | NASA/Ames Research<br>Center/NAS<br>United States                     | SGI Altix ICE 8200EX/8400EX, Xeon HT QC<br>3.0/Xeon 5570/5670 2.93 Ghz, Infiniband<br>SGI                |
| 8    | DOE/SC/LBNL/NERSC<br>United States                                    | Cray XE6, Opteron 6172 12C 2.10GHz, Custom Cray Inc.                                                     |
| 9    | Commissariat a l'Energie Atomique<br>(CEA)<br>France                  | Bull bullx super-node S6010/S6030<br>Bull                                                                |
| 10   | DOE/NNSA/LANL<br>United States                                        | BladeCenter QS22/LS21 Cluster, PowerXCell 8i<br>3.2 Ghz / Opteron DC 1.8 GHz, Voltaire Infiniband<br>IBM |

### Some Statistics



#### Interconnect Family System Share



#### Processor Generation System Share



Operating system Family System Share











# testing performance

IDEAL 10<sup>4</sup> RPI/BG Recomputed Hamiltonian 65536 500 Lanczos iterations 32768  $10^{3}$ 16384 8192 Projection 4096 Walltime (s) 2048 1024 512 256 10<sup>1</sup> 128 64 32 10<sup>0</sup> Atoms/Core 10 100 1000 10000 Number of Cores

strong scaling: increase the no. of procs. on a fixed problem-size

poor scaling for small problem size communication time >> computation time remember latency?

computation ~  $N^3$ , communication~ $N^2$ 

going to bigger problem size helps with communication overhead

weak scaling: keep the problem per processor the same and inc. the problem-size (& processor count)

# Summary

- just touched the tip of the iceberg
- lot of info online
- modern computer architecture: communication >> computation => cache contiguous data, minimize data access
- parallel systems: programming models
- free tools (e.g., LAPACK) online; no need to reinvent the wheel; good to know basics