presentation

---

# Profiling and code optimization

## Radovan Bast

Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
Code examples: [OSI](http://opensource.org)-approved [MIT license](http://opensource.org/licenses/mit-license.html).

---

## When do we want to optimize the code?

- Code is too slow
- Memory demand is too high

---

## When do we want to optimize the code?

- Code is too slow
- Memory demand is too high

### Things we can do about it

- Buy faster computer
- Buy more memory
- Help the compiler
- Change algorithm
- Change data structures (optimize data access)
- Port to a "faster language"
- Parallelize
- Port to another platform (GPU, accelerators)

---

## Profile first, optimize later

- Before you invest money or time, identify the problem
- Here insert famous Donald Knuth quote
- Before solving the problem, verify whether it is worth it

---

## Profiling

- Benchmark and profile before optimizing

---

## Profiling

- Where does the program spend most time?
- Dynamic program analysis (runtime)
- Statistical sampling or event-based
- Profiling perturbs the calculation
- Profiled calculation should be representable

---

## Profilers

## Statistical (sampling)

- Collect data at regular intervals
- Typically small overhead
- Not all code is seen
- May differ between execution
- Show bottlenecks in production code
- Can be used for monitoring over long execution runs
- Reports on external library functions

## Event-based (tracing)

- Collect data (traces) for pre-defined events
  (entering or leaving functions, object allocation, communication, disk I/O)
- Modify the code
- Often large overhead
- Selective

---

## Profilers

- "Home use", free
    - Gprof
    - Callgrind (Valgrind)
    - Python: cProfile, profile
    - OProfile

- HPC, free
    - HPCToolkit
    - DynInst
    - Extrae/Paraver
    - Scalasca
    - TAU

- Commercial
    - Vampir
    - Allinea Performance Reports
    - Allinea MAP
    - CrayPat
    - VTune Amplifier (Intel)

---

## How would you profile stranded on an island?

---

## How would you profile stranded on an island?

- Insert timers!
- Everybody can do that
- In any language
- It is good to have timings anyhow in the output

---

## Before solving the problem, verify whether it is worth it

- Human time often more valuable than CPU time
- We often spend more time reading and writing code than running it
- Compare your programming time multiplied with your salary to the cost of CPU
  time that you will save
- Is a 20% faster code worth 2 months of work?
- Does it really matter whether the run takes 1 minute or 2 minutes
- It probably matters if the computation takes 3 months
- If your code is used by 5000 users then 20\% speedup may be worth it

---

## Performance vs. code obfuscation

- When you optimize for speed/memory, typically you increase complexity
- Imagine your code runs 40% faster but nobody understands it
- If you have to complicate the code, keep the complexity localized
- Define your priorities
- Performance is typically not portable

---

## Complexity complexity complexity

- We do not know the languages of tomorrow
- We do not know the hardware of tomorrow
- Code complexity may prevent us from porting our codes to soft- and hardware of tomorrow
- Keep it simple as long as you can

---

## Buy faster computer

- Moore's law
- Frequency race
- "The party isn't exactly over, but the police has arrived, and the music has been turned way down" (P. Kogge)
- Today we need to think about parallelization

---

## Buy more memory

- Number of cores/threads scales faster than memory
- Today we need to think about parallelization
- Linear scaling requires localization of memory access
- Use tools to identify memory bottlenecks

---

## Help the compiler

- Restrict types (Cython)
- Use pragmas
- Tell the compiler where the bottlenecks are
- Tell the compiler where vectorization is possible
- Compiler flags
- Instruction set

---

## Change algorithm

- Consider increase in complexity
- Pre-factor vs. scaling
- "Slow" algorithm may catch the worm
- Easy algorithm first
- YAGNI

---

## Change data structures (optimize data access)

- Latency: https://gist.github.com/jboner/2841832
- Cache hierarchy
- Cache line
- Gorilla
- MPI and scaling (disk access serializing)
- Swapping

---

## Refactoring

- Natural process
- "In the long run every program becomes rococo - then rubble."
- http://pu.inf.uni-tuebingen.de/users/klaeren/epigrams.html
- Needs tests

---

## Recomputing vs. remembering

- Sometimes recompute faster than reading from disk
- Memoization

---

## Port to a "faster language"

- Every language allows to write slow code
- Pre-factor vs. scaling
- "Slow" language may catch the worm
- Combine languages (FFI)
- Functional programming vs. OOP

---

## Prototype first with "easy" language

- Matlab
- Python
- Julia

---

## Parallelize

- Amdahl's law

- Serial and parallel part scale differently w.r.t. system size

---

## Typical problems with calculations that do not scale

- Parallel region too small
- Serialization of parallel tasks
- I/O heavy code
- Communication overhead
- Work load imbalance
- Resource imbalance
- Memory-bound code
- Sub-optimal use of libraries
- System size too small
- Wrong code

---

## Port to another platform (GPU, accelerators)

- Moving target
- Trend goes towards multi-processor and multi-thread
- If you rewrite your data structures for GPU/accelerators, the effort
  is most probably not wasted

### Before porting

- Restructure mathematical formulation
- Innovate at algorithm level
- Then maybe port

---

## Do not reinvent the wheel

- Use libraries if you can
- Libraries are dependencies

---

## General

- Use elemental functions
- Use pure functions/subroutines (Fortran)
- Use intrinsics
- Beware of multiple loops
- Do not allocating/deallocate inside multiple loops

---

## Example: profiling performance and memory in Python

- https://gist.github.com/bast/e37ee29036d4f43fee4d
- https://gist.github.com/bast/1bd41989a975a999d653

---

## Conclusions

- Profile first, optimize later
- When you scale the system, new bottlenecks will appear