## Numerical applications with object oriented design

### Doing numerics with C++

We developed a templated library of numerical base classes which implement basic data structures like complex numbers, dynamic vectors, static vectors, different types of matrices like full matrices, band matrices, sparse matrices, etc. and also included a representation for Tensors and its typical operations like contraction, direct product and multiplication with contraction. Further-on, some standard matrix solvers like Gauss-Jordan, LU-decomposition and Singular Value Decomposition.
We also have a set of powerful iterative solvers (Krylov subspace methods> along with preconditioners to solve numerically more challenging problems. There's also interfaces to netlib libraries such as CLAPACK or SuperLU. These packages are also mirrored here.

C++ has a bad reputation for numeric calculations which is mainly caused by its object oriented design. Using straightforward implementations of numerical classes such as Vectors, Matrices etc., will cause a lot of unnecessary copying and therefore decrease performance. This however can be avoided through not as obvious implementations of these numerical base classes as is usually the case. We chose an implementation based on the «Temporary Base Class Idiom» TBCI to avoid the unnecessary copying of objects. It can be thought of as some sort of reference counting done by the compiler at compile time.

The performance is impressive and we don't have to hide behind implementations in FORTRAN or C. C++ also allows us to do more optimizations behind the back; we e.g. defer scalar multiplications to be able to do the v = a*x+y (SAXPY, v, x, y vectors, a is a scalar) operation in one loop.

Since version 2.0, we support multithreading. Here, several CPUs work on part of the vectors or matrices; this has been realized using POSIX threads. It scales well for very large vectors or matrices; if your data fits into the cache that's on the CPU, you better don't use multithreading. Many simple arithmetic operations are memory bandwith limited once you don't fit into the caches any more.

Since version 2.4, we have optimized vector kernels (unrolled and optionally with prefetch instructions). They are coded in C, so they are portable. These are better than what most compilers produce out of a plain loop.

If you do a vector calculation, e.g. v = x+y, the way TBCI avoids a copy of memory is by assigning the memory block from the temporary that holds the sum to v; so compared to a hand-tuned loop, we have the overhead of an additional memory allocation and deallocation, but no extra copy. This pays off for large vectors, but for small ones the overhead of calling malloc/free (or rather operator new[] and delete[]) is too high. So small vectors are the worst case for the plain TBCI design, and one might be better off using the fixed sized vector FS_Vector is the size is known at compile time. Otherwise, starting with 2.4, a custom memory allocator that holds a small cache of memory blocks of a size has been added to TBCI to make the overhead of repeated memory (de)allocation low. Let's benchmark it ...

IBM Developerworks has an article, where Matrix libraries are benchmarked. We compare against the winner, Meschach (version 1.2b): Results for TBCI-2.4.0, (CPU) times in s, iPIII-1000, Linux-2.4.18, glibc-2.2.5, 512 SDRam (133), gcc-3.1.1 snapshot, both benchmarks compiled with -O3 -funroll-loops -march=pentiumpro -fschedule-insns2 -fno-inline-functions -DOPT_ARCH_PENTIUM3 -felide-constructors. The best of three runs is reported here.
Meschach 1.2b vs. TBCI-2.4: Runtime vs. matrix size (iP3, 1GHz)
Lib Test1 (1Mio C = A+B; filled w/ 1.0) Test3 (as Test1, but filled w/ random no)
2x23x34x45x5 6x67x78x8 2x23x34x45x5 6x67x78x8
Meschach 0.190.270.350.53 0.770.931.07 0.761.572.694.16 6.158.2510.68
TBCI-2.4 0.240.300.380.52 0.690.931.12 0.831.602.724.17 6.138.2210.61
The results show that the custom memory allocator does bring TBCI close to the performance of Meschach for very small data objects, but it's still behind. For larger sizes, this does not play a role and the well-optimized vector kernels determine the performance.
The test programs are contained in the lina/test/ subdirectory of the tarball.

Since 2.5.0, we have intrinsics in the vector kernels to make use of the SIMD (Single Instruction, Multiple Data) instructions sets of modern CPUs. Currently, SSE2 is supported on x86 and x86-64. So let's repeat IBM's benchmark on a fast machine supporting SSE2, an intel Pentium4 560J (3.6GHz) in 64bit mode, single-threaded. All applications compiled with gcc-4.3, -O3 -funroll-loops -fschedule-insns2 -fno-inline-functions -finline-limit-2400 -ftree-vectorize -march=nocona (and -DOPT_ARCH_PENTIUM4 -DOPT_PENTIUM4 -felide-constructors -fstrict-aliasing for the C++ benchmarks)
Mesch12b/TBCI255/Blitz09/Pooma241: Runtime vs. matrix size (iP4, 3.6GHz)
Lib Test1 (1Mio C = A+B; filled w/ 1.0) Test3 (as Test1, but filled w/ random no)
2x23x34x45x5 6x67x78x824x2464x64 2x23x34x45x5 6x67x78x8
Meschach 0.040.070.090.18 0.190.260.301.6313.32 0.200.470.711.07 1.502.012.55
TBCI-2.5.5 (SSE2) 0.060.080.080.11 0.140.160.191.19 9.56 0.230.490.751.09 1.522.032.59
TBCI-2.5.5 (no SSE2) 0.070.100.100.13 0.150.180.201.3311.96 0.270.510.781.14 1.562.072.64
Blitz-0.9 0.020.050.080.14 0.200.250.312.6222.44 0.200.460.741.11 1.562.102.68
FreePooma-2.4.1 0.500.590.710.80 0.961.131.358.1954.12 0.691.001.251.62 2.112.663.27
Note that the test3 is limited by the random number generator (which is the same for both), whereas test1 is dominated by the time it takes to fill and add the matrices. Using SIMD instructions is a clear win there.

For comparison, I also benchmarked Blitz and (Free)POOMA against Meschach and TBCI.
Blitz does very well for very small matrices but does not scale as well as TBCI or Meschach.
For POOMA, I used the Array<2,double> class; this may not be optimal, I'm not experienced with POOMA. Advice on how to make it go faster is welcome. The results look poor so far overall.

Please also have a look at the online documentation (generated by doxygen).

TBCI NumLib 2.6.3, released 2011-03-21
Announcement / Release Notes
Sources (tar bzip2 archive)
Autogenerated files (not normally needed!)
RPMs can be found on home:garloff:HPC project in openSUSE build service.

TBCI NumLib 2.6.2, released 2010-09-09
Announcement / Release Notes
Sources (tar bzip2 archive)
Autogenerated files (not normally needed!)
RPMs can be found on home:garloff:HPC project in openSUSE build service.

TBCI NumLib 2.6.1, released 2009-08-10
Announcement / Release Notes
Sources (tar bzip2 archive)
Autogenerated files (not normally needed!)
RPMs can be found on home:garloff:HPC project in openSUSE build service.

TBCI NumLib 2.6.0, released 2009-05-28
Announcement / Release Notes
Sources (tar bzip2 archive)
Autogenerated files (not normally needed!)
Post-2.6.0 fix (bug prevented instantiation of solver classes in library -- KG, 20090706)
RPMs can be found on home:garloff:HPC project in openSUSE build service.

TBCI NumLib 2.5.5, released 2008-07-22
Announcement / Release Notes
Sources (tar bzip2 archive)
Autogenerated files (not normally needed!)
RPMs can be found on home:garloff:HPC project in openSUSE build service.

TBCI NumLib 2.5.4, released 2007-12-23
Announcement / Release Notes
Sources (tar.gz archive)
Autogenerated files (not normally needed!)
RPMs can be found on openSUSE build service.

TBCI NumLib 2.5.3, released 2006-11-24
Announcement / Release Notes
Sources (tar.gz archive)
Autogenerated files (not normally needed!)
Source RPM (for SUSE 10)
i586 binary RPM (SuSE Linux 10.1/SLES10: gcc-4.1.0, glibc-2.4.0, SMP enabled)
x86-64 binary RPM (SUSE Linux 10.1/SLES10: gcc-4.1.0, glibc-2.4.0, SMP enabled)
Source RPM (for SUSE 9 or newer)
i586 binary RPM (SuSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
x86-64 binary RPM (SUSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
Other RPMs on request.

TBCI NumLib 2.5.2, released 2006-03-07
Announcement / Release Notes
Sources (tar.gz archive)
Autogenerated files (not normally needed!)
Source RPM (for SUSE 9 or newer)
i586 binary RPM (SuSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
x86-64 binary RPM (SUSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
Source RPM (for SUSE 10)
i586 binary RPM (SuSE Linux 10.1/SLES10: gcc-4.1.0, glibc-2.4.0, SMP enabled)
x86-64 binary RPM (SUSE Linux 10.1/SLES10: gcc-4.1.0, glibc-2.4.0, SMP enabled)
i586 binary RPM (SuSE Linux 10.1/SLES10: gcc-4.1.0, glibc-2.4.0, SMP enabled, SSE2 enabled)
Other RPMs on request.

TBCI NumLib 2.5.1, released 2005-06-27
Announcement / Release Notes
Sources (tar.gz archive)
Source RPM (for SUSE 9 or newer)
i586 binary RPM (SuSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
x86-64 binary RPM (SUSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
Source RPM (for SUSE 8)
Other RPMs on request.

TBCI NumLib 2.5.0, released 2005-04-26
Announcement / Release Notes
Sources (tar.gz archive)
Source RPM
i586 binary RPM (SuSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
x86-64 binary RPM (SUSE Linux 9.1/SLES9: gcc-3.3.3, glibc-2.3.3, SMP enabled)
Other RPMs on request.

TBCI NumLib 2.4.4, released 2004-05-08
Announcement
Sources (tar.gz archive)
Source RPM
i586 binary RPM (SuSE Linux 8.1: gcc-3.2.2, glibc-2.2.5, SMP enabled)
x86-64 binary RPM (SLES8, gcc-3.2.2, glibc-2.2.5, SMP enabled)
Other RPMs (e.g. alpha RPMs) on request.

TBCI NumLib 2.4.3, released 2003-08-09
Announcement
Sources (tar.gz archive)
Source RPM
i586 binary RPM (SuSE Linux 8.1: gcc-3.2.2, glibc-2.2.5)
i586 binary RPM with SMP support enabled
Other RPMs (e.g. Linux x86_64 or alpha RPMs) on request.

TBCI NumLib 2.4.2, released 2003-03-14
Announcement
Sources (tar.gz archive)
Source RPM
i586 binary RPM (SuSE Linux 8.1: gcc-3.2, glibc-2.2.5)
Other RPMs (e.g. Linux CXX-6.3.9.10 alpha RPM) on request.

TBCI NumLib 2.4.1, released 2002-10-11
Announcement
Sources (tar.gz archive)
Source RPM
i386 binary RPM (SuSE Linux 8.0: gcc-2.95.3, glibc-2.2.5)
i586 binary RPM (SuSE Linux 8.1: gcc-3.2, glibc-2.2.5)
Other RPMs (e.g. Linux CXX-6.3.9.10 alpha RPM) on request.

TBCI NumLib 2.4.0, released 2002-08-01
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.3.2, released 2002-05-10
Announcement
Sources (tar.gz archive)
Source RPM
i386 binary RPM (SuSE Linux 7.3: gcc-2.95.3, glibc-2.2.4)

TBCI NumLib 2.3.1, released 2002-04-27
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.3.0, released 2001-12-04
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.2.1, released 2001-07-03
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.2.0, released 2001-07-02
Announcement

TBCI NumLib 2.1.2, released 2001-06-20
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.1.1, released 2001-06-15
Announcement

TBCI NumLib 2.1.0, released 2001-03-15
Announcement

TBCI NumLib 2.0.2, released 2000-11-04
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.0.1, released 2000-09-04
Announcement
Sources (tar.gz archive)

TBCI NumLib 2.0.0, released 2000-08-22
Announcement

They contain compiled libraries, which contain the instantiated code for all TBCI classes. Normally, you don't need to link against these libraries though, as your compiler automatically instantiates the templates that it uses through some mechanism (repository, recompilation, always instantiating and discarding at runtime, ...). However, you may want to save compilation time by using -fno-implicit-templates (in case you explicitly instatiate your own templates as well) or with -fexternal-templates using the GNU compiler to save compilation time and then link against libtbcidouble. Read the "Template Instantiation" section in the GNU compiler manual or similar sections in your favourite C++ compiler's manual.
Some code is not templated, i.e. the SMP support code, the special functions, the Lapack and the SuperLU interface. There you always need to link the libraries or objects.

Documentation explaining the concepts of the design is included in the package in the README file in the main directory and a LaTeX file in the doc/ directory. For those lacking a LaTeX installation, a PDF file can be found here. Install Acrobat Reader(R) from Adobe (also available as Netscape Plugin) or xpdf to display it.

The doxygen generated online docu is also available.