We developed a templated library of numerical base
classes which implement basic data structures like complex numbers,
dynamic vectors, static vectors, different types of matrices like
full matrices, band matrices, sparse matrices, etc. and also included
a representation for Tensors and its typical operations like
contraction, direct product and multiplication with contraction.
Further-on, some standard matrix solvers like Gauss-Jordan,
LU-decomposition and Singular Value Decomposition.
We also have a set of powerful iterative solvers (Krylov subspace
methods> along with preconditioners to solve numerically more
challenging problems. There's also interfaces to netlib libraries
such as CLAPACK
or SuperLU.
These packages are also mirrored here.
C++ has a bad reputation for numeric calculations which is mainly caused by its object oriented design. Using straightforward implementations of numerical classes such as Vectors, Matrices etc., will cause a lot of unnecessary copying and therefore decrease performance. This however can be avoided through not as obvious implementations of these numerical base classes as is usually the case. We chose an implementation based on the «Temporary Base Class Idiom» TBCI to avoid the unnecessary copying of objects. It can be thought of as some sort of reference counting done by the compiler at compile time.
The performance is impressive and we don't have to hide behind implementations in FORTRAN or C. C++ also allows us to do more optimizations behind the back; we e.g. defer scalar multiplications to be able to do the v = a*x+y (SAXPY, v, x, y vectors, a is a scalar) operation in one loop.
Since version 2.0, we support multithreading. Here, several CPUs work on part of the vectors or matrices; this has been realized using POSIX threads. It scales well for very large vectors or matrices; if your data fits into the cache that's on the CPU, you better don't use multithreading. Many simple arithmetic operations are memory bandwith limited once you don't fit into the caches any more.
Since version 2.4, we have optimized vector kernels (unrolled and optionally with prefetch instructions). They are coded in C, so they are portable. These are better than what most compilers produce out of a plain loop.
If you do a vector calculation, e.g. v = x+y, the way TBCI avoids a copy of memory is by assigning the memory block from the temporary that holds the sum to v; so compared to a hand-tuned loop, we have the overhead of an additional memory allocation and deallocation, but no extra copy. This pays off for large vectors, but for small ones the overhead of calling malloc/free (or rather operator new[] and delete[]) is too high. So small vectors are the worst case for the plain TBCI design, and one might be better off using the fixed sized vector FS_Vector is the size is known at compile time. Otherwise, starting with 2.4, a custom memory allocator that holds a small cache of memory blocks of a size has been added to TBCI to make the overhead of repeated memory (de)allocation low. Let's benchmark it ...IBM Developerworks has an article, where Matrix libraries are benchmarked. We compare against the winner, Meschach (version 1.2b): Results for TBCI-2.4.0, (CPU) times in s, iPIII-1000, Linux-2.4.18, glibc-2.2.5, 512 SDRam (133), gcc-3.1.1 snapshot, both benchmarks compiled with -O3 -funroll-loops -march=pentiumpro -fschedule-insns2 -fno-inline-functions -DOPT_ARCH_PENTIUM3 -felide-constructors. The best of three runs is reported here.
| Lib | Test1 (1Mio C = A+B; filled w/ 1.0) | Test3 (as Test1, but filled w/ random no) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2x2 | 3x3 | 4x4 | 5x5 | 6x6 | 7x7 | 8x8 | 2x2 | 3x3 | 4x4 | 5x5 | 6x6 | 7x7 | 8x8 | |
| Meschach | 0.19 | 0.27 | 0.35 | 0.53 | 0.77 | 0.93 | 1.07 | 0.76 | 1.57 | 2.69 | 4.16 | 6.15 | 8.25 | 10.68 |
| TBCI-2.4 | 0.24 | 0.30 | 0.38 | 0.52 | 0.69 | 0.93 | 1.12 | 0.83 | 1.60 | 2.72 | 4.17 | 6.13 | 8.22 | 10.61 |
Since 2.5.0, we have intrinsics in the vector kernels to make use of the SIMD (Single Instruction, Multiple Data) instructions sets of modern CPUs. Currently, SSE2 is supported on x86 and x86-64. So let's repeat IBM's benchmark on a fast machine supporting SSE2, an intel Pentium4 560J (3.6GHz) in 64bit mode, single-threaded. All applications compiled with gcc-4.3, -O3 -funroll-loops -fschedule-insns2 -fno-inline-functions -finline-limit-2400 -ftree-vectorize -march=nocona (and -DOPT_ARCH_PENTIUM4 -DOPT_PENTIUM4 -felide-constructors -fstrict-aliasing for the C++ benchmarks)
| Lib | Test1 (1Mio C = A+B; filled w/ 1.0) | Test3 (as Test1, but filled w/ random no) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2x2 | 3x3 | 4x4 | 5x5 | 6x6 | 7x7 | 8x8 | 24x24 | 64x64 | 2x2 | 3x3 | 4x4 | 5x5 | 6x6 | 7x7 | 8x8 | |
| Meschach | 0.04 | 0.07 | 0.09 | 0.18 | 0.19 | 0.26 | 0.30 | 1.63 | 13.32 | 0.20 | 0.47 | 0.71 | 1.07 | 1.50 | 2.01 | 2.55 |
| TBCI-2.5.5 (SSE2) | 0.06 | 0.08 | 0.08 | 0.11 | 0.14 | 0.16 | 0.19 | 1.19 | 9.56 | 0.23 | 0.49 | 0.75 | 1.09 | 1.52 | 2.03 | 2.59 |
| TBCI-2.5.5 (no SSE2) | 0.07 | 0.10 | 0.10 | 0.13 | 0.15 | 0.18 | 0.20 | 1.33 | 11.96 | 0.27 | 0.51 | 0.78 | 1.14 | 1.56 | 2.07 | 2.64 |
| Blitz-0.9 | 0.02 | 0.05 | 0.08 | 0.14 | 0.20 | 0.25 | 0.31 | 2.62 | 22.44 | 0.20 | 0.46 | 0.74 | 1.11 | 1.56 | 2.10 | 2.68 |
| FreePooma-2.4.1 | 0.50 | 0.59 | 0.71 | 0.80 | 0.96 | 1.13 | 1.35 | 8.19 | 54.12 | 0.69 | 1.00 | 1.25 | 1.62 | 2.11 | 2.66 | 3.27 |
For comparison, I also benchmarked
Blitz
and (Free)POOMA against Meschach and TBCI.
Blitz does very well for very small matrices but does not scale
as well as TBCI or Meschach.
For POOMA, I used the Array<2,double> class; this
may not be optimal, I'm not experienced with POOMA. Advice on how
to make it go faster is welcome. The results look poor so far overall.
Please also have a look at the online documentation (generated by doxygen).
A note about the RPMs:
They contain compiled libraries, which contain the instantiated code
for all TBCI classes. Normally, you don't need to link against these
libraries though, as your compiler automatically instantiates the templates
that it uses through some mechanism (repository, recompilation, always
instantiating and discarding at runtime, ...). However, you may want to
save compilation time by using -fno-implicit-templates (in case
you explicitly instatiate your own templates as well) or with
-fexternal-templates using the GNU compiler to save compilation
time and then link against libtbcidouble. Read the "Template Instantiation"
section in the GNU compiler manual or similar sections in your favourite
C++ compiler's manual.
Some code is not templated, i.e. the SMP support code, the special
functions, the Lapack and the SuperLU interface. There you always need to
link the libraries or objects.
The doxygen generated online docu is also available.
For further information please contact
K. Garloff.
Please send suggestions, comments and bug reports to
the TBCI List <numerix at plasimo dot phys dot tue dot nl>
Please send requests to subscribe to this list to
<numerix dash request at plasimo dot phys dot tue dot nl>.
We strongly encourage you to subscribe to this list, if you are
interested in this library.
Legal stuff: The TBCI numerical library is licensed under the GNU LGPL license.
![]()
10th of October 2002