Speed test using spinpack (under construction 2003/01/17)

Here you can find information, how to measure speed of your machine running spinpack.

First you have to download spinpack-2.15.tgz (or better version). Uncompress and untar the file, configure the Makefile, compile the sources and run the executable. Here is an example:

       gunzip -c spinpack-2.15.tgz | tar -xf -
       cd spinpack

       # --- small speed test --- (1CPU, MEM=113MB, DISK=840MB, nud=30,10)
       ./configure --nozlib
       make speed_test 
       sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_small" 

       # --- big speed test --- (16CPUs, MEM=735MB, DISK=6GB, nud=28,12)
       ./configure --mpt --nozlib
       make speed_test; grep -v small exe/daten.i1 >exe/daten.i
       sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_big"
   

Send me the output files together with the characteristic data of your computer for comparisions. Please also add the output of grep FLAGS= Makefile and cpu.log if you have.

Computation time

The next table gives an overview about computation time for a N=40 site system (used for speed test). First column marks the up- and down-spins (nud) given in daten.i. Other columns list the time needed for writing the matrix (SH) and for the first 40 iterations (i=40) showed by the output. The star (*) marks the default configuration when using make speed_test (see above). The double star is an example for the big speed_test (see above).

    nud    SH-time  i=40-time CPUs machine          time=[hh:]mm:ss(+-ss) v2.15
   ------+---------+---------+--+---------------------------------------
    32,8     1:42(2)  6:07(3) 1  Celeron-1.1GHz-gcc (zlib) disk=22MB/s .
    32,8     1:43     5:51    1  Pentium-1.7GHz-gcc (zlib)
    32,8     1:27     4:46    1  Celeron-1.1GHz-gcc .
    32,8     1:34     4:01    1  Pentium-1.7GHz-gcc
    32,8     1:46     4:21    1  Alpha-731MHz-cxx GS160
    32,8     0:18     3:12    16 Alpha-731MHz-cxx GS160 (2 users)  .
    32,8     0:18     1:33    16 Alpha-731MHz-cxx GS160 (64 threads)
    32,8     1:30  1:01:19    1  Alpha-731MHz-cxx GS160 (maxfile=0)
    32,8     0:26    14:59    8  Alpha-731MHz-cxx GS160 (maxfile=0)
    32,8     0:19     9:08    16 Alpha-731MHz-cxx GS160 (maxfile=0)
    32,8     0:18     7:45    16 Alpha-731MHz-cxx GS160 (maxfile=0, 32 threads)
    32,8     0:15     7:33    16 Alpha-731MHz-cxx GS160 (maxfile=0, 64 threads)
    32,8     4:39    20:09    1  MIPS-250MHz-CC O2100 (zlib)
    32,8     2:45    18:02    2  MIPS-250MHz-CC O2100 (zlib)
    32,8     2:51    16:07    4  MIPS-250MHz-CC O2100 (zlib, 4 user, zip=38MB/10s cat=0.4s)
    32,8     1:19    30:48    8  MIPS-250MHz-CC O2100 (zlib)
    32,8     1:08    18:25    8  MIPS-250MHz-CC O2100 (zlib,  32 threads)
    32,8     1:18    14:59    8  MIPS-250MHz-CC O2100 (zlib, 128 threads)
    32,8     2:13 01:25:41    4  MIPS-250MHz-CC O2100 (maxfile=0, 4 user)
    32,8     1:13    37:39    8  MIPS-250MHz-CC O2100 (maxfile=0,  8 threads)
    32,8     1:01    28:58    8  MIPS-250MHz-CC O2100 (maxfile=0, 16 threads)
    32,8     1:08    24:34    8  MIPS-250MHz-CC O2100 (maxfile=0, 32 threads)
    32,8     5:13    18:21    1  MIPS-194MHz-CC -64 .
    32,8     3:40    18:45    2  MIPS-194MHz-CC -64 (zlib) .
    32,8     2:19    17:55    4  MIPS-194MHz-CC -64 (zlib)
    32,8     2:27    23:12    8  MIPS-194MHz-CC -64 (6 users) .
    32,8     2:16    20:51    8  MIPS-194MHz-CC -64 (zlib, 6 users) .
    30,10     23m      76m    1  Pentium-1.7GHz-gcc (zlib)
    30,10     21m      50m    1  Pentium-1.7GHz-gcc
  * 30,10     24m      64m    1  Alpha-731MHz-cxx GS160
    30,10    7:48    53:00    10 Alpha-731MHz-cxx GS160
    30,10    3:50    18:23    16 Alpha-731MHz-cxx GS160 ( 64 threads)
    30,10    3:33    15:19    16 Alpha-731MHz-cxx GS160 (128 threads)
    30,10    3:37    16:41    16 Alpha-731MHz-cxx GS160 (128 threads, -O3)
    30,10    4:24    19:51    16 Alpha-731MHz-cxx GS160 (128 threads, zlib)
    30,10 1:01:10  4:25:28    1  MIPS-250MHz-CC -O3 O2100 (zlib)
    28,12     24h      40h    1  MIPS-250MHz-CC     O2100 (v1.4)
    28,12    171m       7h    1  Pentium-1.7GHz-gcc
    28,12      5h      10h    1  Alpha-731MHz-cxx GS160
    28,12     81m     5.7h    9  Alpha-731MHz-cxx GS160
 ** 28,12   57:39  5:29:57    16 Alpha-731MHz-cxx GS160 (16  threads)
    28,12   59:22  2:51:54    16 Alpha-731MHz-cxx GS160 (128 threads) .
    27,13      9h        -    1  Alpha-731MHz-cxx GS160
    27,13     53h      96h    1  MIPS-250MHz-CC O2100 (v1.4)
    26,14    107h     212h    1  MIPS-250MHz-CC O2100 (v1.4)
    20,20     21h     515h    1  MIPS-250MHz-CC O2100 (v1.4)

Next figure shows the computing time for different older program versions and computers (I update it as soon as I can). The computing time depends nearly linearly from the matrix size n1 (time is proportional to n1^1.07, n1 is named n in the figure).

4kB png image of computing time

memory and disk usage

Memory usage depends from the matrix dimension n1. For the N=40 sample two double vectors and one 5-byte vector is stored in the memory, so we need n1*21 Bytes, where n1 is approximatly (N!/(nu!*nd!))/(4N). Disk usage is mainly the number of nonzero matrix elements hnz times 5 (disk size for tmp_l1.dat is 5*n1 and is not included here). The number of nonzero matrix elements hnz depends from n1 by hnz=10.4*x^1.07, which was found empirically. Here are some examples:

  nu,nd     n1   memory     hnz    disk  (zip)  (n1*21=memory, hnz*5=disk)
  -----+---------------+----------------------
  34,6    24e3    432kB   526e3   2.6MB 1.3MB 
  32,8   482e3     11MB    13e6    66MB  34MB
  30,10  5.3e6    113MB   168e6   840MB 444MB   small speed test
  28,12   35e6    735MB   1.2e9     6GB         big speed test
  27,13   75e6    1.4GB   2.8e9    14GB 
  26,14  145e6    2.6GB   5.5e9    28GB 
  20,20  431e6    7.8GB    18e9    90GB 
  
   

CPU load

A typical cpu load for a N=40 site system looks like this:

4kB png image of cpu-load

Data are generated using the following tiny script:

   #!/bin/sh
   while ps -o pid,pcpu,time,etime,cpu,user,args -p 115877;\
     do sleep 30; done | grep -v CPU
   

115877 is the PID of the process. You have to replace it. Alternativly you can activate a script activated by daten.i (edit it). The machine was used by 5 users, therefore peak load is only about 12CPUs. 735MB memory and 6GB diskspace were used. You can see the initialization process (20min), the matrix generation (57min) and the first 4 iterations (4x8min). The matrix generation is most dependend from CPU power. The iteration time mainly depends from the disk speed (try: time cat exe/tmp/ht* >/dev/null) and the speed of random memory access. You can improve disk speed using striped disks or files (AdvFS). The maximum number of threads was limited to 16, but this can be changed (see src/config.h).

Why multi-processor scaling is so bad?

During iterations the multi-processor scaling is so bad on most machines -- why? I guess, this is because of random read access to the vector a (see picture below). I thought a shared memory computer should not have such problems with scaling here, but probably I am wrong. In future I try to solve the problem.

6kB png image of dataflow Figure shows dataflow during iterations for 2 CPUs.