Sorry for the language mix (german, english).
 ToDo1503: test-suite NN=40,64,128 * JJ,tJ,tU * dietlibc,mpi,pthread, ca.50% MemSH
     test on knoppix64 as 32bit=(-m32 -static) 64bit=(-m64 -L/usr/lib64) g++ gcc
  test NN=512
 ToDo: check src/*.[ch] ToDo's, may be some are easy to fix
 ToDo17: add NZX_TYPE nzx[] int,long(NZXMAX) like l1, remove XY_FLAG
         for simpler code + better optimization
 ToDo2017: ansatz-auto_check, last check v1.4.7 2001 (for bigsys-speed_test?)
 ToDo: add opt_performance_estimate_only store AH as n1 subset only 
 ToDo: see y2016/lrz* use both maxhfile for disk, maxhmem for mem 
  test mpt icpc (karman=AMD) -ftree-vectorize ignored, 64threads=Ok
    module load torque;module load intel/cc/10.1.008 # dflt=32bit
    module load torque;module load intel/cce/10.1.008 # dflt=64bit
     -march=core2 -mcpu=core2 -static  # erzeugt 32bit +SSE3 (dflt=pentium4)
         PC10.bn8=29ns (but PC10.gcc493.bn8=22ns)
     -xO (core2-duo+sse3) but PC10.is_dual-core
     -xP (core-duo+sse3) PC10.bn8=29ns
  vm8: CC1=gcc CC2=dietc gcc CC3=gcc -std=c99 CC4=clang CC5=tinycc CC6=gcc41
       noSym(div0?) LMsym(sameE) n1=0 4parametersets(memleak?)
     test partly stored SH! using hfile! + 2 threads + mpi(if defined?)
     test 2*a0, test sizeof(t1.xy[0]) = thxy = int128
     z.B. testJJ64cplx32SH0
       sym_k=4 2; n1=0 (sym_k=2, 0+40), with/without symmetry (k=-1000) 
       ulimit<SHsize<maxfile:   nonmpi=old=FAIL,new=OK mpi=hang
       maxfile<min(ulimit,SHsize): hfmax=4e6(94%) nonmpi=OK mpi=?
     int32 (-m32) und n1>2e9 (1 bit for next line (remove!?)) ???
 ToDo2016-01: set MPICC + CC (make test)
 ToDo2016-01: better mix-mode (bigger mpi-packets)
      (do noNZblks + noXY_FLAG before?)
 ToDo: B_NUM = CONFIG_PTHREAD + error wenn CONFIG_PTHREAD==1
       use OMP_NUM_THREADS if -tN and openmp not used
       use GOMP_CPU_AFFINITY='0,2,4,6' (gcc+intel)
 ToDo16: auch bei NoS1SYM s1-Produkte nicht in sym-suche einbeziehen (speed-up)
   N=35*(viel s=1) 16Ksyms++ meggie.ri 2016-11-18 1d octahedral chain a la Strechka
 ToDo1601:
 - for eigenvectors + LM show SS and JJ for each LM-cluster (max?)
 -set type_lm to usefull values 
   - add tU/tJ/JJ lm-tests to make test chain=5 s=? see sym_tU.txt
    largest doable tU 2s=4 (can contain . u d 3) chain without S1sym in 10min?
      statH 4*4 submatrix num_elements
    check biggest lc 2s=4 maxN or N=8 maxS Tnorm2 needed!
    test max lm_factor using lm_factor_double, speed __int128 vs. double?     
 - check OVL+speed chain N=12 2S=7, cubocta N=28 2S=2
 ToDo 2015-12:
 - stability_Status/OpenProblems/ToDo release v2.50 as stable base!!!
   - simplify S1SYM-code (+check lowest nzx) in hamilton_nhv +tU 2.51?
   bisherige Probleme als tests speichern!
   - test BFLY code speed (lc_s1 + s5/2 ico) ok, but ns.r to slow vs. SH!
      short look = missing sign code for tU, improvement not clear + 2.51 else
       analyze on small model 
       check if we have !ismin_lm on test !!!
  - parallel ns.recursive!
  - on-same-node-mpi is slower (smaller pckts) and consumes more energy
        ToDo: pthread or OMP within node (bigger packets, only one core polling)
  - mpi+bigmalloc (90%) + access often causes SEGFAULT instead of failed mallo
     ToDo: big malloc (90%) before, not after initialization  (difficult)
              2.51
       check mit t100!
   - use set/getrlimit() for SH-storage (remove HBLen, use (HBLen=)AH*NZX)
   getrlimit? abschaetzung via scaling higher Sz cfgs/symcfgs or outputMC-quote
      test in configure
   - l1 write node i (+node caching), read node j often "read-error" + abort
     ToDo: write node i, read node i + send to j via MPI (avoid w/r via share)
     check! 2.50 tbase.c:load_l1(*l1,len,ofs) mymap(FILE_l1,ofs) = bad for MPI?
     but better than 2000 files? (not robust!) 2.51? 2.50?
   - give hint of parallel-ns is to slow above ncfgs>1e15 (30h*1000cores)
       see S=1-LC28 ncfgs=7.65e+15 144*16t100 > 40h
       on serial ns let idle other tasks? or besser gleich OMP?
       fuer ^PRF: alle zeitabhaengigen Outputs in Textmitte abschneiden!
        fuer diffs! oder einfach t/m= abschneiden? PRF fuer grep performance?
   - benes for tU, tJ (no speedy sign code) correctness tested for N=4
       N=16 8+8 t=-1 U=8 gcc492-O1 cxx0_1+ n1=2.58e6 ns=0.17m SH=1.18m E=-4.17493
           testcxx8,cxx0(2m)  + LM?
   - remove HBLen, use AH*NZX-packets+size (ifundef HB=AH*NZX ??)
   - change search, store and use of LM (more regulary) ??? 
     z.B: as shift_to_next_site_in_lm-cluster=N + shift_to_next_lm_cluster=1 (defs1)
      loop of symsearch never between higher sites in lm-cluster (faster without S1SYM)
     what about s1-s3-s1-s3 chains? lm1.i0 lm1.i1 lm2.i0 lm2.i1 lm2.i2 ...
                                    lm3.i0 lm3.i1 lm4.i0 lm4.i1 lm4.i2 (-a format)
     or list of different lm-clusters.i0-idx + lm.len + next.lm.i0
   - measure mpi_sendrecv-time but stop if max .lt. 1s or 1it (less syscalls)
   - ibtraffic available at linux? like ifconfig Bytes? 
      /usr/sbin/perfquery -x
      n114: 79MB/s TX + 79MB/s RX  811kpcks/s ca. 100B/pkt = full load 72nodes
      mpi_stress: 128B / 76MB/s, 1k/415MB/s, 16K/1.5GB/s, 1M/2.2GB/s 36nodes
      - visual output (matrix) of traffic_sum node i to j, or min/max
   - 2.48 NN=84 failes compile (s=7/2)
   - mix mpi+pthread failes, speed_test.nhv sometimes hanging or segfault
        (slower, but more powersaving)
   - fix tJ + S1SYM + rekursiv
   - fix serial a64 + parallel --chkpt_load=2 divergence (+ l1_0000.dat ronly)
     see lc_s1.gpl N=28 n1=18735341583 \* 7 l1=123GB h12-h21!=0
       only if n1 mod mpi_n is not zero!? try 159*7t100 mod=0
 ToDo: improve speed Cuboctahedron N=12 s=7/2 reimar2004 v2.22 48sym n=84=42+42
       n1=35.6e6 norm2 bis 3e20 4marvel n2=39h xnz=39.5 SH=13.5h i100=6h
        better estimated n1 + nzx by MC max 1s
        models/cubocta.def
       quanta2 ns=10h SH=6000m=100h zu langsam
       PRF: minsymcfg_bnV8  t=16.4s loops/s= 2.7e+03 t[ns]/cfs= 45524.28 2 99i
       multithread mpi+mpt+openblas-pinning maxscfg=12m,
          neu: mpt+pinning bnV8=1600ns but symcfg_bnv8=133ns maxscfg=9m SH=230m
          try:  defspin1.sh -s {2...n} [daten.def] also -a !
    - rekursiv parallelization ln2(tasks)=bitdepth_start 2tasks=-1bit (S=1)
    - rekursiv progress, higher z count (below 1s) + modulo to tasks
      add to todo/hist or c-code as sample
 ToDo: bad nzx from random, do some nzx-max iterations?
       get better n1 estim. by MC?
       speed-estim: t9_42.42sym.sz4.n1=6e9 18*10 SH=148e6nz/s i100=2e9nz/s
                    = 16GB/s / 16nodes = 1GB/s/node (measure on Gb, compare)
 ToDo: spins L490 s. bugs 12.05.15
 ToDo1503: see speed_mpi.gpl L900
     + check ToDo's in speed_mpi.gpl (slow pt on kautz, needs memlocal sym)
     + speedup vlint (using bit[pair]set + bitget as shift + and, function?)
       /usr/include/c++/4.7.2/x86_64-linux-gnu/bits/c++config.h
     + add pragma no-unroll to lm (code size) and unroll to VS-loop (vectorize)
       check asm, where vectorcode was produced and its speed 
     + add WARN if NOS1SYM
 XXZ-model + field - howto ?   
 ToDo150409: replace ERR(630)(all nodes) by one node warning
       add estmate_n2lm=nzLMx + n2sym from n2speed_test_MC1000rnd_configs 
         needed for s=3/2 n=3*18 better estimation,
      also rr-bits-(n2sym,params,factor) for better parallel storeH-ckpt
           or b16-minifloat=10m+5e+1s=16bit ? 1/8=.125 + b16-Genauigkeitstest
 ToDo1504: spins.c L610 hfmax changed (bad idea?), update min b_*[blk] via MPI!
 ToDo1503:
    openmpi behav. under memory pressure? (out of memory?)
    HyperThreads effectively reduce Cachesize/task (slower)
     need better cache-hit-rate (contradict)
     use __thread for sym on SC_MIPS (3-4*slower on mkautz! own copy is better!)
     (L1 overflow?)
     http://www.akkadia.org/drepper/cpumemory.pdf 
     robustness: (what happens, when I/O is away for a while, test qemu)
        nfs no read daten.i
 CHECK! 5000 cores 20MB memory reserve not enough, ompi hangs up (no error)?
      ok fuer 28*400MB meggie 10+30 24% in mem (372MB)
      ggf. memcheck-pg per exec + kill nach timeout bevor malloc?
    - bei chkpt6 handling doppelte tri.txt Zeilen, korrekte It? +e0??
      + tU coded as uuu,ddd (recycle sym,bfly etc. as u*d)?
    - buggy tU N=150 e3=...
   Ausgabe GB/s bei ersten GB transfer!?
 ToDo: faulth-/detection/tolerance
       Immunity-aware programming, compare Matrix-xor-checksum per iteration?
 ToDo: replace tphase as int in inner loops (multiply cmplx sign outside)
 ToDo: 1505 replace tbase by tbc bascfg/bitcfg or vlint? replace vlint.h
    by C-inline-functions/macros bc_[sg]etbit0,getbit1,getbitN,lshiftN
      bc_and|or|xor (minimum needed for highperf lbrkN + benes)
    for speedup on big NN + elemination of c++-class
    try use __m128i // __SSE2__
 1503 cache-problem? cachegrind 1% hamilton_geth_block
   # meggie.speed:  SH=2480ns/core vs. 5.3ns*160=848ns faktor 3!!!
   #  Hinweis auch zu starke AH-Abhaengigkeit! (cachesize?)
   -t1
     54.47%  src/hilbert.c 1314-1360
     33.91%  src/hilbert.c 1339   
     23.66%  src/spins.c 1369-1544
   -t3
     88.18%  src/hilbert.c 1314-1360 slower on mkautz!
     54.89%  src/hilbert.c 1339     # sym[i][j]-loop
      5.47%  src/spins.c 1369-1544
       nhm_line.(ii=0...hbuf->n).minsymcfg_dflt(hbuf->el[ii].bj, bj); L700
       struct hbuf.struct shelem{bj,blkj,jj,rr}el[NZXMAX*NUM_AHEAD_LINES] change?
        cache + vector-technisch unguenst.
       besser hbuf.H_bj[NZXMAX*AH],H_blkj[..]...
       symcfg,sign,norm2 []
 ToDo:
   minsymcfg_lNbrk2(...,struct tsym *threadlocal_sym_copy)  # ToDo! 2015-03
   dont use global (thread shared) data (nn,nu,nd,sym)
     kautz/MIPS: 6threads: 380ns(global)/48ns(local copy) = factor 8 faster!
     1thread 40ns!
 ToDo: remove HBLen, by table.nz[n1_blk/AH?] (l1,v0,v1,ev,nz) ???
        simple loops, disk-store partly or complete unfilled blocks?
        z.B. sq40j1.8+32 nzx: mean=27.3  min=13 max=33(=nu*4+1) +21%
        z.B. sq40j2.8+32 nzx: mean=53.5  min=31 max=65(=nu*8+1) +21.4%
        z.B. sq40j2.6+34 nzx: mean=42.85 min=27 max=49(=nu*8+1) +14.3%
        + mpi-transfer complete AH-blocks? 
 ToDo: ns Vorgabe tu lt nu fuer rekursiv (ca. n over tu)/sym ca. 2*mpi_n
    u8=482e3 u10=5.3e6 u12=35e6 + store chunks
 ToDo: remove mmap, load l1_0000.dat via node01 to other nodes (local disk)
       would work without tmp_shared too
       dmtcp_launch 
       /opt/ompi-1.8.4/bin/mpirun --preload-files spin,daten.def,daten.i\
         --bind-to core -H node01,node02 -np 4 ./spin
 ToDo: sighand.c: better sigusr1 only (let usr2 for dmtcp)? 
     + use timimg = 2*SIGUSR1/10s = start MPI-traffic-free checkpoint window?
     ckpt-window MPI_Bcast(sig1) if(sig1){print_ready_for_ckpt;sleep30;p_cnt}
         before MPI_SendRecv or next_ahead
     per CKPT_HELP (overhead? no)
 ToDo: auto-define CONFIG_PTHREAD for B_NUM > 1??
 ToDo: make test diet gcc -O2 testsqrt.c /usr/lib/libm.a # -lm does not work!
       or just test __dietlibc__  compare gcc -Os vs. diet free/usable mem
       maxmalloc mpi_i=0 danach mpi_i=1 ... (node structure?)
 ToDo: remove mmap() l1_0000.dat (advantg. swapping, disadv. C/R open file)
       SH stored 100%: l1 needed until 100% reached, release gives advantage 
         for checkpointing less mem (but remove file is better)
         also needed for eigenvalues later
       SH not or partly stored = no advantage of mmap (except swapping = bad)
       before a0 ftell+fclose+a0+fopen+fseek daten.i (mod + closed file)
 ToDo: vectorization + inhomogen MPI (memory Nodes + CoPros a la XeonPhi?)
        minsymcfg_loopN(cfg) via 4sym-vector,
  better do 4 vector-parallel-syms on same cfg and min at end = minsymcfg 
 ToDo: Overlapping Communication with Computation (1Gbit for 3GHz-core) 2015
       replace MPI_SendRecv by 
       concurrent computation  MPI_Isend + MPI_Irecv + CPU + MPI_Wait
        + output if not enough bandwith
       new module needs 3 (pipe-)buffer for each H,Hv:
         - nodes*MPI_Isend+MPI_Irecv H(i-1), (H*v)(i-2);   i is (AHEAD) block idx
           old was a loop, but new is I_All_to_all 
         - optional async read H(i), store H(i-1) + system-ahead i+1, wcache
         - compute H(i) or readH(i), (H*v)(i-1), (vHv,v+=Hv)(i-2)
           optional RAID-update v(i-2)
         - MPI_Wait and opt. I/O_wait
         - Buffers: Hi_send[ahead*nzx]
                    Hi_recv[ahead*nzxmax]
                    Hi_comp[ahead*nzxmax]
                   3*Hvi[ahead]
        + 2 pipe-fill-steps  (loop over 2+n1/ahead)
        speed = minspeed ( cpu(ahead*nzx/s), mpi(ahead*(nzx*8B+4B)/s) 
        ToDo: benchmark both in spinpack! PRF
           MPI_All_to_all benchmark !!! vs. balanced pktsize for zielnodes 
             macht erst ab 4 test nodes sinn, besser 8
              stat: number bis 4k pckts, bis 16, bis 64, darueber ToDo
           max ahead = avail_buffer_space/task = ca. 10% of avail. mem / task
           max pckt size = max ahead * nzx / num_threads
         put this descr. to speed.html and leave a link to it
 - remove threads + function calls ?? (not allowed in OpenCL), replace
   by parallel-pragmas?!
 - I/O store 50% not 0..50% but interleaved block on disk vs. on RAM
    for better background load from disk (maybe interleaved recompute/diskload?) 
    works only if memory is allocated at the beginning (known size ram + matrix)
 ToDo: power reduction using MPI_Isend,Irecv + MPI_Test idle usleep(1)-loop
       usleep may be lengthened by the granularity of system timers
       100*usleep(2) = 0.1s   comp2=+1ms/usleep        GbE=100kB/1ms
       100*usleep(1000) = 0.2s == 100*nanosleep(1ms)
       100*usleep(2000) = 0.3s                      == ! ! ! Problem ! ! !
         depends of linux scheduler! 10-50ms or busyloop
         in/out port 80 takes about 1us on x86 inb_p() or inb() asm/io.h
         sched_setscheduler() SCHED_FIFO or SCHED_RR
 ToDo: power footprint on kautz
    32quanta1.idle=280W  ns=500W SH=520W i100=570W 2.2GHz 2015-02 t/nz[ns]=112.8*16
    32quanta1.idle=280W          SH=390W           1.2GHz
 ToDo: failure tolerance ?
       checkpointing or redundancy (assuming 99% of memory changes its data)
       - 8+1RAID5 redundancy needs lot of network bandwidth!
          + blocks of old and new data between check + update points
            block size max. 10% of memory for efficency = network transfersize
            additional block for overlapping transfer + compute
            # block operations are good in general! speed
            # using double memory  or  fast storage
         networktransfers/iteration = log2(nodes/raidnodes)*memory
         log2(80/10)*256GB=3*256GB/(IB=3GB/s)=256s=4min for save/restore
            compute time should be 4min++/Iteration to have no cpu losses
         does not help on total failure or job switching (or nonvolatile mem)
       - checkpointing
         checkpoint-interval must be smaller (1:10++) than failure interval
         need storage = 2*memory, depends on storage speed
          min storage: 2TB / 70MB/s == 256GB / 12.5MB/s = 5.7h
                 (chkpt-interval=60h, failure-interval=600h=25d)
          one drv/node: 256GB / 70MB/s = 61min
                 (chkpt-interval=10h, failure-interval=100h=4d)
         libs: not much implementations, no change of node numbers
         own:  restart with different node number is possible + crc
       virtual SMP could help to use SMP checkpointing on clusters!
 - all mallocs + errhandlings syncronized to avoid different thread-mem
    consumtions?
    Problem bei partly stored SH! background MPI loest das Problem auch s.o.?
      use multiple of AH + zerofill, 
 ToDo: see speed_estim.html (compute time for full cluster memory usage)
       can be used to buy spinpack-optimized compute cluster
 ToDo: reduce L1-cache needs, using generators and list of gen_idx 
       through all syms, see also butterfly.txt for symconfig speedup
       160sym 8+32 lNbrk=4.9ns SH=160*10.0ns=1600ns
        40sym 8+32 lNbrk=6.7ns SH= 40*17.5ns=700ns  sym_k=-13 0 -2 0
        20sym 8+32 lNbrk=8.2ns SH= 20*28.4ns=568ns  fit tSH=400+7.5*sym
        20sym 8+32 lNbrk=7.5ns SH= 20*26.9ns=538ns NOS1SYM-6%
                                            =520ns -Hstat-3.5%
        20sym 6+34 lNbrk=7.3ns SH= 20*22.3ns=447ns  = 285+20*8ns
         4sym 6+34 lNbrk=11ns  SH=  4*  83ns=333ns  = 285+ 4*12ns
         1sym 6+34 lNbrk=9.1ns SH=  1* 285ns=285ns  
                                            =266ns -Hstst-7%
             bondloop=80 + (nzx=33)*(nsym=160 + ln2tasks)
 ToDo: hamilton_nhv like fast_hamilton_nhv (optional store H additionally)
      nhv(cfg[AH],r[AH]?) = scfg[AH*nzxmax],sgn[AH*nzxmax],n2[AH*nzxmax]
                            mem+20%(better vect+omp) or nzx[AH](mem+1./nzx)
                    mit nzx vectorisierung+omp einfacher? fillzeros?
      norm2 Berechnung? Aufwand + 1/nzx or (byte-) vector? const nzxmax
      replace XY_FLAG by separate x entry (mem+...%) but sort HBSize 
        for tasks (const. package size)
        old.xy_flag(mixed blocks): xy+by+rr=4B+2B+2B=8B  + buffers
        new.xx_idx(sorted blocks): xx+yy+rr=4B+4B+2B=10B + const.pktsize
      -d$TMP for batchjobs? (better remove tmp dependency of code)
      bug: if ulimit -v is smaller maxfile i100 failes ??? fixed 2011?
      bug: hubbard model nu=nd=N n1=0 (should be n1=1)
      - adapt to clusters without shared FS
        (how to distribute sequence of data of unknown number? blocks?
         or dry run for counting only)
         simplest: switch to count only after OOM, store the last
           stored and counted scfg and may be every 1024th?
            2*nodes counted_ranges: 1 1 1 1  to 2 2 ... +2+2 to 4 4 ... +4+4
            store start scfg + foundnum scfg + end_cfg + time_needed
           so we have stored ranges and counted ranges
         2nd round: recompute the counted only on the right nodes
         optional "stop" per nonblocking MPI from OOM node?
        OR roundrobin 10e6cfg-chunks(doubling if under 1min) + list of reallocated
         scfg-blocks (sort blocks in 2nd run, testversion: do it parallel to
         disk, if it works, remove disk code)
         chunks of size of max. free space
         {chunkidx, startcfgORidx, stopcfgORidx, numscfg, time, *scfgs...}
     tree-algo usable for start and stop tree partitioning?
       stop at 1st depht-nu-cfg of depht-(nu-4) ??? problems?
      - make a exsample.html page for different physical models 
        and put the link to the README
  - parallel sort for fulldiag by E_Ising=Ez using Bitoner Sortierer
    http://www-i1.informatik.rwth-aachen.de/~algorithmus/algo12.php
  - scalapack for fulldiag: PDSYEVD
    http://stackoverflow.com/questions/20706523/scalapack-matrix-diagonalization-pdsyevd
  - remove maxmem from daten.i (maxmem+usemem? vom *.c) jobsystem!
    set 0 as default (2011-12-09)
  - generate matrix.pgm earlier in fulldiag scaled for bigger matrizes
    up to n1=200e6 (fits to 2GB)
    to 512..1023 pixel from getH without storing full sparse matrix 2011-11
  - http://graphics.stanford.edu/~seander/bithacks.html
    Swapping individual bits with XOR 
    +: using v & -v last bit counting?
      0010100 | 0010011 = 0010111    (v|(v-1))+1  01... to 10...
      0010100 & 1101100     = 0000100
                       ...  -1  = 0000011
   
  x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary  r = b ^ ((x << i) | (x << j));
  e.t.c.
  - replace vdate.h(#define) by vdate.c(const char) to avoid recompiling spins.c
    or split spins.c (add main.c?)
  - ns write sns to local files and concat explicit (remove bottleneck NetFS)
    filter n1/threads ... 1M-chunks for better distribution
  - use ErzeugeneSym^ni instead of store all Syms (save cache! more speed)
    generate products in smallest() + optional permute via tables (sign?)
    + fast permutation by tables
   http://microcontrollers.wordpress.com/2011/03/11/how-to-do-really-fast-bit-permutations-with-few-operations/
  - use rekursiv n1 and split search paths to threads
    write nu=0 to file0
    read file0 and write nu=1 to file1, rename file1 to file0
    read file0 and write nu=2 to file1, ... 
    stop if nu reached
  - suspend/resume in parallel mode (no storedH, v0/v1 only)
    p.e. suspend after next 20 iterations,
     mips-cluster.kautz dump.mem2TB/(200MB/s=2Gb/s)=167min=2.8h wtime=28h++
     sq48 n1=168e9 *(4+4+6)=2.4TB
     s2tri27 ... (ns try both serial +parallel and break slower method?)
    recompute l1 if file l1_0000.dat is removed or bad (check last!)
  - check all (quasi-parallel) disk operations (ns.l1=OK, rw_v=OK 2012-05)
  - hr_restore: 0 setzen + fehlende indexe neu berechnen statt tausende files?
    use maxnzx as static size for simpler code? but 10-20% more SH-memory
  - check: libckpt (user-directed checkpointing)
  - speichersparsam proggen(cache), rekursive numscfg mit beliebigen Startpunkt
    und stoppunkt proggen (z.B. fruehe Rueckkehr und fortsetzen spaeter oder
     auf anderen thread und unterbrechen nach endl. Zeit)
     Beispiel-config space (N=5,S=1) ...
     chkpt.resume nsymconf() from l1(?) + n1 (save last testcfg all 2h? chkpt0?)
     bsp: N=40-chain syms: 2 non-commuting l=2-syms
       40syms generated by s0,s1,s0*s1,s1*s0,s0*s1*s0,... compact code?
  - maxscfg by exclude higher empty subtrees
  - skip complete a0-run on bad k_sym !? (avoid long runs) ??? if its easy to implement
  - try http://dmtcp.sourceforge.net/ distributed-mt-checkpoint-userlib
  - test triangle48 sym=192=4N n1=ca168e9(l1=7n1=1.1TB=100MB/s*11000s(3h))
    split l1 on failure? or per node or 256 threads ...? l1=216GB/4h
    md-raid0 for 2TB (fuse?)  
     fixed partitioning (equal size or eual nodes) 
      or maxsize l1_0..63 (1TB/8=128GB)
     l1_%4d.dat in 200GB/Bsize chunks (links todifferent FSs,striped?)
  - test triangle-s2 N=2*27=54 e= 3 3 0 9 (tU=108bit) nud=54,53
    3+51 NoS1SYM n1=15540 SH=0.3m  54.70017464 30m/100It
           S1SYM n1=3627  k=-1000  54.70017464
           S1SYM n1=25    k=0      54.76837837 0min
            see s1_triangle.gpl
  - warn on l1 writing on long=32bit systems (split files?)
  - ToDo: fermionic sign for b_smallest_lm() (LM/S1SYM) ???
  - ToDo: test resume after break during checkpointing (incl. ev)
      + robustness gegen datenfehler?
      - checkpoint resume after 2nd++ data-set (a0...a0)
  ToDo: problem bad get_maxscfg for parallel speedup (47+1 8sym 000- n1=6)
  - parastation-mpich send_16MB_from_all_to_task0 causes SEGFAULT on task0
    test ulimit=4GB 4tasks lowsym n1=32e6(+16MB=OK),n1=225e6(730M+16MB) 8m/It
    test: q.mpiexec -l -m ... ./wrapper.sh: ulimit -v lowmem + nice -19 spin
          Verhalten bei Speichermangel mit mpich, 
            limit=70MB (4*64M+16M fail L229) ToDo: sauber abbrechen!?
            limit=84MB (4*64M+16M) OK
  ToDo: a2 2x2 mpi-version! scaling?!
    - rechnet nur einen Datensatz a0? if multithreaded
  ToDo: ccNUMA tips, clear diskcache to allow local malloc !!!
  ToDo: output i100.t during iteration first on 10th than only if changed by 10%
        for smaller diffs
  ToDo: oprofile (2013-04)
  - +dietlib -printf
  - struct s_float + cast op() test accuracity
  - aio +
    interleaved io (m threads writing/reading to/from one hnz-stream)
    pipe_read (+transparent mpi, konst. Blockzahl/Laenge?)
    h_file.c 
  - replace itime by ftime=gettimeofday.s+us*1e-6 or MPI_T, for better shortruns 2013-04
  - add CPUSET for 64...1024 (see memspeed.c)
 ------------------------------------------------------------------- 2013-04
- Strategy:
  recompute matrix H (not storing, because computing is and will be (?) 
  always cheaper than fast storage, see GPU, multicore, L1-caches)
  - checksum H-block-elements to avoid errors (2012) 
- SU(2) use ???
  see R. Schnalle + J. Schnack, Calculating the energy spectra of magnetic
  molecules: application of real- and spin-space symmetries, Apr2010
  International Reviews in Physical Chemistry Vol.29, No 2, 403-452

- speedup ns at MIPS NN>64 (remove y%() from vlint.h)
- s=1 CONFIG_S1SYM + PARALLEL buggy, n1 to big, fur ud=-1 ???
- virt. Test-Cluster (rid-replacement)
- Zi != 0 for square Sz=0 k=-9999
- after release v2.40, make FPGA/GPU ready (b_smallest blockwise) + FPGA
- ising matrix to ising+1storder-excitations?
- storeh2 for mpi (for ns() too)
  MC like method 10000 random configs divided into 100 blocks
   defining start-cfg of each block,
  isingerg as mean value of neighbouring ising ergs?
    like e0+meanexcitations
- remove writing l1 (allows starting 2 jobs within same path)  
  save as one file on node0?
- rename zahl to lfloat (long float), mzahl to sfloat (short float)
- complete new strategy? build n1 via storeh2 (sorted by isingerg)
  and balance number of nonzero elements per rank/thread
  b2i have to ask probably more than one other rank?
- remove n2, parallel ns()
- err9200: set nzxmax-overflow-flag and reset hbuf->n
           stop later! ??? or better dyn.malloc
      or hbuf as compressed part of hbuf[NUM_
- mpi ns() without nfs-transfer
- fix compile-errors on SunOS(isut1), check MRule for kago39
- ns() replace fwrite(l1_xxx.tmp) ??? (tina has NFS problems)
  by mpi_sendrecv
   1st: send buffer to thread_i (i+1 if mem_i=full)
        or simpler send buf_i to thread i%mpi_n (buf_mpi_i)
   2nd: balance l1[] send l1[]begin(i+1)..end to
        or simpler send buf_mpi_i to thread i (l1_i)
- speedup by local malloc? if not change back to v[i+b_ofs[blk]]
  and node_ofs[..], b_ofs[]=0..node_len[node]-1;
- autobalancer, redistribute lines among threads after 1st iteration?
  !ps -m -o time -p $PPID at end to calculate unbalance
- first mpi_n*MPI_calls per hbuf-line, later block a fixed number of lines
  split nhm()
  - hamilton_nhv(i) -> nhm(i,jnz) -> store hbuf[].cfg,rr
  - smallest(hbuf) -> jnz*b_smallest -> store hbuf[].scfg,rr*=(sign,norm2)
  - scfg2blk(hbuf) -> [blk].hbuf...
  - mpi_n*MPI_Sendrecv([i+j].hbuf to i+j, [i].hbuf from i-j) cfg
          scfg2idx -> hbuf[].idx
          MPI_Sendrecv([i+j].hbuf to i-j, [i].hbuf from i+j) idx+v0?
    for all b_len[0] (send 0 if b_len[i]<b_len[0])
  - hlines[HLmax]
- problem: Hbloecke sparse - xy+flag sinnlos? x+y aber mehr mem, mpi-ovhead?
  - store XandY ??? no!
    - Xsteps greater 1 between, 2 nz-Elements
    - 80% more memory needed, more slow!?
  - Y+(nextX-flag) !!!
    - only stripes possible (sort mpi-blocks during read)
    - creation of H needs more time and a lot more MPI-traffic
    - store Y+Y.blk(byte) +20%memory
  - [0].xy=num_elements_line_x [0].r=diag [0].by=block
    [1..n-1]=nondiagonal elements (xy,by,r)
 - by blocks for nodes only? C++ array access via MPI?
- highest to lowest vector_coeff? can int/llong be used for speedup on T1? 
- ca. n1 ohne hash-collisionen?
- erste 10-down-spins nach Tabelle/rekurs.suche * (letzte down-spins ueber
                   restplaetze)
-  next symcfg -> letzte up-spin wegnehmen und neuen Platz suchen bis smallest
- alloc h_arrays within threads (h_xxxx[B_NUM] not needed),
  open/close within threads?
- bessere Speed-Messung um Flaschenhals zu finden?
  - MP-scaling, Size-Scaling hnz/s (wenn sinnvoll)
  - for FPGAs prepared
- smaller code for less bugs
  - remove noSBase (k=-1 can be used, add kud=0 or -1, test speed)
- pictures of most probable 40 states! mark flips (@ critical J2=0.6)
  show first perturbation terms
- Fernziel: FPGAs + MPI (b_smallest-calls blocken)
- MPI async: MPI_PROBE, MPI_GET_COUNT, MPI_RECV, MPI_ISend,Ireceive,waitall/any
- async read n-to-m threads (m<=n), Operationen in Bloecken (speedup)
  zu aufwendig? besser nur in Speicher, ontime-Berechnung oder mmap zu Platte
- tJ has no .3 symmetry like tU (Ham2), makes sence?
- j1-j2-t-U als parameterindex speichern oder via Index (lange 40er j1-j2 Berechnungen)
  (wegen a2 keine getrennten Parameterfiles), besser index?
  speicher index zu [faktor{0,+1/2,-1/2,-1,+1}, parameterindex]
- n-site-Terme in H n>2
- SiSj mit sym viel zu langsam, warum?
- a4 parallel (world leader? <-> AHonecker)
- a2 parallel-skaling testen / verbessern?
- data-multistream-konzept (pipeline concept of vectormachines) for new HPC?
  seriell dynamisch gekoppelte programmierbare Units (z.B. CPUs, FPGAs)
  Abbildung in seq. Prozessoren als threads + stream buffers + stop
  start mechanism if stream buffer is full/empty (wait for data)
  reicht nicht fuer v2=H*v1+v2, auch random access notwendig (stream+RAM)
  MIPS per Watt? cost-performance-per-watt  (ARM SA-1100 1997 133MHz max250mW) 
- nice graphic?: xy-Array fuer colored Isingenergien-Matrixgewichten
    minIsing=0(neel) maxIsing=maxNumBonds=N*Dim
    Ising=num-uu-Bonds+dd-Bonds, H1diff=0..2maxNN-2 H2diff=0..2maxNN
  + state Overlap <PisingstatesE1|H|PisingstatesE2>  => x_out enhanced
  highest/lowest ising recursiv? min..maxE1.E2.E3....
 - check also: grep ToDo src/*.[ch]
 - LM Bindungen in H vereinfachen/generalisieren? 
  (use a more general (simple) method for local symmetries)
   example:
  ..O   O---O   O---O   O...   this is a sample-chain, with 5*N sites
     \ /     \ /     \ /       and N vertical symmetries, which are
      O       O       O        completely decoupled from each other
     / \     / \     / \       and to the other symetries (very similar to LM),
  ..O   O---O   O---O   O...   a future version should care about this
 speedup for local singlets (see 4site_exchange_diamond36.def)
 commutating symmetries (ge: vertauschende Symmetrien)
 - auch LM als symmetrie-subgroup z.B. N=5 S=3/2 (15 sites) 2013-04
   0 1 2   3 4 5   6 7 8  9 10 11  12 13 14
   generators: 
     subgroup0 0 1 2 = sym0=0-1 2, sym1=0 1-2 (sym2=0-2 1=sym0*sym1) l=3
     subgroup1 3 4 5 = sym3=3-4 5, sym4=3 4-5 (sym5=3-5 4=sym0*sym1) l=3
     ...
     subgroup4 12 13 14 ...
     generators: 5*2(10 to store), subgroupsyms=5*3(15 to store)
     oldNoS1sym: 3^5=243 (growing fast, slowdown + may hit CPU-cache size!)
 
 - find solution to avoid 64bit overflow for spinchain N=6 s=7 and bigger
 - instead of pthread_create/join doing on every iteration do it only once
    and use with mutex in mpi compatible way
    - improves sun_top_pcpu (pcpu is resettet to 0 after pwd_create)
    - improves linux_top logging (new pid creation on p_create)
    - could be made mpi compatible (MY_MPI)
 - calculate SiSj for twisted boundary conditions TBC
  (using posx,posy, ww[NN*NN] ?)
 - translate docs to english (partly done)
 - new design via pipes (dataflow) and threads (ex: generate-H-thread
   writes elements sorted to 4 pipes for blocks, system does cashing)
 - Codierung/Indizierung nach Isingenergie mit allen moegl. Bindungen bei
    gleicher Symmetrie (Vorteil, S=1, LM automatisch integriert, schneller?)

 - Einfuehrung optional eines C++ types ULLLong mit >64bit fuer N>64
 - Umstellung auf C++ wuerde das Programm uebersichtlicher machen!
   minimal fuer Daten-Typen
 - Erwartungswerte H_t H_J H_U etc. berechnen (deren Summe ist <H>)
   ermoeglicht bessere Interpretation? evl. H_J1, H_J2
 - H_J1={ny,array of pointer of {iy,nx,array of {ix,wert[y,x]}}}
 - store H_J1,H_J2 seperatly and calculate H = J1*H_J1 + J2*H_J2
- try start neel and count nonzero-elements per lanzcos step (a8?)
  also check overlap to predecessing lanzcos vectors
- repeat with debug++ if fatal error, reduce debug output 
- speed 600MHz      lt=2:05 SH=2:15,    --- 1s/It ---  nu+nd=32+8 n1=482e3
  call   b_smallest lt=2:07 SH=2:13=133s 
  call 2*b_smallest lt=2:05 SH=3:50=230s => b_smallest=1:37=97s=72%
  call   b_getbase  lt=2:07 SH=13:45
  call 2*b2i        lt=2:06 SH=2:29 => +12..14s=10%
       nhmline()                       +1..3s=1%
       1*nhm                SH=2:21
       2*nhm                SH=4:32 => +131s=98%
       nhm=return           SH=0:07s
  lt mit H aufbauen? nach E_Ising sortiert? 3bonds=3dimIsing
   + use cos(k!=0,pi) possible for real numbers
   H*000111=001011+100110 (6)->(6)  better uu=dd=0 ud=du=1 (for +-Operators)
   H*4.2.0 =2.2.2+2.2.2
   H*001011=010011+000111b+001101+101010 (6)->(6,2)
   H*2.2.2 =2.2.2+4.2.0b+2.2.2+0.6.0
  hash(ising-string)? trees of Isingergs (level=bondtype)
   konfig mit min. Bitastand zu aelteren Repres. als Representanten speichern? 
   (kfg1^kfg2 liefert <ij>)
   n1=555e6 hnz=23e9=41.4*n1
   idx  -> (IsingRep -> kfg) -> H*kfg -> kfgs -> IsingRep -> idxs
         H*kfg, kfgs->IsingRep per FPGAs?
         idx <-> IsingRep per (hash)Table?
   v voellig dynamisch mit lowestIsing startend?
   1. Iteration superfast, 2. Iteration 41*langsamer? etc.
   aber Problem index finden bleibt?
   numBonds? (topology-index 1dist.2dist.3dist...N/2dist for chain)
   0=01010101 2(8.0.8.0) -> 10010101 + 01100101 + 00110101 + ... 8(6.4.4.)
   1=10010101 8(6.4.4. ) -> 01010101=0
                             + 10100101=1
                             + 10001101 + 10010011 + ... 16(4.6.)
   2=10001101 16(4.6.)  ->  10010101=1
                             + 10001011 (4.4.)
   3=10001011 16(4.4.)  ->  10001101=2
                             + 01001011 (6.4.2.)
                             + 10000111 (2.4.)
   4a=01001011 8(6.4.2)
   4b=10000111 8(2.4)
- code2 waehlen
- kill SIGUSR sh parallel => kein sinnvoller Wert 
- LAPAck ohne EV (per Option), zheev implementieren fuer sparse + parallel!?
  Lizenz? http://www.netlib.org/lapack/faq.html#1.2
  - change the name of routines if modified, 
  - We only ask that proper credit be given to the authors.
  complex: zheev (JOBZ='N'|'V', UPLO='U', N, A[LDA,N], LDA>=max[1,N], W[N],..)
   wantz = LSAME( jobz,'v'); // teste option
   lower =
- symmetry.tex uebersetzen/neu gestalten
- Oles-Term pruefen und dokumentieren + ggf. nur reines Coulomb-U(i,j)
  6-site U/|i-j|
- Reimars patch = ok
- check OP/sec, theoretical limits MBps MOps etc. no disk/IO?


pid=...
while ps -p $pid; do
 echo -n "$(date +"%j %H:%M:%S") "
 # only for OSF -g $pid (for subprozesses gzip)
 ps -p $pid -o "pid,pgid,ppid,time,etime,usertime,systime,pcpu,pagein,vsz,rss,inblock,oublock" | tail -1
 sleep 30
done
# compare prozess + system pagein/inblock/oublock wenn moeglich

#
plot [700:840] "aab.log" u 0:3 t "cpu/%" w lp,\
 "<awk '{print  ($9-x)/3.e3; x=$9}' aab.log" u 0:1 t "read/3e3" w lp,\
 "<awk '{print  ($10-x)/1.e2; x=$10}' aab.log" u 0:1 t "write/1e2" w lp

marvel: full last
 gzip  -fc1 tmp/htmp001.001 >tmp/htmp001.1.gz   1m32.381s v1.2.4 8MB/s
 gzip  -fc6 tmp/htmp001.001 >tmp/htmp001.6.gz   3m07.496s 
 gzip  -fc9 tmp/htmp001.001 >tmp/htmp001.9.gz   4m34.404s        4MB/s
 bzip2 -fc1 tmp/htmp001.001 >tmp/htmp001.1.bz2  7m43.521s v1.0.1
 bzip2 -fc9 tmp/htmp001.001 >tmp/htmp001.9.bz2 12m14.325s
 ls -l tmp/htmp001.*
  778485760 Jan 17  tmp/htmp001.001
  303230293 Jan 19  tmp/htmp001.1.gz  39%
  297352257 Jan 19  tmp/htmp001.6.gz  38%
  296760506 Jan 19  tmp/htmp001.9.gz  38%
  296796144 Jan 19  tmp/htmp001.1.bz2 38%
  332110892 Jan 19  tmp/htmp001.9.bz2 42% ?
 # decompress to /dev/null
 cat tmp/htmp001.001           0m11.624s # 67MB/s
 gunzip  -c tmp/htmp001.1.gz   0m24.802s Todo: +mem? +ru=100%?
 gunzip  -c tmp/htmp001.9.gz   0m24.043s # 32MB/s async?
 bunzip2 -c tmp/htmp001.1.bz2  1m55.676s

sh prog1 | buffer | sh prog2   # buffered async read?

Performance and efficence

The efficence of spinpack-2.19 on a Pentium-M-1.4Ghz was estimated using valgrind-20030725 for the 40-site square lattice s=1/2-model. 37461525713 Instr./49s = 764M I/s (600MHz) 12647793092 Drefs/49s = 258M rw/s