History/ChangeLog

You can find the last (important) changes here.

 see also ToDo !!!
 --- please report compile problems and/or send your improvements ---
 ToDo: auto-use higher Tnorm2 for overflows + recomment it? (high S1SYM)
 ToDo: test speed_test for some slow nodes! (other processes, defect cpus)
 ToDo: think about AH vs. B_NUM, best AH/core? so AH is cpu-specific
       or node specific (slowest I/O path?), nzx-stats per AH-block?
 ToDo: remove XY_FLAG! needed for simplifications
       makes SH_storage bigger (nzx*8B+4B)!? thbuf.ofs[]; simpler code
       first add scfg2idx_blk incl. XY_FLAG () better gprof?
 ToDo: mpi_l1 + scfg2node  last cfg per node? comments missing! 
 ToDo: MPI_IAlltoallv + store Alltoall + v2.54  STOREH=5(MPI_File_Iwrite)
 ToDo: unique mpi-buffer-names, dynamic alloc?! scplx nsend_rr to mpibuf_vr
       replace HBLen by dyn.nzAHmax to get same MPI-pckt-lenghts SH.nhm_l vs. i100.geth
          see spins.c ah_nz[ah_blks]
       do not fill HBLen!? use it as maxNZX*AH*B_NUM (first compute diff)
       -prio- it will fix the partly-stored-matrix problem too!
        also we have per node (not per thread) SH-blocks = bigger = I/O-speed
       set nzx by hand, if detected to small, autoset to max (reorder mem if bad)
 ToDo17: check speed without MPI_local_node_self_transfer
 ToDo17: split nhm_line to cfg2scfg_nz_blk + MPIscfg2i2vi_nz_blk?  
              +nhn[](should be done by op_H_already!)=compact ?
              XY_FLAG to ofs[] (zero-nzblks!) + XY_FLAG ggf. in storedH only
            see asyncMPI, rename hamilton_puth_line to store_H_nz_AHblock
 ToDo17: testcase hybrid 2x2 k!=0,pi (with cases: H|i>=+|j>-|j>) or fulldiag
         with partly stored H (output max BFACTOR see Ri17 S1+NOS1SYM)
 ToDo17: fix v2.56 checkpointing, store min_nzx (modAH) not max NZX
         fix slow cplx class (supermuc)
v2.60pre  stable 2-loop-nonblocking_MPI_thread0_only
v2.58pre  stable partly-nonblocking_MPI_thread0_only
v2.57pre  stable partly-nonblocking_MPI_thread0_only, taskfile(~threadfile)
v2.56  stable(?) blocking_MPI_thread0_only, scales better than v2.55
 2017-02-22 fix problem with last block on last thread in case of zero length
 2017-02-20 make test fixed (--mpt --mpi @quantum,JJ64,cplx,sisj)
         storing nzx instead NZXMAX again, so memory consumption is like 2.55
 2017-02-18 asynchron fully filled const_size_nz_blks (HBLen) replaced by
         _synchron_ partly filled const_size_nz_blks (NZX*AH), more KISS
         HBLen obsolete now and replaced by NZX*AH, NZX must be minimized
   synchron H-blocks needed to fix complicated MPI-error (since HBLen exist):
        MPI and partly written matrix may cause problems (BFACTOR-err+bad_e0)
   but eats more memory! NZXMAX*n1 instead of old nzx*n1 (subject of change)
    p.e. kago42k1=15+27 +49% (this is a big change in the code pipe)
 2017-01-28 use mem + disk for matrix, set maxmem + maxfile (per CPU-core)
            replace ./tmp_shared by ./tmp (shared or local scratch disk)
            STOREH=0 replaced by runtime options maxmem + maxfile
 2017-01-27 rename short names (better to read + grep), split functions
            to prepare for asynchronous communication (in progress)
 2017-01-23 fix speed_test race condition for mpi/pt/hybrid code (get nzx)
            reorder h-file-content
 2017-01-22 --perf_test=100 reduces n1 by 100, old was stopping at 100
            this gives much better statistics (nzx, performance per nz)
            - more output about mem-usage, better large scale debugging
 2017-01-21 only thread0 does MPI in hybrid mode (bit slower than v2.54!)
            bigger MPI packets, less CPU-power for MPI, better readable code
            will overcome v2.54 when using MPI_IAlltoallv=overlapping Net-I/O
v2.55  - stable seq_MPI_all_threads Alltoallv
 error: MPI and partly written matrix may cause problems (BFACTOR-err+bad_e0)
 2017-02-01 fix compile error, if MPI is not used, workaround bad g++412+cplx
 2017-01-30 CFG_CPUSET=1 to show CPU-affinity, CFG_CPUSET=2 to set affinity
 2017-01-28 speedup for expectation value Zi (diagonal operator)
 2017-01-26 add warnings, bigger buffers enlarge OOM risk by libmpi
 2017-01-25 disable speed_test.nhv for mpi or pthread, its buggy
            gcc-6 + complex compile error fixed (defs.h+__STDC_LIMIT_MACROS)
v2.54 please use v2.55
 2017-01-24 fix mpi + ev=1 hanging triggered by x_sort in error.c (backport)
 2017-01-21 fix complex + CPUSET for icc (need to add too v2.60)
            fix compile problems g++ + complex
 2017-01-19 backport iy y inverse indexing
            old: rnd_write+seq_read, new: rnd_read+seq_write (geth_blk)
 2017-01-19 backports v2.60 output cpuset, use faster blk_scfg2idx
            bigger MPI_MAX_NUM (better defaults for supermuc)
v2.53  - much better MPI scaling using Alltoallv - use v2.55
 2017-01-16 fix hybrid mode code, buggy since 2017-01-15 
 2017-01-15 replace loop over MPI_Sendrecv (mpi_n*sync) by MPI_Alltoallv
    faster on comp2.09.ompi14 + supermuc12.mpich3,
    but much slower (factor 15 for 5700 tasks) on SiCortex09.sc_mpich2
    A2av-emulation for 5700*SiCortex gives speedup of 100% SH and 30% i100
    Problem: My_MPI_Alltoallv was loop j over -blocked- send i to i+j
      but datasize of blocks within each loop is very different
      only sum is distributed, so alltoallv is much better way
    but output of mpi_stats is not correct anymore (use meansize per node now)
 2017-01-13 option: --perf_test for reduced matrix-size performance test
 2017-01-13 temporary fix for bad alloc_mem on disk usage STOREH=2 or 4
            set bigger dflt AHEAD 64 to 1024 (see 2016/lrz + doc/speed_mpi)
v2.52
 2017-01-11 fix icpc compile error
 2017-01-10 fix hybrid code (MPI+OMP), fix (make test) include DFLAGS
 2017-01-09 rewrite to use OpenMP-2.5 only (but i100 is about 6x slower)
 2017-01-08 add OpenMP-3.0 workarround for PTHREAD (but icc10 has OMP-2.5)
            this is for systems, where we have OMP but no or buggy libpthread
 2017-01-06 found some problems using icc10 -pthread = hangs + bugs
 2016-12-27 reduce stack memory needed for speed_test by ca. 16kB
 2016-12-22 replace use of hard coded mask CFG_CPUSET by getaffinity()
            CFG_CPUSET must be set to some nonzero value only (p.e. 1)
            example use cores 4-8: taskset 0x0FF0 ./spin -t8
 2016-12-12 fix unnecessary abortion for maximum iteration (i=MAX_ITR)
 2016-12-01 sym_k= -999999 and below stops sym-search, similar to SIGUSR2
    should be replaced by faster sym search (exclude full commuting syms)
    example: chain of repeating (N=2,3,..)-rings seperated by single spins 
     for N=2 dimer-plaquette-chain, changes in symmetry.c get_recursive_perm
 2016-11-10 avoid re-use of old eigen values on failed malloc (fulldiag)
 2016-11-10 malign alloc big Hbuf for fulldiag, since about 2015-06
v2.51
 - 2016-10-05 gcc-6.2.1 fix stronger spacing errors in models/*.c 
              fix bad signal behave of my_handler() for mpi-jobs (thanks gcc6)
              fix indentation according to gcc6.2 warnings
 - 2016-08-17 g++-6.1.1 fix stronger spacing errors in error.h "..."var
 - 2016-08-09 fix lot of sym output (LM+NOS1 only?) for mpi since 2016-01
 - 2016-05-18 add extra workspace for lapack routines, 2-3x more speed
   using multithread for a4-eigenvectors, use OMP_NUM_THREADS if no option -t
   fix bad "ERROR fclose_l1"-msg (no impact, clean output only)
   add autodetection of core number for configure --mpt, +option --ftlm
   use CFG_PTHREAD>1 for B_NUM, link example*lc2*html in doc/spins.tex
 - 2016-04-29 fix buggy transposed eigenvector-matrix of cjacobi()
   buggy since v2.42 2012-01 test: nev=1 verbose=7 a4
 - 2016-04-28 improved utils/lapack_test.c +options +benchmarks +XXXevd()
   autoswitch to faster LAPACK_dsyevd/zheevd for nev!=0
 - 2016-04-21 fix autoset maxHmem=maxfile, if more than 16GB/core needed
   since 2015-02--, improve output (normvec/startvec)
   fix bad srand(0) (==srand(1) from 2016-04-15, 2nd = 1st random startvector
 - 2016-04-15 for JSchnack2016 Finite-Temperature Lanczos method = FTLM
   see spinpack_rel_pap/FiniteTemperatureLanczos_FTLM_JSchnack2012.pdf
   DFLAGS += -DCFG_FTLM=1 + daten.i: startvec=13 + repeat a0's (unproofed)
   update: better use max_ea=16 and one a0 only, startvec=5
 - 2016-04-15 fix bad v=0 for startvec=8++ (rng_ini+=startvec&~15)
   change randvec, startvec=0,1,4,8++ use, random startvectors changed now
   see randvec()-comments: grep startvec src/vector.c
v2.50d Apr2016
 - add min/max CPU/node speed to detect bad cpus or nodes    2016-04-07
 - fix segfault speed_tests.nhv_no_l1 (high node numbers ca.58*AH) 04-07
   since about 2015-03, verbose=1 is a workaround
 - reordered debug outputs
v2.50c Mar2016
 - fix multiple a0 memleak (l1) introduced 2015-02-24 for BLCR     2016-03-22
   using: valgrind --tool=memcheck --leak-check=full ./spin
 - fix out of array read access for NUM_AHEAD gt n1 since 2015-10? 2016-03-22
 - fix bad "symtable full" case (MaxSym to small) since 2016-01    2016-03-22
 - add usleep(100ms++) for nonzero tasks to exonerate nfs overload 2016-03-16
   ToDo: untested, please try with/without spins.c.L4302 and report
v2.50b Mar2016
 - fix zero block crashs (SEGFAULT+badscfgs for mpi_n big, n1 small)
   introduced 2015-10-08 (benchmark adaptions)                     2016-03-10
 - allow massive oversubscription (64++) removing unneeded MPI_Barrier
   +using --mca mpi_yield_when_idle 1 --mca mpi_preconnect_mpi 1
   usefull for debugging problems on PCs which appear for high mpi_n only
 - fix buggy autocorrection for nu+nd!=nn (since 2016-02 +old)     2016-03-09
 - fix multiple (per-mpi-task) warning "paini overflow" (N>=68) 2016-03-07
   introduced in v2.50 2016-01
v2.50 Apr2015-Jan2016 SIMD-CPU version
 - fix bad delay tasks*seconds, result of fixed mpi-token-ring   2016-02-03
   introduced 2016-01
 - allow more iterations for small systems (better for JJ N=2 s=33/2) -old-
 - fix bad N=2 LM detection, tested: N=2 2s=1...34 using int64   2016-01-23
 - fix overflow problems using -DTnorm2=double (N=10 2s=8)       2016-01-22
 - fix mpi-token-ring for shared mymap_l1                        2016-01-20
 - replace O(n^3) by O(n^2) paini-storage, simplify paini, check overflows
 - add sym_lm (0=max(S(S+1)), 1=max(J(J+1)))  see doc/sym_tU.txt 2016-01-16
 - 10-20% speedup recursive ns for high-spin-systems (see cubocta.def 2s=7)
 - fix highSpin-problems with vectorized code or wud=0 and nu=nd 2016-01-11
 - fix speed_tests() for multithread + static                    2016-01-11
 - symmetry.c partly rewritten for better handling of s1-systems + NOS1SYM
   N=11-s=1-lc-chains: sym_k= -21 0 ...  # skip 2N syms, set LM-syms (fast)
   N=11-s=1-lc-chains: sym_k= 0 0 -60000 # skip LM-syms, set 2N k's only (fast)
 - make auto-VS depend from tbase (VS*S_tbase=const.) faster on SSE2, AVX2
 - enable benes upto 512 bit-cfgs (JJ) tested                      2016-01-08
 - fix vlint - int for NN>128, improved error check + output       2016-01-08
 - speed_test bnv4+bnv8 removed, code reduction (bnVS left), +const-args
 - fix buggy warning "WARNING: NOS1SYM ..." if S1SYM set           2016-01-08
 - add L1_PACKED=0 to allow fast non packed or slow packed tbase_vector l1
 - fix tJ- and tU-model for NN=32 (%4==0) and nn=3 (%4!=0)
   since 2014-05-14
 - fix speed_test.check_minsymcfg for tJ- and tU-model              2016-01-05
 - fix bad spins.c.L1258 special S1-Sym speedup (works for JJ only) 2016-01-04
 - fix bad gcc.phase_eq.fabs(cplx)<1e-6 by [sqrt](Norm(cplx))<1e-12
   tJ_lc_s1 k=2/8 nu,nd=0,2 failed before (see tJ_lc_s1.gpl)        2016-01-04
 - fix sqrt-format for negativ non-sqrt numbers in myprint.c (vvv&64)
 - fix bad phase output (fabs() != sqrt(Norm()))                    2015-12-31
 - add new test data to doc/lc.gpl N=46 n1=45e9, lc_s1.gpl N=26, tJ_lc_s1.gpl
 - fix S1SYM + isminsymcfg_lm() return -1 for tJ+tU model (recursion)
 - Vlint as 2^n-bits to fit benes needs (hangs on old code and non log2)
 - algo=64 stop after l1 (prepare l1 serial, --ckpt_load=2 + parallel) 2015-12
 - earlier output of i100.speed at i001                             2015-12-24
 - reduce error output for mpi (symmetry search, nsym>MaxSym)       2015-10-15
 - fix speed_test factor 2 to high speed outputs (since v2.45?)     2015-10-08
 - fix array overflow (NN>MaxSym) in symmetry.c sym.w initializ.    2015-09-23
 - fix benes-algo for NN>64 (JJ, but slower than lNbrk)             2015-09-23
   NN=512 JJ.nn=4+36 + nn=2+498 tested, some very slow ini parts
 - add term e_i to tJ-model (tJ as tU where U=infty)                2015-09-23
 - fix div 0 for numsym=0, FP exception (or endless loop?)          2015-09-23
 - fix missing initialization of benes-network bnVS after 1st round 2015-09-22
 - fix buggy symcfg_bnv4 (only little speed-test relevant)          2015-09-08
   fix valgrind warnings
 - replace local hbuf (size*AH) by alloc_thbuf, stackovl speed_test 2015-07-08
 - switch back from benes to old slow lNbrk for NN gt 64 (buggy)    2015-07-07
 - add BENES-performance-decision-table in symmetry.h for different CPU-types
   also lot of detailed vector-performance data added in speed_mpi.gpl
 - improve Hbuffer allocation for multi-thread version (B_NUM>1)    2015-06-11
 - fix segfault in 2x2-diag-algo (a2)                               2015-06-08
 - expand (make test) for tJ and tU-model, fix some bugs+warnings   2015-06-02
 - add string.h to spins.c + vector.c to fix linker problem gcc492 -std=c99
 - fix abort-on-non-hermitian-matrix problem for vvv=2 if terms sum to zero
 - add macro for __int128 if it is available (for base configs)     2015-05-17
   speedup 3.5 for N>32 tU on Atom-N455 64bit SSE2 
 - benes network (bnVS) for bit-permute O(2Log2(n)-1) implemented   2015-05-11
   may be faster for SIMD, AVX2-advantage expected, but not seen fully
   set CFG_USE_BENES in hilbert.h (see also minsymcfg_blk_bnVS[_no_ud])
   tested on gcc41-gcc49 -m32/m64 SSE2/AVX2 NN=40
 - hbuf->el[].{bj,blkj,jj,rr} replaced by hbuf->{acj,blkj,jj,rr}[]  2015-04-15
   for better SIMD vectorization, also smaller CPU cache footprint
 - VChunkSize (load/store/checkpoint vectors) reduced to 4MB (OOM--)2015-04-13 
 - replace piecewise malloc by one pre-malloc for storeH (see bugs) 2015-04-08
   reduce problems (deadlocks + aborts) by mpi in OOM conditions (STOREH=0)
   malloc has some intern "optimization" which causes trouble near Out-Of-Mem
v2.49 Mar2015 = new stable (has 32bit-compile-problem, patch available)
 - fix n1=0; speed_test for vvv .and. 3 (old 2)       2015-03-11
 - minimize sbase size  (like old MPI+JJ)             2015-03-10
v2.49 Mar2015 (+fix above)
 - maxfile(=hfmax) should be set for STOREH=0 too, libc/ompi bad on OOM
   see bugs.txt
 - fix speed output (0-sym added) for numsym below 100             2015-03-09
 - fix missing if-condition before error for multithreads          2015-03-06
 - -DCKPT_HELP for SIGUSR2 triggered safe checkpoint window
    without MPI-messages                                           2015-02-26
   speed-loss? no
 - fix speed_test for tJ, tU; rename b_smallest to minsymcfg       2015-02-24
   PRF output per config*num_symmetry                               
 - fix segfault for speed_test for sym.numall=1                    2015-02-23
 - add ReserveMB to avoid Out-of-Memory in case of partly stored H 2015-02-23
   before that checkpointing could not alloc 16MB because of OOM
 - fix multiple output of checkpoint time for MPI                  2015-02-21
v2.48 2015-02-05 
 - fix uninitialized use of thbuf since 2014-07-14      2015-01-31
 - fix buggy hfmax use  			        2015-01-29
 - fix gcc-4.8.2 compile warnings (JJ,tJ,tU) using vlint
 - remove macro Ne dependency of Hubbard-e-term since 2.16 and earlier
 - replace all int= tbase(=vlint) .and. int, may cause bugs NN>64 2014-12-17
 - fix bad checkpointing "v0[eo].dat" if chkpt6.i-last is even    2014-12-03
 - fix segfaults nev!=0 introduced in 2.48                        2014-10-14 
 - fix compile errors for tU on wop()                             2014-10-01
 - reduce messages on loadvec error + abort                       2014-09-19
 - HW fault detection, memory bit flips (?) by checking tridiag range 2014-07
 - partly stored SH-matrix reenabled for maximum iteration speed
   this may cause problems with MPI calling mmap/malloc?
 - thbuf-allocs moved outside loops for better lowmem behav.
 - reduce lowmem outputs (see pipe.c L128)                        2014-07-26
 - coordinated l1-file access to reduce file-system-pressure 
 - replace big dyn. mem allocs within loops (fix lowmem probs)    2014-07-17
   buggy 2nd run (see 2015-01-31)
 - add CFG_SIMPLE_CODE for auto-micro-parallelization tests       2014-07-04
 - fix error propagation to mpi-threads for loadvec()             2014-06-19
 - improve mpi_stress.c + memspeed.c benchmarks                   2014-06-19
 - fix segfault by read unini mem in b_smallest (kago36,comp2)    2014-05-23
 - fix buggy faster b_smallest v2.44 for tU  (tJ untested)        2014-05-23
 - configure is detecting openblas-devel for parallel fulldiag a4 2014-05-15
   big matrizes segfault sometimes for unknown reason (race cond?)
 - some more pre-benchmark outputs (eats 1-2 seconds per test)    2014-05-10
v2.47 2014-02-14 + 2014-05-25_hilbert.c
  - fix bad results for maxfile=0 a0 from a2-improvements(2.45)    2014-02-14
  - fix bad precision for lapack-3.1+.zheev.lwork=3*n1 sawt20Z6    2013-05-24
v2.46 2013-05-02 + 2014-05-25_hilbert.c
  - fix bug SiSj and wop (bad values, caused by .nhv vs. .rr)      2013-05-02
v2.45 2013-04-29
  - benchmarks and speedtests for verbose=3 added, oprofile tests  2013-04-25
  - auto load checkpoint removed to avoid problems with old runs   2013-04-25
  - better i100.t estimation (old was i/(i-1) to big)              2013-04-24
  - change ns-mode switching, a0=auto, a16=rekursiv, a32=old_seq   2013-04-24
  - add fstime() for seconds(+ms,us) as double, itime is obsolete  2013-04-23
  - output timings using prefix PRF for performance, clean diffs   2013-04-22
  - option maxmem removed (use ulimit or job limits vmem)          2013-04-22
  - fix ./configure (bad obsolete --mpp option, check for icc+c99) 2013-04-05
  - fixpoint16 removed, type cast revised, better float accuracy   2013-04-04
  - VecType-tests 1D-JJ-N=40 8+32 n1= 963793 gcc-4.1.2:            2013-04-04 
      1  1*8B: -5.37411616  -5.14810857  g99 -std=c99   i100=0.25m i75 0.18SH -O3
      0  1*4B: -5.37411616  -5.14810857  g99 -std=c99   i100=0.20m i75 0.20SH -O3
      4  2*4B: -5.37411616  -5.14810857  g89__complex__ i100=0.34m i75 0.20SH -O3
    ! 8 fix4B: -5.37389582  -5.37385328  gcc -std=c99   i1000, after bug fixes
    ! 8 fix2B: -5.26368707  -5.26176567  gcc -std=c99   i1000,
      == bad fixpoint float results + lot of complexity, removed for simplicity
      may be C99 half float can be used much simpler (not in gcc-4.1.2!?)
  - VecType-tests 1D-JJ-N=40 6+34 nosym n1=3.8e6 gcc-4.1.2: 
      4  2*4B: -1.75056342  -1.67952075  g++myclass     i100=0.82m i125
      4  2*4B: -1.75056342  -1.67952075  g89__complex__ i100=0.77m i125 0.28SH (=O3)
      4  2*4B: -1.75056342  -1.67952075  g99_Complex    i100=1.09m i125
      5  2*8B: -1.75056342  -1.67952075  g99_Complex    i100=1.29m i125 0.38SH
      5  2*8B: -1.75056342  -1.67952075  g99_Complex    i100=0.91m i125 0.33SH -msse2 -mssse3 -ffast-math -O3
      5  2*8B: -1.75056342  -1.67952075  g89__complex__ i100=0.94m i125 0.28SH -msse2 -mssse3 -ffast-math -O3
      5  2*8B: -1.75056342  -1.67952075  g++myclass     i100=0.91m i125 0.30SH -msse2 -mssse3 -ffast-math -O3
      4  2*4B: -1.75056342  -1.67952075  g++myclass     i100=0.83m i125 0.30SH -msse2 -mssse3 -ffast-math -O3
      1  1*8B: -1.75056342  -1.67952075  g++            i100=0.50m i125 0.23SH -msse2 -mssse3 -ffast-math -O3
      0  1*4B: -1.75056342  -1.67952075  g++            i100=0.48m i125 0.23SH -msse2 -mssse3 -ffast-math -O3
  - zahl,mzahl,mcplx replaced by double,sdouble,scplx (s=short,l=long)
  - CC=gcc -std=c99  adaptions, shbuf.shelem.rr(mcplx-to-cplx)      2013-04-03
  - typecast fix + _Complex support by A.Honecker for icc           2013-04-02
  - speedup next() for S=1++ removed for simplicity and fix S=4 bug 2013-04-02
    ToDo: add parallel code for rekursive algo for compensation
    speed: N=5  S=4 nud=8+32   28s/threads vs. 0s, 12+28 30min vs. 0s(rekursiv)
           N=10 S=2 nud=12+28  31m/threads vs. 2s (Faktor 1000)
           N=14 S=3/2 nud=8+34 48s/threads vs. 1s (Faktor 48)
           N=20 S=1 nud=8+32   34s/threads vs. 8s (Faktor 4)
           N=40 noSym   8+32  120s/threads vs. 127s (Faktor 0.94 !!)
  - mv doc/example2.html doc/example_tU8.html                       2013-03-13
  - ignore long lines (above 1022 chars) in daten.def (m_bcc N=250) 2013-03-04
  - struct shbuf thbuf changed, nhn,nhv[ahead] computation for _nhv 2013-02-07
  - add MPI code to a2 (2x2 method) (works for AH=1 only, ToDo)     2013-02-06
  - fix OOM problem for big MaxSyms (reducing static array)         2013-01-26
  - fix bug even/odd checkpoint badly set back for resume           2013-01-25
v2.44 2013-01-17 + 2013-05-02_diag.c + 2014-05-25_hilbert.c
  - fix bug in algo2 (2x2diag) faster for SH not stored + mem/2     2013-01-17
  - add start-token for send_from_all_nodes-to-node0 on wv (mpich OOM) 2012-11-14
  - fix bad MPI recv bufsize on parallel ns() causing signal 15     2012-11-14
  - get_maxscfg for parallel ns speedup disabled, bad algorithm
  - fix "access last element"-error for n1==0, improve outputs, 2012-11-05
  - fix wrong error in clrvec for n1 smaller than number of nodes 2012-10
  - infile: include + ":" + "pout" removed for simplicity 2012-10,
    infile: x[0-9]* replaced by xout=*, l* replaced by loadvec=* 2012-10
  - break on full symtables to avoid mass output (NoS1Sym + S=1-Lattices)
  - fix mistake in b_smallest ib2=ib1; s2=1; (no nzx for s=2) 
  - fix problem on critical abort during checkpointing,
    use even/odd(n) instead of critical rename of chkpt n-1 to n, 2012-09-27
  - fix wrong complex phase output (1+(r-1))*phase (old: 1+(1-r))*p) 2012-08-03
  - add utils/io_latency.c to analyze storage speed (RAIDs, SSDs) 2012-07-05
  - simplify code + fix some tU lm-bugs (s1-triangle n=27*2 108bit) 2012-07-05
  - remove MaxMem (600MB), default is infty (limited by system) 2012-07-05
  - output cfgs as hex number (more compact), see N=54 sample 2012-07
  - change behavior on bad set nu,nd (try to keep smaller number) 2012-07 
  - fix problem with kill after chkpt2 and disk caching l1 (missing sync) 2012-06-26
  - speedup for parallel numsymconf ca 1..20 (s=1, small nu) 2012-06-24
    fix maxscfg() is now the maximum sym config
  - handle incomplete written chkpt6 (restore old ones) 2012-06-21
  - handle incomplete l1-writes (for odd Bsize) on chkpt.resume 2012-06-20 
  - check for possible changes of NN+Bsize after chkpt.resume 2012-06-20
  - fix problem with scmpich-lib, eating all the memory (and slowdown)
    on parallel_send_to_0/sequential_recv_from_all-sequence on numsymconf() 2012-06-18
  - change checkpoint numbering, 2 substeps (change chkpt for resumed jobs!)
    old=0ns1nc2sh3ew45ev6 new=0ns23sh45ew67ev89  2012-06-11
  - check plausiblity of chkpt1 before using it  2012-06-07
  - replace savevec.MPI_FILE_* by MPI_Send/Recv + task0-file-Ops 2012-06-06
    to work around older network file system (mpich+NFS?) locking problems
v2.43 2012-05-23 add checkpointing functionality for MPI code
  - save_mode renamed to chkpt_mode (unused) 2012-05-23
  - writing more checkpoint files tmp*/chkpt[4] (status renamed to chkpt1)
  - fix problems on parallel func. ns+next() for s=2/2++ systems 2012-05-16
  - using version.c for version_date (avoid unneccessary slow recompilation)
  - fix parallel vector save/load
  - improved checkpointing for mpi-jobs (2*USR2+USR1 + chkpt_time) 2012-05-15
    the problem: sometimes jobtime is limited to a maximum (max. walltime)
    checkpointing is needed to stop (ordered) the job and resume it later
    using options --chkpt_load=4 --chkpt_time=60 or similar (SH not stored only)
v2.42 Oct11-Apr12 2012-05-07 (buggy cjacobi transposed-EV until 2016-04)
  - fix lapack related code 2012-05-05
  - parallel ns() writes one file via task0, 2012-05-05 (see speed_mpi:asgard)
  - for parallel ns(), write single file instead of 8*mpi*pt files (2012-05-02)
  - ">= defined in vlint.h (for 65bit++) 2012-04
  - fix 64bit n1 parallel computation problem for S1SYM JJ NN=64 (2012-03)
  - fix and enable multi threaded numsymcfg-code for s=1 (2012-01)
  - add HEigensystem as replacement of real [[A,-B][B,A]] for cplx matrix
    speedup about 5 (for complex matrix and fulldiag)
  - more info on matrix image (matrix.pgm, verbose+=256[+512], a4)
  - auto choose the faster method for scfg generation (a16 sets the opposit)
    that means the fast serial code is only choosen for pt_n*mpi_n=1
    this decision may be bad for a small number of nodes or threads
  - fix bad abort for mpi code and n1 above 2^32 (LC-41 n1=6.6e9 256nodes)
v2.41 Nov09-Oct11 (2.41b 2015-09-23)
  backport-fix bad MPI-test 32bit n1 of v2.42 2011-10-25 on 2015-09-23
  add doc/lc.gpl example data for spin-1/2-afm-Heisenberg-chain N=40
  add w_diag_op() replacing wop() for diagonal operators for speedup (JS-Oct11)
    this gives nearly the old speed for zizj without MPI ballast, see lc
  better error handling if ./tmp/ is missing (JS-Oct11)
  fulldiag: better error handling if malloc failed (JS-Aug11)
  improved error handling for mpi code of loadvec() (JS-Mar11)
  add mpi-code for loadvec() and minimalistic code for x_out() (JS-Mar11)
  add CONFIG_Fidelity switch to compute overlap to last EV (JS-Mar11)
  fix creation of ./tmp_shared on installation (JS-Sep10)
  fix utils/defspin1.sh (output of positions as float) (JS-Nov09)
v2.40 2009-11-26
      fix n1=0 problem for trivial case nu=0,nd=N where n1=1 is correct 
      add Zi output to fulldiag (a4), fix Zi in case of used ud-symmetry
      add site-rotation-term samples to spins.c (disabled by if_0_endif)
      fix m_lattice.c + m_tilings.c output format (exsample3 did not work)
      fix bug randvec=1 n1=1 (example: all spins up or down)
      fix bug in op_sxsx, op_sisjsksl, op_jxjx, op_ninj, op_nisnjs, op_nis
      add 4-spin operator op_mult_sisj_sksl(), rename op_sisjsk to op_diff_s*
      fix bug for MPI where some nodes can not store full matrix to memory
      remove 11 char limit for input file name
v2.39 2009-04-20
      fix wrong n1=0 for serial code a0, skip trivial case: nu=0, nd=nn
      new option -m (default: daten.def)
      new option -z  for utils/def2fig.sh
      models/lattice.c renamed to m_lattice.c, also new options added
        all coordinates based on base coordinates (ex: 60degree for kagome)
        add stretched kagome lattice
      add CONFIG_SymSearch option to config.h (0: disable slow sym search)
      reduce default static size (probably lower speed)
      partly replace "\n ..." by "...\n" for better MPI output
      Tru64 defines LONG_BIT instead of __WORDSIZE
v2.38 2009-02-11
      fix possible deadlock for wop() (expectation values)
      fix deadlock for x_sort() if last task has block length node_n1=0
      fix early convergence abortion for N=40 square j1=-1 j2=0.42
      exe/tmp can point to local scratch space now (disk cluster)
      exe/tmp_shared for shared scratch space (removed in later versions)
      fix check of XY_TYPE bits for node_n1 (not n1)
      fix error in mrule for mpi_n>1 (wrong results)
      remove execution of spinsdef in Makefile, better for cluster
      remove llong = "long long" for better MPI and ansi compatibility
        will be a problem for gcc -m32 (no C++, 32bit, NN -gt CM(32,16,16))
      fix compile problems for CM=tU,tJ (hubbard model, t-J-model)
v2.37 2008-09-17 (benchmark version)
      add CFG_CPUSET option (set to 0x0005 on dual HT-Xeon boards)
      fix free(static) bug in mpi version of ns()
      output of SH_speed and i_speed in hnz/s
      better configuration script for MPI
      SH.t measures max. realtasktime which can be wrong for overload, fixed 
      better balance measuring from hnz[block] (max. efficiency = t1/(n*tn))
v2.36 Jun08 2008-08-04
      better default settings for mpi and pthread buffers
      expectation values computed now using mpi (wop(op))
      replace v0[thread] by v0 + b_ofs[thread] (better mpi code, less /%-ops)
         (one malloc, first touched by threads)   
      sort mpi data in SH once (like i100)
v2.35 2008-07-22
      replace integer modulo operation (slow on IA64)
      add anisotropy Jz(i,j) (as z$parameter_index to daten.def, default=0)
       for H = Sum(i,j) J(i,j)(Sx(i)Sx(j)+Sy(i)Sy(j))
                       + (J(i,j)+Jz(i,j))(Sz(i)Sz(j))
      fix bug for moved b_smallest() b2i() (part of FPGA rewrite)
v2.34 2008-04-23
      fix deadlock and errors for partly stored matrix using MPI
      fix AddSS bug (was a factor of nw to big and to slow)
      new option -o<outfile> for better job management (torque/PBS buffers stdout localy)
      Fix: log2(n1=0) FP EXCEPTION for Tru64@alpha (-nanf on linux)
      Fix: mymap.read(2GB+) failed on 64bit-systems, buggy since v2.33?
      Fix: ini_thxy() hanging in endlessloop if (int)2*n1<0 (square42)
v2.33 2008-03-16
      mymap(ev) replaced by mymalloc(ev), else slow or hangs on NFSv3+MPI
      Bug for mapped eigenvects on MPI systems fixed (nev>0)
      Bug for 32bit systems using mpi and mymap > 2GB fixed (for LFS)
      norm2 vector removed, code simplified
      Bug fixed, for maxfile = 0, was 10* slower (wrong number stored blocks)
      also output excitations for small n1 at "conv=" line
      Bug fixed, maxfile to small or 0: wrong results (faster mpi)
      Bug fixed, n1 < num_threads
      less output for nodes*ppn > 8
      algo:a0 replaced by a16, a0 is new fast ns(), a16 is old slow ns()
      good scaling up to 128 CPUs tested
v2.32 2008-02-19
      single thread workaround for S=1 systems, multithread computes n1
        sometimes to big (ToDo: check for reason and fix it)
      fix bug in fulldiag (a4) matrix generation
      dont print pointer for verbose malloc for easier diff
      bug fixed for maxfile=0 (incomplete stored H)
      IA64 is very slow for div operations, MPI speed up 50%
      MPI Data reduced, 30% speedup for 100Mbit
v2.31 2007-12-14
      fix problems with model=tU for v2.27 or later
      fix problem of uninitialized values on tiny systems, where some
         threads have an empty cfg-table (since v2.27 and B_NUM>1)
      multiple sublattice detection, better MPI scaling
v2.30 Dec07
      STOREH=0: bug for 2nd run of storeh fixed, 0 is default now in config0.h
      nzxmax fixed for multithread, STOREH=2 missing creation mode fixed
v2.29 2007-12-04
      STOREH=0: realloc of more than half of main memory may fail, recoded
      new sublattice generation, using defined bonds only (not correlations)
      weight matrix of Ising energies for j1 and j2 bonds extended (3D plot)
v2.28 Nov07
      2nd successfull mpi-run (np<4 only for dsk>0%), hybrid MPI+PTHREAD
      define CONFIG_DIMER_CHECK for artificial symmetry breaking (finite systems)
        only symmetries which are not explicite set in daten.sym can be broken
      first successfull mpi-run (np=2..3 pt=1 only), but not usefull (slow)
      output of converged energies for better awk handling
      -D STOREH=0 to store/read H into/from memory for max. performance
      reduce cache (line) coherence overhead by replace of work.bi[thread]
      for configurations output .oxQ for .ud3, which looks better
      new sublattice (SL) generation (bi-/tripartit only)
      macro Sud removed, switch off ud-symmetry by setting sym.wud=sym_ud=0
v2.27 2007-11-22
      new performance data, well pthread scaling up to 32 (no disk used)
      mrule speedup (5-100)x using symmetry, more comments
      history.tex converted to history.html
      set number of threads by option -tn (n={1..B_NUM})
      noSBase is not supported anymore (use sym_k= -9999) for simplicity
      update models/*.c according to modeldef.c (partly untested)
      modeldef.c: only the new flexible format of daten.def is supported now
      mrule parallized, renaming korr to corr (en:correlation)
      remove chk2 function for degenerated states (simplification)
      remove wop_k, wop_t, wop_ud (simpler, can be taken from biggest coeff.)
      renaming ckfg to ccfg (proper engl.), Mai07
      output time in minutes (better to parse) + human abbreviation, Apr07
      fix segfault for small LM-systems in wop_block function
      status script removed, write PID-file instead (more flexible)
      stack-output for SIGUSR1 removed for simplicity (better use gdb)
      fix noncritical error in err_fulltable
v2.26 2007-02-27
      h_get inlined (not done by the compiler), (2-5)x speedup for i100
      better scaling of iteration (was bad before)
      thread code simplified
      B_NL2 removed, B_NUM used now to define #threads, code simplified
      configure checks signal.h, failes on g++-4.1 on SLES9 at IA-64
      four-site exchange added partly (must be fullfill symmetry)
      fast_a2 removed due to future adaptions (#CPU>1024, FPGA)
      may be wrong results for a2 using syms (LC6 was NaN, check it!)
      remove HALF_HXY (store only upper half triangular H,
        problems with blocking, simplify code)
      reduce files from B_NUM^2 to B_NUM
       (non diagonal blocks carry only about 10% of diagonal blocks, wasted)
       problems with high file number if going to massive parallel
      change output of ZiZj, SiSj and S^2 for a4 via sisj=... (daten.i)
v2.25 2006-10-23
      fix gcc-3.3.4 compiler error and warnings
      bug fix for <SiSi>!=0.75 and NOS1SYM if sym_lm!=nn
      a8 bug fixed (this was introduced by new parallel method)
      better error report, if HRMAX is to small
      symmetry search can be aborted by SIGUSR2+SIGUSR1 (usefull for pyrochlore)
      output a warning if maxfile limit was reached (ERR(630))
v2.24 2006-04-12
      Output trace of H, sum of upper left nondiagonal elements of H
       and sum E as a check for correctness of matrix elements (see Bug below)
      Bug a4-wrong results (!=a0) squago30 29+1
       a0: wrong h12 for B_NL2>0 + CONFIG_noB_MASK + nommap, but H is ok
       a4: wrong results for B_NL2>0 (fixed)
      error message "unused bits are set" fixed for N=32 on int32-systems (RS)
      tmp/tri.txt closed after usage (stated as memleak by valgrind)
      performance data for 2-Prozessor Dual-Core Opteron running Linux added
v2.23 2005-07-11
      fix segfault for lapack + complex
      add input format " p%d= %lf" for daten.i
      hole-hole repulsion added for tJ+tU-model (h#)
      bugfix pic2.cc (if edgevectors are negative only)
      bugfix o2tower.sh (convergence warning lead to dublicated values)
      nev=0 for fulldiag using LAPACK (faster), less memory (cplx)
      overflow dbgD for SSANISO fixed
      make N=6 s=14 possible by defining Tnorm2 as double (default=long)
      sisj now is a bitpattern (bit0=<sisj>,bit1=<ss>)
      store_sym_tupel overflow for NN>nn fixed 
v2.22 2005-05-06
      utils/o2tower.sh adapted to new output format
      memleak for fulldiag+pthread (fixed using valgrind),
      vector.c iortho(): wrong degeneracy for complex vectors (fixed),
      daten.def: change format of =pbcf= from A-B-C-D-A to A-B-C-D
      script utils/def2fig.sh for xfig improved, new options
      "code2" code removed, models/m_tU.c added for Hubbard chains
      bug for tJ,tU+SBase removed (no nondiagonalelements since v2.20)
      op_ninj now is (nu+nd)(i)*(nu+nd)(j) instead of nui*nuj
      new expectation values (op_nisnjs) for tJ/tU-model
      1st step to replace wopij() by wop() for any number of sites
      definition of operators depending on more than 2 sites possible,
        see xval.c wop(), op_sisj2(), op_sisjsksl() (under work),
        for s>1/2 all (2s*2s) intrabonds i-j have same correlation
        old: s=5/2 ss=s*(s+1)=8.75 SiSi=0.75 SiSj=0.25 ss=5*SiSi+20*SiSj i!=j
        new: s=5/2 ss=s*(s+1)=8.75 SiSi=SiSj=ss/25=0.35 
      NE0 replaced by ne0 in daten.i (more comfortable)
v2.21 2005-09-08
      rename macro Zahl (german) to VecType, default to complex for c++,
      convergence check improved (sometimes it stopped to early, pew=NEW=60),
      max_NN enlarged from 127 to 32767,
      speed up for s>1/2 emulation (about factor 2s)
      include infile (new command in daten.i, max level 1)
v2.20 2004-03-26
      error.tex translated to english (please correct bad english for me)
      lintab renamed to hilbert (its better named), buggy wopij fixed
      makesym() removed, big vectors (.vec) moved to ./tmp (quota)
      bug040217: nu==nd && Sud==0 && (!norm2)==ERR(600)
      - adding local single site anisotropy (SSANISO, by Reimar Schmidt)
      h_file: testing STORE_XandY for future use (more simple code)
      - N=32+8 XR: raw=hnz*5=66MB zip=34MB, XYR: raw=hnz*9=119MB zip=47MB
      - try to change size of htmp by: export GZIP="-6" before starting spin
      - add octahedron def-files to models
      - add 16MB I/O-buffer (reducing file fragmentation a bit)
      bug030620: complex + LAPACK + fulldiag uninitialized values
        leading sometimes to random results, fixed using valgrind-1.9.6
      memleak fixed, parallel SiSj, using valgrind-1.9.6
v2.19 2004-02-27
      Warning: there could be new bugs, only speed_test is checked by me
      if norm2[] stored dont collect states with different orbit length (vec output),
      H-blocks instead of H-stripes for better speed (local data, CPU cache)
      no binary compatibility to tmp-files of older versions!
      numsymconf uses max. B_NUM threads (old: 16 threads)
      bug: PTHREADS: storeh_block starts with uninitialized values (fixed)
      bug: TBC + sym + nu==nd => -9.95702223 nonhermitian H also wx=0 (fixed)
v2.17 2003-04-24
      pthread_attr explicit set to PTHREAD_SCOPE_SYSTEM, because SunOS
      uses PTHREAD_SCOPE_PROCESS as default (all threads running on one CPU)
      h_file.c completely rewritten for better performance (40%(x86)-200%(MIPS) faster)
      better use of CPU cache, much better speed! has much more potential!
      bug: use of only one H-block (B_NL2=0) negativ shift fixed
      output sym-factor(norm2) (=orbital-length?) after configurations,
      mmap offset must be multiple of _SC_PAGESIZE on some systems (fix),
      bug: startvec=4 and if max_ea>1 and nev>0 and more than one run (fixed)
      new: startvec=<4+rnd*8> (p.e. 12, 20, 28) for different startvectors
      a0: max. n1 iterations (small n1)
v2.16 lapack usable for full diag (a4), try ./configure --lapack  2003-04-11
v2.15 speed_test added,
      Warnings fixed: suggest explicit braces to avoid ambiguous "else",
      bug fixed: r^2 was wrong (pos was set after daten.def was read),
      bug fixed: (for EA>1 checkpoint not resetted, wrong results for EA>1),
      compatibility to cc, c89: Compaq C V6.4-014 (no //, see c89 -V)
      sleeping time on error reduced, repeating errors not printed,
      comments added in put_h()
v2.14 bug: forgotten close after reading tmp/tmph* (resulting in linux-2.4.10
      crash, if ev=1 and a0 is repeated 1000 times) fixed
      model-files with nn<NN accepted (no recompilation needed)
      spins.tex further translated to english
v2.13 compiler error if float fixed
      old Bug fixed, using valgrind-20020601 by Julian Seward, GNU
       (Bug: fulldiag a4 + float, since v1.8)
v2.12 mrule() in xval.c revised, set sublattice SL=... for MRule
      use SL=...;a0;SL=...,a3 for different sublattices
v2.11 version is now real number (2nd dot removed, 2.101<2.11)
      fulldiag: output E and S^2 as table (for TD, susceptibility)
      precision depends on float
v2.1.0  vlint.h (C++class: very long int) added, allows any number of sites
        68 sites tested, but tJ and tU not checked for correctness
        ! matrix dimension is still limited to 32 or 64bit value !
        simplifications using C++ operations (code2 under construction)
v2.0.3  bug fixed (wrong matrix dimension on very small systems),
        quick-start (2 spins) added in documentation
v2.0.2  use ud+=+1,-1 and/or param+=0.1,0,0,0 to change parameters by constant
         steps more easier (p.e. in loops) to generate lot of data,
        "+="-form works for anisotropy too,
        name of package is changed from spin to spinpack (more unique)
        bugs fixed (noSBase,tJ(but slow),s=1,ferrimagnets)
v2.0.1  startvec changed according to v[i]<1.0 (16bit-real)
        Warning: random startvec before v2.0.0 is not reproduced!
        do not load/save l1 (its mmaped to a file already)
        save thxy_r-table all the time (its only small amount of data)
        save_mode=1: bug save hxy fixed; write tmp/status n1 fixed
v2.0.0  remalloc uses fallback to mmap(tmpfile) => no memory limits (a0,a2)
        1GB-File-split removed (use linux-LFS or 64bit or more blocks)
        quash H-file if an reproducible error occurs during write (disk full?)
v1.9.3  mmap used to bypass memory+swap limits (use algo=2 to spare memory)
        bzlib for h-zip (ca. 1/3 of size! (gz=1/2) but slower unzip 2*gz)
        try lin() before get b_smallest should save lot of time
v1.9.2  use (unsigned Tnorm2)norm2, b_smallest, b_ifsmallest3 10% faster
        algo=2 (2x2-diag) parallelized (around a0 speed!?)
v1.9.1  better error handling, lintab/numsymkonf changed (faster, but save_mode may not work)
v1.9.0  storeh(),read_h(),wzizj() parallelized using pthreads (configure -mpt),
        recommended for shared memory multi processing machines,
        using 3 to 4 threads is default (best number of threads depends
        on disk speed and size, it depends also on the value of the variable
        maxfile and of the CPU usage by other processes of cause)
        and can be changed by defining B_NL2 in config.h (3 for 2^3 blocks),
        only lanczos-inc1 is parallelized yet (getting energy),
        speed-square36: numsymkonf=10x, storeh=2x, Hv_from_disk=1.1x
v1.8.3  pthread library used for multiprocessor machines, no MP_PRAGMAs
        MIPSpro C++ v7.2.1 warnings fixed, configure --mpt added
v1.8.2  more output if sisj=1 (sisj-3zizj, num_of_same_bonds)
        missing kommas in param= for default daten.i fixed
v1.8.1  CONFIG_ABEL compiler error fixed, bug: read_sym if s1-sym fixed
        div0 fixed
v1.8.0  missing close(tmp/tmph*.tmp) fixed, (fd-buf overflow after lot of iterations)
        fulldiag => ev=0 possible, no fatal error if jacobi not converge
        fulldiag output changed, some fixes
        use CONFIG_TBC instead of IFww, TBC+noSym+noSud(!) fixed
        daten.i: nud= and param=  instead of :nu,nd,p1,p2,p3...
v1.7.20 h-field added (patch by Reimar Schmidt), add some corrections
        bug in fulldiag (ev, since v1.7.3) fixed
v1.7.19 daten.def pos[x,y,z] is changed to double, m_1d.c added
v1.7.18 configure v0.4.1, configure --debug => stack-checking
v1.7.17 configure v0.4.0, bug in symmetry.c fixed (daten.sym + sym_k=-1)
v1.7.16 sbase->norm2 changed to llong allowing  N=10 S=5/2 on 32bit machines
        b_ifsmallest_lm() etc. added
        bcc40s2 30u10d 31m/10s/18s/483s/509s => 0s/6s/12s/73s/90s i586-133MHz
        S=1 and ud-sym or tJ,tU not tested Apr2001
v1.7.15 maxfile is per block now to live with the 32bit limit
v1.7.14 symcreate now rekursiv, shorter and over 40x faster, 
        ATTENTION: other generators!!!
        no-degeneration bug fixed, docu extended
v1.7.13 daten.sym: cyclic form possible, trying all permutations
v1.7.12 tJ,tU cc-errors, asm removed, n1==0 => no error
v1.7.11 new packaging
v1.7.10 h-buffer for writing, parallel (no speedup), models updated
v1.7.9  check_point=4 implemented (gzip faster (buffer))
        get_h is now not parallel! change it!
Feb 2001
v1.7.8d bug removed for Sz=0, Sud not used; <Szi>=0 always!
v1.7.8c complex.h error if Zahl==4 xor 5 fixed?
v1.7.8,8b bug fixed, HIDX replaced by HRMAX>0 sizeof(thxy)=5 (8)
v1.7.7 HRMAX=0 replaces noHIDX, save_mode+checkpoint against crashs
v1.7.6 mp-bugs: hamilton(), problems with I/O + stack ???
v1.7.5 mp-bug removed: hamilton() v0x,v0y,.. must be local()
v1.7.3 vector divided in blocks, preparation for block version
v1.7.2 bug fixed in fast_nhv(), blocksize rounded to 2^n
v1.7.1 bugs removed: wrong sym if bond twice, h_stored<100%
v1.7.0 H-schreiben/lesen nur noch auf Platte (parallelisierbar)
v1.6.4 zlib used instead of pipe (no memory problems on thor???)
v1.6.3d bug removed, if v0,v1=NULL, small changes in inc1()
v1.6.3c HRMAX in config.h gesetzt, sisj=0 (daten.i)
v1.6.3b Tnorm2 can be set to char (less memory, see spins.h)
v1.6.3 bug in symmetry.c removed (delay, if daten.sym given)
v1.6.2 cache in pipe.c for better performance
v1.6.1 new pipe.c (IRIX64 needs twice of parent-mem for fork or popen)
v1.6.0c bug removed thor: -mp sym_k=-1000 Sz>0 numsymkonf is wrong
v1.6.0 TBC nun auch mit Symmetrien!
v1.5.8 Twisted Boundary Conditions or Field (J^-(i+L)=e^(iw)J^-(i))
v1.5.7 fork - schlaegt manchmal fehl, HRMAX auf 1024 gesetzt
v1.5.6 better error handling (daten.def, more output)
v1.5.5b bug wenn S=1 und sym_k!=0 beseitigt (thanks J.R.)
v1.5.5 S1SYM in numsymkonf() und next() eingebaut, S=1 faster!
v1.5.4 keine zombies, endlich S1SYM schneller and korrekt!?
v1.5.3 base2[],lintab entfernt, block_no(), nw<=Nw mgl.
       automatic SUBLATTICE generation fkt=symkorrelationen()
v1.5.2 bug: pipe.c: killte gzip, bevor gzip zu Ende schrieb 
v1.5.1 Hamilton-Matrix mit gzip gepackt (see pipe.c,h_file.c,ca.50%) 
v1.5.0 es gibt kein model.h mehr! nur noch daten.def zu nutzen
v1.4.8 alternative creation of H, algo=8 a8 [n1max] (10x slower)
v1.4.7 doc/error.tex als genauere Fehlerbeschreibungsliste
v1.4.6 ansatz.h/ansatz.c fuer Variationsansaetze
v1.4.4 bug beseitigt (2x hamilton schreiben erzeugte Fehler)
defspin1.sh zum erzeugen von Gittern mit Spin-1,Spin-3/2,etc.
v1.4.3 Vektoren koennen bei wenig RAM & C++ auf Platte (array_m.h) 
v1.4.2 now use the disk to store l1  (N=40 possible using 4GB mem)
Jan 2000
06.12.99 definition of HIDX in spins.h for storing indizes to matrixvalues
         less memory and disk, therefore faster, (+ 2*8*64k memory for table)
09.11.99 Bug in gcc on alphaPC (large struct object as function argument)
31.10.99 AddSS in Add34 umbenannt, AddSS als $S^2$ Operator zugefuegt
07.06.99 anisotropy parameter for xy-component of Heisenberg-Hamiltonian (XXZ)
v1.4
01.02.99 anisotropy parameter for z-component of Heisenberg-Hamiltonian (XXZ)
         sym_k= -n   for omitting next n generated symmetries, v1.4
23.01.99 kein extra outputfile, besser stdout und filtern via awk
Jan 1999
07.12.98 startvector fuer algo=1 jetzt korrekt waehlbar
30.11.98 HamiltonOperator wird in GB-Happen gespeichert (h_file.c)
         damit kann auf 32bit-UNIX 2GB-Filegrenze umgangen werden 
v1.3     algorithm a4 = full diag (works, bad implementation)
1998-Nov spins-lintab2 fehlerhaft bei nichtvertauschenden unvertraegl. Sym.
         Bsp: N=4 k=2/4 0/2  (sgn[udud]=0)
1998-295 spins-chk3: nur verschiedene Paarkorrelationen berechnen
         nun korrekte Sz,S,ZMag fuer Spin-Mischungen (S=1/2,S=1,...)
1998-293 sighand-ini: shell-script status wird erzeugt, use "sh status"
1998-270 m_bcc: nun immer R(i=0)=( 0 0 0 ) besser fuer Abstansbestimmungen 
1998-190 NE0=0 Eigenvektor=Groundstate, NE0=1 1st EV = 1st Excitation etc.
         Bestimmung der Aequivalenzklassen (max. Anzahl waehlbarer k's)
1998-184 cplx+SBase: storeh >10 mal schneller, da locale arrays nun static!
1998-183 Anzahl Eigenvectoren nun in daten.i nev veraenderbar
1998-170 complex-version getestet fuer JJ-N=5-chain, k=1/5*2Pi, g++,Zahl=4
1998-154 tU,tJ mit up/down getrennt (code2), nun 64 sites (llong) moeglich!
         bug beseitigt, nun tU,tJ mit SBase und k!=0 korrekt
         bug beseitigt, zahl sym, wenn cyclen(P0)!=2 fehlte P0^i, etc
         bug beseitigt, falsche cyclenlaenge der Permutation (not Prod!)
1998-139 notation year-day_of_year, daten.i nun mit Sym_ud=<+1|-1> 
         date '+%Y-%j'
18.05.98 bug in symkorrelation() beseitigt, erzeugte unsym. Korrelationen
08.04.98 version 1.2 bis N=64 JJ-Sites (long long), NoAbel, k!=0 geht ?
15.01.98 lintab2(),wzizj() fuer symmetrie beschleunigt + korrelat-sym-Ansatz
Jan 1998
15.12.97 Endlich der Durchbruch beim Symmetrie suchen, nun 0:01 statt 8:01
         beim BCC30_30 (486DX4-100)! (Versucht Nachbar als naechstes)
12.12.97 Nutzung nichtvertauschender Permutationen geht im k=0 Raum 
         (viele 32er Systeme koennen nun berechnet werden ca. 20h,200MB)
OO.OO.97 Parameter (J's,t's,U's etc.) werden ueber Indexliste
         zugewiesen, leichter hantierbar, Bindungen klassifiziert
Jan 1997
08.01.96 ud-Symmetrie kann mit Sud=1 genutzt werden, 
         H-Matrix wird NUR bei Speichermangel in hxy.tmp (tmp_hxy) gespeichert
10.07.95 rq: (H*v-<v|H*v>*v)^2 => |(H*v/<v|H*v>-v)| 
19.04.95 _nhn=_nhv(m1=1) -> einsparung _nhn
11.04.95 symmetrisiere ->  v'=v +/- SymOp(v)    (aendert EW nicht!!!)
11.04.95 gleichzeit. diag. 2A(-1,+1)+B(-1,+1)=(-2-1,-2+1,2-1,2+1), [A,B]=0
1995-Mar H speichern tJ-16 Faktor 3.5 schneller      tJ-16 131s/70It
14.02.95 einfuehrung symmetriequantenzahlen (vertauschende Permutationen)
12.10.94 16-er tJ lanz=13m24s lanz2=13m42s
10.10.94 hamilton_nhv (neue prozedure) <n|H|v> l.615 tJ-8  1s
10.10.94 hamilton_nhn (neue prozedure) <n|H|n> l.695 tJ-8 24s
27.09.94 lanz2 zugefuegt (neues verfahren)     l.866 tJ-8 98s
29.06.94 diplomversion fertig (ohne Symmetrien)      tJ-8  3s/6s