Sorry for the language mix (german, english).
ToDo: hamilton_nhv like fast_hamilton_nhv (optional store H additionally)
replace XY_FLAG by separate x entry (mem+...%) but sort HBSize
for tasks (const. package size)
old.xy_flag(mixed blocks): xy+by+rr=4B+2B+2B=8B + buffers
new.xx_idx(sorted blocks): xx+yy+rr=4B+4B+2B=10B + const.pktsize
-d$TMP for batchjobs? (better remove tmp dependency of code)
bug: if ulimit -v is smaller maxfile i100 failes ??? fixed 2011?
bug: hubbard model nu=nd=N n1=0 (should be n1=1)
- adapt to clusters without shared FS
(how to distribute sequence of data of unknown number? blocks?
or dry run for counting only)
simplest: switch to count only after OOM, store the last
stored and counted scfg and may be every 1024th?
2*nodes counted_ranges: 1 1 1 1 to 2 2 ... +2+2 to 4 4 ... +4+4
store start scfg + foundnum scfg + end_cfg + time_needed
so we have stored ranges and counted ranges
2nd round: recompute the counted only on the right nodes
optional "stop" per nonblocking MPI from OOM node?
OR roundrobin 10e6cfg-chunks(doubling if under 1min) + list of reallocated
scfg-blocks (sort blocks in 2nd run, testversion: do it parallel to
disk, if it works, remove disk code)
chunks of size of max. free space
{chunkidx, startcfgORidx, stopcfgORidx, numscfg, time, *scfgs...}
tree-algo usable for start and stop tree partitioning?
stop at 1st depht-nu-cfg of depht-(nu-4) ??? problems?
- make a exsample.html page for different physical models
and put the link to the README
- parallel sort for fulldiag by E_Ising=Ez using Bitoner Sortierer
http://www-i1.informatik.rwth-aachen.de/~algorithmus/algo12.php
- remove maxmem from daten.i (maxmem+usemem? vom *.c) jobsystem!
set 0 as default (2011-12-09)
- generate matrix.pgm earlier in fulldiag scaled for bigger matrizes
up to n1=200e6 (fits to 2GB)
to 512..1023 pixel from getH without storing full sparse matrix 2011-11
- http://graphics.stanford.edu/~seander/bithacks.html
Swapping individual bits with XOR
+: using v & -v last bit counting?
0010100 | 0010011 = 0010111 (v|(v-1))+1 01... to 10...
0010100 & 1101100 = 0000100
... -1 = 0000011
x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary r = b ^ ((x << i) | (x << j));
e.t.c.
- replace vdate.h(#define) by vdate.c(const char) to avoid recompiling spins.c
or split spins.c (add main.c?)
- ns write sns to local files and concat explicit (remove bottleneck NetFS)
filter n1/threads ... 1M-chunks for better distribution
- use ErzeugeneSym^ni instead of store all Syms (save cache! more speed)
generate products in smallest() + optional permute via tables (sign?)
+ fast permutation by tables
http://microcontrollers.wordpress.com/2011/03/11/how-to-do-really-fast-bit-permutations-with-few-operations/
- use rekursiv n1 and split search paths to threads
write nu=0 to file0
read file0 and write nu=1 to file1, rename file1 to file0
read file0 and write nu=2 to file1, ...
stop if nu reached
- suspend/resume in parallel mode (no storedH, v0/v1 only)
p.e. suspend after next 20 iterations,
mips-cluster.kautz dump.mem2TB/(200MB/s=2Gb/s)=167min=2.8h wtime=28h++
sq48 n1=168e9 *(4+4+6)=2.4TB
s2tri27 ... (ns try both serial +parallel and break slower method?)
recompute l1 if file l1_0000.dat is removed or bad (check last!)
- check all (quasi-parallel) disk operations (ns.l1=OK, rw_v=OK 2012-05)
- hr_restore: 0 setzen + fehlende indexe neu berechnen statt tausende files?
use maxnzx as static size for simpler code? but 10-20% more SH-memory
- check: libckpt (user-directed checkpointing)
- speichersparsam proggen(cache), rekursive numscfg mit beliebigen Startpunkt
und stoppunkt proggen (z.B. fruehe Rueckkehr und fortsetzen spaeter oder
auf anderen thread und unterbrechen nach endl. Zeit)
Beispiel-config space (N=5,S=1) ...
chkpt.resume nsymconf() from l1(?) + n1 (save last testcfg all 2h? chkpt0?)
bsp: N=40-chain syms: 2 non-commuting l=2-syms
40syms generated by s0,s1,s0*s1,s1*s0,s0*s1*s0,... compact code?
- maxscfg by exclude higher empty subtrees
- skip complete a0-run on bad k_sym !? (avoid long runs) ??? if its easy to implement
- try http://dmtcp.sourceforge.net/ distributed-mt-checkpoint-userlib
- test triangle48 sym=192=4N n1=ca168e9(l1=7n1=1.1TB=100MB/s*11000s(3h))
split l1 on failure? or per node or 256 threads ...? l1=216GB/4h
md-raid0 for 2TB (fuse?)
fixed partitioning (equal size or eual nodes)
or maxsize l1_0..63 (1TB/8=128GB)
l1_%4d.dat in 200GB/Bsize chunks (links todifferent FSs,striped?)
- test triangle-s2 N=2*27=54 e= 3 3 0 9 (tU=108bit) nud=54,53
3+51 NoS1SYM n1=15540 SH=0.3m 54.70017464 30m/100It
S1SYM n1=3627 k=-1000 54.70017464
S1SYM n1=25 k=0 54.76837837 0min
see s1_triangle.gpl
- warn on l1 writing on long=32bit systems (split files?)
- ToDo: fermionic sign for b_smallest_lm() (LM/S1SYM) ???
- ToDo: test resume after break during checkpointing (incl. ev)
+ robustness gegen datenfehler?
- checkpoint resume after 2nd++ data-set (a0...a0)
ToDo: problem bad get_maxscfg for parallel speedup (47+1 8sym 000- n1=6)
- parastation-mpich send_16MB_from_all_to_task0 causes SEGFAULT on task0
test ulimit=4GB 4tasks lowsym n1=32e6(+16MB=OK),n1=225e6(730M+16MB) 8m/It
test: q.mpiexec -l -m ... ./wrapper.sh: ulimit -v lowmem + nice -19 spin
Verhalten bei Speichermangel mit mpich,
limit=70MB (4*64M+16M fail L229) ToDo: sauber abbrechen!?
limit=84MB (4*64M+16M) OK
ToDo: a2 2x2 mpi-version! scaling?!
- rechnet nur einen Datensatz a0? if multithreaded
ToDo: ccNUMA tips, clear diskcache to allow local malloc !!!
ToDo: output i100.t during iteration first on 10th than only if changed by 10%
for smaller diffs
ToDo: oprofile (2013-04)
- +dietlib -printf
- struct s_float + cast op() test accuracity
- aio +
interleaved io (m threads writing/reading to/from one hnz-stream)
pipe_read (+transparent mpi, konst. Blockzahl/Laenge?)
h_file.c
- replace itime by ftime=gettimeofday.s+us*1e-6 or MPI_T, for better shortruns 2013-04
- add CPUSET for 64...1024 (see memspeed.c)
------------------------------------------------------------------- 2013-04
- Strategy:
recompute matrix H (not storing, because computing is and will be (?)
always cheaper than fast storage, see GPU, multicore, L1-caches)
- checksum H-block-elements to avoid errors (2012)
- SU(2) use ???
see R. Schnalle + J. Schnack, Calculating the energy spectra of magnetic
molecules: application of real- and spin-space symmetries, Apr2010
International Reviews in Physical Chemistry Vol.29, No 2, 403-452
- speedup ns at MIPS NN>64 (remove y%() from vlint.h)
- s=1 CONFIG_S1SYM + PARALLEL buggy, n1 to big, fur ud=-1 ???
- virt. Test-Cluster (rid-replacement)
- Zi != 0 for square Sz=0 k=-9999
- after release v2.40, make FPGA/GPU ready (b_smallest blockwise) + FPGA
- ising matrix to ising+1storder-excitations?
- storeh2 for mpi (for ns() too)
MC like method 10000 random configs divided into 100 blocks
defining start-cfg of each block,
isingerg as mean value of neighbouring ising ergs?
like e0+meanexcitations
- remove writing l1 (allows starting 2 jobs within same path)
save as one file on node0?
- rename zahl to lfloat (long float), mzahl to sfloat (short float)
- complete new strategy? build n1 via storeh2 (sorted by isingerg)
and balance number of nonzero elements per rank/thread
b2i have to ask probably more than one other rank?
- remove n2, parallel ns()
- err9200: set nzxmax-overflow-flag and reset hbuf->n
stop later! ??? or better dyn.malloc
or hbuf as compressed part of hbuf[NUM_
- mpi ns() without nfs-transfer
- fix compile-errors on SunOS(isut1), check MRule for kago39
- ns() replace fwrite(l1_xxx.tmp) ??? (tina has NFS problems)
by mpi_sendrecv
1st: send buffer to thread_i (i+1 if mem_i=full)
or simpler send buf_i to thread i%mpi_n (buf_mpi_i)
2nd: balance l1[] send l1[]begin(i+1)..end to
or simpler send buf_mpi_i to thread i (l1_i)
- speedup by local malloc? if not change back to v[i+b_ofs[blk]]
and node_ofs[..], b_ofs[]=0..node_len[node]-1;
- autobalancer, redistribute lines among threads after 1st iteration?
!ps -m -o time -p $PPID at end to calculate unbalance
- first mpi_n*MPI_calls per hbuf-line, later block a fixed number of lines
split nhm()
- hamilton_nhv(i) -> nhm(i,jnz) -> store hbuf[].cfg,rr
- smallest(hbuf) -> jnz*b_smallest -> store hbuf[].scfg,rr*=(sign,norm2)
- scfg2blk(hbuf) -> [blk].hbuf...
- mpi_n*MPI_Sendrecv([i+j].hbuf to i+j, [i].hbuf from i-j) cfg
scfg2idx -> hbuf[].idx
MPI_Sendrecv([i+j].hbuf to i-j, [i].hbuf from i+j) idx+v0?
for all b_len[0] (send 0 if b_len[i]<b_len[0])
- hlines[HLmax]
- problem: Hbloecke sparse - xy+flag sinnlos? x+y aber mehr mem, mpi-ovhead?
- store XandY ??? no!
- Xsteps greater 1 between, 2 nz-Elements
- 80% more memory needed, more slow!?
- Y+(nextX-flag) !!!
- only stripes possible (sort mpi-blocks during read)
- creation of H needs more time and a lot more MPI-traffic
- store Y+Y.blk(byte) +20%memory
- [0].xy=num_elements_line_x [0].r=diag [0].by=block
[1..n-1]=nondiagonal elements (xy,by,r)
- by blocks for nodes only? C++ array access via MPI?
- highest to lowest vector_coeff? can int/llong be used for speedup on T1?
- ca. n1 ohne hash-collisionen?
- erste 10-down-spins nach Tabelle/rekurs.suche * (letzte down-spins ueber
restplaetze)
- next symcfg -> letzte up-spin wegnehmen und neuen Platz suchen bis smallest
- alloc h_arrays within threads (h_xxxx[B_NUM] not needed),
open/close within threads?
- bessere Speed-Messung um Flaschenhals zu finden?
- MP-scaling, Size-Scaling hnz/s (wenn sinnvoll)
- for FPGAs prepared
- smaller code for less bugs
- remove noSBase (k=-1 can be used, add kud=0 or -1, test speed)
- pictures of most probable 40 states! mark flips (@ critical J2=0.6)
show first perturbation terms
- Fernziel: FPGAs + MPI (b_smallest-calls blocken)
- MPI async: MPI_PROBE, MPI_GET_COUNT, MPI_RECV, MPI_ISend,Ireceive,waitall/any
- async read n-to-m threads (m<=n), Operationen in Bloecken (speedup)
zu aufwendig? besser nur in Speicher, ontime-Berechnung oder mmap zu Platte
- tJ has no .3 symmetry like tU (Ham2), makes sence?
- j1-j2-t-U als parameterindex speichern oder via Index (lange 40er j1-j2 Berechnungen)
(wegen a2 keine getrennten Parameterfiles), besser index?
speicher index zu [faktor{0,+1/2,-1/2,-1,+1}, parameterindex]
- n-site-Terme in H n>2
- SiSj mit sym viel zu langsam, warum?
- a4 parallel (world leader? <-> AHonecker)
- a2 parallel-skaling testen / verbessern?
- data-multistream-konzept (pipeline concept of vectormachines) for new HPC?
seriell dynamisch gekoppelte programmierbare Units (z.B. CPUs, FPGAs)
Abbildung in seq. Prozessoren als threads + stream buffers + stop
start mechanism if stream buffer is full/empty (wait for data)
reicht nicht fuer v2=H*v1+v2, auch random access notwendig (stream+RAM)
MIPS per Watt? cost-performance-per-watt (ARM SA-1100 1997 133MHz max250mW)
- nice graphic?: xy-Array fuer colored Isingenergien-Matrixgewichten
minIsing=0(neel) maxIsing=maxNumBonds=N*Dim
Ising=num-uu-Bonds+dd-Bonds, H1diff=0..2maxNN-2 H2diff=0..2maxNN
+ state Overlap <PisingstatesE1|H|PisingstatesE2> => x_out enhanced
highest/lowest ising recursiv? min..maxE1.E2.E3....
- check also: grep ToDo src/*.[ch]
- LM Bindungen in H vereinfachen/generalisieren?
(use a more general (simple) method for local symmetries)
example:
..O O---O O---O O... this is a sample-chain, with 5*N sites
\ / \ / \ / and N vertical symmetries, which are
O O O completely decoupled from each other
/ \ / \ / \ and to the other symetries (very similar to LM),
..O O---O O---O O... a future version should care about this
speedup for local singlets (see 4site_exchange_diamond36.def)
commutating symmetries (ge: vertauschende Symmetrien)
- auch LM als symmetrie-subgroup z.B. N=5 S=3/2 (15 sites) 2013-04
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
generators:
subgroup0 0 1 2 = sym0=0-1 2, sym1=0 1-2 (sym2=0-2 1=sym0*sym1) l=3
subgroup1 3 4 5 = sym3=3-4 5, sym4=3 4-5 (sym5=3-5 4=sym0*sym1) l=3
...
subgroup4 12 13 14 ...
generators: 5*2(10 to store), subgroupsyms=5*3(15 to store)
oldNoS1sym: 3^5=243 (growing fast, slowdown + may hit CPU-cache size!)
- find solution to avoid 64bit overflow for spinchain N=6 s=7 and bigger
- instead of pthread_create/join doing on every iteration do it only once
and use with mutex in mpi compatible way
- improves sun_top_pcpu (pcpu is resettet to 0 after pwd_create)
- improves linux_top logging (new pid creation on p_create)
- could be made mpi compatible (MY_MPI)
- calculate SiSj for twisted boundary conditions TBC
(using posx,posy, ww[NN*NN] ?)
- translate docs to english (partly done)
- new design via pipes (dataflow) and threads (ex: generate-H-thread
writes elements sorted to 4 pipes for blocks, system does cashing)
- Codierung/Indizierung nach Isingenergie mit allen moegl. Bindungen bei
gleicher Symmetrie (Vorteil, S=1, LM automatisch integriert, schneller?)
- Einfuehrung optional eines C++ types ULLLong mit >64bit fuer N>64
- Umstellung auf C++ wuerde das Programm uebersichtlicher machen!
minimal fuer Daten-Typen
- Erwartungswerte H_t H_J H_U etc. berechnen (deren Summe ist <H>)
ermoeglicht bessere Interpretation? evl. H_J1, H_J2
- H_J1={ny,array of pointer of {iy,nx,array of {ix,wert[y,x]}}}
- store H_J1,H_J2 seperatly and calculate H = J1*H_J1 + J2*H_J2
- try start neel and count nonzero-elements per lanzcos step (a8?)
also check overlap to predecessing lanzcos vectors
- repeat with debug++ if fatal error, reduce debug output
- speed 600MHz lt=2:05 SH=2:15, --- 1s/It --- nu+nd=32+8 n1=482e3
call b_smallest lt=2:07 SH=2:13=133s
call 2*b_smallest lt=2:05 SH=3:50=230s => b_smallest=1:37=97s=72%
call b_getbase lt=2:07 SH=13:45
call 2*b2i lt=2:06 SH=2:29 => +12..14s=10%
nhmline() +1..3s=1%
1*nhm SH=2:21
2*nhm SH=4:32 => +131s=98%
nhm=return SH=0:07s
lt mit H aufbauen? nach E_Ising sortiert? 3bonds=3dimIsing
H*000111=001011+100110 (6)->(6) better uu=dd=0 ud=du=1 (for +-Operators)
H*4.2.0 =2.2.2+2.2.2
H*001011=010011+000111b+001101+101010 (6)->(6,2)
H*2.2.2 =2.2.2+4.2.0b+2.2.2+0.6.0
hash(ising-string)? trees of Isingergs (level=bondtype)
konfig mit min. Bitastand zu aelteren Repres. als Representanten speichern?
(kfg1^kfg2 liefert <ij>)
n1=555e6 hnz=23e9=41.4*n1
idx -> (IsingRep -> kfg) -> H*kfg -> kfgs -> IsingRep -> idxs
H*kfg, kfgs->IsingRep per FPGAs?
idx <-> IsingRep per (hash)Table?
v voellig dynamisch mit lowestIsing startend?
1. Iteration superfast, 2. Iteration 41*langsamer? etc.
aber Problem index finden bleibt?
numBonds? (topology-index 1dist.2dist.3dist...N/2dist for chain)
0=01010101 2(8.0.8.0) -> 10010101 + 01100101 + 00110101 + ... 8(6.4.4.)
1=10010101 8(6.4.4. ) -> 01010101=0
+ 10100101=1
+ 10001101 + 10010011 + ... 16(4.6.)
2=10001101 16(4.6.) -> 10010101=1
+ 10001011 (4.4.)
3=10001011 16(4.4.) -> 10001101=2
+ 01001011 (6.4.2.)
+ 10000111 (2.4.)
4a=01001011 8(6.4.2)
4b=10000111 8(2.4)
- code2 waehlen
- kill SIGUSR sh parallel => kein sinnvoller Wert
- LAPAck ohne EV (per Option), zheev implementieren fuer sparse + parallel!?
Lizenz? http://www.netlib.org/lapack/faq.html#1.2
- change the name of routines if modified,
- We only ask that proper credit be given to the authors.
complex: zheev (JOBZ='N'|'V', UPLO='U', N, A[LDA,N], LDA>=max[1,N], W[N],..)
wantz = LSAME( jobz,'v'); // teste option
lower =
- symmetry.tex uebersetzen/neu gestalten
- Oles-Term pruefen und dokumentieren + ggf. nur reines Coulomb-U(i,j)
6-site U/|i-j|
- Reimars patch = ok
- check OP/sec, theoretical limits MBps MOps etc. no disk/IO?
pid=...
while ps -p $pid; do
echo -n "$(date +"%j %H:%M:%S") "
# only for OSF -g $pid (for subprozesses gzip)
ps -p $pid -o "pid,pgid,ppid,time,etime,usertime,systime,pcpu,pagein,vsz,rss,inblock,oublock" | tail -1
sleep 30
done
# compare prozess + system pagein/inblock/oublock wenn moeglich
#
plot [700:840] "aab.log" u 0:3 t "cpu/%" w lp,\
"<awk '{print ($9-x)/3.e3; x=$9}' aab.log" u 0:1 t "read/3e3" w lp,\
"<awk '{print ($10-x)/1.e2; x=$10}' aab.log" u 0:1 t "write/1e2" w lp
marvel: full last
gzip -fc1 tmp/htmp001.001 >tmp/htmp001.1.gz 1m32.381s v1.2.4 8MB/s
gzip -fc6 tmp/htmp001.001 >tmp/htmp001.6.gz 3m07.496s
gzip -fc9 tmp/htmp001.001 >tmp/htmp001.9.gz 4m34.404s 4MB/s
bzip2 -fc1 tmp/htmp001.001 >tmp/htmp001.1.bz2 7m43.521s v1.0.1
bzip2 -fc9 tmp/htmp001.001 >tmp/htmp001.9.bz2 12m14.325s
ls -l tmp/htmp001.*
778485760 Jan 17 tmp/htmp001.001
303230293 Jan 19 tmp/htmp001.1.gz 39%
297352257 Jan 19 tmp/htmp001.6.gz 38%
296760506 Jan 19 tmp/htmp001.9.gz 38%
296796144 Jan 19 tmp/htmp001.1.bz2 38%
332110892 Jan 19 tmp/htmp001.9.bz2 42% ?
# decompress to /dev/null
cat tmp/htmp001.001 0m11.624s # 67MB/s
gunzip -c tmp/htmp001.1.gz 0m24.802s Todo: +mem? +ru=100%?
gunzip -c tmp/htmp001.9.gz 0m24.043s # 32MB/s async?
bunzip2 -c tmp/htmp001.1.bz2 1m55.676s
sh prog1 | buffer | sh prog2 # buffered async read?
Performance and efficence
The efficence of spinpack-2.19 on a Pentium-M-1.4Ghz was estimated using
valgrind-20030725 for the 40-site square lattice s=1/2-model.
37461525713 Instr./49s = 764M I/s (600MHz)
12647793092 Drefs/49s = 258M rw/s