c Parallel Molecular Dynamics Algorithms
c
c Authored by Steve Plimpton
c   (505) 845-7873, sjplimp@cs.sandia.gov
c   Dept 1421, MS 1111, Sandia National Labs, Albuquerque, NM  87185-1111
c
c Last modified:  30 April 1996
c

This directory contains 3 Lennard-Jones benchmark codes which all
perform the same physics, but use different parallelization
strategies.  The algorithms used by the 3 codes, the benchmark
specification, and the performance results on various machines are
discussed in the paper

S. J. Plimpton, "Fast Parallel Algorithms for Short-Range Molecular
Dynamics", J Computational Physics, vol 117, (March 1995) p 1-19.

The codes and a postscript copy of the paper are available from me via
e-mail or on the Web at http://www.cs.sandia.gov/~sjplimp/main.html

Please send me e-mail if you want to be notified of any upgrades or
significant bug-fixes to this software.

Also, feel free to contact me regarding the codes if you

(a) find any bugs,
(b) discover some way to speed them up,
(c) port them to a new machine,
(d) have specific questions about how they work, or
(e) encounter problems when running them.

Finally, I'd appreciate knowing if you find them useful or succeed in
modifying them to solve some interesting problem ...

Steve

----------------------------------------------------------------------------

Files in this directory:

(If you didn't download all 3 codes, you won't have all these files.)

lja.f		F77 source for atom-decomposition code
ljf.f		F77 source for force-decomposition code
ljs.f		F77 source for spatial-decomposition code

lj[afs].h	include files for each of the 3 codes

parlib[afs]_machine.f

		F77 source for the parallel routines used by each of 
			the 3 codes on each machine
		unix = UNIX workstation
		ncube = nCUBE 2
		gamma = Intel iPSC/860
		osf = Intel Paragon running OSF
		mos = Intel Paragon running SUNMOS
		t3d = Cray T3D
		mpi = portable MPI implementation

Crib_lj[asf]	variable documentation for each of the 3 codes
Crib_parlib	subroutine documentation for the parallel routines used
			by all the codes

lj.out.example	sample output file

Makefile	Makefile for all 3 codes on all machines
Makefile_lower	low-level generic Makefile used for some targets

README          this file

----------------------------------------------------------------------------

Making the codes:

(IMPORTANT NOTE for Cray T3D users: Because the T3D only supports
8-byte reals and integers, you MUST make some minor changes to the
source files to turn them into double-precision before making a T3D
version.  See the section on double-precision below.)

The Makefile is setup to make versions of lja, ljf, and ljs for
various machines.  Type "make" to see a list of supported targets.
Typing "make prefix.suffix" where prefix = a, f, or s (for the
different decomposition types) and suffix is a machine type, will
produce an executable named "ljprefix_target", such as ljf_unix or
ljs_osf.

The Makefile works on many Unix platforms, but may be incompatible
with some, since the "make" command is not very standard.  If it is
available on your system an alternative is to use the Gnu make command
"gmake", which is standard.  Failing that, talk to a local make
expert, or contact me and I may be able to help you modify the
Makefile for your machine.

When make is run for a particular target, it will create an Obj_suffix
sub-directory to store machine-specific *.o files.  This is to prevent
*.o files generated for different machines from getting mixed up.

You will need to edit the "Machine-specific Definitions" section of
the Makefile for your environment.  Specifically, the path locations
of compilers and the "size" command will likely need to be specified.
The optimization levels are the ones I have found to be generally
fastest on the various machines.  You may also need to add libraries
specific to your machine in the SYSLIB variable.  For example, the MPI
version will require a native MPI library to be linked with the code.

I compile my Unix versions for a Sun SPARC machine.  It's possible
that on other Unix workstations (IBM, SGI, HP) you'll get some
compiler glitches.  For example, the timer routine in
parlib[afs]_unix.f calls "etime", which not all Unix system libraries
seem to support.

The Unix versions of these 3 codes are single-processor codes that can
be compiled for any processor that runs UNIX - e.g. workstations, Cray
Y-MP, Cray C90, etc.  However, these codes are not optimized for the
vector and RISC processing those machines support.  These are not the
codes that produced the Cray Y-MP and C90 timings discussed in the JCP
paper.  That is a single-processor code that implements the same
physics as these 3 codes but uses different data and loop structures
to take advantage of vector processing.

When making Cray T3D versions, note that the parlib*_t3d.f files have
routines which make PVM calls.  They "include" a specific PVM header
file which may be different on your system.

The Makefile is currently setup to create an MPI version for the Intel
Paragon running SUNMOS.  To make an MPI version for some other
platform, you will need to edit the mpi_DEFS settings in the Makefile.
You may also need to edit the parlib*_mpi.f files to use the correct
include files and timer routines for your MPI machine.

----------------------------------------------------------------------------

Running the codes:

All 3 of the codes read the file lj.in which has several options like
atom density, cutoff distances, and neighbor-list frequency.  You can
modify these as you wish; the current settings are for the benchmark
problem described in the paper mentioned above - a liquid LJ state
point.  Note that the temperature in the file is really the initial
kinetic energy; the system will equilibrate to about 1/2 that
temperature within a hundred or so steps since it starts with a fcc
lattice (potential energy minimum).

The codes also prompt for several inputs which for benchmarking
purposes, I find more convenient to set on the fly.  Depending on the
code and your choices you won't see all of the prompts in any one run.

(1) Atoms: nx,ny,nz

Number of cubic unit cells in each dimension.  There are 4 atoms/unit
cell in the initial fcc lattice, so specifying "5 5 5" is a run with
500 atoms.

(2) # of Timesteps

(3) Neighboring: (0) N^2 (1) Binned

Create neighbor lists via an all-pairs search (N^2 option) or by
binning the atoms first.  Which is faster depends on the problem size,
number of processors, choice of parallel algorithms.

(4) Specify binning: (0) No (1) Yes

The program will choose the bin size for you (option 0) according to
the neighbor-list cutoff or you can override it (option 1).

(5) Bins in each direction: x,y,z

You choose how many neighbor bins to have in each dimension of the
global domain.  The reason I put this option in is that for the
benchmark problem it turns out that if unit cells are specified as a
multiple of 5, then neighbor bins (with a cutoff of 2.8 sigma) turn
out to be just under a multiple of 3.  For example, if you run a
10x10x10 problem of 4000 atoms, the program wants to put 5.999
neighbor bins in each dimension which rounds to 5 and runs a little
slower.  So for the benchmarks I specify 6x6x6 bins in this case.

(6) Newtons 3rd law: (0) No (1) Yes (2) Hybrid

The lja and ljf codes can save computation by taking advantage of
Newton's 3rd law at the expense of extra computation.  Which is faster
depends on the problem size, number of processors, etc.  The lja code
also has a hybrid option which can sometimes be faster - see
documentation at the beginning of the neighbor4 routine in lja.f

(7) Transpose: (0) No (1) Yes

The force-decomposition code ljf has an older option in it which
requires transposing a portion of the force-matrix (see the paper).
It is always faster not to do this, but the option is there for
historical purposes.

(8) Specify processor grid: (0) No (1) Yes

The ljf and ljs codes allow you to specify how to map processors to
the force matrix (ljf) or 3-d physical domain (ljs).  This is required
if you are running on a non-power-of-2 number of processors and can be
useful with ljs if your global physical domain is not cubical.

(9) Processor grid: numrow,numcol

If you choose to specify the processor grid in the ljf code you tell
the code the "row by column" array of processors to map to the force
matrix (see the paper).  A nearly square array is faster.  You can
also use this to mask out processors.  For example if you run on 105
processors, you can specify a 10x10 grid and 5 processors won't be
used.

(10) Processor grid: x,y,z

If you choose to specify the processor grid in ljs code you tell the
code the "x by y by z" array of processors to map to the phsical
domain (see the paper).  The code runs fastest when the processor
sub-domains are as cubical as possible.  So if you are running a
20x4x20 problem (6400 atoms) on 32 processors, you might want to
choose a 4x2x4 processor grid instead of the 2x4x4 grid the code will
give you by default.

After the prompts, the code will do some setup, then start the
timestep dynamics.  You won't get any other output until the timestep
loop finishes.

If you get an error message about "boosting" something, it means your
arrays are not allocated large enough.  You need to modify a parameter
statment in the lj*.h file and recompile.  Some of these errors are
detected at setup, others (like neighbor list overflow) may not be
detected until the middle of a run.  When the latter happens, you
should get an error message to your screen, and the program will
either gracefully stop (if all processors incurred the same error) or
hang.

I've tried to be pretty careful about detecting memory-overflow kinds
of errors.  If the codes ever crash or hang without spitting out an
error message first due to algorithmic/parallelism problems (as
opposed to physics problems, like too big a timestep or putting 2
atoms on top of each other), it's probably a bug, so let me know about
it.

When the run finishes, you get some timing information on the screen
and fuller output to the file lj.out which includes thermodynamic info
and memory-usage statistics.  The last column of thermodynamic output
is a energy-conservation monitor which should be close to 1.0 (ratio
of current total-energy to initial total-energy).  Actually, because
of the leapfrog convention of velocities being on the half-step, the
initial energy is an estimate, so energy conservation is generally
better between 2 intermediate timesteps than versus the initial time=0
estimate.

----------------------------------------------------------------------------

Understanding the codes:

Unfortunately, I haven't had the time or inclination to comment the
codes in much detail.  And the codes are pretty stream-lined as
benchmarks tend to be.  Hopefully, you will find the Crib files useful
and my coding style not too opaque.  The J Comp Phys paper is the best
high-level overview of what each of the 3 codes are doing as far as
parallelism.

On the bright side, if you can't understand something or are trying to
modify some section, give me a call or send e-mail about it and I'll
be happy to explain (or obfuscate) things more fully.

----------------------------------------------------------------------------

Double precision:

All of the codes are written as single-precision (32-bit) real.  These
are the changes to the *.h and *.f files you need to make to convert
them to double precision (64-bit). 

*** changes in lj[afs].h

(1) implicit real*4 -> implicit real*8
(2) real*4 -> real*8 in all variable declarations
(3) nbyte=4 -> nbyte=8 in parameter statements
(4) ibyte=4 -> ibyte=8 (make this change ONLY for the T3D version)

*** changes in lj[afs].f

(1) real*4 function -> real*8 function

*** changes in parlib[afs]_*.f (already done in parlib[afs]_t3d.f)

(1) real*4 -> real*8 in all local variable declarations
(2) nbyte=4 -> nbyte=8 in parameter statements

Additional optimization note for the Cray T3D version:

Last time I checked you get about a 12% speed boost if you replace
the lines

sr2 = sigsq/rsq
sr6 = sr2*sr2*sr2
tmp = sr6*(sr6-0.5)/rsq

by

recip = 1.0/rsq
sr2 = sigsq*recip
sr6 = sr2*sr2*sr2
tmp = sr6*(sr6-0.5)*recip

in the force routines in the lj[afs].f files.

This is because in the former case the Cray compiler doesn't store a
temporary and performs 2 divides.