This post is an update to my previous post on building Abinit with OpenMPI in Ubuntu, with this post providing a workaround (solution?) to a run-benign but ultimately thoroughly aggravating issue with starting calculations in the abinip parallel build.
The description of the procedure, and the problem in the OpenMPI 1.3.x build, is as taken from the previous page (repeated so that the error makes its way and embeds itself a little deeper into the search engines).
To run parallel Abinit on a multi-processor box (that is, SMP. The actual multi-node cluster setup is in progress), the command is SUPPOSED to be follows:mpirun -np N /opt/etsf/abinit/5.6/bin/abinip < input.file >& output
Where N is the number of processors. For mpirun, you need to specify the full path to the executable (which, for the build above, is as Abinit installs abinip when the build occurs in the /opt directory). The input.file specification is as per the Abinit users manual so I won't go into it here. You will also be asked to supply your password because I've done nothing to the setup of ssh (you are, in effect, logging into your machine to run the MPI calculation).
Now, when the above is run, this is the error that I get:
abinit : nproc,me= 4 0
Give name for formatted input file:
At line 127 of file iofn1.F90 (unit = 5, file = 'stdin')
Fortran runtime error: End of file
abinit : nproc,me= 4 1
abinit : nproc,me= 4 2
abinit : nproc,me= 4 3
mpirun has exited due to process rank 0 with PID 7131 on
node terahertz-desktop exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
What is supposed to happen is that the input.file file lists the files that Abinit requires to perform the run and provides these files by name one-at-a-time as Abinit requests them upon start-up. For some reason, the input.file file is not being read properly or is not being read at all before the job crashes. Oddities noted in the above order of the output include
(1) the abinit : nproc,me= values are not grouped above the "Give name for formatted input file:" <- Abinit does not appear to be trying to read the text from the nproc,me lines as actual input data, as you have to provide all of the files before Abinit will crash with a wrong file name.
(2) At line 127 of file iofn1.F90 <- this is an Abinit file that is responsible for reading the contents of input.files. So, is the problem with this fine in Abinit? Well…
(3) The serial build of Abinit (abinis) runs just fine with input.file <- which leads me to conclude that the problem is mpirun-related. I hope to resolve this (I'm sure it will be trivial) and post my error accordingly.
What's the work-around? Simple. Copy the contents of the input.file file (literally Crtl+C with the text selected) and paste it after running this command:
mpirun -np N /opt/etsf/abinit/5.6/bin/abinip
Abinit will ask for the files in order AND your Crtl+C includes the carriage returns at the end of each line, so you are effectively feeding Abinit the same content it would read from the input.file file if, in fact, it was capable of reading the input.file file.
After considerable searching for NOT the error I was having with Abinit, I discovered the following thread [Post 1, Post 2, Post 3, Post 4, Post 5, Post 6] in the OpenMPI Users Mailing List (see? Once these lists get populated with enough content, you're bound to find just about everything). This doesn't directly address the problem (the problem is related but different, the origin of the problem is the same, and googling "OpenMPI" and "stdin" was what brought it to my attention).
The solution to the problem above is to build Open-MPI 1.2.x instead of Open-MPI 1.3.x.
NOTE: If, after building Open-MPI 1.2.x, you receive the following error the first time you run mpirun:
mpirun: error while loading shared libraries: libopen-rte.so.0: cannot open shared object file: No such file or directory
Simply type the following:
To make the proper links to libopen-rte and associated libraries. From the man page…
ldconfig creates the necessary links and cache to the most recent shared libraries found in the directories specified on the command line, in the file /etc/ld.so.conf, and in the trusted directories (/lib and /usr/lib). The cache is used by the run-time linker, ld.so or ld-linux.so. ldconfig checks the header and filenames of the libraries it encounters when determining which versions should have their links updated.
The stdin problem, I think, may remain in the entire 1.3.x build series as, and it's not a great piece of deduction, the fix is reported in the trunk for 1.4.1 (not being big on alpha/beta testing and wanting to get Abinit running more than wanting a final answer to this problem, I did not try installing the trunk build and testing accordingly), although I have no idea how quickly things happen in OpenMPI development.
Long-story-short, Abinit 5.6.5 and OpenMPI 1.2.x works just fine at the READ *.files step and, most importantly, Abinit runs can now be properly scripted to run without my having to be by the machine to copy+paste the contents of the *.files file.