Note: 64-bit mode under Linux adds further issues.
Note added 2/24/2009:: Aparna Vemuri of EPRI
reports troubles with recent gcc compiler systems
and Portland Group pgf90: the gcc
Fortran name mangling system has changed, requiring a change
in compile flags. For mixed pgf90/gcc builds, one
can either remove the -Msecond_underscore
flag
from FOPTFLAGS
in the Makeinclude.Linux2_x86_64pg_gcc*
or else change the line CC = pgcc
to
CC = gcc
in the
Makeinclude.Linux2_x86_64pg_pgcc* files. These
modifications have been made to the 2/24/2009 release of the
Makeinclude.Linux2_x86_64pg_gcc*, with the older
flags commented out, for use by those who need them.
In particular, Gnu Fortrans (g77 and g95)
have different name mangling behavior than is
the default with Portland Group pgf90. Vendor supplied
NetCDF librararies libnetcdf.a
always use
the Gnu Fortran conventions, and as such are incompatible with
the default compilation flags for SMOKE or CMAQ.
For the Linux/Portland Group/SMOKE or CMAQ
combination, you have two choices:
libnetcdf.a
and
default I/O API build, but fix the SMOKE
or CMAQ compile flags, using
ioapi/Makeinclude.Linux2_x86pg_gcc*
as
your guide; or
libnetcdf.a
from scratch for yourself,
using compile flags compatible with your SMOKE
or CMAQ build; build the I/O API using
ioapi/Makeinclude.Linux2_x86pg_pgcc*
; and
use these libraries.
This Portland Group inconsistency is exactly why the I/O API
is supplied with multiple /Makeinclude.Linux2_x86pg*
files in the first place... Note that the I/O API supplies a
script nm_test.csh
and a make target
make nametestto help you identify such problems.
Back to "Troubleshooting" Contents
Internal compiler errors have shown with gcc/g77 on the some Linux distributions for x86_64, particularly with Fedora Core 3 and Red Hat Enterprise Linux Version 3 for x86_64: the symptom is a sequence of messages such as the following:
A workaround is to weaken architecture/optimization flags for binary typeerror: unable to find a register to spill in class `AREG' /work/IOAPI/ioapi/currec.f:93: error: this is the insn: (insn:HI 145 171 170 8 (parallel [ (set (reg:SI 3 bx [95]) (div:SI (reg/v:SI 43 r14 [orig:67 secs ] [67]) (reg/v:SI 2 cx [orig:68 step ] [68]))) (set (reg:SI 1 dx [96]) (mod:SI (reg/v:SI 43 r14 [orig:67 secs ] [67]) (reg/v:SI 2 cx [orig:68 step ] [68]))) (clobber (reg:CC 17 flags)) ]) 264 {*divmodsi4_cltd} (insn_list:REG_DEP_ANTI 92 (insn_list:REG_DEP_OUTPUT 91 (insn_list 140 (insn_list 84 (insn_list:REG_DEP_ANTI 139 (nil)))))) (expr_list:REG_DEAD (reg/v:SI 43 r14 [orig:67 secs ] [67]) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_UNUSED (reg:SI 1 dx [96]) (nil))))) ...confused by earlier errors, bailing out
Linux2_x86_64
to get around this compiler bug --
eliminating the -fschedule-insns and -march=opteron
optimization flags from "Makeinclude.Linux2_x86_64" will tend to
get rid of the problem. Note that this same compiler bug will bite
you when trying to build lots of other stuff
(TCL/TK, plplot, NCAR graphics), on
FC3/gcc/g77 systems, and the same fix seems to work for
many other problems as well.
Back to "Troubleshooting" Contents
relocation error
link issues for x86_64 Linux
COMMON
blocks
exceed 2GB on the x86_64 platforms, Intel ifort
and icc will give you failures, with messages about
relocation error
s at link-time. The problem is that
the default "memory model" doesn't support huge arrays and
huge code-sets properly. The "medium" memory model
supports huge arrays, and the "medium" memory model
supports both huge arrays and huge code-sets. To get around
this, you will need to add
-mcmodel=medium -shared-intelto your compile and link flags (for the medium model), and then recompile everything including libioapi.a and libnetcdf.a using these flags. Note that this generates a new binary type that should not be mixed with the default-model binaries. There is a new binary type
BIN=Linux2_x86_64ifort_medium
for this
binary type, and a is a sample Makeinclude file for
it, to demonstrate these flags:
Makeinclude.Linux2_x86_64ifort_medium
Other compilers and other non-Linux x86_64 platforms will have similar problems, but the solutions are compiler specific.
Back to "Troubleshooting" Contents
SGI F90 compiler-flag problems: It seems that
SGI version 7.4 and later Fortran compilers demand a different set
of TARG
flags than do 7.3.x and before. For example,
for an Origin 3800 (where hinv reports
24 400 MHZ IP35 Processors
CPU: MIPS R12000 Processor Chip Revision: 3.5
...
one would use the following sets of ARCHFLAGS
compiler
flags in Makeinclude.${BIN}
with the different
Fortran-90 compiler versions:
-TARG:platform=ip35,processor=r12000
for 7.3.x and before
-TARG:platform=ip35 -TARG:processor=r12000
for 7.3.x and before
There are a number of problems with both the I/O API and netCDF with the newer (version 7.4) SGI compilers:
Added 12/18/2003
SGI claims to have fixed this in the latest patch for F90 version 7.4.1 (bug # 895393); I haven't had time to test it yet, though. -- CJC
libnetcdf.a
that
was built will also lead to program crashes.
BLOCK DATA
subprograms from libraries.
For the upcoming I/O API Version 3, we have
put into place a workaround-hack that puts a
conditionally-compiled non-Fortran-conforming SGI-only
CALL INITBLK3
at the start of subroutine INIT3
.
The IRIX 7.4 f90 compiler also
thoroughly mangles the buffering of log-output in ways
that we have not yet managed to decipher completely,
much less repair. The outcome is that log output will
show up in scrambled order. (Note that
industry-standard mapping of WRITE(*,...)
onto unbuffered UNIX standard output still happens with
version 7.3 and must be preserved, but fails with
version 7.4.)
libnetcdf.a
that
was built will also lead to program crashes.
All the netCDF "magic numbers" are defined
in the I/O API NETCDF.EXT
file (which
is the I/O API name for the file netCDF calls
src/fortran/netcdf.inc
. Errors defined
in netCDF 2.x have positive values in the range 1...32
(except for NCSYSERR
which is -1); errors
newly defined for netCDF 3.x are in the range -60...-1.
General methodology: find the error-number and then try
to figure out what's wrong from the name of the
corresponding PARAMETER
.
Note that UCAR re-defined some of these errors between
versions 3.3.1 and 3.4 of netCDF (while leaving the
various library versions link-compatible), so you may
have to look at the src/fortran/netcdf.inc
for the version of the netCDF libnetcdf.a
you are linking with, if this is different from the
version used to build your libioapi.a
Martin Otte, US EPA, reports that there are similar errors encountered with netCDF Version 4, due to more stringent interpretation of flags for opening or creating files. This is fixed in the Oct. 28 I/O API distribution.
This is NCSYSERR
, meaning the system wouldn't
give you permission for what you wanted to do. Most
probably it means you need to check permissions on either
the file you're trying to create or access, or on the
directories in its directory path.
"Not a netcdf id", which can happen both if the
file honestly isn't a netCDF file, and also if it
is a netCDF file, but wasn't shut correctly. (unless you've
declared a file "volatile" by
setenv <file> <path> -v
,
netCDF doesn't update the file header until you call
SHUT3()
or M3EXIT()
.)
"Invalid Argument", but almost certainly this means you're using netCDF library 2.x with an I/O API library built for netCDF version 3.x (NCAR accidentally changed one of the "magic numbers" used in opening files when they upgraded netCDF from 2.x to 3.x).
This is a variant of the system permission problem.
A directory spec of with an extra nonexistent
component, e.g., /foo/bar/qux/zorp
when
you really mean /foo/bar/zorp
and the
/foo/bar/qux
doesn't exist seems to cause
Error -31. Can also happen by trying to open too many
netCDF files simultaneously (although the I/O API
has additional traps around this).
Or on a Cray vector machine, this may mean you're running up against your memory limit. (On Crays, netCDF v3.x dynamically-allocates a fairly large buffer to optimize I/O for each file; this allocation may well push you over your (interactive or queue) memory limit. For netCDF v3.4, there are tricks you can play with environment variables to manipulate these buffer sizes. This error also has turned up with some of the more obscure file-permission problems.
Probably means you tried to read data past the last date-and-time on the file (the I/O API runs netCDF in "verbose mode", so that netCDF will always print all error messages, including this one. Also can happen when the calling program is running in parallel, but a non-MP-enabled version of the I/O API library was linked in.
OPEN3()
with
status argument FSNEW3
)
FDESC3
header-components
maxncattrs
exceeded (would indicate
a bug in I/O API internals)
VGTYP3D(<variable>)
in
FDESC3
is not one of
M3INT
. M3REAL
, or
M3DBLE
NCOLS3D
,
NROWS3D
, NLAYS3D
, or
NTHIK3D
(else would indicate a bug
in I/O API internals).
PARMS3.EXT
inappropriately for the target machine.)
SHUT3()
or M3EXIT()
).
/usr/include/sys/errno.h
This is only relevant for users at NCEP, where local politics forbids any copy of libnetcdf.a on their systems. libnotcdf.a is a library that satisfies linker references to libnetcdf.a with "stub" routines that merely report that the user is trying to use NCEP-forbidden netCDF file mode instead of NCEP-required native-binary file mode.
Analysis due to Robert Elleman, Dept of Atmospheric
Sciences, University of Washington: When programs
are compiled with the Portland compilers, without
the -mp
flag (as is the default
for mcip) but the I/O API is compiled
with this flag (as is the I/O API default),
the program will hang (i.e., appear to freeze, consuming
all available computational resources but making no
evident progress).
Solution: either use the
-mp
compile flag for all
compiles -- both program and library, or use it for
neither.
General principle: Make sure the program
compile-flags and the I/O API compile-flags (and
the netCDF compile-flags!) are consistent!
LOGFILE
on my SGI?"
There is a problem with SGI f90
Version 7.4 and initialization of COMMON
blocks. The Fortran language standard specifies that
COMMON
blocks must be initialized by
BLOCK DATA
subprograms, but (since the
actual operations of compiling and linking are not
covered by the language standard, which considers them
"implementation details") does not specify
just how to ensure that the BLOCK DATA
subprogram is linked in with the rest of the executable.
Usual and customary industry practice is that the use
of a statement
EXTERNAL FOOBLOCK
in either the main program or in other subroutines that
are called should ensure that BLOCK DATA FOOBLOCK
is linked into the final executable. This does not
happen with SGI f90 Version 7.4,
even in very simple test cases.
Note that BLOCK DATA INITBLK3
is needed to
initialize I/O API internal data structures,
including the unit number for LOGFILE
and
the number of I/O API files currently open;
fortuitously, the latter seems to be initialized to zero
(which is correct); the former is not initialized
correctly, leading to failures to open and use a
LOGFILE
when you try to specify one.
Note that this error does not seem to happen with
SGI f90 Version 7.3 or earlier.
I have submitted this problem to SGI in an error
report. Their reply is to suggest the use of
non-standard
CALL DATA INITBLK3
, which would
need to be done by every internal I/O API
routine that references the STATE3
internal
data structures.
--CJC
__mp_getlock
,
__mp_unlock
, or something else with
_mp
in it?"
This probably means that you are using a version of the
libioapi.a
that is enabled for
OpenMP parallel
usage, but have not activated the system parallel
libraries. On SGI, this means that you need to add a
directive -lmp
at the end of the library
list in the final compile command that links the
executable (and similarly for other compilers).
See the variable OMPLIBS
defined in
your machine/compiler's Makeinclude.${BIN}.
fort.<nn>
files come from?"
On some systems (notably Sun and SGI), there are
incompatibilities in run-time libraries between
f77
and f90
.
The upshot is that on these systems, you can link
together Fortran-77 and C using f77
,
or Fortran-90 and C using f90
, but
you can't link together Fortran-77 and Fortran-90.
The default I/O API distribution is built using
f77
and runs into this problem when your
model code is built using f90
. The
solution is to rebuild the I/O API using
f90
.
RH7 uses quite-nonstandard gcc v2.96 and glibc versions; there are patches available at URL http://www.redhat.com/support/errata/rh7-errata-bugfixes.html
RH7's gcc v2.96 does not work with the standard edition of the Portland Group F90 compiler; there is a version which does work; see URL http://www.pgroup.com/faq.htm: (UPDATE on: RED HAT 7.0 and 3.2 RELEASE COMPILERS!)
segmentation fault
on the OPEN3
call when I attempt to create a
new file!"
Probably the file description was not completely filled
in. This has been observed, for example, when one of
the variable names VNAME3D(I)
in
FDESC3.EXT
was not set correctly. (What
actually happens is that the FDESC3.EXT
data
structures are initialized to zero by the linker; then
the netCDF internals don't handle strings that contain
just ASCII zeros correctly).
ncdump
/PAVE
it now!"
Probably the file wasn't shut correctly. (Unless
you've declared a file "volatile" by
setenv <file> <path> -v
,
netCDF doesn't update the file header until you call
SHUT3()
or M3EXIT()
.)
This usually means that the script which ran your program failed to execute correctly the setenv to define the logical name for the file. Try using the env command in the script before you run the program, in order to get started debugging your script.
ncabor_
or open3_
(etc.) is an undefined symbol?"
There are three probable causes we've been observing:
Probably, the command line that links your
program has -lnetcdf
before -lioapi
instead of after. (Most UNIX linkers only try
to resolve things in terms of libraries yet to
be defined, and don't go backwards. E.g., if
you have
!!! INCORRECT !!! f77 -o foo foo.o ...
-lnetcdf -lioapi
the linker
won't know where to go to find netCDF
functions that are called in the I/O API;
instead, if you use
!! CORRECT: f77 -o foo foo.o ... -lioapi -lnetcdf
then
the linker will scan "-lnetcdf" to
find functions called in "-lioapi"
Another possibility is that you are doing multilingual programming, and using maybe "cc" or "g++" or something else to do the link step. If so, you need to explicitly list the libraries that f77 would include. The list of these is vendor dependent but frequently looks something like
... -lf77 -lU77 -lm
One way to find out is to try to use
f77 in verbose mode
(f77 -v ...
on most UNIX
systems) to do the linking: it may not find
the needed C++ libraries, but it will tell
you what libraries it needed for the Fortran
part of the I/O API and you can then
modify your original link command to use them.
Compilers "mangle" the names of
Fortran COMMON
blocks, functions,
and subroutines in various ways (usually turn
them into lower case, and then prefix or
postfix them by one (or, for
gcc/g77, sometimes two)
underscores. This will be a problem when
you use the Intel or Portland Group compilers
on Linux systems that come with a
system-installed libnetcdf.a
(which will have been built with
gcc/g77
).
The precise mangling behavior depends upon the
compiler, your system defaults file for the
compiler, and the compile/link command lines
themselves. (It can also happen that
netCDF was built without the
expected Fortran or C++ support thay your
model was expecting. A useful UNIX utility for
diagnosing these problems is nm,
which reports what linker visible symbols are
present in binary executable, object
(.o
), or library
(.so
and .a
) files.
So if you see a linker error message like
symbol foo_ not found (referenced in bar.o)
then do the following sorts of things:
nm foo.o | grep -i foo
nm libnetcdf.a | grep -i foo
nm libioapi.a | grep -i foo
etc, and maybe
man -k foo
to try to find which program-component has the
differently-mangled symbol that the linker
needs. Then go back and review the compiler
flags used in the build-process for that
component. nm_test.csh
to help you with this:
run
nm_test.csh <obj-file>
<lib-file> <symbol>
Sometimes you'll find that the missing symbol was in a system routine that the compiler should have known about but somehow (maybe bad compiler-installation) didn't. That one happened to me earlier this week (as I write this May 3, 2002) on an HP system.
This is a PAVE bug, not an I/O API bug: the original person who wrote the file-reader for PAVE couldn't be bothered to use the I/O API, but instead used raw netCDF reads without proper data-structure and error checking. NetCDF fills in "holes" in its files with a particular fill-value that you are seeing, and this is an indication that the data for that variable and time step was never written to the file. This happens, for example, at the starting time for an MM5/MCPL run, for some of the variables which aren't calculated until after the run is in progress.
This is fixed in Pave Version 2 and later.
>>> WARNING in subroutine CHKFIL3 <<< Inconsistent file attribute NVARS for file FOO Value from file: 6 Value from caller: 9
This means that
FOO
already exists
FSTATUS=FSUNKN3
) in the call
to OPEN3
FOO
's header does not match the
file description you have supplied in the
FDESC3
COMMONs.
For the I/O API, you can't change a file's definition once it has already been created. What you probably want to do is to delete the existing file (or move it somewhere else), and re-run your program--this time creating a new file according to the description you supply.
PGF90-W-0006-Input file empty (<somewhere>/ioapi/ddtvar3v.F) PGF90/any Linux/x86 5.2-4: compilation completed with warnings
There are three worker routines that are empty after preprocessing for the non-coupling-mode compiles. Some compilers treat the attempt to compile an empty file as a problem situation... It isn't.
Send comments to
Carlie J. Coats, Jr.
cjcoats@email.unc.edu