Posix wc
Mon May 18 17:57:08 2015
The linux/coreutils implementation of wc doesn't seem to be Posix compliant in two ways.
The first, and more trivial way is that it's output seems to have extra spaces in it. Posix says that the output format will default to
"%d %d %d %s\n"
But here you can see some sample output with leading space and extra space between the characters.
granicus% wc ../mkdir/*
10 138 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 321 10147 total
granicus%
The Posix rationale section notes that the System V format was
"%7d%7d%7d %s\n"
which may be the source of the additional whitespace.
The second issue is potentially more serious, as it is actually different output. This depends on what is a "word". Posix specifies that a word is "a non-zero-length string of characters delimited by white space".
I take this to mean (in pseudocode):
if we aren't in a word and the current character is not a whitespace character, then start a word
if we are in a word and the current character is a whitespace character, then end or stop being in a word.
in C:
wasspace = 1; /* begining of file is effectively whitespace */
while ((ch = fgetc(f)) != EOF) {
fb++;
if (ch == '\n') {
fl++;
}
if (wasspace && !isspace(ch) {
fw++;
wasspace = 0;
}
if (isspace(ch)) {
wasspace = 1;
}
}
This gives
granicus% ./wc ../mkdir/*
10 166 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 349 10147 total
in contrast to the coreutils output above.
changing
if (wasspace && !isspace(ch) {
to
if (wasspace && !isspace(ch) && isprint(ch)) {
makes my implementation match the linux coreutils one. Thus the coreutils wc seems to be using a definition of word equivalent to "a non-zero-length string of printable characters delimited by white space". This is only likely to matter for binary files, and the number of "words" in a binary file is pretty much useless anyway, but I believe this deviates from posix. It's possible that coreutils wc can be invoked in such a way as to match, but I haven't found one. Also, the coreutils definition is probably better than the posix one, but that's a matter for a defect report or change request for posix.