Bad Data

Posix wc

Mon May 18 17:57:08 2015

The linux/coreutils implementation of wc doesn't seem to be Posix compliant in two ways.

The first, and more trivial way is that it's output seems to have extra spaces in it. Posix says that the output format will default to

"%d %d %d %s\n"

But here you can see some sample output with leading space and extra space between the characters.

granicus% wc ../mkdir/*    
   10   138  8872 ../mkdir/mkdir
   69   183  1275 ../mkdir/mkdir.c
   79   321 10147 total
granicus%

The Posix rationale section notes that the System V format was

"%7d%7d%7d %s\n"

which may be the source of the additional whitespace.

The second issue is potentially more serious, as it is actually different output. This depends on what is a "word". Posix specifies that a word is "a non-zero-length string of characters delimited by white space".

I take this to mean (in pseudocode):

if we aren't in a word and the current character is not a whitespace character, then start a word

if we are in a word and the current character is a whitespace character, then end or stop being in a word.

in C:

wasspace = 1; /* begining of file is effectively whitespace */
while ((ch = fgetc(f)) != EOF) {
    fb++;
    if (ch == '\n') {
        fl++;
    }
    if (wasspace && !isspace(ch) {
        fw++;
        wasspace = 0;
    }
    if (isspace(ch)) {
        wasspace = 1;
    }
}

This gives

granicus% ./wc ../mkdir/*
10 166 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 349 10147 total

in contrast to the coreutils output above.

changing

    if (wasspace && !isspace(ch) {

to

    if (wasspace && !isspace(ch) && isprint(ch)) {

makes my implementation match the linux coreutils one. Thus the coreutils wc seems to be using a definition of word equivalent to "a non-zero-length string of printable characters delimited by white space". This is only likely to matter for binary files, and the number of "words" in a binary file is pretty much useless anyway, but I believe this deviates from posix. It's possible that coreutils wc can be invoked in such a way as to match, but I haven't found one. Also, the coreutils definition is probably better than the posix one, but that's a matter for a defect report or change request for posix.

Posted in / software / posix