common: manage directory search list as views instead of allocated strings
This list is derived from the `Gvfilepath` global, something that is live for
the entire run of a program. So we can avoid `strdup`-ing components being
extracted out of it and maintain these as read-only references instead. This is
both faster and decreases memory usage.
gvc auto_output_filename: avoid 'strdup' when constructing file name
This block of code is scanning a string of `:`-separated entries and writing
them into `buf` in reverse order `.`-separated. We can rewrite it to avoid
dynamic allocation, thus avoiding certain failure cases and locking overhead.
Unfortunately there seems to be no variant of `strrchr` that takes a length. So
we need to write our own loop for locating the last `:` with a limit.
cgraph tok: add extra check for duplicate separator characters
When a user supplies the same character more than once when constructing a
tokenizer, nothing is functionally incorrect but this is strong indicator of a
bug or misunderstanding by the caller. For example, a bug like this is
documented in #2259.
When using `shape=record`, certain characters within labels have alternate
semantics. The switch in `parse_reclbl` handles these, with most characters
having no special semantics and branching to the default case. The trailing loop
in this case was attempting to accrue UTF-8 continuation bytes. But lets take a
look at the Wikipedia table describing UTF-8:¹
Notice that the continuation bytes are distinguished by upper 0b10 bits. Now
consider that the trailing loop was using a mask with 128 (0b10000000) to
identify such bytes. Such a value masks _out_ bit 6. That is, this loop
condition expression was true for all values with upper 0b10 bits _or_ upper
0b11 bits.
The first consequence of this is that this loop thought multiple consecutive
non-ASCII characters were a single character. It treated the 0b11xxxxxx byte 1
of a new non-ASCII character as if it were another continuation byte of the
preceding non-ASCII character.
The second consequence of this is that an ASCII character followed by non-ASCII
characters would be treated as a single character. That is, 0b0xxxxxxx bytes led
to adjacent following 0b1xxxxxxx bytes being consumed along with them.
These factors combine in the #925 example to confuse the state machine of the
containing loop and result in a malformed label being produced.
This fix adjusts the mask such that it takes _both_ bit 7 and bit 6 and compares
against 0b10xxxxxx.
There are a number of other odd things going on with this code that I did not
attempt to change:
1. ASCII characters are allowed to have following UTF-8 continuation bytes.
This would be considered malformed, but this code treats it as legal.
2. UTF-8 permits a maximum of 3 continuation bytes (see table above) but this
code allows an arbitrary number of continuation bytes. It also does no
validation that the leading byte’s upper bit values and the number of
continuation bytes correspond.
3. There is something called “hard space” mode that this code toggles to keep
track of when a space that would otherwise be omitted needs to be
preserved in the output. Once the hard space flag is toggled on, the code
appears to never toggle it off. This looks like it has the (presumably
unintended) effect of something like `"\\ "` causing all later spaces in
the label to become hard spaces.
This function was using the current system locale to encode and decode data sent
to Graphviz and received from Graphviz when using a textual output format. As a
result, encoding exceptions would occur if either the input or the output
contained non-ASCII characters and the system locale was not a UTF-8 one.
Apparently none of the current test suite hits this scenario. However, an
upcoming commit adds a test case that does.
This change forces the encoding and decoding to be done as UTF-8, which is also
what Graphviz unconditionally uses.
This code was using `strtok` as if it splits based on the single separator
passed to it. But `strtok` actually treats the second parameter as a list of
character separators. In this change, we rephrase this code to do what its
original author appears to have intended.
This slightly changes the semantics of this code. But it seems we do not know
the exact intent of the original, so this is hoped to match the author’s
intention.
Clang seems to consider `{NULL}` different from `{0}`, with the latter being an
intent of zero initialization and the former a possible accidental omission of
other fields.
Vincent Fu [Fri, 15 Jul 2022 00:19:20 +0000 (20:19 -0400)]
dot.demo: replace LDFLAGS with LDLIBS in Makefile
With LDFLAGS I am unable to build the demo programs using the Makefile
but the Makefile works with LDFLAGS changed to LDLIBS. We are using
pkg-config to obtain the appropriate libraries. So LDLIBS is the
appropriate variable to use.
The previous use of `oldof` was a verbose way of allocating a single element, so
we replace it with the central allocation helper, also avoiding crashes if
allocation fails.
xdot sprintXDot: steal agxbuf’s buffer instead of double copying
8064f6e902cc4c3062cffa2d1d307ee9cf1893bb replaced lib/xdot’s inline copy of a
subset of the agxbuf.h API with an include of the header containing the full
API. This gives us access to `agxbdisown`. This function effectively does the
work of `agxbuse;strdup;agxbfree` by taking the existing dynamically allocated
buffer within the `agxbuf` object, rather than making yet another copy of this
data only to discard the original.
edgepaint: remove unnecessary 'strdup' of 'lightness'
Pointers `getopt` returns in `optarg` point into the original `argv` which lives
in immortal storage. There is no need to duplicate such a pointer to prolong its
lifetime.
This commit looks like it is changing the source string, but `arg` and `optarg`
point at the same thing at this point. But `optarg` is not `const` qualified, so
we can do this assignment without a compiler warning.
smyrna: remove unnecessary 'strdup' calls in 'mTestgvpr'
The strings being duplicated are passed through to `gvpr` which does not modify
its arguments. So by rearranging when we release `bf2`, we can remove the need
to dynamically allocate the members of `argv`.
smyrna load_attributes: use a string view for 'ss'
This code contained multiple memory leaks and unchecked allocations:¹
1. `pch` was `strdup`-ed into `ss` on line 58. But `strdup`-ed again when
being saved to an `attr` field. This lost the memory originating from the
first `strdup`.
2. Cases 0, 3, and 4 of the switch do not save the full contents of `ss` at
all. This means naively removing the `strdup` calls in cases 1, 2, and
default would not have solved the memory leak in (1) because cases 0, 3,
and 4 would still leak memory.
3. None of the `strdup` calls in this function were checked for failure.
This commit attempts to solve all the above. We now take a read-only reference
to the string data on line 58 and only `strdup` it when needed.
¹ It also assumes all lines of the input file are fewer characters than
`BUFSIZ`, a platform-dependent constant. I do not know why this would be
guaranteed. However, this problem seems orthogonal to the above.
This loop contains no `continue` statements, its counter is incremented in a
regular way, and the counter is unused outside the loop. So we can write the
loop more concisely and scope `attrcount` more tightly by using a `for` loop
instead of a `while` loop.
This loop contains no `continue` statements, its counter is incremented in a
regular way, and the counter is unused outside the loop. So we can write the
loop more concisely and scope `ind` more tightly by using a `for` loop instead
of a `while` loop.
Pointers `getopt` returns in `optarg` point into the original `argv` which lives
in immortal storage. There is no need to duplicate such a pointer to prolong its
lifetime.
Sequence IDs are calculated using 64-bit counters in `Agclos_s`. But then the
field used to store sequence IDs, `Agtag_s.seq`, is `sizeof(unsigned) * 8 - 4`
bits wide, 28-bit on x86 and x86-64. As a result, the compiler believes IDs that
exceed 2²⁸ - 1 can occur and overflow `Agtag_s.seq`:
edge.c:213:30: warning: conversion from 'int' to 'unsigned int:28' may change
value [-Wconversion]
213 | AGSEQ(in) = AGSEQ(out) = seq;
| ^~~
...
graph.c: In function 'agopen1':
graph.c:77:20: warning: conversion from 'uint64_t' {aka 'long unsigned int'}
to 'unsigned int:28' may change value [-Wconversion]
77 | AGSEQ(g) = agnextseq(par, AGRAPH);
| ^~~~~~~~~
...
node.c: In function 'newnode':
node.c:76:16: warning: conversion from 'uint64_t' {aka 'long unsigned int'} to
'unsigned int:28' may change value [-Wconversion]
76 | AGSEQ(n) = seq;
| ^~~
...
node.c: In function 'agnodebefore':
node.c:359:22: warning: conversion from 'uint64_t' {aka 'long unsigned int'}
to 'unsigned int:28' may change value [-Wconversion]
359 | AGSEQ(snd) = (g->clos->seq[AGNODE] + 2);
| ^
In practice, ingesting a graph of this size is not achievable, so these
overflows cannot occur.
This change introduces assertions and casts in these cases to explain the
assumptions to the compiler. It squashes the above warnings. In future, perhaps
these fields should all be made to all consistently use the same type.
gv_trim_zeros: identify string extent instead of writing a '\0'
The buffer that this function was truncating is destined for `gvwrite`. So we
can make the whole thing read-only by identifying a string extent instead of
modifying the buffer in place. The compiler may have been able to identify the
intent of this code anyway¹ but if not these changes make it clearer how this
code can be optimized.
This looks like a bit of a strange change, when we now wrap the entire file in
`extern "C"`. However this has two key benefits:
1. `dot_builtins` and `dot_static` that include this source needed an
Autotools hack¹ to force compilation to use the C++ front end (`c++`)
instead of the C front end (`cc`) in order to link against the C++ standard
library. By moving this source into C++ we can remove this hack.
2. When trying to integrate `dot_builtins` into the CMake build system, MSVC
complains (correctly) that the initializers to the array in this file are
not compile-time constants. GCC and Clang apparently allow this by a
non-standard extension. By moving this into C++, we get more relaxed
initialization semantics that allow this on all compilers.
This code is not currently compiled and, in fact, will not compile if you try to
re-enable it. As an example issue, it uses `GD_inleaf`, a macro intended for
accessing `Agraphinfo_t` fields, on a `node_t`. This is sort of a double mistake
as `Agraphinfo_t` also has no `inleaf` field. This problem seems to have been
present in the very first Graphviz revision, 256ef66663ca0c072554ee3f5e7971911031b3c7. Fortunately the mistakes sort of
cancelled each other out because the `GD_*` marcos did no casting and
`Agnodeinfo_t` _does_ have an `inleaf` field. The outcome seems to be what the
author intended, even if the route by which they got there was not intended.
The above is only one of several issues with this code. Resurrecting it has
unknown cost and unknown benefit, so we remove it here to avoid the implication
that it can be easily switched back on.
Contrary to the X11 documentation,¹ it seems button values other than 1-5 can be
returned as button press events. The assertions altered in this commit were
introduced to guarantee the value does not exceed the limits of the type of the
parameter in the user’s callback (`int`). So we can safely relax this to just
the limit itself.
This was validated by doing an exhaustive comparison of all strlen ≤2 inputs to
both the before and after function. Not bulletproof, but it is a strong signal
that the new version is functionally identical.