Set and bag operations in shell
Tue Nov 17 14:41:29 2020
Peter Krumins has a blog post about set operations in the unix shell which has some issues.
- Many of his operations are bash specific, and so I wouldn't call them for the "unix" shell, but rather the bash shell.
- For the most part, they aren't set operations at all, but rather bag operations.
- There is a join(1) command which can simplify some of these.
But the items are good for what they are. Following is my take on these.
First some definitions and specifications.
- A set is a unique list of elements, one per line.
- A bag is a non-necessarily unique list of elements.
- Sets and bags will be in files names with capital letters.
- Specific elements will be given in single quotes.
- A plain unquoted 'e' will be used to denote an arbitrary element
- Tests output nothing to stdout and exit 0 if the test is true and 1 if false
Set Membership
This is a simple grep against a fixed string.
grep -Fxq e A
Set Equality
This is more difficult, because the set files may not be sorted, and we want to avoid intermediate files. We can use join here to output non-matching lines.
expr "$(join -v1 -v2 A B | head -n1 | wc -l)" = 1 > /dev/null
Ok, that's hideous. I am not aware of any unix command that will return by exit value if it's input is empty.
Merely outputing the wc -l won't give an exit value, so by itself that's not sufficient to meet the above specs.
So, here's a better version:
join -v1 -v2 A B | head -n1 | (read x && test -n "$x")
At least a little better anyway, as it's a pure pipeline, albeit with a subshell. The subshell isn't strictly needed, but it avoids polluting the calling shells namespace.
From here on out the snippets will assume that you have the read and test in a shell function or program named 'nonempty', which returns true if its stdin consists of at least one line.
Similarly the snippets will assume that you have an 'empty' program which is the opposite.
And both of those will only need to read at most one line, so a preceeding head -n1 isn't needed.
Set Cardinality
The cardinality of a (finite) set is just the number of elements. Unlike the above, this produces a result, so we don't have to do stupid tricks with expr to get an exit status.
wc -l A | cut -d ' ' -f1
Ok, that's still dumb, because there isn't any way to tell wc to not say the name of the file, if it reads from a file rather than stdin. So you can do
wc -l < A
But the redirect has a certain goofiness of it's own.
We can let awk do the work.
awk 'END { print NR }' A
I think that's posix. It's going to be generally true that "let awk handle it" will solve these issues. I'm not sure that's a shell solution though, and I'd have to retitle this post as set operations in awk, which isn't nearly as cool.
Subset
Here we're testing if A is a subset of B
join -v1 A B | empty
Here the join will output the elements in A that didn't have a match. If that's empty, then it's a subset. Note that this is not necessarily a proper subset, as the sets could be equal.
Union
Since sets can't have any duplicate elements, we just sort -u
sort -u A B
Intersection
The join command makes this trivial.
join A B
Complement
The elements of A that are not in B
join -v1 A B
Symmetric Difference
Elements that are in A or B but not both
join -v1 -v2 A B
Power Set
A power set is a 'set of sets', specifically the set of all subsets of a set. We don't really have any way to represent arbitrarily nested sets. Since we don't need to parse them, just generate them, I'll put them as the elements separated by spaces on a single line.
If we assign a bit to each element of the starting set, then the power set is the set of all N bit numbers, with the subset elements corresponding to the set bits.
a b c = three elements.
so it has subsets:
000 = {}
001 = {c}
010 = {b}
011 = {b,c}
etc.
and then the power set is just the set of all of those sets.
{ {}, {a}. {b}. {c}, {a, b} ... }
So, I can output the power set by getting the count, computing 2^count, then doing some ridiculous bit bashing.
And for even as little as 32 elements, the powerset will have 4 billion elements.
The obvious conclusion is "don't do this in shell".
Posted in / software