## Set and bag operations in shell

Tue Nov 17 14:41:29 2020

Peter Krumins has a blog post about set operations in the unix shell which has some issues.

- Many of his operations are bash specific, and so I wouldn't call
them for the "unix" shell, but rather the
*bash*shell. - For the most part, they aren't set operations at all, but rather bag operations.
- There is a join(1) command which can simplify some of these.

But the items are good for what they are. Following is my take on these.

First some definitions and specifications.

- A
*set*is a unique list of elements, one per line. - A
*bag*is a non-necessarily unique list of elements. - Sets and bags will be in files names with capital letters.
- Specific elements will be given in single quotes.
- A plain unquoted 'e' will be used to denote an arbitrary element
- Tests output nothing to stdout and exit 0 if the test is true and 1 if false

## Set Membership

This is a simple grep against a fixed string.

```
grep -Fxq e A
```

## Set Equality

This is more difficult, because the set files may not be sorted, and we want to avoid intermediate files. We can use join here to output non-matching lines.

```
expr "$(join -v1 -v2 A B | head -n1 | wc -l)" = 1 > /dev/null
```

Ok, that's hideous. I am not aware of any unix command that will return by exit value if it's input is empty.

Merely outputing the wc -l won't give an exit value, so by itself that's not sufficient to meet the above specs.

So, here's a better version:

```
join -v1 -v2 A B | head -n1 | (read x && test -n "$x")
```

At least a little better anyway, as it's a pure pipeline, albeit with a subshell. The subshell isn't strictly needed, but it avoids polluting the calling shells namespace.

From here on out the snippets will assume that you have the read and test in a shell function or program named 'nonempty', which returns true if its stdin consists of at least one line.

Similarly the snippets will assume that you have an 'empty' program which is the opposite.

And both of those will only need to read at most one line, so a preceeding head -n1 isn't needed.

## Set Cardinality

The cardinality of a (finite) set is just the number of elements. Unlike the above, this produces a result, so we don't have to do stupid tricks with expr to get an exit status.

```
wc -l A | cut -d ' ' -f1
```

Ok, that's still dumb, because there isn't any way to tell wc to not say the name of the file, if it reads from a file rather than stdin. So you can do

```
wc -l < A
```

But the redirect has a certain goofiness of it's own.

We can let awk do the work.

```
awk 'END { print NR }' A
```

I think that's posix. It's going to be generally true that "let awk
handle it" will solve these issues. I'm not sure that's a *shell* solution
though, and I'd have to retitle this post as set operations in awk, which
isn't nearly as cool.

## Subset

Here we're testing if A is a subset of B

```
join -v1 A B | empty
```

Here the join will output the elements in A that didn't have a match.
If that's empty, then it's a subset. Note that this is not
necessarily a *proper* subset, as the sets could be equal.

## Union

Since sets can't have any duplicate elements, we just sort -u

```
sort -u A B
```

## Intersection

The join command makes this trivial.

```
join A B
```

## Complement

The elements of A that are not in B

```
join -v1 A B
```

## Symmetric Difference

Elements that are in A or B but not both

```
join -v1 -v2 A B
```

## Power Set

A power set is a 'set of sets', specifically the set of all subsets of a set. We don't really have any way to represent arbitrarily nested sets. Since we don't need to parse them, just generate them, I'll put them as the elements separated by spaces on a single line.

If we assign a bit to each element of the starting set, then the power set is the set of all N bit numbers, with the subset elements corresponding to the set bits.

a b c = three elements.

so it has subsets:

```
000 = {}
001 = {c}
010 = {b}
011 = {b,c}
```

etc.

and then the power set is just the set of all of those sets.

```
{ {}, {a}. {b}. {c}, {a, b} ... }
```

So, I can output the power set by getting the count, computing 2^count, then doing some ridiculous bit bashing.

And for even as little as 32 elements, the powerset will have 4 billion elements.

The obvious conclusion is "don't do this in shell".

Posted in / software

326 comments.