Tue Nov 17 14:41:29 2020
Peter Krumins has a blog post about set operations in the unix shell
which has some issues.
- Many of his operations are bash specific, and so I wouldn't call
them for the "unix" shell, but rather the bash shell.
- For the most part, they aren't set operations at all, but rather
bag operations.
- There is a join(1) command which can simplify some of these.
But the items are good for what they are. Following is my take on these.
First some definitions and specifications.
- A set is a unique list of elements, one per line.
- A bag is a non-necessarily unique list of elements.
- Sets and bags will be in files names with capital letters.
- Specific elements will be given in single quotes.
- A plain unquoted 'e' will be used to denote an arbitrary element
- Tests output nothing to stdout and exit 0 if the test is true and 1 if false
Set Membership
This is a simple grep against a fixed string.
grep -Fxq e A
Set Equality
This is more difficult, because the set files may not be sorted,
and we want to avoid intermediate files. We can use join here to
output non-matching lines.
expr "$(join -v1 -v2 A B | head -n1 | wc -l)" = 1 > /dev/null
Ok, that's hideous. I am not aware of any unix command that will
return by exit value if it's input is empty.
Merely outputing the wc -l won't give an exit value, so by itself that's
not sufficient to meet the above specs.
So, here's a better version:
join -v1 -v2 A B | head -n1 | (read x && test -n "$x")
At least a little better anyway, as it's a pure pipeline, albeit
with a subshell. The subshell isn't strictly needed, but it
avoids polluting the calling shells namespace.
From here on out the snippets will assume that you have the read and test
in a shell function or program named 'nonempty', which returns true
if its stdin consists of at least one line.
Similarly the snippets will assume that you have an 'empty'
program which is the opposite.
And both of those will only need to read at most one line, so
a preceeding head -n1 isn't needed.
Set Cardinality
The cardinality of a (finite) set is just the number of elements.
Unlike the above, this produces a result, so we don't have
to do stupid tricks with expr to get an exit status.
wc -l A | cut -d ' ' -f1
Ok, that's still dumb, because there isn't any way to tell wc
to not say the name of the file, if it reads from a file rather
than stdin. So you can do
wc -l < A
But the redirect has a certain goofiness of it's own.
We can let awk do the work.
awk 'END { print NR }' A
I think that's posix. It's going to be generally true that "let awk
handle it" will solve these issues. I'm not sure that's a shell solution
though, and I'd have to retitle this post as set operations in awk, which
isn't nearly as cool.
Subset
Here we're testing if A is a subset of B
join -v1 A B | empty
Here the join will output the elements in A that didn't have a match.
If that's empty, then it's a subset. Note that this is not
necessarily a proper subset, as the sets could be equal.
Union
Since sets can't have any duplicate elements, we just sort -u
sort -u A B
Intersection
The join command makes this trivial.
join A B
Complement
The elements of A that are not in B
join -v1 A B
Symmetric Difference
Elements that are in A or B but not both
join -v1 -v2 A B
Power Set
A power set is a 'set of sets', specifically the set of all subsets of
a set. We don't really have any way to represent arbitrarily
nested sets. Since we don't need to parse them, just generate
them, I'll put them as the elements separated by spaces on a single line.
If we assign a bit to each element of the starting set, then the power set is
the set of all N bit numbers, with the subset elements corresponding to the set
bits.
a b c = three elements.
so it has subsets:
000 = {}
001 = {c}
010 = {b}
011 = {b,c}
etc.
and then the power set is just the set of all of those sets.
{ {}, {a}. {b}. {c}, {a, b} ... }
So, I can output the power set by getting the count, computing 2^count, then doing some ridiculous bit bashing.
And for even as little as 32 elements, the powerset will have 4 billion
elements.
The obvious conclusion is "don't do this in shell".
Posted in / software
Wed Apr 25 15:15:50 2018
Here is a script that will output commands to break up the head git commit into
a separate commit for each file. Public domain.
#!/bin/sh
#break up a given commit into separate commits, one for each file
#commit=$(git rev-parse "${1:-HEAD}")
commit=HEAD
#git rebase -i
author=$(git log -n1 $commit --format='%an <%ae>')
authdate=$(git log -n1 $commit --format='%ad')
msg=$(git log -n1 $commit --format='%B')
# while read?
echo git reset HEAD^
git diff-tree --no-commit-id --name-only -r $commit -z |
xargs -0 -n1 -IFILE echo git add FILE '&&' git commit -C $commit -uno
Posted in / software
Thu Jun 22 15:11:38 2017
Here's a small function to create an assertion, which throws an error
if the first argument is not true. Public domain.
create or replace function assert(pass boolean, msg text default null)
returns void
language plpgsql
as $$
begin
if pass then
return;
end if;
if msg is null then
get diagnostics msg = PG_CONTEXT;
end if;
raise exception '%', msg;
end;
$$;
Posted in / software / postgres
Wed Mar 15 17:09:03 2017
I frequently need to get the definition of a function out of
postgres, so here's a small shell script I wrote to do that.
#!/bin/sh
{
cat <<EOS
select pg_get_functiondef(oid)
from pg_proc where proname = '$1'
;
EOS
} | psql -Aqt
Posted in / software / postgres
Tue Mar 15 08:16:06 2016
We're doing it wrong. Unix is powerful because of simple abstractions and
interfaces. All of our web apps are way too complex. We should get back to a
simple notion. In unix "everything is a file". We don't need that as such,
but what people care about is generally bags of bytes, which can be called
files. We care about who can access our information, and how.
We should be thinking about things in terms of "bags of bytes",
"people/groups/roles", and "permission capabilities". Everything else is some
abstraction over these. I posit that though if those three were the only
"real" things, and all applications were built up abstraction layers reducable
to those three, it would be easier and faster to build new applications, and
they would be more secure. If "can this person see these bytes" were baked in
at the lowest level, it would be a lot easier to verify the security of a
person's information, and a lot harder to screw it up. Similarly with
any other particular capability. "can modify", "can delete", etc.
Some notation, a comma separated list in square brackets is a tuple. E.g.
"[file, role]" would be a data item associating a file and a role. The meaning
of this is dependent on the particular application using the tuple.
Objects
files: a bag of bytes with a mime type, metadata, notes, etc. whether a
backup, image, music, video, etc is all a view into different types of files.
message: text with a source and destination, the text can be viewed as a file.
That is a "message" is a "file" with mandatory source and destination metadata.
role: a user, group, etc.
capability: a label on an object allowing them to take a particular action.
required capabilities and their semantics are particular to a given
application, but a set of standard names and intended interpretations should be
developed.
acl entry: [role, file, capability]
"ownership" of a file is probably an acl entry. It's entirely possible for
more than one person to "own" a particular bag of bytes. different people
could have different sets of metadata for a file.
file list: an ordered set of files
comment: a [file, file, role]
Applications
under this ontology, for example:
email: a message wherein the source and dest are email addresses, and the
system attempts to hand off the message. additionally the default notion is
that the files associated are readable only by the source and destination
roles.
blog: a collection of files (messages?) timestamped, maybe tagged, etc, made
viewable in (descending) timestamp order. i.e. a set of [file, timestamp],
possibly [file,role,timestamp] for a group blog, and [file,role,file] for
comments.
flickr: a view of image files made viewable to the public,
twitter: a blog with a very short maximum file size for blog posts.
youtube: a video blog supporting file lists and comments
calendar: set of files, each containing a particular calendar item. could be
ics files, could be something else.
usenet: a public set of messages where the source is a role, and the
destination is a set of newsgroups. The message format is constrained by
relevant rfcs.
Implementation Notes
The primary interface needed for functionality is the ability to manipulate the
base objects. Thus the ability to create roles, files, acls, file lists, etc
is paramount. In theory, once that is done, if it has a public API, other
things can be built on top of those, possibly creating new tuple types to
support their particular application.
"metadata" should probably just be json at this point, and an application with
full permissions should probably be allowed to create new named tuples. For
example, a "blog" application might connect to the backend API at install time
and ask for a new tuple of [file,role, timestamp] to be created and named
"posts", and a second one of [file,role,file,metadata] named "comments" the
metadata on the comments intended for an approval system, but not otherwise
interpreted by the system. Some method of indexing json/metadata might be
needed.
Note that a system wouldn't need to use json for metadata, it could roll its
own, but if it's json, the system can probably be taught to index it easily.
Thoughts on modification of data: I think it's best to think of data as
immutable, like a number. You can't "modify" 37, it just is what it is. So,
if you are making (say) an image editing program, a particular edit doesn't
modify the original as such, but rather creates a new file (i.e. bag of bytes),
and possibly associates it with the same tags/name/revision chain, or whatever
the image application wants. The application could, if it wanted, do some sort
of garbage collection and delete the original, though it would make more sense
for it to disassociate it with the name instead, since the original could,
conceivably, be in use by some other application. This is analogous to hard
links in a unix system. You can't really delete a file as such, but it will be
removed after the last link is removed. Some system to allow different
applications to use the same link sets would be needed. If you have a picture
gallery application, and an image editing one, you want them to both work on
the same set of pictures by name. It might be worth having an explicit notion
of one file being a revision of another at a low level in the system. I'm not
really sure yet how to solve this problem.
Posted in / software
Fri Dec 4 19:15:17 2015
The ancients developed the notion of "Analysis", which is breaking things apart
to understand them better, and "Synthesis" which is putting things together to
make something new. Several pages on ProjectRho
make note of this, "[Within] the realm of thought there are only two operations
possible: Analysis and Synthesis. That is, cutting something into pieces and
merging two pieces into one. http://www.projectrho.com/elemental.html.
This isn't a bad start, and is even self-documenting, in that it analyzes thinking
by breaking it into two parts.
Programming is essentially a process of analysis of a problem. Before you can write
a program, you have to know how to solve the problem, and you have to have decomposed
the problem into simple enough pieces that it is easy to write computer code to
solve each piece. If you can't do this, either because you can't solve the original
problem, or because you can't break the problem down far enough for the parts to be
expressible in a way the computer can understand, you will not be successful at
programming.
Where does synthesis come in? Synthesis is related to the notion of code-reuse.
The unix philosophy makes things easier, much easier in fact, to build and rebuild
systems in two ways that complement each other.
First unix encourages making each analytic chunk a separarable piece of code,
and in particular, a separate executable that can be run all by itself. So
if you've analyzed a problem into three components A, B, and C, the unix way
of doing things is to actually write three programs A, B, and C that solve
the components independently.
However, this leaves the problem of how do you combine A, B, and C into
something that solves the original problem, or to use the terminology above,
how do you synthesize them into a whole? What unix does to make this easy is
the clever and simple solution of having (essentially) all programs communicate
in the same way. A program accepts input via a mechanism called, logically
enough, "standard input", and writes its results to, as you might guess,
"standard output". Standard input and output are then usually just simple
streams of text. By using this mechanism, and by having the interface between
components be simple streams of text, it becomes possible to use a given piece
of code in new ways very simply by inserting it into the middle of the stream.
So, suppose you are solving some other problem which happens to break
down into components X, Y, and Z. If component B happens to match component
Y (or X, or Z, it doesn't matter), you don't have to then spend any time
on solving Y, because you can just re-use B. But you couldn't do that directly
if you hadn't written B as a completely separate component, at best you
might look at the first problem and try to extract the code for B out to
use. A lot of times this is harder than just re-writing from scratch.
Even extracting the code to for your component Y assumes that you recognize
that some part of the original program matches your Y and can be used as such,
which is less likely if B isn't a separate component to begin with.
It's not just the synthesis part that doing things the unix way allows, it's
the analysis part too. Normally a problem doesn't break down (i.e. analyze)
into just one level of sub-problems. Instead, you have some problem, that you
analyze into parts A, B, and C, but then when you go to solve B, you discover
that you need to break down B into B1, B2, and B3, and tackling B3 you may find
you need to solve B3a, B3b, and B3c. B3b may then break down further, and so
on. (I don't mean to imply that all problems decompose into three parts, these
are just examples.)
Unix makes even the analysis part easier, because you may, and frequently do,
find that problem B3 has already been solved and there's a standard or pre-existing
tool for solving B3. By using that tool for B3, you don't need to go break
B3 down any further, no matter how complex B3 actually is. Indeed, the history and
consistency of unix is such that a lot of the truly hard problems are already solved
and a lot of time indeed is saved by just re-using standard tools.
Posted in / software
Thu Jul 23 19:43:07 2015
Apparently the clowns over at firefox don't know how the domain
name system works. I'm writing an image gallery site, and firefox
keeps wanting to change my saved password for "if.org". Ok, look,
first of all, the domain name is "granicus.if.org", not "if.org".
Second, the site at https://granicus.if.org/gallery/ has nothing
to do with anything else hosted on granicus. It's a completely
different site. But the notion that maybe there's more than
one level to the DNS and that just perhaps a single domain name
could host more than one service seems to escape the geniuses
at firefox.
Posted in / software
Tue Jul 7 05:19:50 2015
I am working on a basic ticketing system, and in the ticket table
I have an "Impact" column, which should only allow a particular
set of pre-defined impacts. The problem with using a numeric
value for "severity" or similar is that people don't know what
the numbers are supposed to mean and they tend to overstate the
severity of their particular ticket.
So I have, rather than "severity", "impact". The values I have
tentatively decided on are
- Production Outage
- Unusable
- Inconvenient
- Informational
These aren't necessarily the best names, "Inconvenient" should perhaps
be separated into "Has Work Around", and "Partial Functionality" or something
like that.
We would however like to sort by the impact. There are a couple
of ways to do this naively.
You could cleverly choose the names of the field to be in alphabetical
order. I mention this only because it occurred to me. There are probably
enough synonyms that you could always make this work, but the names would
more likely than not become quite tortured.
The other way is to have a separate "impact type" table with the
name and a sort order or numerical level. But you'd have to join
this table every time you wanted to sort or every time you wanted
to get the name for the level depending on whether you chose the name
or the number to be the primary key.
Neither of these solutions is very elegant. So, enter the enum.
We can create an enum thus:
create type impact as enum (
'Informational',
'Inconvenient',
'Unusable',
'Production Outage'
);
Which allows us to sort, compare, and generally use the labels instead
of numbers. Internally these are stored in four bytes and postgres handles
the conversions for you. They're effectively a 32 bit float as far as the
storage is concerned.
So far so good, but I'm writing an external application which needs to
know what order the enum labels are in. Fortunately, it is simple
to query the catalogs and get this information. I wrote a helper
function:
create function enumlist(enumname text) returns
setof pg_enum language 'sql' as $$
select * from pg_enum where enumtypid = enumname::regtype::oid
$$ stable strict;
Which can be used to get the labels and sort order for them for use by
an external application.
% psql -c "select * from enumlist('impact')"
enumtypid | enumsortorder | enumlabel
-----------+---------------+-------------------
19525184 | 1 | Informational
19525184 | 2 | Inconvenient
19525184 | 3 | Unusable
19525184 | 4 | Production Outage
(4 rows)
%
The advantage of this is that you don't have to hard-code the values anywhere,
and the database is the master repository of your label definitions. If
your applications query this as needed, any change in the database will
be automatically reflected in your application logic and displays. Any time
you can avoid a double edit, that's a win for maintainability.
Posted in / software / postgres
Mon May 18 17:57:08 2015
The linux/coreutils implementation of wc doesn't seem
to be Posix compliant in two ways.
The first, and more trivial way is that it's output
seems to have extra spaces in it. Posix says that the
output format will default to
"%d %d %d %s\n"
But here you can see some sample output with leading space and
extra space between the characters.
granicus% wc ../mkdir/*
10 138 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 321 10147 total
granicus%
The Posix rationale section notes that the System V format was
"%7d%7d%7d %s\n"
which may be the source of the additional whitespace.
The second issue is potentially more serious, as it is actually different
output. This depends on what is a "word". Posix specifies that a word
is "a non-zero-length string of characters delimited by white space".
I take this to mean (in pseudocode):
if we aren't in a word and the current character is not a whitespace
character, then start a word
if we are in a word and the current character is a whitespace character,
then end or stop being in a word.
in C:
wasspace = 1; /* begining of file is effectively whitespace */
while ((ch = fgetc(f)) != EOF) {
fb++;
if (ch == '\n') {
fl++;
}
if (wasspace && !isspace(ch) {
fw++;
wasspace = 0;
}
if (isspace(ch)) {
wasspace = 1;
}
}
This gives
granicus% ./wc ../mkdir/*
10 166 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 349 10147 total
in contrast to the coreutils output above.
changing
if (wasspace && !isspace(ch) {
to
if (wasspace && !isspace(ch) && isprint(ch)) {
makes my implementation match the linux coreutils one. Thus the coreutils wc
seems to be using a definition of word equivalent to "a non-zero-length string
of printable characters delimited by white space". This is only likely to
matter for binary files, and the number of "words" in a binary file is pretty
much useless anyway, but I believe this deviates from posix. It's possible
that coreutils wc can be invoked in such a way as to match, but I haven't found
one. Also, the coreutils definition is probably better than the posix one, but
that's a matter for a defect report or change request for posix.
Posted in / software / posix
Tue May 5 14:19:25 2015
What the hell? The new top output is awful. It's ugly, the colors
are low contrast, and why is it using colors anyway? A few
would be fine if they meant something, but color just for the sake
of color is terrible.
Also, this is in 3.3.10, which is a point release. A point release is for
bug fixes or trivial changes. Not "we thought you might like to not know how
to use our software anymore". The change should have been at least a 3.4, and
probably a 4.0. That would at least let everyone know something was up.
For the record, 'zV1W' seems to get the old settings back and write
them to a .toprc.
That .toprc is in some undocumented format. It's effectively a binary
configuration format. It should probably be called '.topblob' instead
of '.toprc'. I expect to be able to edit a .rc file.
Posted in / software