Bad Data

Set and bag operations in shell

Tue Nov 17 14:41:29 2020

Peter Krumins has a blog post about set operations in the unix shell which has some issues.

  1. Many of his operations are bash specific, and so I wouldn't call them for the "unix" shell, but rather the bash shell.
  2. For the most part, they aren't set operations at all, but rather bag operations.
  3. There is a join(1) command which can simplify some of these.

But the items are good for what they are. Following is my take on these.

First some definitions and specifications.

Set Membership

This is a simple grep against a fixed string.

grep -Fxq e A

Set Equality

This is more difficult, because the set files may not be sorted, and we want to avoid intermediate files. We can use join here to output non-matching lines.

expr "$(join -v1 -v2 A B | head -n1 | wc -l)" = 1 > /dev/null

Ok, that's hideous. I am not aware of any unix command that will return by exit value if it's input is empty.

Merely outputing the wc -l won't give an exit value, so by itself that's not sufficient to meet the above specs.

So, here's a better version:

join -v1 -v2 A B | head -n1 | (read x && test -n "$x")

At least a little better anyway, as it's a pure pipeline, albeit with a subshell. The subshell isn't strictly needed, but it avoids polluting the calling shells namespace.

From here on out the snippets will assume that you have the read and test in a shell function or program named 'nonempty', which returns true if its stdin consists of at least one line.

Similarly the snippets will assume that you have an 'empty' program which is the opposite.

And both of those will only need to read at most one line, so a preceeding head -n1 isn't needed.

Set Cardinality

The cardinality of a (finite) set is just the number of elements. Unlike the above, this produces a result, so we don't have to do stupid tricks with expr to get an exit status.

wc -l A | cut -d ' ' -f1

Ok, that's still dumb, because there isn't any way to tell wc to not say the name of the file, if it reads from a file rather than stdin. So you can do

wc -l < A

But the redirect has a certain goofiness of it's own.

We can let awk do the work.

awk 'END { print NR }' A

I think that's posix. It's going to be generally true that "let awk handle it" will solve these issues. I'm not sure that's a shell solution though, and I'd have to retitle this post as set operations in awk, which isn't nearly as cool.

Subset

Here we're testing if A is a subset of B

join -v1 A B | empty

Here the join will output the elements in A that didn't have a match. If that's empty, then it's a subset. Note that this is not necessarily a proper subset, as the sets could be equal.

Union

Since sets can't have any duplicate elements, we just sort -u

sort -u A B

Intersection

The join command makes this trivial.

join A B

Complement

The elements of A that are not in B

join -v1 A B

Symmetric Difference

Elements that are in A or B but not both

join -v1 -v2 A B

Power Set

A power set is a 'set of sets', specifically the set of all subsets of a set. We don't really have any way to represent arbitrarily nested sets. Since we don't need to parse them, just generate them, I'll put them as the elements separated by spaces on a single line.

If we assign a bit to each element of the starting set, then the power set is the set of all N bit numbers, with the subset elements corresponding to the set bits.

a b c = three elements.

so it has subsets:

000 = {}
001 = {c}
010 = {b}
011 = {b,c}

etc.

and then the power set is just the set of all of those sets.

{ {}, {a}. {b}. {c}, {a, b} ... }

So, I can output the power set by getting the count, computing 2^count, then doing some ridiculous bit bashing.

And for even as little as 32 elements, the powerset will have 4 billion elements.

The obvious conclusion is "don't do this in shell".

Posted in / software


git-shatter

Wed Apr 25 15:15:50 2018

Here is a script that will output commands to break up the head git commit into a separate commit for each file. Public domain.

#!/bin/sh

#break up a given commit into separate commits, one for each file

#commit=$(git rev-parse "${1:-HEAD}")
commit=HEAD

#git rebase -i 

author=$(git log -n1 $commit --format='%an <%ae>')
authdate=$(git log -n1 $commit --format='%ad')
msg=$(git log -n1 $commit --format='%B')

# while read?
echo git reset HEAD^
git diff-tree --no-commit-id --name-only -r $commit -z |
xargs -0 -n1 -IFILE echo git add FILE '&&' git commit -C $commit -uno

Posted in / software


Assert for Postgres

Thu Jun 22 15:11:38 2017

Here's a small function to create an assertion, which throws an error if the first argument is not true. Public domain.

create or replace function assert(pass boolean, msg text default null)
returns void
language plpgsql
as $$
begin
    if pass then
        return;
    end if;

    if msg is null then
        get diagnostics msg = PG_CONTEXT;
    end if;

    raise exception '%', msg;
end;
$$;

Posted in / software / postgres


Dumping a function definition

Wed Mar 15 17:09:03 2017

I frequently need to get the definition of a function out of postgres, so here's a small shell script I wrote to do that.

#!/bin/sh

{
cat <<EOS
select pg_get_functiondef(oid)
from pg_proc where proname = '$1'
;
EOS
} | psql -Aqt

Posted in / software / postgres


Web Application Abstraction Layers

Tue Mar 15 08:16:06 2016

We're doing it wrong. Unix is powerful because of simple abstractions and interfaces. All of our web apps are way too complex. We should get back to a simple notion. In unix "everything is a file". We don't need that as such, but what people care about is generally bags of bytes, which can be called files. We care about who can access our information, and how.

We should be thinking about things in terms of "bags of bytes", "people/groups/roles", and "permission capabilities". Everything else is some abstraction over these. I posit that though if those three were the only "real" things, and all applications were built up abstraction layers reducable to those three, it would be easier and faster to build new applications, and they would be more secure. If "can this person see these bytes" were baked in at the lowest level, it would be a lot easier to verify the security of a person's information, and a lot harder to screw it up. Similarly with any other particular capability. "can modify", "can delete", etc.

Some notation, a comma separated list in square brackets is a tuple. E.g. "[file, role]" would be a data item associating a file and a role. The meaning of this is dependent on the particular application using the tuple.

Objects

files: a bag of bytes with a mime type, metadata, notes, etc. whether a backup, image, music, video, etc is all a view into different types of files.

message: text with a source and destination, the text can be viewed as a file. That is a "message" is a "file" with mandatory source and destination metadata.

role: a user, group, etc.

capability: a label on an object allowing them to take a particular action. required capabilities and their semantics are particular to a given application, but a set of standard names and intended interpretations should be developed.

acl entry: [role, file, capability]

"ownership" of a file is probably an acl entry. It's entirely possible for more than one person to "own" a particular bag of bytes. different people could have different sets of metadata for a file.

file list: an ordered set of files

comment: a [file, file, role]

Applications

under this ontology, for example:

email: a message wherein the source and dest are email addresses, and the system attempts to hand off the message. additionally the default notion is that the files associated are readable only by the source and destination roles.

blog: a collection of files (messages?) timestamped, maybe tagged, etc, made viewable in (descending) timestamp order. i.e. a set of [file, timestamp], possibly [file,role,timestamp] for a group blog, and [file,role,file] for comments.

flickr: a view of image files made viewable to the public,

twitter: a blog with a very short maximum file size for blog posts.

youtube: a video blog supporting file lists and comments

calendar: set of files, each containing a particular calendar item. could be ics files, could be something else.

usenet: a public set of messages where the source is a role, and the destination is a set of newsgroups. The message format is constrained by relevant rfcs.

Implementation Notes

The primary interface needed for functionality is the ability to manipulate the base objects. Thus the ability to create roles, files, acls, file lists, etc is paramount. In theory, once that is done, if it has a public API, other things can be built on top of those, possibly creating new tuple types to support their particular application.

"metadata" should probably just be json at this point, and an application with full permissions should probably be allowed to create new named tuples. For example, a "blog" application might connect to the backend API at install time and ask for a new tuple of [file,role, timestamp] to be created and named "posts", and a second one of [file,role,file,metadata] named "comments" the metadata on the comments intended for an approval system, but not otherwise interpreted by the system. Some method of indexing json/metadata might be needed.

Note that a system wouldn't need to use json for metadata, it could roll its own, but if it's json, the system can probably be taught to index it easily.

Thoughts on modification of data: I think it's best to think of data as immutable, like a number. You can't "modify" 37, it just is what it is. So, if you are making (say) an image editing program, a particular edit doesn't modify the original as such, but rather creates a new file (i.e. bag of bytes), and possibly associates it with the same tags/name/revision chain, or whatever the image application wants. The application could, if it wanted, do some sort of garbage collection and delete the original, though it would make more sense for it to disassociate it with the name instead, since the original could, conceivably, be in use by some other application. This is analogous to hard links in a unix system. You can't really delete a file as such, but it will be removed after the last link is removed. Some system to allow different applications to use the same link sets would be needed. If you have a picture gallery application, and an image editing one, you want them to both work on the same set of pictures by name. It might be worth having an explicit notion of one file being a revision of another at a low level in the system. I'm not really sure yet how to solve this problem.

Posted in / software


Analysis and Synthesis in programming

Fri Dec 4 19:15:17 2015

The ancients developed the notion of "Analysis", which is breaking things apart to understand them better, and "Synthesis" which is putting things together to make something new. Several pages on ProjectRho make note of this, "[Within] the realm of thought there are only two operations possible: Analysis and Synthesis. That is, cutting something into pieces and merging two pieces into one. http://www.projectrho.com/elemental.html.

This isn't a bad start, and is even self-documenting, in that it analyzes thinking by breaking it into two parts.

Programming is essentially a process of analysis of a problem. Before you can write a program, you have to know how to solve the problem, and you have to have decomposed the problem into simple enough pieces that it is easy to write computer code to solve each piece. If you can't do this, either because you can't solve the original problem, or because you can't break the problem down far enough for the parts to be expressible in a way the computer can understand, you will not be successful at programming.

Where does synthesis come in? Synthesis is related to the notion of code-reuse.

The unix philosophy makes things easier, much easier in fact, to build and rebuild systems in two ways that complement each other.

First unix encourages making each analytic chunk a separarable piece of code, and in particular, a separate executable that can be run all by itself. So if you've analyzed a problem into three components A, B, and C, the unix way of doing things is to actually write three programs A, B, and C that solve the components independently.

However, this leaves the problem of how do you combine A, B, and C into something that solves the original problem, or to use the terminology above, how do you synthesize them into a whole? What unix does to make this easy is the clever and simple solution of having (essentially) all programs communicate in the same way. A program accepts input via a mechanism called, logically enough, "standard input", and writes its results to, as you might guess, "standard output". Standard input and output are then usually just simple streams of text. By using this mechanism, and by having the interface between components be simple streams of text, it becomes possible to use a given piece of code in new ways very simply by inserting it into the middle of the stream.

So, suppose you are solving some other problem which happens to break down into components X, Y, and Z. If component B happens to match component Y (or X, or Z, it doesn't matter), you don't have to then spend any time on solving Y, because you can just re-use B. But you couldn't do that directly if you hadn't written B as a completely separate component, at best you might look at the first problem and try to extract the code for B out to use. A lot of times this is harder than just re-writing from scratch. Even extracting the code to for your component Y assumes that you recognize that some part of the original program matches your Y and can be used as such, which is less likely if B isn't a separate component to begin with.

It's not just the synthesis part that doing things the unix way allows, it's the analysis part too. Normally a problem doesn't break down (i.e. analyze) into just one level of sub-problems. Instead, you have some problem, that you analyze into parts A, B, and C, but then when you go to solve B, you discover that you need to break down B into B1, B2, and B3, and tackling B3 you may find you need to solve B3a, B3b, and B3c. B3b may then break down further, and so on. (I don't mean to imply that all problems decompose into three parts, these are just examples.)

Unix makes even the analysis part easier, because you may, and frequently do, find that problem B3 has already been solved and there's a standard or pre-existing tool for solving B3. By using that tool for B3, you don't need to go break B3 down any further, no matter how complex B3 actually is. Indeed, the history and consistency of unix is such that a lot of the truly hard problems are already solved and a lot of time indeed is saved by just re-using standard tools.

Posted in / software


Firefox Password Manager Stupid

Thu Jul 23 19:43:07 2015

Apparently the clowns over at firefox don't know how the domain name system works. I'm writing an image gallery site, and firefox keeps wanting to change my saved password for "if.org". Ok, look, first of all, the domain name is "granicus.if.org", not "if.org". Second, the site at https://granicus.if.org/gallery/ has nothing to do with anything else hosted on granicus. It's a completely different site. But the notion that maybe there's more than one level to the DNS and that just perhaps a single domain name could host more than one service seems to escape the geniuses at firefox.

Posted in / software


Getting Enums out of Postgres

Tue Jul 7 05:19:50 2015

I am working on a basic ticketing system, and in the ticket table I have an "Impact" column, which should only allow a particular set of pre-defined impacts. The problem with using a numeric value for "severity" or similar is that people don't know what the numbers are supposed to mean and they tend to overstate the severity of their particular ticket.

So I have, rather than "severity", "impact". The values I have tentatively decided on are

These aren't necessarily the best names, "Inconvenient" should perhaps be separated into "Has Work Around", and "Partial Functionality" or something like that.

We would however like to sort by the impact. There are a couple of ways to do this naively.

You could cleverly choose the names of the field to be in alphabetical order. I mention this only because it occurred to me. There are probably enough synonyms that you could always make this work, but the names would more likely than not become quite tortured.

The other way is to have a separate "impact type" table with the name and a sort order or numerical level. But you'd have to join this table every time you wanted to sort or every time you wanted to get the name for the level depending on whether you chose the name or the number to be the primary key.

Neither of these solutions is very elegant. So, enter the enum.

We can create an enum thus:

create type impact as enum (
    'Informational',
    'Inconvenient',
    'Unusable',
    'Production Outage'
);

Which allows us to sort, compare, and generally use the labels instead of numbers. Internally these are stored in four bytes and postgres handles the conversions for you. They're effectively a 32 bit float as far as the storage is concerned.

So far so good, but I'm writing an external application which needs to know what order the enum labels are in. Fortunately, it is simple to query the catalogs and get this information. I wrote a helper function:

create function enumlist(enumname text) returns
setof pg_enum language 'sql' as $$
    select * from pg_enum where enumtypid = enumname::regtype::oid
$$ stable strict;

Which can be used to get the labels and sort order for them for use by an external application.

% psql -c "select * from enumlist('impact')"
 enumtypid | enumsortorder |     enumlabel     
-----------+---------------+-------------------
  19525184 |             1 | Informational
  19525184 |             2 | Inconvenient
  19525184 |             3 | Unusable
  19525184 |             4 | Production Outage
(4 rows)

% 

The advantage of this is that you don't have to hard-code the values anywhere, and the database is the master repository of your label definitions. If your applications query this as needed, any change in the database will be automatically reflected in your application logic and displays. Any time you can avoid a double edit, that's a win for maintainability.

Posted in / software / postgres


Posix wc

Mon May 18 17:57:08 2015

The linux/coreutils implementation of wc doesn't seem to be Posix compliant in two ways.

The first, and more trivial way is that it's output seems to have extra spaces in it. Posix says that the output format will default to

"%d %d %d %s\n"

But here you can see some sample output with leading space and extra space between the characters.

granicus% wc ../mkdir/*    
   10   138  8872 ../mkdir/mkdir
   69   183  1275 ../mkdir/mkdir.c
   79   321 10147 total
granicus%

The Posix rationale section notes that the System V format was

"%7d%7d%7d %s\n"

which may be the source of the additional whitespace.

The second issue is potentially more serious, as it is actually different output. This depends on what is a "word". Posix specifies that a word is "a non-zero-length string of characters delimited by white space".

I take this to mean (in pseudocode):

if we aren't in a word and the current character is not a whitespace character, then start a word

if we are in a word and the current character is a whitespace character, then end or stop being in a word.

in C:

wasspace = 1; /* begining of file is effectively whitespace */
while ((ch = fgetc(f)) != EOF) {
    fb++;
    if (ch == '\n') {
        fl++;
    }
    if (wasspace && !isspace(ch) {
        fw++;
        wasspace = 0;
    }
    if (isspace(ch)) {
        wasspace = 1;
    }
}

This gives

granicus% ./wc ../mkdir/*
10 166 8872 ../mkdir/mkdir
69 183 1275 ../mkdir/mkdir.c
79 349 10147 total

in contrast to the coreutils output above.

changing

    if (wasspace && !isspace(ch) {

to

    if (wasspace && !isspace(ch) && isprint(ch)) {

makes my implementation match the linux coreutils one. Thus the coreutils wc seems to be using a definition of word equivalent to "a non-zero-length string of printable characters delimited by white space". This is only likely to matter for binary files, and the number of "words" in a binary file is pretty much useless anyway, but I believe this deviates from posix. It's possible that coreutils wc can be invoked in such a way as to match, but I haven't found one. Also, the coreutils definition is probably better than the posix one, but that's a matter for a defect report or change request for posix.

Posted in / software / posix


Latest linux props-ng 'top' format terrible

Tue May 5 14:19:25 2015

What the hell? The new top output is awful. It's ugly, the colors are low contrast, and why is it using colors anyway? A few would be fine if they meant something, but color just for the sake of color is terrible.

Also, this is in 3.3.10, which is a point release. A point release is for bug fixes or trivial changes. Not "we thought you might like to not know how to use our software anymore". The change should have been at least a 3.4, and probably a 4.0. That would at least let everyone know something was up.

For the record, 'zV1W' seems to get the old settings back and write them to a .toprc.

That .toprc is in some undocumented format. It's effectively a binary configuration format. It should probably be called '.topblob' instead of '.toprc'. I expect to be able to edit a .rc file.

Posted in / software