"Visitors' Guide" to Push genome schemes

I recall a (loud, yelling over crowd noise) conversation at Arbor Brewing Company with @lspector and @thelmuth (and maybe some other folks) after a GPTP session, in which we chatted about genomes. I think this was back in the first few days you were using Plush genomes, because I recall we had a long rambling thread about handling parentheses.

As I move from the new Push interpreter (which is pretty much done, by the way) to the flexible representation-agnostic search infrastructure that motivated my extraction of the interpreter in the first place, I think it would be good to revisit all the many ways we’ve come up with for representing Push programs as genomes (and why). First, because I’d like to be able to support all or any of them in the search system I’m developing, but also because there seems to be a lot of interest in one or another being “better” for certain tasks, and there’s interesting research to be done on that.

So please help me remember what I might have missed, as I start a list of Push (and Nudge, which may as well be like Push) genomes we’ve actually implemented, with some motivations:

just trees dammit

Almost certainly everybody’s at least experimented with Push “genomes” that are simply Push programs stored and manipulated as Lisp-like nested trees. This enables simple comparisons with Koza-style expressions, but has some intellectual and algorithmic baggage when it comes to implementing search operators.

It also suffers somewhat from the biases in “uniform random trees” Jason Daida pointed out in his GPTP work.

Plush genomes

As I understand it, this representation is essentially a sequence of tuples, in which branching structure is associated with program tokens (instructions and literals), and in which the Push program is expressed by maintaining a rolling “nesting level” counter. Is that about right?

This enables treatment of the genome as a linear sequence by search operators, while managing the syntactic consistency of parentheses, and also introduces important Push-specific biases into “random code” which other schemes seem to miss.

I don’t know how the structural biases unfold in “uniform random code”, but would be interested in hearing. But regardless of structural biases, it does seem this one permits construction of random trees of exactly the specified number of program points.

Nudge or “footnote-based” genomes

These tackle the “tree structure” problem using an algorithm inspired by the Messy GA literature, and consist of an abstract (linear) “type framework” and a separate list of arbitrary literals, including ":code literals". In the process of translating from genome to program, the framework is “filled in” by consuming matching items from the literals, without replacement, in a depth-first order, until no unassigned values remain or the literals are exhausted.

The motivations for this one were twofold: First, a cleaner separation between algorithmic structure and constants when applying search operators; one can more easily “change all the integers” if they’re in a big pile in the “footnotes section”. But the presence of an arbitrary number of genes for literals permitted untranslated material to be carried along with partially-working programs.

The tricky part here is that one never knows ahead of time how large a Push program will be, given its genome, without actually doing the translation. But linear search operators can be used, though in the form I describe here they need to be applied to the “body” and “footnotes” sections separately.

Zipper genomes

This is what I’ve been using recently in my work with Push. The genome is a sequence of tuples, each consisting of a simple Push form and a positioning instruction. As each gene is translated, a notional “write head” in a Clojure zipper structure is moved, and the indicated Push form is added after the new position. The positioning instructions are a bit more expressive than those built into clojure.zip off the shelf, and include “wrapping” when the bounds of a tree or subtree are encountered, and also absolute indexing in the Push style—that is, an arbitrary integer is converted into a valid index automatically.

The “simple Push forms” are what you might expect: instructions, inputs, literals, and empty lists. I think I may add a nil one, as well, just for completeness; that would move the “write head” without adding a new token.

This approach definitely supports linear search operators, and the development of semantic blocks. It also permits the construction of code of an exact size (specified in program points).

“break-and-repair”

(I’ve seen this in old papers, and can’t recall the proper name of the scheme)

This approach uses a linear genome of simple Push forms, plus arbitrary ( and ) tokens. If the genome happens to form a valid tree, with balanced parentheses, the obvious tree results. If not, then deterministic rules are applied to balance the parentheses and form a valid tree. Typically these are something like: Alternately add a ( at the head and ) at the tail until the resulting string is a valid tree.

As you can probably already see, the resulting trees (for almost any deterministic scheme for adding new parentheses) is kind of weird, with more root branches and more nesting on the final branch than you would expect from a “balanced” approach. Alternative schemes with different biases might include deleting unmatched parentheses, or shifting the parentheses already present left and right until they work out. Needless to say, most of these feel like kludges.

Grammatical representations

Years back I spent some time poking at simple rewriting grammars for Push programs, but made no major headway. The idea is sound, and certainly the Push grammar is simple enough that this would be a reasonably simple approach to explore.

Others?

I’m sure I’ve missed some.

I’ve intentionally avoided talking about tree-generating algorithms, though I think many of the motivations for the genomes we’ve explored arise from concerns about that process.

2 Likes

Close enough.

One thing I’d emphasize is that Plush genomes try to clarify the role of parentheses in Push programs as groupers of code for semantic purposes instead of evolutionary purposes.

Plush also allows for easy extensions of instruction “tuples” through something like metadata, such as the addition of epigenetic markers that @williamlacava has focused on. We have considered other extensions as well. These extensions could influence either the translation of Plush into Push or the variation operators acting on the genomes. For example, we have discussed that addition of alternation hotspots, similar to recombination hotspots in biology. These instruction tags would make it more (or less?) likely to alternate at that particular instruction during crossover.

I’m not sure how “old” you’re talking, but this sounds a lot like ULTRA. ULTRA is a genetic operator that works on normal “just trees” Push programs, and not a separate representation. As such, it doesn’t quite match your description – its repair steps are stochastic, not deterministic.

But, maybe you’re referencing some other, older scheme of which I am not aware.


P.S. This seems like a great thing to put in the Orientation category, since it will help new people understand all our wacky representations.

Done! (for some definition of “understand”)

I think? Like from the 90s?

It sounds like ULTRA to me too, but I think the corrections are that:

  • In our ULTRA work the genomes were “just trees” (Push programs) but a linear intermediate representation was used for variation (search). That is, you convert trees to linear programs (by substituting :open for “(” and :close for “)”), do alternation (flexible crossover) and mutation on the linearized representations, and then translate back to “just trees” (Push programs) which become the children.

  • Because alternation and mutation could unbalance the parentheses, repair might be necessary. (ULTRA stands for Uniform Linear Transformation with Repair and Alternation.) We repaired stochastically, with both additions and subtractions being possible, and all locations being possible (not just the beginnings or the ends). Nonetheless, we saw weird patterns emerging from repairs over generations, which is one of the things that led us to Plush.

Yes, I remember the ULTRA talk and it should be added to the list(s), but I am recalling an older scheme I saw at some GECCO, where parentheses were present as unmatched tokens in the program. I tried that myself at one point in the early Nudge project days, back 2008 ish.

Basically what I ended up doing (and I swear there was a paper, or maybe it was a poster that I ran into) was: scatter parentheses tokens liberally, then apply deterministic rules to make any linear genome into a particular translated tree as needed. But it turns out to be hard to search for the keywords, and I can’t remember if they had a handy acronym.

Later Hmm… it might be a thing Rick Riolo, Bill Worzel and I used on a consulting gig. I will look at old notes.

I am finishing up the “zipper genomes” (aka BB8) files today, hopefully. Should I post a description of the here, or just make sure the documentation on github is clear and up to date? I doubt you folks will want to use them; they’re qualitatively different from most of the others.

Will let you know when they’re uploaded to github.

I’m curious and would be happy to see a brief description here.

(Brief) Sketch of “BB8” Genomes

These are a Push genome representation that’s based on one of the several things we did in the Nudge & AnswerFactory projects back in 2009 or so. The original idea was to facilitate the discovery of building blocks in Push-like languages. The most recent Clojure-based Push version relies heavily on clojure.zip, so I’ve been calling them “zipper representations”, but I realized I’ve written seven or eight versions of them since the first, so… BB8.

The genome is a linear sequence of tuples, each of which contains :from, :put and :item fields. The process for translating is relatively easy to describe: The “seed” program is an empty Clojure seq, held in a zipper construct. Each gene (tuple from the genome) is interpreted in turn, in an identical way:

  1. the zipper cursor is moved according to the instruction in the :from field
  2. the specified :item is added to the zipper either on the left or right of the current cursor position, depending on the value of the :put; if the :item is nil, then the cursor is moved but nothing is added to the zipper; the first item added to an empty list (either the root, or any sublist) is selected as soon as it’s added
  3. goto 1

The values for :from I’ve been working with are

  • :head cursor selects the head of the tree (the first item inside the root)
  • :tail select the last item in a depth-first traversal of the tree (aka, just before the end of the zipper)
  • :subhead selects the head of the sublist the cursor is currently in, or the global head if it was at the root or end
  • :subtail select the last item in a depth-first traversal of the sublist the cursor is in
  • :left move left one step on this level; wrap around to the rightmost item, if it starts at the subhead position
  • :right move one step right on this level; wrap around if at rightmost
  • :prev move to previous item in global tree; wrap to tail if currently at head
  • :next move to next item in tree; wrap to head if at tail
  • :up move up from current position; stay at head if already in top level
  • :down move down from current position, if a branch is selected
  • :stay don’t move the cursor
  • [integer] jump to absolute index (using Push modulo index style)

The values of :put are :L and :R right now.

The values of :item are currently:

  • some :instruction
  • an empty list ’()
  • some :input
  • some :literal
  • nil (nothing)

That’s it. No muss, no fuss, no bother. Every list of genes translates into a valid tree. Chunks of lists can be rearranged, and have a mixed effect on the change in the resulting trees—by which I mean, some small changes for small moves, some large changes for large moves. If no nil items are present, then every gene produces exactly one program point in the resulting Push program.

I’m finishing up the Clojure implementation today, but the concepts are more or less identical to Ruby stuff we used to good effect years ago. Actually better, since Clojure has much better support for tree twiddling.

1 Like

Here’s an example from my notes. This is worked out by hand, but I’ll be able to check it shortly with the codebase I’m finishing now:

genome:

:next :L ()
:down :L ()
:head :R 1
:next :L ()
:subhead :R 2
:left :R 3
:subhead :R 4
62 :R ()
:right :L ()
:right :L 5
:stay :L 6
:stay :L 7
:next :R nil
:down :R 8
62 :R 9
62 :R 10
62 :R 11

becomes (cursor is indicated with square braces)


([*]) ;; (starting state)
( [ ( ) ] ) ;; insert an empty list, which remains selected
( ( [ ( ) ] ) ) ;; down into selected list, add a new list in that
( [ ( ( ) ) ] 1 ) ;; select head, add 1 to its right
( ( ( ) [( )] ) 1 ) ;; select next item, insert empty list to its left
( ( [( )] 2 ( ) ) 1 ) ;; jump to subhead, add 2 to right
( ( ( ) 2 [( )] 3 ) 1 ) ;; move left (wraps around), add 3 to right
( ( [( )] 4 2 ( ) 3 ) 1 ) ;; jump to subhead, add 4 to right
( ( ( ) 4 2 ( ) 3 ) [1] ( ) ) ;; there are 9 points; jump to #8 (62%9), 0-based; insert list to right
( ( ( ) 4 2 ( ) 3 ) 1 ( ) [( )] ) ;; move right, add list to left
( 5 [( ( ) 4 2 ( ) 3 )] 1 ( ) ( ) ) ;; move right (wraps to head), add 5 to left
( 5 6 [( ( ) 4 2 ( ) 3 )] 1 ( ) ( ) ) ;; stay, add 6 to left
( 5 6 7 [( ( ) 4 2 ( ) 3 )] 1 ( ) ( ) ) ;; stay, add 7 to left
( 5 6 7 ( [( )] 4 2 ( ) 3 ) 1 ( ) ( ) ) ;; move to next, do nothing
( 5 6 7 ( ([8]) 4 2 ( ) 3 ) 1 ( ) ( ) ) ;; move down (into selection), add 8 (stays selected because it was empty
( 5 6 7 ( (8) 4 2 ( ) 3 ) 1 ( ) [( )] 9 ) ;; jump to (62%16)=14, add 9 to right
( 5 6 7 ( (8) 4 2 ( ) 3 ) [1] 10 ( ) ( ) 9 ) ;; jump to (62%17)=11, add 10 to right
( 5 6 7 ( (8) 4 2 [( )] 11 3 ) 1 10 ( ) ( ) 9 ) ;; jump to (62%18)=8, add 11 to right
1 Like

Found some old notes from the Answer Factory days, and back then I though I’d heard of a similar technique for generating “fair” trees back in the early 2000s. No citations forthcoming; if anybody has one, please let me know.

The other thing I found in the notes and early experiments was that we only used the indexed form. That is, what I’ve called the :from field here was only ever an integer index in the earlier ones we did. That led to some unusual biases (as all schemes do), which is probably why we abandoned it. This “zipper/walker” version I’m using now seems to afford a much more robust and flexible approach to twiddling trees by rearranging linear genomes. At some point I’ll do some experiments in which the items being placed are the ordinal numbers (as in my example), and see how the distribution of resulting strings and/or trees compares to other schemes.

1 Like

Really really interesting. I’m not sure how to begin thinking about how this compares to Plush, but I want to.

I am not yet convinced it needs this complexity. I think that the search space is basically randomized with this many cursor-moving commands. I do recall than when we used just integers (mod tree-size), the resulting genomes were basically randomly shuffled by minor changes like a deletion or insertion. I might take a look at limiting the repertoire to :next, :prev and :stay, since in that case there will be more structural correlation between “close” genomic variants.

In case it wasn’t obvious, the point of this scheme (and its ancestors) was always to give these commands to the code itself, of course.

This reminds me of how grammatical evolution can get totally different results from minor changes to the genome. In that case, the instructions themselves can change wildly, where here (if I’m understanding things correctly) only the structure of the program will change wildly.

1 Like

Yes. If you change a movement command (the :from field) in a single gene tuple, it will not only move the cursor for one insertion, but for all subsequent insertions until one with a global position is encountered. But if at the same time you don’t change the :item fields themselves, you’ll just get a wide-ranging sampling of the same items in (probably biased) random tree structures. So in the example I gave above, if you change the :from genes but leave the :item genes intact, there will always be 16 program points in the resulting program, with the same integer constants and the same total number of subtrees appearing in every one of them. It remains to be seen what kinds of bias there might be in the samples.

@mcphee and I spoke a bit about this in our afternoon chat today. It will be interesting to compare the intrinsic biases and “tendencies” of various genomes and representations under search operators. For instance, the BB8 representation is a bit more “volatile” under point mutation, as @thelmuth suggested. As I think I said above, we first developed its simpler ancestor because we wanted to see building blocks developing under hillclimbing.

I did a quick one-mutant walk of a random genome just now, where the mutations only changed a single :from gene randomly, in one of the genes, without changing their :put or :item values. Thus, every program translated from one of these mutant genomes will still contain exactly 8 code blocks and the integers 1–18.

The results (translating the mutated genomes) are here.

It strikes me that we should consider a suite of explorations like this: autocorrelation analyses under mutation and crossover, that sort of thing, for each genome representation. That might help inform us about which one(s) to use in which circumstances.

Has anybody written an inverse algorithm for taking Push programs to (possibly over-simple) Plush genomes? Several search algorithms I’m building out would work better if we could swap between representations.

Not that I know of.