Lisp AI: February 2019

Sunday, February 24, 2019

Overview of trees in graph theory

The most basic way in which graphs are related to trees is through the block cut tree of the graph, which can be formed by any connected component. The most basic tree related property of a graph is the nature of its biconnected components. A tree is simply a graph whose biconnected components are all trivial. A cactus tree has the simplest biconnected components, and a clique tree or a block graph has the largest biconnected components. A pseudoforest is a cactus tree with only one biconnected component in each connected component.

The block cut tree of a graph
The modular decomposition tree of a graph

The other aspect of a graph is the modular decomposition. Given any cograph we can form a cotree which roughly describes the structure of the cograph, as the cograph can be constructed by unions and complements. The modular decomposition is a generalization of this process to general graphs, and produces perhaps the most interesting information about the graph besides the block cut tree of the graph. Well these two structures describe the vertices of the graph, the spanning trees describe sets of edges of the graph. There are two types of trees: directed and undirected trees. Well the undirected trees are already graphs, the directed trees have a comparability which produces its own class of graphs.

Tree graphs
Comparability graphs of trees

The comparability graphs of trees are precisely the trivially perfect graphs. The trivially perfect graphs are the unique class of graphs that are isomorphic to preorders in a manner that preserves in adjacency. In this way, the trivially perfect graph can be recovered from the forest preorder by its order comparability. This demonstrates that tree graphs and structures seem to be the most directly related to order theory. Structures that are related to order theory tend to forbid certain cycles, and the comparability graphs and co-comparability for example are both perfect graphs so they forbid odd cycles. Another way to take a graph and produce an order is to select a root and form an ordered tree from that. In this way, both tree graphs and comparability graphs of trees can be converted to preorders of different sorts.

Modular decomposition

One of the basic concepts of graph theory is that of a module of a graph, which is a subset of the graph in which every element outside of the subset is either adjacent or non-adjacent to each member of the module. This generalizes the cotree consisting of connected components and their complements, as demonstrated by the graph below which is connected and so is its complement.

When a graph is a cograph, then it can be decomposed into atomic components by the modular decomposition. Otherwise, modules can occur that are unrelated to the connected components or their complements, in this case as Gallai showed the modular decomposition can be produced by the maximal modular subsets, which form a set partition of the component. This mechanism is used to produce the modular decomposition described below for the graph.

The first thing that one notices about this modular decomposition is that it is a laminar family. All the direct subsets of each module in the tree partition their parents. In fact, it is a nullfree laminar upper tree family. In this sense, the modular decomposition tree is like the parse tree of a graph.

Monday, February 18, 2019

Parse trees as set systems

The task of parsing a data structure into its constituent components usually involves an ordered sequence of characters or bytes. When classifying subsets of an ordered sequence like a byte stream or a string, the most important thing about the sequence is its betweenness relation, as we can limit our parent set to the set of intervals of the sequence. This is what instaparse does, as it automatically gives you the intervals of characters in the parse tree of some string subject to some context-free grammar. To demonstrate this I will consider the simplest possible program.

(defn hi [args]
    (prn "hello world"))

A set system derived from this program is shown below. The set system contains all the sets of characters that locate the different well formed parts of the tree structure as represented as string. The parse tree can be broken down into two components : minimal members and containers for minimal members. In Lisp the minimal members are atoms like the symbols and strings and the collections are lists. In this way, Lisp directly represents the parse tree structure in the language itself.

As a set system the parse tree is a laminar family. As laminar families emerge in the study of parsing, they are a special class of families that should be studied further. It is also nullfree because it doesn't contain the empty set as a member.

Definition. a family of sets $F$ is called a laminar family if for every two elements $S_1, S_2$ either $S_1 \subseteq S_2$, $S_2 \subseteq S_1$, or $S_1 \cup S_2 = \emptyset$. In other words, every pair of elements is either order comparable or disjoint.

The class of laminar families is a set of sets of sets of rank two, which makes it a clique complex whose relation is defined by the union of the order comparability and disjointness. Every nullfree laminar family forms an upper forest, and connected nullfree laminar families form upper trees. So laminar families provide a general means of describing tree decompositions of structures.

Saturday, February 16, 2019

Commutative aperiodic semigroups

Certain partial orders have semilattices associated with them. Semilattices are commutative idempotent semigroups that arise entirely from order theory. Commutative aperiodic semigroups are a generalization of semilattices, which can be used in a wider variety of cases. It is known that every commutative semigroup has an algebraic preorder associated with it, as described below. It can easily be seen that this is not a partial order, unless the semigroup is aperiodic.

\[ \exists z : x + z = y \implies x \le y \]
Set theory is based upon set union, which forms an idempotent commutative monoid. Well idempotence works well for set theory, it won't do for arithmetic. The natural numbers under addition $(\mathbb{N},+)$ can instead be modeled as an aperiodic commutative semigroup.This automatically generates the partial ordering on the natural numbers $(\mathbb{N},\le)$ because the addition operation only ever builds up a number by making it bigger, which is obviously not the case in a non-aperiodic semigroup like the integers. So set theory can be founded by an idempotent commutative monoid and arithmetic can be founded instead by an aperiodic commutative monoid.

It is interesting to consider the atoms in a commutative aperiodic semigroup like this. Atoms are defined as in order theory, as the elements that are only greater then the identity. The atoms in set theory are the individual members or singleton sets, which can be used to build up every element through set union. The atom in arithmetic is the singular unit, one which can be used to build up every natural number through addition. These are atomically generated semigroups. Atomically generated semigroups don't need to be atomistic lattices, as seen by the fact that the natural numbers are atomically generated. It is in this context that multisets can arise as values built up from a commutative aperiodic monoid, like sets are built up, except that the combiner operation is aperiodic rather then idempotent.

Tuesday, February 12, 2019

CLR instruction classes

The classification of instructions provided by the CIL instruction set specification describes base instructions as well as object instructions, but a much better classification of the different types of CIL instructions was provided by Serge Lindin in his publications on the IL assembler. I modified the classification of CIL instructions a little bit to note some regularities of its opcodes. It can be noticed that the upper ontology of the CLR is relatively similar to the JVM because both of them have similar instruction classes at the highest level as they are both high level stack machines, but the differences between them come in the different instructions belonging to different instruction classes. Some of the differences of these instructions belonging to the six different types of instruction classes will be examined below, which can be used to make a full ontology of opcodes of the instruction set.

Stack operations:
The stack manipulation operations are relatively similar to their JVM counterparts. There are two types of stack manipulation operations, the operations that deal with stack manipulation and constant pushing. The stack manipulation operations are nop, pop, and dup. There are fewer stack manipulation operations in the CLR then in the JVM, there doesn't appear to be any swap operation and there are fewer forms of dup, but that is okay. The constant pushing operations are also very similar to the JVM, they mainly deal with pushing i4, i8, r4 and r8 values onto the stack as well as pushing strings and null.

Generalized local variables:
I describe these operations as dealing with generalized local variables because unlike the JVM the CLR has different types for arguments and local variables, which are both stored together in the JVM. In order to convert a JVM instruction to the CLR, you need to change the local variables in the arrays into argument variables and then store the remaining variables as local variables. The only other difference worth mentioning is that unlike the JVM which is a 32-bit machine, and which requires you to store 64-bit values with 2 locals, the CLR seems to be more generic. You can also get the addresses of variables with certain instructions. Along with static fields, these make up the atomic variables which along with constants can be pushed onto the stack directly with push instructions.

Transformations
The CLR has all the operations you need to deal with arithmetic, logic, conversion and related operations. One thing to note is that the operations in the CLR are generic, so you don't need to put the type of each instruction at its prefix. In place of comparison operations, the CLR has logical condition check operations, which allows you to check if two values are equal, greater, or less then one another without having to use control flow operations to do it. The only other thing worth noting is that the CLR has operations dealing with pointers like direct and indirect access operations as well as operations dealing with blocks, that do not appear in the JVM.

Vector operations
The CLR has all the vector operations you need in order to create vectors, like creating vectors, checking their length, and getting and storing the elements of them. As the values of vectors can be modified at any index, a system of generalized variable operations should include the vector operations along with static fields, instance fields, local variables, arguments, as types of variables with their own operations.

Classes and value types
All of these operations are object model instructions in the specification of the CLR instruction set. There are instructions for allocation, type checking, casting, throwing exceptions, and so on. One significant difference is that the CLR has a system of value types as well as reference types, rather then simply having only reference types and primitives. As a result, there are box and unbox operations built in at the instruction level, as well as other operations like sizeof twhich gets the size of some value type, and cpobj which copies value types. This is a significant difference, because the CLR has its own unique object model.

Control flow operations
The control flow operations are actually surprisingly similar to the JVM, you have unconditional branching instructions, unary conditional branching instructions, comparative branching instructions, and switch instructions as jump operations all of which appear in the JVM instruction set ontology. There is also the ret instruction which doesn't need to be prefixed by any type, because the operations in the CLR tend to be generic. There are also special operations for dealing with exception handling. The greatest difference appears to be in the operations that deal with method calls, like the tail prefix which allows for tail call optimization at the instruction level.

Saturday, February 9, 2019

Ontology of JVM instructions continued

In the previous post we classified JVM instructions, largely by considering their nature as operations in a stack machine. Two classes of operations stood out in this process : value push operations and generic generalizations of the typed operations of the JVM. The nullary push instructions (constant push instructions, local access, getstatic) correspond to the operands of a combiner. The generic instruction classes define classes of isomorphic operators that deal with different types, whose value can be determined by the compiler from operand types. These two things can be used to help build a language from the JVM instruction set.

One other detail worth considering is instruction parameters. A large percentage of the operations of the instruction set do not require instruction parameters, like the aforementioned value push instructions, the primitive transformations, the return instruction, instructions which take an array as a parameter, and various procedures among others. Others can have their instruction parameters eliminated through reflection. There are a few instructions, belonging to three different classes, that seem to be harder to eliminate. These instructions necessitate the use of extended prefix notation in any thin wrapper over the JVM.

Variable operations:
The variable operations require some variable information passed to them as an instruction parameter. The atomic variable modification operations require that the atomic variable is passed as an instruction parameter. The putfield requires that the field name as instruction parameter. Array store is distinguished from these others by the fact that it doesn't require any instruction parameters. I ultimately decided to solve this problem by introducing something like setf, which takes as an instruction parameter all the information about the generalized variable being modified.

(def generalized-variable-modification 
  '#{fstore_1 lstore_3 sastore bastore dastore 
    dstore_3 dstore_1 istore istore_0 astore_2 
    fastore astore istore_2 istore_1 iinc castore 
    lstore_2 istore_3 dstore_2 lstore dstore_0 
    putstatic lstore_1 fstore_2 astore_0 dstore 
    astore_1 putfield fstore_0 lastore iastore 
    fstore lstore_0 fstore_3 aastore astore_3})

Type operations:
The type operations require a type passed to them as instruction parameter. They come in two forms: the reference allocations and the reference type check operations. It is easy to see why the reference allocations are distinguished among the reference operations by the fact that they must have some type passed to them as an instruction parameter, because there isn't a reference created yet to do reflection on. With the reference type check operations, the entire operation is defined by some instruction type so they also belong to this category.

(def reference-allocation 
  '#{multianewarray new anewarray newarray})

(def reference-type-check
  '#{instanceof checkcast})

Jump operations:
The jump operations require some label as an instruction parameter. This includes all process control instructions except for those that deal with method call and return. As the instructions are defined by jumping to a label, the label must be passed as an instruction parameter.

(def jump
 '#{ifeq iflt ifne ifgt ifnull ifle ifge ifnonnull
    if_icmpgl if_icmple if_icmplt if_icmpne
    if_acmpeq if_icmpeq if_icmpgt if_acmpne
    jsr ret goto lookupswitch tableswitch})

Thursday, February 7, 2019

Ontology of JVM instructions

The JVM opcodes by function helps users to better understand the opcodes of the Java virtual machine (JVM) by category. But this classification has its limitations, its a tree and therefore it doesn't include all the instruction types that may useful to the compiler developer. Additionally, there is a category of miscellaneous operations and array length is categorized as an object function rather then an array function. At the same time, instanceof and checkcast are object operations even though they can be applied to arrays. It is clear that there is a general type of references, and certain instructions are applicable to both types of references. The only functions that are truly reserved for object references are the instance field functions. The static field functions are not really related to references, and should therefore be grouped under atomic variables with the local variables.

In order to enable stack compatibility I decided to separately classify multi-valued instructions and uniquely-valued instructions. The only multi valued instructions are the duplication and swap instructions. The uniquely valued instructions have the same calling convention as methods, so they can be grouped together with them. In this sense, this classification of instructions is ultimately stack based and it deals with the fact that the JVM is a stack machine. Here is an alternate view of a hierarchical part of this ontology.

Uniquely valued stack instructions
    Constant push instructions
    Trivial procedures
Atomic variable instructions 
Reference allocation instructions (these return references)
    New, newarray, anewarray, multianewarray
Reference operations (these take references as arguments)
    Reference procedures (zero valued)
        Reference variable modification
        Unary reference procedures 
            throw, monitorenter, monitorexit
    Reference transformations (single valued)
        Reference variable access
        Unary reference operations 
            getfield, arraylength, reference type checkers
Primitive transformations (these take and return primitives)
    Unary primitive transformations
        Cast instructions
        Neg
    Binary primitive transformations
        Binary arithmetic instructions
        Logical instructions
        Comparison instructions

The atomic variables include both the local variables and the static variables, they are characterized by the fact that they do not require a reference as an argument to access them. The class of value push instructions, which are nullary single valued instructions, includes all the constant push operations, local variable access operations, and the class variable access operations. The class of value push instructions can be used to construct the atomic expressions in a Lisp dialect, like Clojure or Lisp flavoured Java. In this way, when an atomic expression like 2, 2.2, x, or class/name appears in the code then they are automatically converted to value push instructions. This is part of the correspondence between Lisp and a stack machine. The atomic expressions in Lisp correspond to their own instruction class (the value push instructions) and the combiners correspond to their own instruction classes as well separately.

In order to make Lisp correspond to the stack machine you only need to make it so that atomic expressions correspond to certain value push instructions, and the combiner forms correspond to certain uniquely valued instructions. All Lisp programs consist of these two types of components, just as a stack machine consistent of these different types of instruction classes, which makes it so effective to construct a correspondence between them. The only other nullary single valued instruction is a static method call that takes no arguments and returns something, basically a constant function or an object reference allocation. By convention, the constant function should be a call to a function like (Class/function) rather then appearing as an atomic term, so that effectively deals with the problem of atomic expressions.

Generic instructions:
When it comes to combiners on the other hand, in the JVM there are different combiners for different data types of arguments presented to the instruction on the stack. In order to make generic instructions available, it is useful to be able to have instruction classes corresponding to the different versions of an instruction that takes different types. For example, the add class could include iadd, ladd, fadd, and dadd instructions. Then when a generic combiner is presented to the Lisp flavoured Java developer, it is immediately known that it will produce some member of the generic instruction class dependent upon the types of the arguments given to it. This need for generic instruction classes, is especially important because of the typed nature of the JVM.

Generalized variable instructions:
The JVM actually has two types of variables avaiable to it: atomic variables and reference variables. The atomic variables do not take any arguments on the stack to get their value or any extra ones to set their value. The atomic variables can therefore be accessed as atomic expressions like in Lisp, hence their name. The atomic variables are the local variables and the class variables. The reference variables are the array variables as described by the array store and array load operations and the instance variables which are the fields of some object reference. Each of these different variable types have both the getters and accessors on them, so they can be both accessed and modified. Generalized variables correspond to the l-values in the Java programming language. By using the generalized variable instruction class, we can better understand how the setf operation can be implemented by the compiler.

Sunday, February 3, 2019

Books on virtual machines

The Java virtual machine book is ideal for anyone wanting to learn about the Java virtual machine. The authors of this book also created Jasmin which is the standard assembly language for the Java virtual machine. So it is an ideal resource for learning about the assembly language of the JVM. In order to understand the JVM I read this book along with the Java virtual machine specification which can be found online.

When learning about the common language runtime (CLR) it is best to get a book by Serge Lindin like .NET IL Assembler. Serge Lindin appears to be the only author that is dealing extensively with the CLR virtual machine and its assembly language. His book has helped me to understand the CLR and its differences from the JVM. It has everything you need to know about the CLR and its instruction set. In particular, it has a classification of the instructions used by the CLR. It uses the assembler ilasm which comes with the CLR itself.