Monday, December 31, 2018

The need for high-level architectures

Most instruction sets available in stock hardware are register-based, which means that programmers dealing with them need to manually perform register allocation. This manual register allocation is generally not as good as an optimizing compiler anyways, which performs liveness analysis, graph coloring, etc to appropriately allocate variables in registers. Stack machines on the other hand, avoid this issue, and provide programmers with an easy to use high level instruction set.

Stacks are a totally ordered data structure. On the other hand, the Lisp programming language, which was named for List processing, focuses on lists which are a also a totally ordered data structure. Lists and stacks are two types of representations of a totally ordered data structure. As a consequence, stack machines belong together with a Lisp dialect which corresponds to their instruction set. A Lisp compiler for a stack machine can be constructed with ease.

Well the majority of instruction sets available in stock hardware are register based, we can at least get high-level stack based architectures with virtual machines. In particular, there is the JVM and the CLR which are both stack based. Towards this end, I created Lisp flavoured Java, which is a Lisp dialect which corresponds directly to the Java virtual machine bytecode. Atomic forms in the language correspond to the push instructions in the virtual machine. Combiners correspond to atomic instruction classes or to macros, which generate simpler forms. This demonstrates that even a stack based machine like the JVM can be made isomorphic to a Lisp dialect.

The Lisp programming language championed the notion of homoiconicity, which means that code is written as data. This considerably simplifies the parsing phase of the complier. High level stack-based architectures, like the JVM or the CLR, are perhaps equally important as they simplify code generation as well. A Lisp dialect designed for a high level stack based architecture, can be compiled seamlessly.

Friday, December 21, 2018

Lisp flavoured java implementation

The four previous posts described the Lisp flavoured java programming language, which is a thin wrapper over the Java virtual machine. This language was designed so that it would hide nothing of the virtual machine's internals from the programmer, and so that Lisp forms would translate directly to bytecode. The syntax of the language is Extensible Data Notation, which is determined by the Clojure reader. Semantic analysis is performed by the resolver, and compilation is done with the help of the ASM library. This is an early release, so some of the features of this language ould be improved in the future.

https://gitlab.com/jhuni/lisp-flavoured-java

Tuesday, December 18, 2018

Lisp flavored java defining classes

Classes can be defined in Lisp flavored java using the full extent of the different features of the Java virtual machine, like access flags, types, and so on. Although Clojure does provide a genclass interface it doesn't deal with access flags and all the functionality of the java virtual machine, because the language wasn't designed to be Java with parenthesis like this one. Here is an example of a Hello World class, which is one of the simplest things you will see programmed in Java. One notable aspect of this program is that it uses . in order to invoke an initialization method, which is something that probably hasn't been seen before quite like this. This can be used to initialize this or super with the appropriate arguments and it will be produce the appropriate invokespecial instruction.
(class public HelloWorld
  (method public V <init> []
    (.<init> super))

  (method public static V main [(arr java.lang.String) args] 
    (.println java.lang.System/out "Hello world"))
Well is the one special type of instance method whose invokation . may not be familiar the on the other hand the class initializer is a static method used to initialize the class so it can be used to deal with certain aspects of programming in the Java virtual machine. One of the reasons that this syntax is possible is that Lisps in general including Clojure allow the tags to be used in symbol names. Note that the message doesn't need to be clarified because static variables and static methods are imported throughout the class by default just like Java.
(class public Message
  (field public static java.lang.String message)
  (method public static <clinit> [] 
    (setf message "Hello World"))
  (method public V <init> []
    (.<init> super)))
The importation of static methods allows for you to implement static recursive functions that can call themselves. The most obvious example that comes to mind is the factorial function so that will be demonstrated here. Its also worth mentioning that type inference is supported in the exact same manner as in Java, which is that it is supported for local variables. Type inference cannot be supported for methods and their return values, because it is necessary for us to know at compile time what the signatures of the methods of the class are in order to do self reflection.
(class public FactorialHelper
  (.method public V <init> []
    (.<init> super))
  (method public static I factorialIterative [I n]
    (var rval 1)
    (var i 1)
    (while (<= i n)
      (setf rval (* rval i)) 
      (incf i 1))
    rval)
  (method public static I factorialRecursive [I n] 
    (if (= n 0)
      1
      (* n (factorialRecursive (- n 1))))))
Here is one more example that better demonstrates some details like dealing with constructors, and using this in order to deal with fields, methods, and things like that on it.
(class public Point
  (field public I x)
  (field public I y)

  (method public V <init> [I x I y] 
    (.<init> super)
    (setf (.-x this) x)
    (setf (.-y this) y))

  (method public I getX []
    (.-x this))

  (method public I getY []
    (.-y this))

  (method public V setX [I x]
    (setf (.-x this) x))

  (method public V setY [I y]
    (setf (.-y this) y)))
Hopefully these few examples demonstrated some of the functionality of the Lisp flavoured java language system, which is directly translatable to Java bytecode with appropriate instructions, access flags, and so on. This is now a pretty full description of the capabilities of this language like its data operations, modification procedures, control flow, and class syntax. This language can be considered to be essentially Java with parenthesis, which makes it ideal for programming on the Java virtual machine.

Saturday, December 15, 2018

Lisp flavoured java control flow

Unconditional branches:
In order to implement a programming language isomorphic to the JVM instruction set, it is necessary to support goto statements, because that is how control flow is dealt with on a low level. Most high level languages on the JVM don't support this. To implement any language with branches, its good to be able to use symbols to define the branch targets.
(tag label)
(go label)
This is a trivial example, so it is not hard to see how this corresponds to bytecode.
label:
goto label
This syntax is mostly reminiscent of Common Lisp, much like the syntax for modification procedures except you can declare tags anywhere rather then in just a body. Common Lisp is influenced by the Lisp machine lisp so it is clearly best suited for low level programming.

Generalized conditionals:
The java virtual machine instruction set has several operations that deal with conditional branches. These require on certain conditions placed on operands on the stack, so they can be generalized by using appropriate logical operators. There are two types of logical operators that deal with single arguments operators that classify the sign of an argument and operators that determine rather or not a value is null or not. Then there are a variety of comparison operators. With the operations that determine sign it may not be obvious the to the Java programmer that is their actual purpose.
(isnonnegative num)
An operation like ispositive, isnegative, isnonnegative, isnonpositive, iszero, or isnotzero produces the correspond conditional branch instruction then as you would expect.
iload_0
ifge then
iconst_0
goto end_conditional
then:
iconst_1
end_conditional:
The builtin comparison operations which are limited to integers can be generalized with the appropriate comparison instructions for you.
(<= num1 num2)
The appropriate bytecode is then produced from this logical operation.
iload_0
iload_1
if_icmple then
iconst_0
goto end_conditional
then:
iconst_1
end_conditional:
If the values are compared are longs, then we can get the appropriate value with the proper comparison followed by a test on it. To do a conditional branch you can simply then use if with an appropriate conditional and it will generate the proper bytecode for you.
(if (<= num1 num2) (go label))
The presence of the if statement around the branch statement tells the compiler that you want to a conditional branch, so it can generate the proper bytecode for you, without increasing the complexity of the bytecode. Perhaps a different syntax could be created for this, but this works.
iload_0
iload_1
if_icmple label
This makes the instruction a bit more functional, which is a positive development, but it doesn't hide any functionality from the user or abstract anything anyway that you could use yourself in your programs.

Switches:
Lookupswitches can be dealt with by using the extend prefix notation as expected. The switch is passed as the first argument to the combiner followed by the operand to be pushed onto the stack.
(lookupswitch 
    (10 label1
     20 label2
     30 label3
    :else dftl)
    num)        
This operation could be dealt with instead by a case statement, but the purpose of this language is to hide nothing, and no operation in the instruction set from the programmer. This produces a complex instruction which looks something like this.
iload_0
lookupswitch 
 10 : label1
 20 : label2
 30 : label3
 default : dftl
As can be clearly seen here, the switch instruction still corresponds to the bytecode generated. Table switches can be automatically recognized and generated for you if you use an interval of keys or they can programmed directly.

Method calls:
The syntax to support method calls is not that interesting, Clojure already has a syntax that works well enough so there is no need to change it. The only issue is that the implementation will need to do a bit of reflection to determine the right kind of instruction to use, but once that is done the method call is translated directly to a sequence of instructions pushing values onto the stack followed by the correct invoke instruction.
(java.lang.Math/max 0 10)
(.charAt "string" 0)
The following produced assembly code demonstrates how these function calls correspond directly to invoke instructions on the java virtual machine.
iconst_0
bipush 10
invokestatic java.lang.Math/max(II)I
ldc "string"
iconst_0
invokevirtual java.lang.String/charAt(Ljava/lang/String;I)C
The benefit of using this notation is that this follows the same pattern of all the combiners defined so far, values are pushed onto the stack one by one and then the instruction is determined by the prefix. As mentioned previously, high level programmers might not realize that when programming Java the instance a method is being applied on is actually an operand pushed onto the stack.

Friday, December 7, 2018

Lisp flavoured java modification procedures

The first part of this process of abstracting the Java virtual machine dealt mainly with the construction of expressions from atoms and functions, and less with the aspect of modifying storage locations. This lead to a minimal abstraction of Java bytecode, however, I want to create not only a minimal abstraction but an elegant one as well. Towards that it is useful to introduce generalized variables. In the Java virtual machine there are four types of things that can be considered to be generalized variables: local variables, class variables, instance variables, and array index variables. These are displayed below. By abstracting these different types of variables away, we can modify any of them using a single function and then the compiler can produce the appropriate function associated with them. The Lisp way of doing this is to have a function like setf which modifies variables.
(type I x)
(setf x 10)
(type D y)
(setf y 1.0)
The appropriate local variable modification instructions are introduced by the compiler corresponding to the form of the variable being modified.
bipush 10
istore_0
dconst_1
dstore_1
This is how you might deal with modifying a static variable of a class:
(setf MainClass/name 0)
This will be compiled to a putstatic instruction with the appropriate type:
iconst_0
putstatic MainClass.name I
When dealing with compound variables a special form can be placed in as the first argument to the setf rather then an atom. The form appears exactly as an access to the variable would appear.
(type (arr I) coll)
(setf (aload coll 0) 10)
(setf (aload coll 1) 20)
These two function calls are expanded into two different iastore instructions. The appropriate instruction in this case is determined by the element type component of the array type of the variable being referenced. The ordering of the operations on the stack produced by the modification procedure is preserved, because the lvalues in the Java virtual machine instruction set always come first. This means this is still a simple abstraction of the Java virtual machine instruction set which corresponds directly to its bytecode.
aload_0
iconst_0
bipush 10
iastore
aload_0
iconst_1
bipush 20
iastore
The other type of compound variable, the instance variable can be dealt with as you would expect.
(type java.awt.Point p)
(setf (.-x p) 10)
(setf (.-y p) 20)
The type of the field being modified is determined at compile time by reflection.
aload_0
bipush 10
putfield java/awt/Point.x I
aload_0
bipush 20
putfield java/awt/Point.y I
Well all of these different uses of the assignment procedure have been interesting, the Java virtual machine does support one other type of modification procedure that deals with variables and that is the iinc procedure. The iinc procedure only deals with integer valued local variables though, so I believe it is worthwhile to abstract it away with a procedure like incf. This incf function can then be applied to any of the different types of variables.
(type I x)
(incf x 1)
(type (arr I) coll) 
(incf (aload coll 0) 1)
When the variable being dealt with by incf is an integer then it will produce the appropriate iinc instruction in order to provide you with more compact byte code. Otherwise, it will be necessary to get the value of the variable being addressed, push the amount it is to be added by onto the stack, and then add them in order to get what the value is going to be set to.
iinc 0 1
aload_1
iconst_0
aload_1
iconst_0
iaload
iconst_1
iadd
iastore
These two types of instructions will allow you to produce any of the modification instructions available to the Java virtual machine in an effective and generalized manner. We can now see how there are two different types of data operations available in the Java virtual machine: functions that deal with modifying values on the stack, and modification procedures that assign values to variables. Together these two types of instructions already described here define all the data operations of the Java virtual machine. The only instructions that have not been dealt with yet are the control flow instructions which will be dealt with in a later post.

Wednesday, December 5, 2018

Lisp flavoured java

The problem of how to make a thin layer over the Java virtual machine instruction set is addressed here. Lisp programs are built up two components: atomic expressions and compound forms. As the Java virtual machine is a stack machine, the thin layer over it will be constructed by making it so that atomic expressions correspond to push instructions and compound forms correspond to everything else. In this way, the Java virtual machine can be made isomorphic to a Lisp dialect. The structure of the Java virtual machine described in the JVM specification further describes the instruction set of the JVM and how it handles types with prefixes with type prefixes like Top which describes how instructions are provided as combiners.

Atomic expressions:
The Atomic expressions in the Java virtual machine come in two forms: constants and variables. The atomic expressions in Lisp correspond directly to the Atomic instructions in the Java virtual machine. In this way, the Lisp flavored java is orthogonal to the Java virtual machine's high level instruction set. Here are some examples of constant expressions:
10
2.2
"foo"
Integer like expressions are read as integers rather then as longs by default, just like the Java programming language, though unlike Clojure which assumes you are using a long by default. The use of integers by default is necessary for any kind of low level programming on the Java virtual machine, as ints are used for array access, array length, etc and they have the most specialized instructions associated with them, so especially for performance it is necessary to use integers by default. Float like expressions are read as doubles just like Java again, as there is no need to change that standard. So literals are like what you would expect from Java. The atomic expressions are directly converted one to one to push instructions and the appropriate push instructions that save the most space are determined for you.
bipush 10
ldc2_w 2.2
ldc "foo"
Variables are different from constants in that they can have a type associated with them. Consequently, type declarations can be associated with variables. One of the first thing one notices when writing JVM bytecode by hand, is that it is hard to keep track of local variables without the use of names, so certainly is nice to be able to use symbols instead that refer to local variables.
(type I a)
(type J b)
a
b
The appropriate push instruction is determined for you based upon the type of the local variable referred to be the symbolic expression. The atomic expression then corresponds one to one with a push instruction in the machine code.
iload_0
lload_1
The other type of variables besides local variables is class variables. The class variables can be designated using a slash in a symbol.
java.lang.Math/PI
java.lang.System/out
The class variables are then converted to get static expressions. Since variables can have types, class variables can have types as well. The types of class variables is determined by reflection rather then by type declarations.
getstatic java/lang/Math.PI D
getstatic java/lang/System.out Ljava/io/PrintStream;
This demonstrates how different atomic expressions like constants, local variables, and class variables in the language correspond directly to push instructions in the Java virtual machine, making the language a thin layer over the Java instruction set.

Compound forms:
Compound forms are constructed by combining atomic expressions with a combiner. The definition of combiners is determined by their types as described by the Java virtual machine specification. The appropriate typed version of an instruction is then determined at compile time. Consider the operation Tneg which can come in the forms ineg and lneg. The Tneg combiner is compiled to the appropriate instruction based upon the type of the argument passed to it. The standard means of defining these typed combiners is to simply remove the T prefix in front of it. So Tneg is simply neg.
(neg 1)
The neg operation is then compiled to its corresponding combiner instruction well the atomic expression one is compiled to its appropriate push instruction and these are then added to the machine code.
iconst_1
ineg
The same is true for the Trem operation which is simply reduced to rem when it appears in a compound form.
(rem 10 2)
When an operation like rem is applied to multiple arguments the order the arguments appear in is preserved by the stack machine. In many ways the nature of Lisp as a means of constructing programs from ordered lists is especially suited for programming with a stack machine.
bipush 10
iconst_2
irem
In order to make the programming easier, for the other arithmetic operations + corresponds to Tadd, * corresponds to Tmul, - corresponds to Tsub, and / corresponds to Tdiv. These are merely aliases for these fundamentally typed operations.
(type I x)
(/ (* x (+ x 1)) 2)
These expressions can be nested, and then their arguments will be pushed onto the stack in order and then evaluated accordingly. In this way, we can see how this Lisp dialect directly corresponds to the JVM instruction set.
iload_0
iload_0
iconst_1
iadd
imul
iconst_2
idiv
Logical operations like shl,shr,ushr,and,or,xor are similarly provided directly to the programmer accordingly and they are compiled to their typed versions. Casts can be expressed as T2i, T2l, T2f, T2d, T2s, T2c, and T2b. These convert an operand of some type to some other type. The ordinary standard of expressing these operands by removing the type in the front would leave a combiner with a name starting with a letter, so I decide that these can aliased by their cast names int is T2i, long is T2l, float is T2f, double is T2d, short is T2s, char is T2c, and byte is T2b. This basically describes how the data transformations in the JVM can be converted directly to a Lisp like syntax in an effective manner. The next detail is to determine how to deal with arrays. Elements of arrays can be loaded with aload as you would expect.
(type I i)
(type (arr I) coll)
(+ (aload coll i) (aload coll (+ i 1)))
This expression and all of its components are then converted directly to the following bytecode:
aload_0
iload_1
iaload
aload_0
iload_1
iconst_1
iadd
iaload
iadd
Arraylength is an interesting case unlike the other ones described so far as it doesn't have any special type information associated with it, because it simply applies to arrays every time.
(type (arr I) coll)
(arraylength coll)
This is one is so simple that you hardly even need a compiler to deal with it, it can be converted directly to its associated bytecode.
aload_0
arraylength
You can do a getfield operation much like you would in Clojure by specifying the field name in the combiner.
(type java.awt.Point point)
(.-x point)
The owner of the field is determined by the argument passed to the field accessor, and then the type of the field is determined by reflection which then is used to output the appropriate getfield argument.
aload_0
getfield java/awt/Point.x I
So all of the atomic expressions and combiners so far have directly correspond to instructions in the Java virtual machine instruction set with their operands pushed onto the stack. You can make instructions that have instruction parameters by using extended prefix notation. Extended prefix notation means that the instruction parameters are passed in the front next to the combiner before the operands to be pushed onto the stack are specified.
(type java.lang.Object obj)
(newarray I 9)
(multianewarray I 9 9)
(instanceof java.lang.String obj)
These operations like new, newarray, multianewarray, instanceof, and checkcast correspond directly to their corresponding instructions except they must have some instruction parameter passed to it in the front.
bipush 9
newarray int
bipush 9
bipush 9
multianewarray [[I 2
aload_0
instanceof java/lang/String
This describes how all the different combiners and atomic expressions can be made to correspond to machine instructions, except in the case in which their is an extended prefix notation which means that the programmer will have to pass an argument in the front to define the instruction before the following stack operands. This deals with most of the data operations of the Java virtual machine. Miscellaneous other no operand instructions like nop, pop, athrow, monitorenter, and monitorexit simply correspond directly to their machine instructions. Pop is simply a function which takes its argument pushes it onto the stack and then returns nothing. Variable modification and control flow will be dealt with later.