Saturday, November 24, 2018

Understanding the java virtual machine instruction set

It is well known that Lisp programs consist of two components: atoms and compound forms. Compound forms start with a combiner and they are followed by a collection of other forms which can be atoms or other compound forms. Using combiners and starting with atoms we can form any Lisp program. This concept of program construction can be related to programming in a stack machine. Atoms are things that are pushed directly on to the stack, and combiners are the instructions that manipulate stack operands.

The atomic expressions in the JVM can only mean the constants and variables that can be loaded onto the stack directly using certain operations. Constants can be pushed onto the stack using either special operations designed to save space or by a reference to the constant pool assigned to the ldc instruction. Variables come in two forms: local variables and class variables. Local variables are accessed by some integer as well as a type, well class variables are accessed by getstatic. As any programmer would want to assign names to variables, they can be represented in practice using symbols. In order to distinguish between local variables and class variables there can be a delimiter like '/' in Clojure or ':' in Kawa Scheme. In any case in the end we have two types of atoms: constants and symbolic variables.

All other things can be done using compound forms with appropriate combiners which correspond to instructions. All the arguments to a compound form are pushed onto the stack followed by the combiner. A uniquely valued combiner consumes its arguments and then either puts something onto the stack or it returns nothing. There are arithmetic, logic, cast, and comparison functions for dealing with primitive values, the load and access functions for dealing with arrays, allocation instructions, type checkers, and field access instructions. All of these are essentially functions that directly operate on values on the stack. The nop and pop instructions consume their instructions but don't do anything. So in terms of data operations, the main thing is modification procedures like store, astore, putfield, and putstatic which operate on place forms of various sorts.

The control flow instructions come in two basic forms : conditional branches and switches. There are a wide variety of different control flow instructions that are variants of these two forms. Switches are different because they involve branching to a wide variety of different points. Other procedures include return, throw, monitorenter, and monitorexit. All of the instructions mentioned so far take the basic form of an operation that takes its arguments on the stack and then returns some value or nothing at all. Methods, which are user defined combiners, take the same form as this. As a result, many of these instructions can in theory be defined as methods. For example, one can define most arithmetic, logic, comparison, and cast instructions as methods instead of as instructions and it will have the same effect on the stack. The only exception to this is local variables and control flow instructions which deal with properties inaccessible to invoked methods.

Methods come in different forms including instance methods and static methods. The difference between the two doesn't always matter that much because the JVM can use devirtualization to optimize instance methods. All Clojure functions are defined as instance methods on function objects, but this is okay because the JVM is very good at devirtualization especially as clojure functions are final. Invokespecial deals with constructors and other special cases. The main point then is that there are methods that can be accessed by pushing their operands onto the stack and then returning either some value or nothing in the case of a void method. So since the virtual machine directly supports uniquely valued functions like these they should be the basis of all compound forms.

Then finally there are instructions that return multiple values. These operations like dup, dup2, dup_x1, dup2_x1, dup_x2, and dup2_x2 don't do anything special, but rather they provide a more efficient alternative to pushing things onto the stack more then once. As these multi valued instructions do not follow the calling convention defined for methods, which is that they either push a value onto the stack or nothing at all, they should be handled entirely by the compiler. The main function of the compiler then, is to find ways to insert instructions like dup into compiled forms to make them more efficient, as these operations like dup should not be handled by the user. Instead, the stack should certainly be abstracted away by an expression based system like Lisp.

So in conclusion there are three types of operations in the JVM instruction set: atomic values which are push instructions, uniquely valued instructions like method calls, and multi valued instructions. The later the multi valued instructions can be handled entirely be the compiler. This demonstrates how a Lisp like expression tree can easily be mapped onto the JVM instruction set, and in general a stack machine is an ideal platform for implementing a Lisp dialect using the outline described here.

No comments:

Post a Comment