JavaTime Program Transformation Toolkit

[Overview] [Namespace][Bytecode] [Source] [Other Projects] [Installation] [Javadoc]

1. Overview

This document describes a research prototype of a system for manipulating and examining Java classes in both source and bytecode forms. The byetcode related portion of the interface is both more developed and more mature, having been used for a successful class project, while the source-level interface is less developed, having only been used for an unsuccessful class project. There is a single data structure used to represent the namespace occupied by Java entities, whether they originate in a .class or .java file. Thus, the class hierarchy is divided into roughly three sections: a container/namespace hierarchy for translating names into entities, an abstract syntax tree data structure for Java source files, and a structural encoding of .class files.

2. Namespace

2.1 Containers

All entities that may be named descend from the class astagjava.NamedObject. There are three major categories of named object: containers (including scopes, bytecode/source files, packages, and reference type declarations), types (primitive, class, interface, etc), and typed objects (constructor, method, field, etc.).

All NamedObjects contain 4 basic fields:
 
name A NameNode that gives the qualified name of this entity.
container Refers to the principle container that the entity belongs to, the container named by NAME's qualifier.
cunit A system variable used for access to global variables such as the root package, and the primitive types.
source The object which produced this name, which may be either from the source or bytecode representations.
There are a number of methods for looking-up certain data in the namespace. There is not, in general, a one-to-one mapping between names and objects in each container, so there are lookup methods that allow you to look for objects of various types. Some lookup functions return multiple values, while others that search for specific types may return just one object or null. The root package is used at the top level for locating qualified names. Packages have four specific lookup routines intended for general use.

These routines are simplified interfaces to the more general lookup routines described below. Of these, the first three simply return elements of the class hierarchy: packages, compunits, and classfiles. For example, suppose that R is the root package, then:

  R.lookupClassfile (java.lang.String)

will return the object representing the bytecode found in String.class. The fourth is a (perhaps misleading) abbreviation that attempts to find the public reference type declaration for a named class. Rather than return the ClassFile representing java.lang.String, for example, lookupRefdecl returns an object of type ReferenceDecl (either ClassDecl or InterfaceDecl, except in the case of an error, in which case it returns ErrorDecl)--instead of returning the class file, it returns the class itself. There is little distinction in the case of bytecode, which contain only one class, but the difference is pronounced in source files where multiple classes are present. LookupRefdecl is equivalent to:

  ClassFile cf = R.lookupClassfile (java.lang.String);
  ReferenceDecl rd = cf.lookupMain ();

Note that it chooses to search the bytecode, not source. This is arbitrary (and distasteful). The lookupMain routine, in this case, searches for a reference decl named "String" in the container java.lang.String.

Packages contain other packages, classfiles, and compunits. The next level of hierarchy begins inside individual classfiles and compunits. These contain one or more reference decls corresponding to the reference types (class or interface types) declared within. In addition, compunits also contain unqualified type names that were imported into the scope. For example, the file Foo.java containing:

  package P;
  import java.util.Hashtable;

  class Foo { }
  interface Bar { }

produces a CompUnit that contains the bindings:

In addition, the implicit "import P.*" and "import java.lang.*" produce: There are a variety of lookup methods defined on ReferenceDecls for locating the members of the class or interface. (In addition to these lookup methods, there are methods for retrieving various properties of the declaration, such as interfaces, modifiers.)
 
NamedObjectLookup lookupConstructor() Returns all constructors.
NamedObjectLookup lookupField(String name) Returns all fields matching name.
NamedObjectLookup lookupMember(String name) Returns all members matching name.
ConstructorDecl   lookupConsdecl(String name, String sig) Search for a constructor decl: a null sig means search for a unique constructor decl.
MethodDecl        lookupMethdecl(String name, String sig) Search for a method decl: a null sig means search for a unique method decl.
The final container type is BlockScope, used for storing variable and parameter bindings. It has no specific lookup operations. The rest of the lookup operations are defined on all containers. These fall into several categories and are implemented in terms of one another.

A lookup operation may search all the enclosing scopes of a container, or it may only search the first scope--just the container itself and not its parent containers. When deciding what the "next" enclosing scope to use it, the following rules is applied:

  1. If the container is a CompUnit or ClassFile, search the root package next.
  2. If the container is a Package, the search is concluded.
  3. Otherwise, search the parent container.
In addition to these rules, reference declarations (class & interface) have special scoping rules that cause lookups to also search through members of the superclass and interfaces.

A Validator is a predicate that restricts the result of a lookup to a subset of the class hierarchy. For example, the validator AstObject.vldtr() is unrestrictive, while Package.vldtr() restricts the result to Package objects.
 
NamedObjectLookup lookup(String name) Lookup named object in enclosing scopes.
NamedObjectLookup lookupValid(String, Validator) Lookup named object in enclosing scopes.
NamedObjectLookup lookupFirst(String, Validator) Lookup named, valid object in first scope.
NamedObjectLookup lookupAll(Validator) Lookup all valid objects in first scope.
There are also qualified lookup routines that recursively search for a fully qualified name in the enclosing scopes.
 
NamedObjectLookup qlookup(NameNode) Qualified lookup in a named container object.
NamedObjectLookup qlookupValid(NameNode, Validator) Qualified, validated lookup in a named container object.
These routines take into account the various types of qualifier to determine the appropriate lookup strategy. For example, the qualified name System.out.println() is matched as follows. First, lookupValid is used to find the first binding of "System" in the enclsoing scopes. This will be found in the enclosing CompUnit, where "import java.lang.*" gave produced the binding "System -> type java.lang.System". Having found a type for the first qualifier, it proceeds to obtain a corresponding type declaration (see the following section) and then lookup "out" in its class declaration. This returns a TypedObject (field declaration). Using the type of this object (java.io.PrintStream) to obtaining the corresponding class declaration, it looks up "println". The lookup will return all the methods named println, and you still have to select the one you want using overloading rules, etc. You won't need to use this type of lookup when examining Java source because the ss3() routine handles overloading and replaces the object reference "System.out.println" with two access nodes that make the derivation explicit: a TypeFieldAccessNode (for the static "out" field of java.lang.System) and ObjectFieldAccessNode (for the chosen "println" field of java.io.PrintStream).

There are four more specialized lookup routines that are extensions of the above. Already mentioned, lookupMain() searches for the public class or interface declaration in a class or source file. lookupSuper() returns a class or interface declaration corresponding to the superclass or super-interface. The two lookup routines dealing with types produce type objects instead of objects in the class hierarchy (class and source files)--this will become clear when types are discussed next. lookupAllTypes() returns a type object for each distinct entry in a package, and lookupType() returns a specific type by name.
 
ReferenceDecl     lookupMain() Returns the main class or interface.
ReferenceDecl     lookupSuper() Search for the first enclosing superclass.
NamedObjectLookup lookupAllTypes() Returns all Types.
Type              lookupType(NameNode) Returns a Type.

2.2 Types

Type objects include reference types, primitive types, method types, and the special void type.

Reference types include arrays, classes, interfaces, and the special null type. Class and interface types are represented using lazy bindings--holding one of these objects does not actually load the corresponding class or source file. For this reason, equality on types must be tested using the equals() method, not the equality operator. Since just the name of a type does not indicate whether it is a class or interface, both are initially represented as the superclass ClassOrInterfaceType. Within ClassOrInterfaceType, there are three classes: ErrorType, LazyType, and ResolvedType. These three represent the cases in which a type is unknown due to an earlier error in processing, a type that did not immediately have its declaration loaded, and a type that has already resolved to some declaration. A ClassOrInterface object can be queried for its corresponding declaration using the asDecl() method, which forces the underlying object to be loaded and returns a ReferenceDecl. This returns a ClassDecl, InterfaceDecl, or in case of an error, an ErrorDecl. Now it is clear why the lookupAllTypes() and lookupType() methods are special, they translate class and source files into LazyTypes before returning their result.

The method type is used for organizing the parameter and return types of a method. It does not represent an actual Java language type, however it can be treated as such for most purposes, including signature generation.

The primitive types, as well as the null and void types, are accessible through the ever-present cunit() object. For example, cunit().system().booleanType() refers to the boolean type.

There are various routines for learning information about types, including:
 
asCpReference() Class file specific: returns a constant pool entry refering to this object.
asReferenceType() Return a reference type.  For primitive types it returns the corresponding type from java.lang, such as java.lang.Integer.
equals(AstObject) True if two type objects are equal.  You cannot use the equality operator because type objects are not unique.
isAssignableFrom(Type) True IFF this type may be assigned from the argument.
isCallableWith(Type) True IFF the argument type type may be method invocation converted to this type.
isCastableFrom(Type) True IFF this type may be casted to from the argument.
isIntegral() True IFF this type is integral (byte, char, short, int, long).
isNumeric() True IFF this type is numeric. (integral, float, double).
primType() Return the primitive's type, an enumeration (PrimType) found in the C header hc.h.
promoteTo(Type) Promotion of this type another.  For promoting this type to the least upper bound of it and the argument, except it doesn't really work right because it does not handle reference types.
signatureToType(String, ClassHierarchy) (Static method) Convert a type signature to a Type.  The first argument is the signature, and the second is used for looking up types.
toString() Converts to string representation, for printing.
typeSignature() Returns the type signature of this type, the encoding used in bytecode files.
width() 2 if long or double, otherwise 1.  This is used in bytecode verification and classfile generation.

2.3 Typed Objects

The remainder of the named objects are those that have associated types. Unlike type declarations, which are containers that define types, and type objects, which represent all types, these objects simply have a name and an associated type, including constructors, methods, fields, parameters, and local variables. These are used to fill in the def() field of various nodes in the AST. Constructors are named "<init>".

3. Bytecode

Bytecode manipulation begins with a ClassFile object, obtained using lookupClassfile. The data is represented in a structure that closely resembles the specified file format. Class files contain: The constant pool determines much of the organization of this interface. There are two types of constant pool entry, according to whether the entry has been resolved or not. An unresolved constant pool entry is one that has not had its references to other constant pool entries resolved. The constant pool is read in its entirity and unresolved constant pool entries are created. Entries are resolved on demand, and new resolved constant pool entries can be created from scratch, but unresolved constant pool entries are not used outside of the class file reading routine. Resolved constant pool entries have a class that is determined by their data type (class, string, double, long, name & type, etc), whereas unresolved entries have a class that is determined only by their format (single CPI ref, double CPI ref, single word, double word, or UTF8). Later on, when class files are being written, the constant pool is re-generated. Thus, modifying a resolved constant pool entry will only result in modification to a single constant reference and not all references.

For example, the type ConstantPoolClass has one field, cpClassname that returns a slashified class name, e.g. "java/lang/Object". The special method cpAsType() returns a ResolvedType object for the class. The ConstantPoolString object represents a Java-language string constant. Numeric constants are represented using ConstantPoolDouble, ConstantPoolFloat, ConstantPoolLong, and ConstantPoolInteger. The ConstantPoolFieldRef, ConstantPoolMethodRef, and ConstantPoolInterfaceMethodRef types each have 3 associated strings: the class name being referenced, the field or method name being referenced, and the type signature of that object.

Fields and methods are accessible through the ClassFile object's cfFields() and CfMethods() methods. Alternatively, the source() field of a FieldDecl or MethodDecl (see the 4 basic NamedObject fields) will return the corresponding CfFieldInfo or CfMethodInfo object. Both of these objects inherit from CfMemberInfo, which has the following two string fields:
 
cfName The name of this member, a string
cfDescriptor Misnamed: this is the type signature
In addition, fields have an initial value (optional):
 
cfInit A constant pool entry
and methods have a code segment (optional) and a list of exceptions thrown:
 
cfCode The actual bytecodes for this method, of type Code
cfExceptionThrows The exception types that can be thrown
Most of the interesting things you can do with a class have to do with its code. The Code object can be used for generating, examining, and modifying the bytecode instructions of a method. The abstraction that is used resembles a text editor. There are code markers that represent fixed points in the sequence of instructions. Insertion before or after a marker does not change its relative position in the sequence. A marker is used like an insertion point, and has methods for the insertion of new instructions preceding its current position. A code position object is used as a placeholder for branch targets: the CodePositionFloating type is used when generating new positions, while the CodePositionAbsolute is used internally for translating relative program counter offsets into floating code positions.

Instructions are represented by a doubly linked list. Markers and positions are represented using special, non-existent instruction types. The Code object has methods for retrieving the first and last instruction, as well as for obtaining markers at the beginning, end, or middle of the sequence.

There are 13 different instruction formats; each is represented by its own type.
 
BipushInstruction Only the bipush (push byte) instruction uses this format
SipushInstruction Only the sipush (push short) instruction uses this format
IincInstruction Only the iinc (increment) instruction uses this format
InvokeIInstruction Only the invokeinterface instruction uses this format
MultiNewInstruction Only the multianewarray instruction uses this format
NewArrayInstruction Only the newarray instruction uses this format
JumpInstruction All the branch instructions use this format
LoadCPInstruction Instructions that load from the constant pool
LoadLVInstruction Instructions that load from a local variable slot
NoargInstruction Instructions with no operands
SwitchInstruction A tableswitch or lookupswitch instruction
LabelInstruction Internal use
MarkerInstruction Internal use
Those are the basics of a class file. There is more to know about this stuff, but there isn't much point to describing: how to compute stack size & local variable counts, exception throw and catch information, and using arbitrary attributes (named key-value pairs embedded in the class file).

There is an example of how to use this interface to re-write a Java class file contained in the Java classes:

These files are heavily commented; visit them in the above order.

4. Source

Source files are accessed through the CompUnit object. The container hierarchy that represents a source file has already been discussed. The CompUnit contains a number of reference type declarations, each of which contains a number of members, and so on. There is no good mechanism in place for modifying source code in the same way that there is for byte code. I did experiment with this capability, but never reached a point where it was convenient or easy enough to explain. That experiment was done using the C++ interface, and can be found in the file libast/aug.cpp.  Warning: this code is very, very ugly.

The interface does allow for the abstract syntax tree to be traversed and examined. To do this requires an understanding of the class hierarchy, which is quite large. There is one class per node type in the abstract syntax tree, approximately 130 node types in all. The root of all nodes in the AST is a TreeNode. The easiest of these nodes to understand are the statement and expression nodes--they map directly onto the language constructs that you are already familiar with. The remainder of the nodes represent various structural and syntactic components of the compilation unit, such as import statements, class and interface declarations, formal parameters, field and method declarations, names, and types.

The Javadoc produced class hierarchy is especially useful in understanding these nodes.  Each TreeNode has at least two fields:
 
position location of the construct within the source file
cunit the containing compilation unit
Each node also has zero or more abstract syntax tree children. For example, the MethodCallNode has two children:
 
method an expression (ExprNode) yielding the method to call
args a list of expressions (ExprNodes) provided as arguments
For each field there exists a method to set the value of that field named setField(). I do not encourage the use of these methods.

Lists are represented by a special tree node type called a TreeListNode. It has variable arity and its fields are not named, they are only numbered. There are generic methods for accessing the children of lists and any other tree node by their index. These are:
 
arity() the number of children of the node
child(N) the Nth child
children() an array of children of length arity()
These methods allow tree traversal without specific attention to the meaning of a particular node.

Some tree nodes may absent. When the grammar allows for an optional construct, nodes are left absent to indicate that fact. Two methods can be used to test for absence:
 
absent() true if the node is absent
present() true if the node is not absent
Absent nodes are created using the static omitted constructor in each non-abstract class. For example, the method NameNode.omittedNameNode() creates an absent NameNode. Some of the places you may expect an absent node are:

Some of the absenses are filled in by static semantic analysis, but some will remain. Abstract classes may not be absent, since they cannot be instanciated, so there is a special class for creating omitted expression nodes: OmittedExprNode. This is used to represent an absent, optional expression, such as in "for (;;)" statements, or in default case statements.

An easy way to get a feel for the node structure is to print it. There is a simple Java program to prints the node hierarchy out in the Java class weld.PrintTree.  See the documentation there. More elaborate implementations could surely be made.

5. Other Projects

Here is a list of other things that I did related to this project that are not written up here:

6. Installation Etc.

Earlier versions of the software used the JTCLASSPATH environment variable instead CLASSPATH because it was unable to extract class files from zip archives. Since the JDK distributes its classes in that format, I had to use a separate variable that included an unpacked version of classes.zip. This caused lots of trouble, so I decided to fix the problem.

It now uses the CLASSPATH environment variable and can read zip files. The only catch is that you still have to put classes.zip into your CLASSPATH--the JDK automatically provides this variable in the its startup scripts.

To install the software on Windows, first you must have JDK 1.1 installed. Download the distribution (zip, tar.gz).  There are two DLLs. The first is AST.DLL, which contains all of my code, and the second is ZLIB.DLL, which contains the un-compression code for reading zip files.

These two DLLs must be installed in your path, the PATH variable. Suppose I unpack the distribution in the location G:\DIR, creating the file G:\DIR\JAVATIME. Set PATH like so:

Now you should be able to run the sample applications. For example, to run the weld.PrintTree example, execute It will output the AST nodes for each class as described above.