[maemo-developers] Java acceleration/Jazelle

Wed Jul 18 00:40:47 EEST 2007

Folks,

This is a summary of a conversation Simon and I had off line.  We
decided it would be a good idea to post it here to the list so others
could see the discussion and comment.   A couple of caveat's to keep in
mind.  I haven't had a chance to compile and try the code yet, I've been
reading the patent.  I'm also not through the entire patent yet as
well.  This means the following could require revision.

Simon and I also agreed it would be worthwhile (and make Quim happy) if
we started a Wiki page to condense our knowledge.  Following email
threads and pulling out the useful nuggets gets tedious when the thread
gets long.

 From what I've seen in the patent the Jazelle hardware treats Java
opcodes similar to Thumb instructions.   It switches to Jazelle mode and
processes the Java opcodes directly in the CPU pipeline in sequence with
other Thumb and ARM opcodes.

According to the patent, a program could start out executing 32 bit
opcodes, switch to Thumb instructions to load a sequence of Java byte
codes,  switch to Jazelle mode and execute them, return to Thumb mode,
then return to ARM mode and exit.  The program is then really a sequence
of three different types of opcodes (ARM 32, Thumb, Java).

The above is only meant to illustrate that Jazelle is not a coprocessor
implementation like the old FPA11 FPU supported by the NWFPE in the
kernel, and that execution of Java is interleaved with ARM and thumb
code.  The processor is basically executing Java bytecodes once started
until it is told to stop.

Simon pointed out that the actual transition has to be ARM->Java->ARM
according to the processor manual
(http://www.arm.com/pdfs/DDI0211I_arm1136_r1p3_trm.pdf).  This is not
what the patent suggests, but the manual will better describe the actual
implementation.

My take on a sequence for a JVM bytecode processing loop is:

load a stream of bytecodes into a buffer.
load r14 with address of first bytecode to execute in the buffer
load r12 with the address of the code to handle the bytecode
bxj r12
....

The program then proceeds to run executing the byte codes in the
buffer.  For this to happen, each handler for a particular byte code must:

a) load the address of the next byte code to execute
b) load the address of the software code to handle the next byte code.
c) process the current opcode
d) call bxj bjx r12 to loop and execute the next bytecode.

An opcode handler thus looks like this:

load r14 with address of first byte code to execute
load r12 with the address of the software code to handle the byte code
process the current opcode
bxj r12

This type of architecture makes sense as each opcode knows what data
follows it in the byte code stream and can adjust the byte code pointer
in r14 to point to the next opcode correctly.   Basically as long as r14
and r12 are filled before the bxj opcode is called things should be
fine.  The patent author is a little long winded about interleaving the
fills of these registers with the processing to avoid pipeline stalls.
Fine, but this is an optimization for performance that could be done after.

ARM expects a Jazelle enabled JVM to have a software handler for all
byte codes.  The reason for this is that Jazelle can be enabled/disabled
by software via a bit in CPSR.  You can check whether it is
enabled/disabled by looking at a bit in CP14.  If Jazelle is disabled,
bxj r12 calls the software routine in r12.  As long as Jazelle is
enabled you should be able to execute any of the first 203 opcodes.  One
caveat are the floating point opcodes, they may require special handling
if no VFP is present.

It is implied that register r12 should always point into the JVM, either
to a software handler for an opcode or to an unhandled byte code
handler.  A simple implementation is to always load the address of the
same routine in r12, and use it for a jump table to execute any byte
code that hits it.  This however incurs the overhead of a comparison,
and  a couple of indirect jumps to process every opcode not handled by
hardware.

To alleviate this overhead, the patent also talks about a program
translation table, and the JVM's ability to program the table.  It is
implied the Jazelle hardware is able to look up the address of the
handler for a byte code in the Jazelle translation table more efficiently.

The patent isn't clear about the form this table takes, how to program
it, or if one is actually provided with the CPU core.   From the way the
patent is written it is possible to program the translation table with a
mapping between a byte code and the address of its handler for the
opcodes (in the range 203-253) supported by the JVM and load r12 with
the address of an unhandled opcode exception handler always.

The question is how does the Jazelle hardware know where to find the
translation table?  One thought is that the translation table base
address is provided in a register (RExec in the patent), then the
Jazelle hardware simply adds the bytecode value to this address and
jumps to the ARM code there. This would require that the translation
table is always 256 pointers long, but not that each of the pointers has
to point to a different piece of code. I.e. some could point to
emulation code and others to a single unhandled opcode handler.

However the patent is specific, the translation table need not require
256 entries.  It could be a table with two entries per opcode (the
opcode, and the pointer to the handling code).  It also says there is no
need to know how large this table is, as the hardware could just discard
the entries that you attempt to program after the table is full.  Those
opcodes not in the table will then just be handled by the software
routine pointed at by R12.

I'm fairly certain that most of this will not require kernel support,
except for the access to the registers CPSR and CP14 which Simon pointed
out.  The patent seems to indicate that it is possible to run one or
more different Jazelle enabled JVM's on the same CPU.

The one thing that isn't obvious is how to get out of the Java
processing loop since each byte code handler loads the address of the
next byte code to  process and immediately forces its execution.  The
patent specifies the Jazelle implementation reserves byte codes 0xfe and
0xff for its own use as the JVM specification allows.  Perhaps one of
these is used to signal the end of the opcode stream and allow a
controlled return to JVM control.  Don't know yet.

This is the problem the Jalimo presentation found (their code always
crashed after the last byte code.  I suspect the Jazelle implementation
just kept running processing random junk as byte codes until it did
something to cause an exception.

To further the investigation I suggest the following:

a) Learning to enable/disable Jazelle.

b) Find out if a program translation table exists, and how to program
it.  I don't think this is strictly necessary to use the Jazelle
hardware, but would be a nice optimization if it can be made to work.

c) Trying a longer byte stream (for the moment you could use a bytecode
in the range 204-253 as a barrier indicating the end of the bytecode
stream so you can regain control.  I don't believe you can rely on the
unhandled opcode handler as the byte immediately following the bytecode
buffer might be a valid opcode, which Jazelle will attempt to handle.

d) Then try two byte streams, and switch to the second when control
returns from the first to your processing loop.   This would be a more
real world example of a JVM reading a multi-megabyte Java  program off
disk into a set of buffers for execution.

e) After that play with the buffers so that the data for one opcode is
split between two buffers.  This should result in a prefetch abort which
will need to be handled (not sure how but the patent specifically
mentioned it (my eyes were glazing over by that point).  Similar
problems will occur when floating point ops throw an exception (divide
by zero, NaN, etc).

The following describes an experiment I suggested:
Create an array with opcodes 204 to 255 in it.  Create one handler for
all opcodes.
Set up R14 to point to opcode 204.
Set up R12 to your handler.
Push the address you want to return to onto the stack.
Write your handler in C and printf to the console what opcode you are
handling as long as the opcode is <= 253.  Setup R14 to point to the
next opcode, and R12 to point to your handler.  For opcodes 254, 255 pop
the return address off the stack and continue.

I believe this will chew through all the opcodes in the array, dumping
output to the console until opcode 254 is encountered.  At that point
execution of Java bytecodes will stop.  This should occur whether
Jazelle is enable or not.

Next put an iadd opcode in the middle of the array.  Create a special
case for this opcode and its data in your handler, and run the program
again.  If Jazelle is not enabled, you should see a printf for the iadd
opcode, if it is enabled you won't.

Simon, you have done a great deal of work, and gotten quite far.   Good
work!

Scott