[maemo-developers] Java acceleration/Jazelle
From: Scott Bambrough sbambrough at storm.caDate: Wed Jul 18 00:40:47 EEST 2007
- Previous message: Java acceleration/Jazelle
- Next message: Java acceleration/Jazelle
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Folks, This is a summary of a conversation Simon and I had off line. We decided it would be a good idea to post it here to the list so others could see the discussion and comment. A couple of caveat's to keep in mind. I haven't had a chance to compile and try the code yet, I've been reading the patent. I'm also not through the entire patent yet as well. This means the following could require revision. Simon and I also agreed it would be worthwhile (and make Quim happy) if we started a Wiki page to condense our knowledge. Following email threads and pulling out the useful nuggets gets tedious when the thread gets long. From what I've seen in the patent the Jazelle hardware treats Java opcodes similar to Thumb instructions. It switches to Jazelle mode and processes the Java opcodes directly in the CPU pipeline in sequence with other Thumb and ARM opcodes. According to the patent, a program could start out executing 32 bit opcodes, switch to Thumb instructions to load a sequence of Java byte codes, switch to Jazelle mode and execute them, return to Thumb mode, then return to ARM mode and exit. The program is then really a sequence of three different types of opcodes (ARM 32, Thumb, Java). The above is only meant to illustrate that Jazelle is not a coprocessor implementation like the old FPA11 FPU supported by the NWFPE in the kernel, and that execution of Java is interleaved with ARM and thumb code. The processor is basically executing Java bytecodes once started until it is told to stop. Simon pointed out that the actual transition has to be ARM->Java->ARM according to the processor manual (http://www.arm.com/pdfs/DDI0211I_arm1136_r1p3_trm.pdf). This is not what the patent suggests, but the manual will better describe the actual implementation. My take on a sequence for a JVM bytecode processing loop is: load a stream of bytecodes into a buffer. load r14 with address of first bytecode to execute in the buffer load r12 with the address of the code to handle the bytecode bxj r12 .... The program then proceeds to run executing the byte codes in the buffer. For this to happen, each handler for a particular byte code must: a) load the address of the next byte code to execute b) load the address of the software code to handle the next byte code. c) process the current opcode d) call bxj bjx r12 to loop and execute the next bytecode. An opcode handler thus looks like this: load r14 with address of first byte code to execute load r12 with the address of the software code to handle the byte code process the current opcode bxj r12 This type of architecture makes sense as each opcode knows what data follows it in the byte code stream and can adjust the byte code pointer in r14 to point to the next opcode correctly. Basically as long as r14 and r12 are filled before the bxj opcode is called things should be fine. The patent author is a little long winded about interleaving the fills of these registers with the processing to avoid pipeline stalls. Fine, but this is an optimization for performance that could be done after. ARM expects a Jazelle enabled JVM to have a software handler for all byte codes. The reason for this is that Jazelle can be enabled/disabled by software via a bit in CPSR. You can check whether it is enabled/disabled by looking at a bit in CP14. If Jazelle is disabled, bxj r12 calls the software routine in r12. As long as Jazelle is enabled you should be able to execute any of the first 203 opcodes. One caveat are the floating point opcodes, they may require special handling if no VFP is present. It is implied that register r12 should always point into the JVM, either to a software handler for an opcode or to an unhandled byte code handler. A simple implementation is to always load the address of the same routine in r12, and use it for a jump table to execute any byte code that hits it. This however incurs the overhead of a comparison, and a couple of indirect jumps to process every opcode not handled by hardware. To alleviate this overhead, the patent also talks about a program translation table, and the JVM's ability to program the table. It is implied the Jazelle hardware is able to look up the address of the handler for a byte code in the Jazelle translation table more efficiently. The patent isn't clear about the form this table takes, how to program it, or if one is actually provided with the CPU core. From the way the patent is written it is possible to program the translation table with a mapping between a byte code and the address of its handler for the opcodes (in the range 203-253) supported by the JVM and load r12 with the address of an unhandled opcode exception handler always. The question is how does the Jazelle hardware know where to find the translation table? One thought is that the translation table base address is provided in a register (RExec in the patent), then the Jazelle hardware simply adds the bytecode value to this address and jumps to the ARM code there. This would require that the translation table is always 256 pointers long, but not that each of the pointers has to point to a different piece of code. I.e. some could point to emulation code and others to a single unhandled opcode handler. However the patent is specific, the translation table need not require 256 entries. It could be a table with two entries per opcode (the opcode, and the pointer to the handling code). It also says there is no need to know how large this table is, as the hardware could just discard the entries that you attempt to program after the table is full. Those opcodes not in the table will then just be handled by the software routine pointed at by R12. I'm fairly certain that most of this will not require kernel support, except for the access to the registers CPSR and CP14 which Simon pointed out. The patent seems to indicate that it is possible to run one or more different Jazelle enabled JVM's on the same CPU. The one thing that isn't obvious is how to get out of the Java processing loop since each byte code handler loads the address of the next byte code to process and immediately forces its execution. The patent specifies the Jazelle implementation reserves byte codes 0xfe and 0xff for its own use as the JVM specification allows. Perhaps one of these is used to signal the end of the opcode stream and allow a controlled return to JVM control. Don't know yet. This is the problem the Jalimo presentation found (their code always crashed after the last byte code. I suspect the Jazelle implementation just kept running processing random junk as byte codes until it did something to cause an exception. To further the investigation I suggest the following: a) Learning to enable/disable Jazelle. b) Find out if a program translation table exists, and how to program it. I don't think this is strictly necessary to use the Jazelle hardware, but would be a nice optimization if it can be made to work. c) Trying a longer byte stream (for the moment you could use a bytecode in the range 204-253 as a barrier indicating the end of the bytecode stream so you can regain control. I don't believe you can rely on the unhandled opcode handler as the byte immediately following the bytecode buffer might be a valid opcode, which Jazelle will attempt to handle. d) Then try two byte streams, and switch to the second when control returns from the first to your processing loop. This would be a more real world example of a JVM reading a multi-megabyte Java program off disk into a set of buffers for execution. e) After that play with the buffers so that the data for one opcode is split between two buffers. This should result in a prefetch abort which will need to be handled (not sure how but the patent specifically mentioned it (my eyes were glazing over by that point). Similar problems will occur when floating point ops throw an exception (divide by zero, NaN, etc). The following describes an experiment I suggested: Create an array with opcodes 204 to 255 in it. Create one handler for all opcodes. Set up R14 to point to opcode 204. Set up R12 to your handler. Push the address you want to return to onto the stack. Write your handler in C and printf to the console what opcode you are handling as long as the opcode is <= 253. Setup R14 to point to the next opcode, and R12 to point to your handler. For opcodes 254, 255 pop the return address off the stack and continue. I believe this will chew through all the opcodes in the array, dumping output to the console until opcode 254 is encountered. At that point execution of Java bytecodes will stop. This should occur whether Jazelle is enable or not. Next put an iadd opcode in the middle of the array. Create a special case for this opcode and its data in your handler, and run the program again. If Jazelle is not enabled, you should see a printf for the iadd opcode, if it is enabled you won't. Simon, you have done a great deal of work, and gotten quite far. Good work! Scott
- Previous message: Java acceleration/Jazelle
- Next message: Java acceleration/Jazelle
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]