Tuesday, February 24, 2009

computers

my current computer at work is a lenovo laptop with a t2600, which is a 32bit dual core intel with 2,16ghz, 2mb l2 and 667mhz fsb. 3 gigs of ram, and one of those laptop harddrives. I've been running ubuntu on it. it's approaching 3 years of age so i'm due for an upgrade soon. yay.

people are either using macbook pro's with 2,6ghz c2ds or lenovos with t7700s with 4 gigs of ram. so there's not much of a difference performance wise.

my job doesn't really require me to have a laptop. if I change customer sites, I have a 24" screen I need to carry with me, I have a bunch of other laptops anyway and apple's crap job at releasing java6 is almost as bad as hpux. so I've been considering a desktop computer.

after all my crap about measuring performance I'd like to figure out if a nehalem, triple channel memory, 4 disk raid 10 on 3,5" sata drives would be significantly better for work stuff than a laptop. that is, running buildr, running maven, working in eclipse, running windows in virtualbox, having a jetty or two running, and a couple of other programs open.

we have about 5000 classes now. I wonder if I should work on breaking our projects down to smaller separate builds instead to cut down on the time spent building.

I have a feeling I am currently more compute bound and ram constrained than io bound, since our project's physical size is way smaller than the java memory footprint it occupies. but I haven't measured it.

Sunday, February 22, 2009

the purpose of all this low level mumbo jumbo is not just to enhance my self image as an eccentric dork. I love the way david moon keeps saying "but I haven't measured it".

the execution of java code is separated from it's execution in the execution units of the cpu by several layers of indirection: javac, the jvm jit compiler, the operating system, the instruction set architecture, the microarchitecture, on-die and off-die cpu caches. each of these utilize their own non-trivial optimizations.

so, while you can reason about the activities of any of these, if you want to know how they affect you, you really have to measure what happens. if results from measurements are unexpected, then something you know about the world between your .java file and the lowest level of hardware might help explain it.

as we are after real world performance, our measurements should happen at the appropriate level. if we are waiting for a web service to fetch our account data in XML format for 2 seconds over the network, it doesn't really matter what our registers do. most likely any optimizations we will have to make in response to measured and profiled performance problems will be at the higher levels of our code. without measurements, otherworldly pursuits of speculative gains are likely to hurt actual performance.

making code clean, readable and simple is virtually every time the most worthwhile optimization.

ctrl-d ctrl-d ctrl-d

Saturday, February 21, 2009

unrolling.

in the olden days I posted about unrolling loops. now I learned that besides the fact that we want to conserve instruction cache, the cpu itself can unfold the loop.

when the cpu breaks instructions into IOPs, µops or something similar, and when it does branch prediction and such, if it can determine the outcome of a decoded branch, it can either discard the branch instruction entirely from the code stream, or it can inline the probable branch into the code stream for speculative execution immediately after the branch instruction. even if the branch instruction stays in the code stream, if the processor has a separate branch unit, then the presence of the branch instruction will not affect the latency of the actual instructions inside the loop, it will "only" waste bandwidth in the different queues inside the processor.

the powerpc g4 was the first cpu to do branch folding. intel has published some data on the pentium 4, and it works a bit differently. more modern cpus probably have an even more advanced take on this.

--
updated.

the new intels have something called loop detection which will attempt to count the number of iterations a given loop takes. if the loop is repeated with the same number of iterations, even the loop exit is predicted correctly.

further, assembly language loops can (and should be) structured so that the condition and target of a branch can be predicted as early as possible before the branch is actually taken.

that is, while a loop in a high level language could be:
[Them,Others].each do { |target|
target.nuke
}
or
for (i=0; i<100; i++) {
launchNukes();
}
incrementing i and comparing it to <100 can be separated from launching more nukes by more code than is obvious from just the high level representation.

this probably looks more like:
...
i = 0
if not i < 100 jump to after_loop
set next_branch to loop_start
loop_start:
inc i
set status based on (i<100)
load warheads
acquire target
fuel missiles
press red button
if loop should continue jump to next_branch
after_loop:
...
yup.

Monday, February 16, 2009

derive while you derive

it turns out monkey has been cancelled.

and the javadocs have been deleted. but aptana has a copy. resources can't get into folders. only files.

you can access them with eclipse monkey. but it's not at all clear how.

here we go:

--- Came wiffling through the eclipsey wood ---
/*
* Menu: Derived > Make Targets Derived
* Kudos: miau
* License: EPL 1.0
* DOM: http://download.eclipse.org/technology/dash/update/org.eclipse.eclipsemonkey.lang.javascript
*/

function main() {
loadBundle("org.eclipse.core.resources");
var ResourcesPlugin = Packages.org.eclipse.core.resources.ResourcesPlugin;
var projects = ResourcesPlugin.getWorkspace().getRoot().getProjects();

out.println("Setting target directories to derived status.");

for ( var i = 0; i < projects.length; i++) {
var project = projects[i];
if (project.isOpen()) {
out.println("Project: " + project.getName());
var members = project.members();
recurseIntoMembers(members);
}
}

function recurseIntoMembers(folders) {
for ( var j = 0; j < folders.length; j++) {
if (folders[j].getType() == 2) {
recurseIntoMembers(folders[j].members());
if (folders[j].getName() == "target") {
out.println("setting derived status on: "
+ folders[j].getFullPath())
folders[j].setDerived(true);
}
}
}
}
}
--- And burbled as it ran! ---

Sunday, February 15, 2009

yo dawg.

I heard you like dependency injection so I put IOC in your tests so you can inject while you mock.

@RunWith(JDaveRunner.class)
public class MockitoSpec extends Specification<Void> {
@GuicedMock
private List<String> list;

@GuicedMock
@Named("hello")
private List<String> namedList;

private static class TargetClass {
@Inject
private List<String> list;

@Inject
@Named("hello")
private List<String> namedList;
}

private static class ConstructorInjection {
private final List<String> list;

@Inject
public ConstructorInjection(final List<String> list) {
this.list = list;
}
}

public class Injecting {
public void mocks() {
final Injector injector = Guice.createInjector(new Module() {
public void configure(final Binder binder) {
GuicedAnnotationEngine.binder.set(binder);
MockitoAnnotations.initMocks(MockitoSpec.this);
}
});
final TargetClass target = new TargetClass();
injector.injectMembers(target);
specify(target.list, is(list));
specify(target.namedList, is(namedList));
final ConstructorInjection constructorInjection = injector.getInstance(ConstructorInjection.class);
specify(constructorInjection.list, is(list));
}
}
}

and then...

import static java.lang.annotation.ElementType.FIELD;
import static java.lang.annotation.RetentionPolicy.RUNTIME;
import static java.util.Arrays.asList;

import java.lang.annotation.Annotation;
import java.lang.annotation.Retention;
import java.lang.annotation.Target;
import java.lang.reflect.Field;
import java.lang.reflect.Type;
import java.util.List;

import org.mockito.Mock;
import org.mockito.Mockito;

import com.google.inject.Binder;
import com.google.inject.BindingAnnotation;
import com.google.inject.Key;

public class GuicedAnnotationEngine implements AnnotationEngine {
public static ThreadLocal<Binder> binder = new ThreadLocal<Binder>();

public Object createMockFor(final Annotation annotation, final Field field) {
if (annotation instanceof Mock) {
return Mockito.mock(field.getType(), field.getName());
}
if (annotation instanceof GuicedMock) {
return bindMock(field.getType(), field);
}
return null;
}

private <T> T bindMock(final Class<T> type, final Field field) {
final T mock = Mockito.mock(type, field.getName());
final Annotation bindingAnnotation = getBindingAnnotation(field);
final Key<T> key = getKey(field.getGenericType(), bindingAnnotation);
binder.get().bind(key).toInstance(mock);
return mock;
}

private Annotation getBindingAnnotation(final Field field) {
final List<Annotation> annotations = asList(field.getAnnotations());
for (final Annotation annotation : annotations) {
final Class<? extends Annotation> annotationType = annotation.annotationType();
if (annotationType.isAnnotationPresent(BindingAnnotation.class)) {
return annotation;
}
}
return null;
}

@SuppressWarnings("unchecked")
private <T> Key<T> getKey(final Type genericType, final Annotation bindingAnnotation) {
if (bindingAnnotation == null) {
return (Key<T>) Key.get(genericType);
}
return (Key<T>) Key.get(genericType, bindingAnnotation);
}

@Target(FIELD)
@Retention(RUNTIME)
public static @interface GuicedMock {
}
}

one more thing...

sourcepath="M2_REPO/something/something"

may not be an absolute path but sourcepath="/M2_REPO/something/something" certainly is.

eclipse derived resources

update: a more relevant posting.

ctrl-shift-r can filter out derived resources.

the derived status of a resource can only be set with context-click.. properties.. check derived, click ok.

i.e. it is not stored anywhere in the file system in a human readable format. this can be annoying, for example when the derived status on your targets gets reset and your project has 8 of them.

but wait, you can set them using javascript.

so.

1. install monkey with the eclipse update url:
http://download.eclipse.org/technology/dash/update/

2. copy the script into the clipboard, including the "came wiffling.." and "burbling.." parts.

3. choose scripts.. paste new script.

4. mind you that the dom url has changed.

it is now: http://download.eclipse.org/technology/dash/update/org.eclipse.dash.doms

ctrl-alt-m runs the latest script. also, you need to have the project open which contains the script. but you don't have to have your script in the eclipse monkey scripts project.

luckily we only use maven half of the time, but we have target directories all the time.
so I edited the script a bit. but first, what does it do:
var files = resources.filesMatching(”.*/pom\.xml”);

var targetFolder;
for each( file in files ) {
if (targetFolder = file.eclipseObject.parent.findMember(”target”)) {
targetFolder.setDerived(true);
}
}

as far as i can tell. it looks for files named pom.xml, then it looks through the directory containing pom.xml for files (directories) named target, then it sets them to derived status.
in other words, it doesn't parse the pom.xml in any way. so you don't have to worry about maven.

so now my script looks like:
/*
* Menu: Derived > Make Targets Derived
* Kudos: Donnchadh and others
* License: EPL 1.0
* DOM: http://download.eclipse.org/technology/dash/update/org.eclipse.eclipsemonkey.lang.javascript
*/

function main() {
var files = resources.filesMatching(".*/*");

for each( file in files ) {
if (targetFolder = file.eclipseObject.parent.findMember("target")) {
targetFolder.setDerived(true);
}
}
}

the Menu: bit in the comment tells Eclipse Monkey what to name the menu item and where to place it under Scripts...

why I can't do resources.filesMatching(".*/target"), I don't know. but this takes less than a wink on a small project so even though * seems excessive, no measurements support further optimization requirements.

Saturday, February 14, 2009

eclipse type filter

you only use the Label org.apache.wicket.markup.html.basic.Label?

eclipse has type filters. fire up preferences and type type filters where it says type filters. add all those hierarchies that have generic classes such as Label, Link, Message etc. even the sample they suggest is java.awt.*




miau.biz now with pictures.

Tuesday, February 10, 2009

times 31

reading effective java 2nd ed. published in 2008, josh bloch writes that multiplication by 31 may be optimized by the runtime into an shl 5 and a sub.

certainly no, I thought. when I was a kid there were no pipelines, memory was fast and assembler was executed in order. the vga screen was 320 by 200 pixels with 256 indexed colours. imul took about 10 cycles. shifting, adding and substraction just one, so to find the offset of the pixel in row Y, column X in screen memory we needed Y*320+X.

so Y SHL 6, grab that, the result SHL 2, add the two together, Y*64 + Y*256 = Y*320.

but then came the pentium pro. and all you cared about was keeping the cpu busy, avoiding cache misses, making branches predictable and keeping the pipeline full. and certainly no 5 instruction multiplication.

so let's see who is right. me or josh.

public static void main(String[] args) {
int j=0;
for(int i=0;i<10000000;i++)
{ j+=i*31; }
System.out.printf("j is: %d",j);
}
and
/home/me/jdk7/bin/java -server -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileCommand=option,\*MiauIsAwesome.main,PrintNMethods MiauIsAwesome
on a core2, out comes:
0xb42231d6: shl $0x5,%ebp
0xb42231d9: sub %edi,%ebp
let's change 31 to 77. that would 64 + 8 + 4 + 1.
0xb420d268: imul $0x4d,%esi,%ebx
ok.

so Josh is right. atleast it could happen. whether this would be the case in real life code is different, because here the entire loop is so short that everything fits in l1, the branch prediction always succeeds, and even the named registers suffice for the data:

0xb420d268: imul $0x4d,%esi,%ebx
0xb420d26b: add %ebx,%edi
0xb420d26d: inc %esi
0xb420d26e: cmp $0x989680,%esi
0xb420d274: jl 0xb420d268

or:

0xb4223230: mov %edi,%ebx
0xb4223232: shl $0x5,%ebx
0xb4223235: sub %edi,%ebx
0xb4223237: add %ebx,%esi
0xb4223239: inc %edi
0xb422323a: cmp $0x989680,%edi
0xb4223240: jl 0xb4223230

there are a gazillion ways to do all this in assembler beyond these two. so hotspot might try a bunch of different ways before deciding.

btw imul on p4 was 10 cycles, and still 3 on core2.

Wednesday, February 4, 2009

print assembly

well,

on your ubuntu 8.10.

get your jdk7 binaries:

get binutils-2.17 sources

then remember to export CFLAGS="-fPIC -Wno-error"

make your old binutils.

get your hsdis sources to openjdk/jdk7/hotspot/hotspot/file/tip/src/share/tools/hsdis/

patch them.

export BINUTILS=/home/you/Desktop/binutils-2.17

make hsdis.

export LD_LIBRARY_PATH=/home/you/openjdk/jdk7/hotspot/hotspot/file/tip/src/share/tools/hsdis/bin/linux
Linkki
/home/you/jdk7/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly YourClass

enjoy!

0xb55e844a: movl $0x0,(%esp)
0xb55e8451: mov %ecx,0x4(%esp)
0xb55e8455: mov %eax,%ecx
0xb55e8457: mov %eax,0x64(%esp)
0xb55e845b: call 0xb559df50 ; OopMap{[60]=Oop [100]=Oop [116]=Oop [124]=Oop off=2352}
;*invokespecial
; - java.lang.StringBuffer::toString@13 (line 561)
; - java.io.BufferedReader::readLine@70 (line 319)
; {optimized virtual_call}
0xb55e8460: lea 0x78(%esp),%eax
0xb55e8464: mov 0x4(%eax),%esi


Tuesday, February 3, 2009

building fast

our build takes roughly 7 minutes to complete. that includes about 2,5 minutes of compiling and copying resources, and 4,5 minutes of running tests.

building is completely serial. I will assume a simplified view of building consisting only two steps: compilation and testing. given a tree of sub-projects where a and b depend on c the following takes place:
  1. compile c
  2. test c
  3. compile a
  4. test a
  5. compile b
  6. test c
that's fine if each of the six steps consumes all the resources available. but if they don't, we could get the build completed faster with the following organization:
  1. compile c
  2. test c, compile a, compile b
  3. test a, test b
if we had unlimited resources, we could also do the following wasteful but simpler solution:
  1. compile c, compile c, compile c
  2. test c, compile a, compile b
  3. test a, test b
further, each unit test should be independent of all others. that means no side-effects: no changing global state, no files, no threadlocals, etc. if unit tests were completely independent of each other and we had unlimited resources we could start them all at the same time.

when running in your IDE it might be fine to set a system property, alter a static, write to a database that's trashed at the end of the test run. when running in a build tool with all tests running consecutively in the same VM you will get nondeterministic results unless you politely return everything to the way it was in the After or destroy section of your tests. You might want to call Application.unset() in destroy.

in case you don't have unlimited computing resources available but you have more than you are currently using, you might want to start several tests in parallel.

when running tests in parallel, changes to global state are visible to others before your destroy gets a chance to run. stuff that works fine in a sequential environment, setting a static, committing a transaction, creating a table, setting a system.property can all cause weird results when running tests in parallel. and this is all on the test side. so the failures are not indicative of threading problems in production code.

i tried this to see how it would actually work. we have about 1000 test classes and 5000 test methods on our project. the results are, well, undeterministic.

we have some statics, and some system properties. the statics can be gotten around with thread locals that are cleaned afterwards. the system properties can be gotten rid of by making their use more explicit, introducing them as parameters which the test hands to the production class, and having the code that reads the system property either trivial enough to require no automated tests, or tested through another mechanism. you could even have all tests that alter system properties acquire a global lock for the duration of the test.

the biggest problem is h2. we have h2 tests for hibernate stuff. we also have tests that require transactions to be committed, so we can't isolate our tests with a simple rollback in the destroy() method everytime. we also have warp on top of hibernate.

I tried a bunch of stuff like appending System.nanotime() to the database url, setting a non default schema. which requires you to manage the creation of the tables yourself. (also called the schema). I didn't get through to the end on this one, but Warp has a singleton in it somewhere, and when I tried to avoid it, the tests would at some point not be able to get a session anymore. and this was given in a Guice enhanced class. so I did something a lot simpler.

In the end, I decided on creating a static lock that tests using H2 acquire in create() and release in destroy(). If you are running a bunch of tests at the same time, it won't matter if some threads are contending on the lock for H2. Of course YMMV, especially depending on how many tests use H2, and how long they take to run, if you have your tests queued alphabetically and if your H2 testclasses all start with the word Hibernate. also note that the static lock is roughly equally visible as the H2 and Warp's static sessionfactory (or whatever it really is). therefore if you have several VM's running in parallel, each of them can have its own lock, and you can run as many H2 tests in parallel as you have VMs.

I used my own buildr "framework". (for example 'junit' is a framework). the framework figured out which tests to run, rsync'ced the compilation target directories and their dependencies to worker machines on the same network, queued the tests, ran them, gathered the results and reported back to buildr. at first I did it like so, with the projects from the beginning of this post:
  1. compile c
  2. run tests for c on grid
  3. compile a
  4. run tests for a on grid
  5. compile b
  6. run tests for b on grid
and then as a second try:
  1. compile c
  2. compile a
  3. compile b
  4. run tests for a, b and c on grid
I never got to the solution I suggested at the top however.

For the 7 minute build, the first way got me through in 14 minutes and the second way got me through in about 8 minutes with 2 worker machines.

I started a master node to queue the tests and gather the results on the computer running buildr and workers on the other side to figure out which tests to run and run them. I used terracotta and blockingqueues for the messaging.

so far, it is my understanding that to speed up a build it would be most interesting to figure out a way to have buildr run every possible thing in parallel. it may be worthwhile to specify the order to run things in, the possible parallelizations, and required serializations explicitly, rather than trying to figure out a generic way. if more than one computer should be used during the build, figure out a lightweight way to handle communication, especially with very little overhead in the startup phase of testing. maybe jinterface.