AOTMadeFaster

From APIDesign

Revision as of 08:38, 30 August 2019 by JaroslavTulach (Talk | contribs)
Jump to: navigation, search

This is the draft of my first medium post kept here for the case when medium closes down.

[[Native image] tool rightfully attracts a lot of attention as it offers significant improvements in terms of startup speed and overall memory usage. However, if you create some benchmarks (using for example JMH) to evaluate peak performance you may observe that the native image is slower, sometimes even few times.

Honestly, that is the expected behavior! The problem isn't in [[Native image], but in having misguided expectations. Let's take a quick tour at the pros and cons of [[Native image] now. Then let's use GraalVM EE 19.2.0 profile guided optimization to mitigate some of the limitations.


Contents

Benefits of [[Native image]

Are you seeking for a system with fast start, low system requirements and multi-threaded communication? Do you want to code in a higher level language than C? Do you want to enjoy the benefits of memory safety and automatic garbage collection? There are few languages especially designed to fit such a design landscape and you may be considering to rewrite your application to one of those. However, if you are living in the JVM world then there is an easier option. Use GraalVM's [[Native image] tool!

The [[Native image] takes bytecode of your application and compiles it to native executable ahead of time. As a result one can code in any JVM language (Java, Scala, Kotlin) and get a single self-contained file as an output. Single file has many benefits: It can be easily copied from one system to another just by itself - it contains all the application code as well as necessary runtime support (think of garbage collector). Single file gets loaded and is ready to run - no need to seek for various JAR, properties & co. files and wait for them to open, load and initialize. The file generated by [[Native image] gives us instant startup. In addition to that the [[Native image] tool is able to capture a snapshot of an application memory - e.g. you can bring your system into ready to run state and when the generated native executable is started it continues exactly from where it was. This eliminates repetitive initialization and makes the startup time even more instant.

Another benefit of ahead of time compilation is lower memory consumption. Classical JVM keeps enormous amount of various metadata in addition to the JIT generated native code. These metadata are needed to be able to de-optimize at almost any moment. Nothing like that is needed in case of [[Native image] - the generated code covers all the possible code paths and never de-optimizes. The native code is known to be enough and all the metadata can be dropped when the native executable is being generated.

In spite of all the above goodies, the [[Native image] fullfils the most important aspects of a JVM - one can use a language of own choice - be it Java, Scala, Kotlin, etc. One can benefit from all the development tools available for the JVM. One can use the strong concurrency guarantees of a JVM and one doesn't need to care about garbage collection. The rich ecosystem of JVM full of useful libraries, tools and frameworks awaits to be compiled ahead of time.


Limitations of [[Native image]

The previous text might make you believe that [[Native image] is great and it should replace the Java HotSpot VM immediately. That would not be accurate. The benefits brought by [[Native image] aren't for free - they come with a cost. As such there are some aspects where [[Native image] limits its users more than classical Java HotSpot VM would.

Obviously the native executable can only run on a single platform. If you generate the image for 64-bit Linux, it only runs on Linux. If for Mac, it runs on Mac. If the executable is generated for Windows, it is going to run only on Windows. The portability is restricted compared to classical JAR file. Another limitation is caused by missing metadata during runtime. The previous section mentioned missing metadata as a benefit, but it also has its cost. By dropping information about classes and methods, one's ability to perform reflection is limited. The reflection is still possible, but it has to be configured and compiled into the native executable. As there are many Java frameworks that rely on reflective access, getting them run on [[Native image] may require additional configuration. Yet another restriction comes from the fact that the [[Native image] runtime may not support all features of Java. Running Swing UI toolkit may not be possible as it is too dynamic. On the other hand, [[Native image] successfully managed to execute Javac, Netty, Micronaut, Helidon and Fn Project - all large and nontrivial applications running on top of JDK.

The last drawback associated with ahread-of-time complication is the speed. What? I thought [[Native image] starts faster! Well, it does start significantly faster than similar JVM application, but at the end, when the application runs for a long time, the just-in-time compiler can actually outperform the AOT one. As Helidon.io guys put it:

"On the other hand, everything is always a tradeoff. Long running applications on traditional JVMs are still demonstrating better performance than GraalVM native executables due to runtime optimization. The key word here is long-running; for short-running applications like serverless functions, native executables have a performance advantage. So, you need to decide yourself between fast startup time and small size (and the additional step of building the native executable) versus better performance for long-running applications."


Now we are getting to the main topic of this post. Let's take a look why the peak performance of AOT compilation is slower and then let's speed it up!


There is no Free Lunch!

[[Native image] gives you certain gains at the cost of other losses. By removing most of the typical metadata associated with JVM execution, native image gives up on further optimizations based on execution profiles. The ahead of time generated code is what one gets. There is no chance to do more inlining, co-locate code on hot paths or aggressively over optimize and rely on a trap to signal the need for de-optimization and less optimal compilation. These are exactly the optimizations that make JVM so great for reaching excellent peak performance. During ahead of time compilation [[Native image] doesn't have enough information to generate such optimal code.

On the other hand, there is no need for initial interpretation of the bytecode. There is no need for deoptimizations and there is no support for random reflection poking around your classes. As a result for short-lived application native image starts faster, overall uses less memory. The benefits are huge, however everything comes at some cost. There is no free lunch. Or is it?

Improving Peak Performance of [[Native image]

Commonly used technique to mitigate the missing just in time optimization is to gather the execution profiles at one run and then use them to optimize subsequent compilation(s). With a great pleasure we can announce that GraalVM EE 19.2.0 comes with such Profile Guide Optimizations system. Let's demonstrate its functionality on a classical object oriented demo application - let's work with shapes of geometric objects:

abstract class Shape {
    public abstract double area();
 
    public static Shape cicle(double radius) {
        return new Circle(radius);
    }
 
    public static Shape square(double side) {
        return new Square(side);
    }
 
    public static Shape rectangle(double a, double b) {
        return new Rectangle(a, b);
    }
 
    public static Shape triangle(double base, double height) {
        return new Triagle(base, height);
    }
 
    static class Circle extends Shape {
        private final double radius;
 
        Circle(double radius) {
            this.radius = radius;
        }
 
        @Override
        public double area() {
            return Math.PI * Math.pow(this.radius, 2);
        }
    }
 
    static class Square extends Shape {
        private final double side;
 
        Square(double side) {
            this.side = side;
        }
 
        @Override
        public double area() {
            return Math.pow(side, 2);
        }
    }
 
    static class Rectangle extends Shape {
        private final double a;
        private final double b;
 
        Rectangle(double a, double b) {
            this.a = a;
            this.b = b;
        }
 
        @Override
        public double area() {
            return a * b;
        }
    }
 
    static class Triagle extends Shape {
 
        private final double base;
        private final double height;
 
        Triagle(double base, double height) {
            this.base = base;
            this.height = height;
        }
 
        @Override
        public double area() {
            return base * height / 2;
        }
    }
}

The above program introduces the Shape interface and its four implementations: Circle, Square, Rectangle and Triangle. The base interface defines area() method and each of the geometric classes overrides it and provides different implementation, suitable for its shape. Those who know how object oriented languages are implemented can already smell the problem. Right, if we create an array of shapes and go through it, the code will have to be ready for virtual method dispatch. Let's do it:

static double computeArea(Shape[] all) {
    double sum = 0;
    for (Shape shape : all) {
        sum += shape.area();
    }
    return sum;
}

The array of all shapes can contain any instances and as such the call shape.area() has to be able to call any of the actual methods. That's usually done with a virtual method table associated with each geometric class. Find out the current shape is Circle, then lookup the actual implementation of Circle.area() method and call it. Doing this in a generic way certainly requires a bit of calculations. To demonstrate that let's generate a huge array of random objects and measure how much time invoking the computeArea method takes:

public static void main(String[] args) throws Exception {
    int cnt = Integer.parseInt(args[0]);
    long seed = Long.parseLong(args[1]);
    int repeat = Integer.parseInt(args[2]);
    Shape[] samples = generate(3, args, cnt, seed);
 
    double expected = computeArea(samples);
    long prev = System.currentTimeMillis();
    for (int i = 0; i < repeat * 1000; i++) {
        double sum = computeArea(samples);
        if (sum != expected) {
            throw new IllegalStateException();
        }
        if (i % 1000 == 0) {
            prev = System.currentTimeMillis();
        }
    }
    System.err.println("last round " + (System.currentTimeMillis() - prev) + " ms.");
}
 
static Shape[] generate(int offset, String[] types, int count, long seed) {
    java.util.Random r = new java.util.Random(seed);
    Shape[] arr = new Shape[count];
    for (int i = 0; i < arr.length; i++) {
        String t = types[offset + i % (types.length - offset)];
        Shape s;
        switch (t) {
            case "circle":
                s = Shape.cicle(r.nextDouble());
                break;
            case "rectangle":
                s = Shape.rectangle(r.nextDouble(), r.nextDouble());
                break;
            case "square":
                s = Shape.square(r.nextDouble());
                break;
            case "triangle":
                s = Shape.triangle(r.nextDouble(), r.nextDouble());
                break;
            default:
                throw new IllegalStateException("" + t);
        }
        arr[i] = s;
    }
    return arr;
}

If you put all the above code into file Shape.java (do it in an empty directory), you can compile it with GraalVM's [[Native image] tool:

# install native-image via gu tool, if not yet installed
$ /graal-ee-19.2.0/bin/gu install native-image
 
$ /graal-ee-19.2.0/bin/javac Shape.java
$ /graal-ee-19.2.0/bin/native-image Shape
$ ls -1
graal-ee-19.2.0
shape
'Shape$Circle.class'
'Shape$Rectangle.class'
'Shape$Square.class'
'Shape$Triagle.class'
Shape.class
Shape.java

A shape executable has been generated. When you run it, it is going to be completely standalone, start fast, require little memory, but it won't be optimized. Try it:

$ ./shape 15000 43243223423 30 square rectangle
last round 35 ms.
$ ./shape 15000 43243223423 30 triangle circle
last round 34 ms

The actual execution time may vary depending on the speed of your computer. The absolute values do not matter much, we just want to make the execution faster. Let's train our program to be ready for square and rectangle. To do so we need to capture the data about the actual program execution. Let's thus generate the PGO data:

$ /graal-ee-19.2.0/bin/java -Dgraal.PGOInstrument=shape.iprof Shape 15000 43243223423 130 square rectangle

The shape.iprof file is generated once the execution is over. If you inspect its content, you may find out there is a reference to Shape$Square, but there is no reference to Shape$Circle. Of course - we've been training the program for square and rectangles, not circles! The fact that Shape$Circle is missing in the shape.iprof file signals that the training was successful. Let's now use the data and regenerate our native image:

$ /graal-ee-19.2.0/bin/native-image --pgo=shape.iprof Shape
$ ./shape 15000 43243223423 30 square rectangle
last round 25 ms.

Speedup! Instead of 35ms we can now execute the trained program in 25ms. Just by training it, recording the compiler decisions and using them to guide the compilation, we have sped up our program by almost 30%.

Of course, the speed up is only visible when the real workload mimics the one that we've been training for. If the program input diverges and the execution gets into the non-optimized paths, it can actually be even slower than without any profiles:

$ ./shape 15000 43243223423 30 triangle circle
last round 49 ms.

Should something like that happen, it is time to re-profile your application, gather new PGO data and recompile.


Conclusions

It is well known that [[Native image] gives you quick startup and low memory requirements. GraalVM EE 19.2.0 brings you simplified way to use PGO - with its help it is possible to train your application for specific workloads and significantly improve the peak performance. Download GraalVM EE 19.2.0 and try it yourself.

GraalVM is great and we are working on making it even better!

Personal tools
buy