37

If one needs different JVMs for different architectures I can't figure out what is the logic behind introducing this concept. In other languages we need different compilers for different machines, but in Java we require different JVMs so what is the logic behind introducing the concept of a JVM or this extra step??

14
  • 1
    Possible duplicate of Compilation to bytecode vs machine code
    – gnat
    Commented Apr 3, 2017 at 14:18
  • 12
    @gnat: Actually, that's not a duplicate. This is "source vs byte code", i.e. just the first transformation. In language terms, this is Javascript versus Java; your link would be C++ versus Java.
    – MSalters
    Commented Apr 3, 2017 at 20:12
  • 2
    Would you rather write a simple bytecode interpreter for those 50 appliance models you're adding digital coding to for upgrading or 50 compilers for 50 different hardware. Java was originally developed for appliances and machinery. That was its strong suit. Keep that in mind when reading these answers as java has no true advantage nowadays (due to inefficiency of the interpreting process). It's just a model we continue to use.
    – user64742
    Commented Apr 4, 2017 at 4:06
  • 1
    You seem to not understand what a virtual machine is. Its a machine. It could be implemented in hardware with native-code compilers (and in the case of the JVM it has been). The 'virtual' part is what's important here: you're essentially emulating that architecture on top of another one. Say I wrote an 8088 emulator to run on x86. You aren't going to port the old 8088 code to x86, you're just going to run it on the emulated platform. The JVM is a machine you target like any other, the difference being it runs on top of the other platforms. Commented Apr 4, 2017 at 13:32
  • 7
    @TheGreatDuck Interpreting process? Most JVMs nowadays do just-in-time compilation to machine code. Not to mention that "interpretation" is a pretty broad term nowadays. The CPU itself just "interprets" the x86 code into its own internal microcode, and it's used to improve the efficiency. The latest Intel CPUs are exceedingly well suited to interpreters in general too (though you'll of course find benchmarks to prove whatever you want to prove).
    – Luaan
    Commented Apr 4, 2017 at 16:57

8 Answers 8

79

The logic is that JVM bytecode is a lot simpler than Java source code.

Compilers can be thought of, at a highly abstract level, as having three basic parts: parsing, semantic analysis, and code generation.

Parsing consists of reading the code and turning it into a tree representation inside the compiler's memory. Semantic analysis is the part where it analyzes this tree, figures out what it means, and simplifies all the high-level constructs down to lower-level ones. And code generation takes the simplified tree and writes it out into a flat output.

With a bytecode file, the parsing phase is greatly simplified, since it's written in the same flat byte stream format that the JIT uses, rather than a recursive (tree-structured) source language. Also, a lot of the heavy lifting of the semantic analysis has already been performed by the Java (or other language) compiler. So all it has to do is stream-read the code, do minimal parsing and minimal semantic analysis, and then perform code generation.

This makes the task the JIT has to perform a lot simpler, and therefore a lot faster to execute, while still preserving the high-level metadata and semantic information that makes it possible to theoretically write single-source, cross-platform code.

20
  • 7
    Some of the other early attempts at applet distribution, such as SafeTCL, did actually distribute source code. Java's use of a simple and tightly-specified bytecode makes the verification of the program much more tractable, and that was the hard problem that was being solved. Bytecodes such as p-code were already known as part of the solution to the portability problem (and ANDF was probably in development at the time). Commented Apr 3, 2017 at 18:10
  • 9
    Precisely. Java start up times are already a bit of an issue because of the bytecode -> machine code step. Run javac on your (non-trivial) project, and then imagine doing that entire Java -> machine code on every start up. Commented Apr 3, 2017 at 19:22
  • 24
    It has one other huge benefit: if someday we all want to switch to a hypothetical new language - let's call it "Scala" - we only need to write one Scala --> bytecode compiler, rather than dozens of Scala --> machine code compilers. As a bonus, we get all the JVM's platform-specific optimizations for free. Commented Apr 3, 2017 at 20:18
  • 8
    Some things are still not possible in JVM byte code, such as tail call optimization. I recall this greatly compromises a functional language that compiles to JVM.
    – JDługosz
    Commented Apr 4, 2017 at 0:24
  • 8
    @JDługosz right: JVM unfortunately imposes quite some restrictions / design idioms that, while they may be perfectly natural if you're coming from an imperative language, can become quite an artificial obstruction if you want to write a compiler for a language that works fundamentally different. I thus consider LLVM a better target, as far as future-language-work-reuse is concerned – it has limitations too, but they more or less match the limitations that current (and likely some time in the future) processors have anyway. Commented Apr 4, 2017 at 13:08
27

Intermediate representations of various sorts are increasingly common in compiler / runtime design, for a few reasons.

In Java's case, the number one reason initially was probably portability: Java was heavily marketed initially as "Write Once, Run Anywhere". While you can achieve this by distributing the source code and using different compilers to target different platforms, this has a few downsides:

  • compilers are complex tools which have to understand all the convenience syntaxes of the language; bytecode can be a simpler language, since it is closer to machine-executable code than human-readable source; this means:
    • compilation may be slow compared to executing bytecode
    • compilers targeting different platforms may end up producing different behaviour, or not keeping up with language changes
    • producing a compiler for a new platform is a lot harder than producing a VM (or bytecode-to-native compiler) for that platform
  • distributing source code is not always desirable; bytecode offers some protection against reverse engineering (although it's still fairly easy to decompile unless deliberately obfuscated)

Other advantages of an intermediate representation include:

  • optimisation, where patterns can be spotted in the bytecode and compiled down to faster equivalents, or even optimised for special cases as the program runs (using a "JIT", or "Just In Time", compiler)
  • interoperability between multiple languages in the same VM; this has become popular with the JVM (e.g. Scala), and is the explicit aim of the .net framework
3
  • 1
    Java was also oriented to embbeded systems. In such systems, the hardware had several constraints of memory and cpu.
    – Laiv
    Commented Apr 3, 2017 at 21:31
  • Can compliers be developed in a way that they first compile Java source code into byte code and then compile byte code into machine code? Would it eliminate most downsides you mentioned?
    – Sher10ck
    Commented May 7, 2018 at 16:56
  • @Sher10ck Yes, it's perfectly possible AFAIK to write a compiler that statically converts JVM bytecode to machine instructions for a particular architecture. But it would only make sense if it improved performance enough to outweigh either the extra effort for the distributor, or the extra time to first use for the user. A low-power embedded system might benefit; a modern PC downloading and running many different programs would probably be better off with a well-tuned JIT. I think Android goes somewhere in this direction, but don't know details.
    – IMSoP
    Commented May 7, 2018 at 18:20
8

It sounds like you're wondering why we don't just distribute source code. Let me turn that question around: why don't we just distribute machine code?

Clearly the answer here is that Java, by design, does not assume it knows what the machine is where your code will run; it could be a desktop, a super-computer, a phone, or anything in between and beyond. Java leaves room for the local JVM compiler to do its thing. In addition to increasing the portability of your code, this has the nice benefit of allowing the compiler to do things like take advantage of machine-specific optimizations, if they exist, or still produce at least working code if they do not. Things like SSE instructions or hardware acceleration can be used on only the machines that support them.

Seen in this light, the reasoning for using byte-code over raw source code is clearer. Getting as close to raw machine language as possible allows us to realize or partially realize some of the benefits of machine code, such as:

  • Faster startup times, since some of the compiling and analysis is already done.
  • Security, since the byte-code format has a built-in mechanism for signing the distribution files (source could do this by convention, but the mechanism to accomplish this isn't built-in the way it is with byte code).

Note that I don't mention faster execution. Both source code and byte code are or can (in theory) be fully compiled to the same machine code for actual execution.

Additionally, byte code allows for some improvements over machine code. Of course there are the platform independence and hardware-specific optimizations I mentioned earlier, but there are also things like servicing the JVM compiler to produce new execution paths from old code. This can be to patch security issues, or if new optimizations are discovered, or to take advantage of new hardware instructions. In practice it's rare to see big changes this way, because it can expose bugs, but it is possible, and it's something that happens in small ways all the time.

8

There seem to be at least two different possible questions here. One is really about compilers in general, with Java basically just an example of the genre. The other is more specific to Java the specific byte codes it uses.

Compilers in general

Let's first consider the general question: why would a compiler use some an intermediate representation in the process of compiling source code to run on some particular processor?

Complexity Reduction

One answer to that is fairly simple: it converts an O(N * M) problem into an O(N + M) problem.

If we're given N source languages, and M targets, and each compiler is completely independent, then we need N * M compilers to translate all those source languages to all those targets (where a "target" is something like a combination of a processor and OS).

If, however, all those compilers agree on a common intermediate representation, then we can have N compiler front ends that translate the source languages to the intermediate representation, and M compiler back ends that translate the intermediate representation to something suitable for a specific target.

Problem Segmentation

Better still, it separates the problem into two more or less exclusive domains. People who know/care about language design, parsing and things like that can concentrate on compiler front ends, while people who know about instruction sets, processor design, and things like that can concentrate on the back end.

So, for example, given something like LLVM, we have lots of front ends for various different languages. We also have back-ends for lots of different processors. A language guy can write a new front-end for his language, and quickly support lots of targets. A processor guy can write a new back-end for his target without dealing with language design, parsing, etc.

Separating compilers into a front end and back end, with an intermediate representation to communicate between the two isn't original with Java. It's been pretty common practice for a long time (since well before Java came along, anyway).

Distribution Models

To the extent that Java added anything new in this respect, it was in the distribution model. In particular, even though compilers have been separated into front-end and back-end pieces internally for a long time, they were typically distributed as a single product. For example, if you bought a Microsoft C compiler, internally it had a "C1" and a "C2", which were the front-end and back-end respectively--but what you bought was just "Microsoft C" that included both pieces (with a "compiler driver" that coordinated operations between the two). Even though the compiler was built in two pieces, to a normal developer using the compiler it was just a single thing that translated from source code to object code, with nothing visible in between.

Java, instead, distributed the front-end in the Java Development Kit, and the back-end in the Java Virtual Machine. Every Java user had a compiler back-end to target whatever system he was using. Java developers distributed code in the intermediate format, so when a user loaded it, the JVM did whatever was necessary to execute it on their particular machine.

Precedents

Note that this distribution model wasn't entirely new either. Just for example, the UCSD P-system worked similarly: compiler front ends produced P-code, and each copy of the P-system included a virtual machine that did what was necessary to execute the P-code on that particular target1.

Java byte-code

Java byte code is quite similar to P-code. It's basically instructions for a fairly simple machine. That machine is intended to be an abstraction of existing machines, so it's fairly easy to translate quickly to almost any specific target. Ease of translation was important early on because the original intent was to interpret byte codes, much like P-System had done (and, yes, that's exactly how the early implementations worked).

Strengths

Java byte code is easy for a compiler front-end to produce. If (for example) you have a fairly typical tree representing an expression it's typically pretty easy to traverse the tree, and generate code fairly directly from what you find at each node.

Java byte codes are quite compact--in most cases, much more compact than either the source code or machine code for most typical processors (and, especially for most RISC processors, such as the SPARC that Sun sold when they designed Java). This was particularly important at the time, because one major intent of Java was to support applets--code embedded in web pages that would be downloaded before execution--at a time when most people accessed the we via modems over phone lines at around 28.8 kilobits per second (though, of course, there were still quite a few people using older, slower modems).

Weaknesses

The major weakness of Java byte codes is that they aren't particularly expressive. Although they can express the concepts present in Java pretty well, they don't work nearly so well for expressing concepts that aren't part of Java. Likewise, while it's easy to execute byte codes on most machines, it's much harder to that in a way that takes full advantage of any particular machine.

For example, it's pretty routine that if you really want to optimize Java byte codes, you basically do some reverse engineering to translate them backwards from a machine-code like representation, and turn them back into SSA instructions (or something similar)2. You then manipulate the SSA instructions to do your optimization, then translate from there to something that targets the architecture you really care about. Even with this rather complex process, however, some concepts that are foreign to Java are sufficiently difficult to express that it's difficult to translate from some source languages into machine code that runs (even close to) optimally on most typical machines.

Summary

If you're asking about why to use intermediate representations in general, two major factors are:

  1. Reduce an O(N * M) problem to an O(N + M) problem, and
  2. Break the problem up into more manageable pieces.

If you're asking about the specifics of the Java byte codes, and why they chose this particular representation instead of some other one, then I'd say the answer largely comes back to their original intent and the limitations of the web at the time, leading to the following priorities:

  1. Compact representation.
  2. Quick and easy to decode and execute.
  3. Quick and easy to implement on most common machines.

Being able to represent many languages or execute optimally on a wide variety of targets were much lower priorities (if they were considered priorities at all).


  1. So why is P-system mostly forgotten? Mostly a pricing situation. P-system sold pretty decently on Apple II's, Commodore SuperPets, etc. When the IBM PC came out, P-system was a supported OS, but MS-DOS cost less (from most people's viewpoint, was essentially thrown in for free) and quickly had more programs available, since it's what Microsoft and IBM (among others) wrote for.
  2. For example, this is how Soot works.
4
  • Quite close with the web applets: the original intent was to distribute code to appliances (set top boxes...), in the same way that RPC distributes function calls, and CORBA distributes objects.
    – ninjalj
    Commented Apr 4, 2017 at 18:39
  • 2
    This is a great answer, and a good insight into how different intermediate representations make different trade offs. :)
    – IMSoP
    Commented Apr 4, 2017 at 20:40
  • @ninjalj: That was really Oak. By the time it had morphed into Java, I believe the set top box (and similar) ideas had been shelved (though I'm the first to admit that there's a fair argument to be made that Oak and Java are the same thing). Commented Apr 9, 2017 at 17:43
  • @TobySpeight: Yeah, expression is probably a better fit there. Thanks. Commented Apr 9, 2017 at 17:45
0

In addition to the advantages that other people have pointed out, bytecode's a lot smaller, so it's easier to distribute and update and takes up less space in the target environment. This is especially important in heavily space-constrained environments.

It also makes it easier to protect copyrighted source code.

1
  • 2
    Java (and .NET) bytecode is so easy to turn back into reasonably legible source that there are products to mangle names and sometimes other information to make this harder—something also often done to JavaScript to make it smaller, since we're just now maybe setting on a bytecode for Web browsers. Commented Apr 5, 2017 at 8:34
0

The sense is that compiling from byte code to machine code is faster than interpreting your original code to machine code just in time. But we need interpretations to make our application cross-platform, because we want to use our original code on every platform without changes and without any preparations(compilations). So first javac compiles our source to byte code, then we can run this byte code anywhere and it will be interpreted by Java Virtual Machine to machine code more quickly. The answer: it saves time.

0

Originally, the JVM was a pure interpreter. And you get the best performing interpreter if the language that you are interpreting is as simple as possible. That was the goal of the byte code: To provide an efficiently interpretable input to the run-time environment. This single decision placed Java closer to a compiled language than to an interpreted language, as judged by its performance.

Only later on, when it came apparent that the performance of the interpreting JVMs still sucked, did people invest the effort to create well-performing just-in-time compilers. This somewhat closed the gap to faster languages like C and C++. (Some Java inherent speed issues remain, though, so you will probably never get a Java environment that performs as well as well written C code.)

Of course, with the just-in-time compiling techniques at hand, we could go back to actually distributing source code, and just-in-time compiling it to machine code. However, this would heavily decrease startup performance until all relevant parts of the code are compiled. The byte code is still a significant help here because it's so much simpler to parse than the equivalent Java code.

1
-5

Text Source Code is a structure that intends to be easy to be read and modified by a human.

Byte code is a structure that intends to be easy to be read and executed by a machine.

Since all the JVM does with code is read and execute it, byte code is a better fit for consumption by the JVM.

I notice that there haven't been any examples yet. Silly Pseudo Examples:

//Source code
i += 1 + 5 * 2 + x;

// Byte code
i += 11, i += x
____

//Source code
i = sin(1);

// Byte code
i = 0.8414709848
_____

//Source code
i = sin(x)^2+cos(x)^2;

// Byte code (actually that one isn't true)
i = 1

Of course byte code is not just about optimizations. A large part of it is about being able to execute code without having to care about complicated rules, like checking if the class contains a member called "foo" somewhere further down in the file when a method refers to "foo".

3
  • 2
    Those byte code "examples" are human readable. That's not byte code at all. This is misleading and also doesn't address the question asked.
    – Wildcard
    Commented Apr 5, 2017 at 1:24
  • @Wildcard You may have missed that this a forum, read by humans. That's why I put content in human readable form. Given that the forum is about software engineering, asking readers to understand the concept of simple abstraction is not asking much.
    – Peter
    Commented Apr 5, 2017 at 13:03
  • Human readable form is source code, not byte code. You are illustrating source code with expressions pre-computed, NOT byte code. And I didn't miss that this is a human-readable forum: You're the one who criticized other answerers for not including any examples of byte code, not me. So you say, "I notice there haven't been any examples yet," and then proceed to give non-examples that don't illustrate byte code at all. And this still doesn't address the question at all. Reread the question.
    – Wildcard
    Commented Apr 5, 2017 at 21:40

Not the answer you're looking for? Browse other questions tagged or ask your own question.