-5

Suppose we maintain a massive electronic library of texts/photos/videos etc., and want to ensure that these files are readable indefinitely long in the future. [Update] one of the major problems with digital libraries is the program rot: due to bugs in content creation and/or content playback software (and due to feature removal from playback software):

Many documents can be reproduced only on particular versions of software, of the OS and of the computer hardware.

So we:

  • Keep snapshots of versions of OS/software which are known to read these files without errors.
  • Keep snapshots of VM implementations which are known to run these OS/software versions without errors.

However, this is obviously not enough: for the best result, we need to preserve the versions of CPU on which the VM implementation is running!

The only exit of this vicious circle seems to have “a virtual CPU”: a “virtual instruction set” which is:

  • Powerful enough so that one can recompile the VM mentioned above to run on this pseudo-CPU.
  • Simple enough so that one can write a very simple interpreter for this instruction set (e.g., in a pseudo-code — but it better be compilable for periodic checks of it working!).

The target is that N (or N²) years in the future, a future librarian should be able to quickly rewrite this “sample interpreter” into whatever programming language is available at that time. After this, the library becomes readable. (In other words, all one should provide is:

  • A general human-readable instruction how to navigate the library;
  • The human-readable (pseudo-)code of the interpreter.
  • A blob keeping the compiled VM, the OS and the reader programs.
  • A blob keeping the library.)

Of course, in the best of the worlds, such a CPU architecture would be already available!

Question: is it available? If not, how close it is to being available?

16
  • 4
    Why not just document the file formats?
    – cwallach
    Commented Jul 16, 2021 at 4:37
  • 1
    Another problem with “just document it” is QA. How would you guarantee that the documentation “is enough” to implement the decoding?! Commented Jul 16, 2021 at 6:10
  • 2
    CPUs can be emulated. The observable behaviour of current CPUs is meticulously documented, making it possible to write emulators even if Intel, AMD, and ARM suddenly disappear. But I believe our best bet for such a cyber-ark is to keep the necessary software written in a memory-safe language that is likely to be ported to new CPU architectures. For software that I want to keep until the end of my life, I currently use Rust.
    – amon
    Commented Jul 16, 2021 at 7:31
  • 2
    "How would you guarantee that the documentation “is enough” to implement the decoding?!" All the popular file formats are implemented in multiple operating systems, running on multiple CPU architectures. It's safe to assume that people have understood the standards well enough to decode them.
    – Simon B
    Commented Jul 16, 2021 at 8:21
  • 1
    @SimonB but half the time they base the decoding on how other programs are doing it. Commented Jul 16, 2021 at 15:44

2 Answers 2

5

The problem is that you're replacing a relatively simple problem with a harder one.

We already have a number of well documented file formats. These cover documents, still images, sounds and video. They have been implemented on multiple operating systems on multiple hardware platforms. So we know that the standards are well enough written.

You want to replace that with a whole VM running on a virtual CPU. But you still have to document how to implement the thing so well that someone decades in the future can implement it on whatever machine they have. And if they do get the content to display or play, it's now embedded in an entire VM, making it harder to paste it into another document.

Also, when video standards are first drawn up, they require special hardware or the fastest processors to implement them. Once you add the overhead of running a VM, they become impossible to implement on any hardware we have. Much better to describe the process of decoding the format, and letting people implement in the most efficient way on their platform.

2
  • 1
    Having “a well documented file format” is completely irrelevant when one needs to preserve existing documents. Even if only cares about “well-formed documents” (and you do not — the library should preserve buggily-implemented documents too as far as they may be “shown” somehow), one needs to have (1) a linter which proves well-formedness; (2) a proof that the linter is doing what it is supposed to do; (3) a converter working on well-formed documents; (4) a proof that the converter is doing what it is supposed to do. AFAIK, there is no such formats in use (except pixmaps). Commented Sep 15, 2022 at 5:27
  • I updated the question to clarify this. Commented Sep 15, 2022 at 6:13
3

This is absolutely possible. You can just pick literally any general-purpose CPU architecture that's simple enough for your liking - for example, a Z80 with a memory-banking extension - and write an interpreter for other architectures on that one.

Z80 is relatively simple (and you could even remove the less-than-simple parts). x86_64 is not. If you have a x86_64 interpreter written for Z80, you only have to write new Z80 interpreters as technology advances.

It will be slow, of course, but fully functional. Interpreted emulation is always slow, and you are proposing to use two layers of it. One or both can be a JIT compiler, but then you are adding a huge amount of complexity which could break down in unexpected ways. Future generations can always write their own JIT using your interpreter as a reference.


By the way, human languages evolve. The year is 4040 and English has the status that ancient Egyptian hieroglyphs do today. Can they read your instruction set description? You may need to bootstrap your description from scratch - like we tried to do with the Voyager records. Luckily mathematics is universal and timeless, so you can assume they'll have the concepts available, but not the notation. You have to design a record they can look at and think "hey, wait a minute! That's binary addition!" Think something like:

- | |- || |-- |-| ||- ||| |--- |--| |-|- |-|| ||-- ||-| |||- ||||

|- || |-| ||| |-|| ||-| |---| |--|| |-||| |||-| ||||| |--|-|
# optionally add more prime numbers until you run out of carvings
# maybe write e and pi and sqrt(2) in binary to introduce the "decimal" point symbol

-   +   -   =   -
|   +   -   =   |
-   +   |   =   |
|   +   |   =   |-
||--|   +   |--|-   =   |-|-||
(include a bunch more binary addition examples here)

(same for subtraction)
(and it just goes on and on like this)
6
  • I reread my question, and your “answer” leaves me bewildered. Did you just say that one can compile a contemporary VM (e.g., QEMU) into Z80 code?! Did you say that the compiled program would be able to run the VM with a contemporary OS and a reasonably-sized task? Did you say that there is an emulator for Z80 with a very short code? THESE were my questions… Commented Sep 15, 2022 at 6:33
  • And as far as your no-English aside goes: such questions are well-researched and widely known. There is no need to invent a bicycle here! (Moreover, I address an obviously pressing need here, not some hypothetical imagined situation. Although I agree: this situation is ALSO very interesting! ;―) Commented Sep 15, 2022 at 6:37
  • @IlyaZakharevich Is there any reason you think Z80 could not be the language you asked for? As I said - Z80 is just an example - you could use just about any well-known instruction set that's ancient by today's standards, perhaps ARM1 for a bit more simplicity. Or create your own one which is similar to it. Commented Sep 15, 2022 at 17:00
  • Maybe I’m missing something… What is the practical amount of (virtual) memory supported on Z80 architecture? To show HTML (and render Blue Ray) one should better have a few GB of heap available to the program (and, as a corollary, to the emulator of the processor). If Z80 can do this, then it should be indeed OK! Commented Oct 23, 2022 at 13:52
  • @IlyaZakharevich not just Z80 but "Z80 with a memory banking extension". You can simply say that e.g. memory addresses $00, $01, $02, $03 are a 32-bit little-endian extension to addresses in the range $8000-$FFFF. Now you have a 47-bit address space which is plenty. Yeah, the emulated CPU has to do a whole lot of work to access addresses in that range - and it's even worse if part of the emulator code has to fit in that range, which it will - but remember that archiving is the concern here, not speed. Commented Oct 24, 2022 at 16:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.