Inspired by the amazing Risc-V emulator shader, RVC, I set out to try and create a emulator shader that was as simple as possible, with a minimal VRAM footprint by emulating a 8-bit CPU, the CDP1802. The final shader is capable of running compiled C programs while taking up less then 1ms of rendering time each frame, but also being capable of emulating a full-speed CDP1802 at 3.2MHz if allowed to use 100% of my GPU.

The CDP1802

The CDP1802 is a 8-bit microprocessor released by RCA in 1974, and holds the achievement for being the worlds-first CMOS processor. I won't go into all the technical details about it, rather just try to explain why I choose this CPU.

The primary factor was mostly just that it is the fist CPU I learned assembly for, so I am intimately familiar with it, but I also chose it for its incredibly simplistic instruction set. There are no variable addressing mode instructions. Every CDP1802 instruction does one thing, with one addressing mode.
A emulator for the CPU can easily be implemented using just a massive switch-block with no further, complex addressing logic within. That should turn this into the most compact CPU emulator shader ever, at the cost of a lot of performance.
After all, the CDP1802 was already considered poor in performance all the way back when it was released. Its ISA may be simple, but that also means it is slow for doing complex things. Don't expect this shader to be running any complex programs, at least not at its default speed mode.

Basic setup

The shader runs inside a CustomRenderTexture configured to update every frame. The type is R16_G16_B16_A16_UINT, and its size is exactly 256 by 256 pixels, equalling 65536 total pixels, which equals the full 16-bit addressing space of the CDP1802. Every pixel in the red color space represents one byte of memory.
The emulator needs to store some information too, mainly the internal CPU register values. The CDP1802 contains a set of 16 16-bit scratchpad registers, plus a few smaller specific purpose registers. The emulator’s memory write cache also needs to be stored temporarily. All of this is stored in the green color channel. Some visual debugging information is also rendered onto the green color channel. Blue is unused.

To load a program, the shader, on the first frame, simply copy-pastes the contents of a regular texture, loaded from a PNG file, into the CRT. This texture contains a compiled program that's been converted from binary to image format using a tool I quickly coded up in Java. From there, any program can be loaded into the emulator.

There are two implementations of the shader that I created. The first is dumb and slow, the second is the final version with a memory write cache and other optimizations.

Version 1

The main block of the emulator is as simple as I described earlier. The code reads the next CPU opcode from memory, which gets feed into a massive switch-block that jumps to the code for emulating that instruction.

The CDP1802 has 255 valid opcodes, leaving only one unused (though I do find a use for it later). This contains some that cannot be emulated, namely the I/O instructions. The interrupt control and return instructions are implemented, but useless, as there is currently no way to trigger an interrupt. Interrupts don’t work with the C compiler I will be using later, so would be useless either way.

I actually wanted this part of the code to not contain any shader-specific code, so that it can be understood more easily. It is surrounded by some prologue and epilogue code, which contains the shader-specific code for reading the emulator state, and saving it back to the CRT at the end.

The biggest hurdle was memory access. Memory reads are simple enough: sample from the CRT's red color channel. Register reads work the same, except the source is taken from the emulator memory in the green color channel. This can be easily abstracted through a macro. If a memory write occurs, the value and address of that write are stored in variables, to be handled by the epilogue. Since the prologue reads all current register values into an array and some variables, those can be modified directly, to be written back to the CRT during the epilogue.

The way memory writes are handled is the primary difference between versions 1 and 2. Doing memory writes is weird in a CRT. There is no function you can call to arbitrarily set a pixel's color. Instead, the fragment shader function is run once for every pixel, and its return value is that pixel's new color. So, the only way to set a pixel’s color is to be inside that pixel’s instance of the fragment shader.
In version 1 of the emulator shader, this translated to running one step of the whole emulator for every pixel in the fragment shader, checking if the current pixel either corresponds to the address of the current memory write (if one happened), or the location of one of the CPU registers in emulator memory.
For all other pixels, the result of the emulation step is discard;ed.

This is incredibly wasteful, and incredibly, incredibly slow. Not just from the overhead of the tens of thousands of emulators running in parallel, but also because it means only one instruction can be executed per frame. However, IT WORKED! It did its job as a proof-of-concept! I was already running my first C program inside VRChat at this point. Yes, it took it a good 30 seconds to print "Hello, World!", but it was progress!

Additionally, creating a framework for abstracting the shader-specific components of the emulator proved to be a really good Idea. I was actually able to copy-paste the main loop of a CDP1802 emulator I’d written in Java into my shader code, and it required almost 0 modification to work. This definitely reduced the amount of time spent debugging the emulator, as I was able to focus on finding issues in just the prologue and epilogue functions.

Still, a lot of further optimizations were needed.

Version 2

The first problem to solve was how to run more then one instruction per frame. The problem was memory writes. In Version 1, the moment one occurs, program execution has to stop for that frame, so the epilogue code can commit the change in memory to the CRT. Of course, you could use an array to keep track of multiple writes.
Now, execution only needs to stop if the array fills up. But what if a read occurs on a memory cell that was written to in the same frame? The memory read macro would have to first search through the write buffer to see if the value changed, return the changed value if yes, and fall back to sampling the CRT if no. Wait, that's what a cache is!

With this implementation of a write cache in hand, I could now run multiple instructions per frame. Of course, increasing the IPF (Instructions Per Frame) also meant increasing the overhead of the fragment shader running 65356 times per frame, meaning my GPU was getting absolutely obliterated! Luckily, thanks to the write cache, I now had everything I needed to fix that too.

The solution is similar to what RVC uses, as it was explained to me by one of RVC’s devs. Instead of updating both the emulator state and CPU memory in a single shader pass, separate the two. Running the emulator and computing the new emulator state and write cache could be done in one pass, and using that information to update the CPU memory in another. The only further modification required for this is that the write cache also needs to be persisted into emulator memory to store it in-between the passes.

The main advantage of this is that the emulator memory is a lot smaller in terms of CRT pixels it requires. So, the first pass only needs to be run for a few dozen pixels, as opposed to 65356. And as this pass includes running the emulator itself, this should result in a massive speed boost!
The second pass, which reads the contents of the write cache and uses it to update the CPU memory, still needs to run for all pixels, but this is a fairly simply, and speedy, step.
Lastly, CRT update zones can be used to run both passes every frame, while also limiting the number of pixels updated by the first pass. I like to call these passes "Update", for updating the emulator state, and "Commit" for writing the memory changes to the CRT.

Version 2 is a lot faster then Version 1, finally running both passes combined at < 1ms, as I promised in the beginning. However, there is still a few things missing to get it to a fully functional computer.

Basic I/O

A usable computer obviously needs I/O, and with this being a VRChat Avatar, we can use some existing mechanics to interface with it. Firstly, output. One of the most simple ways of outputing information is through text. Traditionally, you’d use a serial terminal to let the CPU send out one character at a time, which is then displayed on a screen. In this case, I can’t easily do that. I’d need another CRT shader with memory to keep track of the text printed so far, so I went with a simpler solution, that’ll even allow me to implement some actual graphics later as well.

Keeping the text buffer in CPU memory does take away a few Kilobytes of RAM, but not a large enough amount to have to worry about. Every character is actually two bytes as well, the second byte storing that character’s color, to allow printing colored text. This does, however, mean some CPU overhead, as the CPU no longer just sends out characters (i.e. by writing them to a memory-mapped device in a single instruction), but needs to actually manage the text buffer. It has to maintain row and column pointers to write out characters sequentially, as well as handle line breaks.

This does mean printing is a bit slow, but does allow for arbitrary writes into the text buffer. I also made it so that a special "full block" character is available for the CPU to print. Combined with the ability to set text color on a per-character basis, this means the CPU is able to generate graphics as well! At an amazing resolution of 64 by 50 pixels/characters, which is the size of the terminal.

All that’s needed now is a shader to look at the text buffer, and use a monospaced pixel font I created to show the contents as text on a quad, which I could then attach to my avatar as a prop.

However, the location at which the C compiler places the array is completely arbitrary and a bit random. So, at program start, the CPU is going to write a 16-bit pointer to the array to memory locations 65535 and 65534. The shader reads these to know where to look for the text buffer.

Writing to random memory locations seems a bit dangerous, as the CPU is running a C program that might try to allocate those locations for a variable. However, while technically possible, this is not going to happen here. With just 64K of memory, there is no heap. Every variable is allocated on the stack, which grows upwards. As the special memory locations are located at the end of memory, the only way for the C compiler to try and use them would be if your program is using up all the available RAM. And at that point, you've got a whole other problem entirely!

Input is a bit more complicated. Obviously I was going to use gesture menu toggles as a simple "keypad", but getting avatar parameter values into a CRT is quite difficult. The solution for this problem was complicated enough to where I factored it out into its own mini-project, which I've already written about here. The final keypad consists of the numbers 0 - 5 and an "ENTER" key to confirm inputs. Then, I hit VRC’s limit of 7 controls per sub-menu. Entering larger numbers will be a bit convoluted, but for basic input, this suffices.

I can also have the CPU interface with the emulator itself. Remember that unused opcode? I can actually use it as an instruction prefix to implement my own instructions into the emulator, and use them to trigger special behaviour in the emulator. I want to keep compatibility with real hardware as much as possible though, so I tried to use this as little as possible. Infact, it is currently only used once. The YIELD instruction will tell the emulator to immediately stop executing more instructions that frame, and instead move on to running the commit pass. This effectively pauses emulation for one frame, and can be used for delays as well as reducing the GPU usage of the shader during idle loops.

Of course, there are more interesting and unique ways for the processor to interface with VRC.

Avatar I/O

The next question is, is it possible for the CPU to control things on my avatar? The answer is...yes, sortof. I already found a way to get data from the FX Controller into the emulator, which could be used for a variety of inputs. Additionally, input can also be taken by sampling a Render Texture from a camera on your avatar. I used this in my demo to make a touch-button, but it only really worked for myself. Also keep in mind that I did this before Avatar Dynamics, so a lot more should be possible now in terms of emulator inputs.

However, output is quite limited. I can’t extract data from the emulator and turn it back into animator parameters. Instead, I have to keep using the same trick I did for the text terminal, where a shader samples the CRT to take input from the emulator. I managed to create a demo in which the emulator animates my avatar emissions, by writing a byte into a special memory location, which my avatar shader samples and interprets as a brightness value for the emissions. One could also potentially use geometry shaders to have the emulator actually implement object toggles, but I didn’t experiment with this.

It should theoretically be possible to run the emulator in a VRChat world and use Udon to access the CRT and implement various I/O devices, but I didn’t experiment with this much either, and it is outside the scope of my goals for this project anyways.

Getting C to work

So far, I have a working emulator with working I/O, but I am still writing test programs in assembly. And while that's good for simple programs, I do want to eventually move on to something higher-level, just to make my life easier. Luckily, a C compiler for the CDP1802 exists, and its the most scuffed thing imaginable. Its based on the Little C Compiler (LCC), which compiles ANSI C, the original standard documented all the way back in 1989. It is quite different from modern C to the point where you could almost safely call it a different programming language alltogether. There's also no standard library except for the few dozen lines of code that come with the compiler. I had some print functions, and assembly subroutines for integer and even floating-point operations, but that was pretty much it.

So, I need a library to interface with my custom I/O devices. That code lives in vrcio.c, and consists of the length of code required to handle printing characters into the text buffer, as well as reading keypad inputs. These functions combine with the ones that come with the compiler to make something similar-ish to stdio. You can printf("Test %d", i); in your code, and it’ll actually work! Except, it doesn’t. Not without patching the compiler itself first, because of course it doesn’t play along well with a CPU emulator in a Unity shader.

Most of the issues come from the compiler itself being built for a very specific hardware configuration. For instance, it insists on using one of the CDP1802's I/O instructions on specifically port 5 for printing characters. No other way to fix this then to delete the putc function from the include files and define my own in C, but imagine my surprise when things still didn’t print. Turns out the compiler’s standard library doesn’t even use putc for printing strings, instead also going to the I/O instruction directly. Many hours of debugging where spent discovering this. *sigh!* At least now I can finally printf properly.

Maths is also another thing I need and have very little of. Basic operations all work. I can add, substract, multiply and divide, but there is no math.h, unless I make my own. Which is what I did. It only has four operations in it. The first is square-root, using the "evil bit hack" algorithm (fast inverse square root), the second is cosine using an iterative approximation and I can derive the third algorithm, sine, from that. The last function is rand, using xorshift to provide random numbers. None of these are particularely fast. Cosine takes several seconds to complete in the emulator. These few functions should be about all I need for my demos though, so I left it at that.

Oh, also, the software floating-point is slightly glitched. If you multiply specifically 0 by 0, then, through the process of magic, you get 4 as the result. I couldn't figure out how to fix this problem at the source, so you might instead find some instances of if(x == 0 && y == 0) return 0; in my code.

Demos

I created a program containing 4 demos. The first simply uses a touch button to toggle a light. The second is a series of benchmarks: approximating the golden ratio, approximating pi and finding prime numbers. These are very slow and take upwards of 30 minutes to complete. The third is a emissions controller that lineary interpolates between random values. It is barely fast enough to udpate the emission brightness 15 to 20 times a second. Very stuttery up-close, but actually not very noticable from farther away. The fourth is a pocket calculator. You select an operation, then enter the oprands, and get a result.

This is the first time I had to deal with the limited size of the keypad. My solution was to make ENTER a mod key. If you press it, a 'M' appears next to the input, indicating it has been activated. Pressing it again will de-activate it. From there, you can press another key to complete a key combination. M + any number from 1 - 4 will put the selected number plus 5 into the input. So Mod + 2 will put 7. Mod + 5 puts a decimal point into the input, and Mod + 0 actually confirms the input.
A video of this demo can be seen here:

1802 Namebadge port

If you’ve explored this site before, you probably already saw my CDP1802 Namebadge project. If not, go check it out here!
And of course, I can port this to run on the emulator as well, while staying compatible with the real hardware. This does work with the original software with few modifications, but is very slow. So, I re-wrote parts of it to use a different algorithm for scrolling the text. This one is a lot faster, but not usable for generic text rendering, unlike the previous code. It can only scroll text from right to left.

The code still uses an internal framebuffer for constructing the next frame, before pushing it to the displays. This is technically not needed for the emulator, but I wanted to keep this compatible with the real hardware (that this code indeed runs on, but way too fast to actually be able to read the text, unsurprisingly). It also ended up minimizing screen tearing on the dot-matrix displays.

For rendering the output framebuffer, I recycled a dot-matrix display shader I created once for use in a VRC world. Since, in the emulator, the function to push the frame buffer just copies it to a different part of memory instead of to some external hardware, the shader can read those memory regions and display them.

Just for fun, I also coded a quick shader to display all the register values. Funnily enough, the new program needs barely any RAM, only using up 32 bytes for the framebuffer, and that’s it. The CPU's registers are enough for everything else.
Infact, there were three unused registers by the end: R2, R9 and R10. So, I made a bit pattern run accross R9 and R10, just for the looks on the register visualizer.

The final software runs very fast. I get no noticable increase of GPU usage from it, most likely because the new program was so efficient, I was able to reduce the emulator's instructions per second by a third.
The final product looks like this:

Closing thoughts

While not really useful for anything, I think the outcome of this project is still very cool. At the very least, it gave me something show of to people. I think this could definitely be turned into something more useful. The problem right now is that I am emulating a 8-bit CPU with a very primitive instruction set that doesn’t really take advantage of a GPU's ability to do floating-point and 32-bit integer operations in hardware.

One partial solution I thought of would be to simulate another piece of hardware alongside the CPU, such as an AM9511, which is an early hardware FPU, except the emulated chip would really just pass the floating-point operations to the GPU core. Simulating a more powerful CPU, such as a 6502 or even a Z80, would also be a way to improve performance while keeping complexity relatively low.

If you want to play with this yourself, or just look at the code, you can do so on GitHub. Just keep in mind that actually setting this up on your avatar is pure hell! And please put it on a separate avatar so it is only active when you need it. It is still a relatively intense shader, and the input layer sometimes causes graphical glitches in surrounding transparent materials, which seems to be unfixable as of now.