Tholin’s Place

This is a continuation of a previous website entry. If you haven’t already, go read it first. I explain a lot of the fundamentals in there.

GFMPW-1 kinda happened out of nowhere. It had been a year since GFMPW-0, and half a year since the disappointing bring-up of the AS2650 and I had already deeply resigned myself to my fate of never being able to design another chip and fix the AS2650.
But it appeared Efabless was going to give us one last huraah, announcing GFMPW-1, the second free shuttle on the GlobalFoundries 180nm process, in the same breath as officially confirming the cancellation of the OpenMPW program. And even better, I had a 5 week deadline now.

AS2650-2

The AS2650v2, or just AS2650-2, is what I went on to call my complete re-make of the AS2650. Yeah, that verilog I came up with a year ago I basically had no choice but to throw away and start over if I wanted to make any kind of progress.

And this gave me time to reflect on the issues the development of the AS2650 originally had. Yes, some of it could be chalked up to my own lack of experience, but really, several development mistakes were made. And most of them centered around time.
I had, in short, not split my time correctly between development and verification. I was too obsessed with optimizing every bit of the design and getting the whole thing done first, and ended up writing only the most basic test cases imaginable that covered maybe 10% of all possible instruction encodings and machine states. I’d wager it was actually less than that. In hindsight, no surprise it turned out bad.

So the new battleplan was simple: start from scratch, building up the processor one class of instructions at a time and then, crucially, put a pause on any further development until test cases had been written! Testbenches would not be something that would be created in parallel to the design at some point, but instead be blocking milestones directly in my path.

On top of that, getting a working CPU core was top-priority! As stated in part one, I had ambitions for putting more functionality on the die, but I would regard these simply as optional goals, to make sure maximum time was put into developing the CPU itself.

I also wanted to make the CPU core more advanced, giving it proper interrupt capabilities and a whole new addressing mode to extend addressable memory all the way from 8KiB to 64KiB. And prerequisites in the register set and physical interfaces for these instructions was the only exception I made for the order of my development process.

The result was probably the most successful development process I have ever had. Every instruction, every addressing mode, every bit of functionality has test cases for many combinations of scenarios. I even got paranoid and made tests to check all possible combinations of states to the add and sub instructions, a total of 262,144 permutations, taking hours to complete.
I also discovered the magic of verilator and could at last run RTL simulations of real software in a reasonable time.

When I finished and finally ran the design through the OpenLane flow, it ended up being smaller than the original AS2650, despite not only having the same target implementation, but also containing additional features the original did not. Just goes to show how badly designed the original was.

Another big change was the amount of documentation that I put together, which is to say, any at all. This is the project that made me see the value in documenting your own work, even when you’re not expecting anyone but yourself to ever make use of it. Its this relaxing period after you have finished all the work on a project, where you sit down and spend a few hours reflecting on everything you have just accomplished. And who knows, maybe you’ll find some issues still just going over everything again?

It also helps me specifically, because I tend to forget very easily, so I definitely am not going to remember on my own the intricate details of how to use the AS2650-2 by the time bring-up comes around.

AS2650-2 - what’s different?

So, what exactly changed? Well, you can read the full documentation for details, but in short: a lot! I did say I wanted to really make this architecture my own. Improve it and make it better using modern technology. I retained the additional instructions from the first AS2650 (such as multiply) and added some more, at one point even having to use the sole remaining unused opcode as a prefix to get more space. Its mostly utility instructions, like complement and clear, but also ways to interact with the on-die stack, finally allowing it to spill over into memory under software control. The PSW could now also be pushed and popped and I expanded the stack to be 16-bits wide and 16 entries wide, a large upgrade over the original 2650.

I went on to replicate the original 2650’s paged memory addressing to provide further backwards-compatibility and then doubled the amount of pages it could access to get to 64KiB of memory address space. A whole new addressing mode, "far addressing" for load, store and branch instructions, now exists, to directly access memory anywhere in the address space. The original absolute addressing now only acts as a faster, more versatile way to access local memory.

The physical memory interface would change as well, going to a multiplexed address/data bus to free up pins for built-in GPIO. I determined the memory throughput cost to be minimal compared to the massive gain in functionality. A worthy tradeoff, if you will.

Interrupts were also finally implemented, though slightly differently than the 2650’s vectored interrupts. Interrupts were still numbered, but now looked up addresses to branch to in a software-defined table instead of requiring the interrupt source to define an absolute address. Another change to improve usability as am embedded controller.

At this point, very little time had actually passed, less than a week, definitely proving that properly managing my development time was a good idea. So, that meant moving on the adding on-die peripherals.
To implement these, I decided to finally split the design up into multiple macros. Up until now, I had been throwing the whole RTL Verilog at OpenLane, building a single large circuit. However, this is not always optimal. At some point, it becomes very time-consuming to run the flow on a large circuit, and that gets especially annoying if you are only making changes to one particular area of it.

Instead, physically splitting the design up into smaller subcircuits that are wired together is an option. These subcircuits are called macros, as they can be instantiated on the die not just once, but as many times as required. They can be thought of as individual verilog modules made manifest, with only wires passing between them.

Infact, up until this point, I had simply been building a single large macro to be embedded onto the die. Now, it was time to split things up. I wanted there to be multiple peripheral devices and so I decided to come up with a simple, internal address/data bus to allow the CPU to communicate with them. The 2650’s extended IO instructions (rede, wrte) were re-purposed for sending and receiving data from peripheral macros.

For the actual peripherals, I started by implementing simple, software programmable IO ports. Two ports of 8 bits each. Every other hardware functionality would be built on top of that, with IO pins being re-designatable to various alternate functions. Just as dictated by my development flow, these were implemented, verified and integrated one by one. In order of implementation, we have the basic GPIO, three counters/timers, PWM generators, a UART and a SPI port. The SPI port has its lines on dedicated pins, but the others are all special functions that can be enabled on the GPIO pins.
Oh, I then also put a whole C64 SID on there as well, but more about that later.

Completed layout for the AS2650-2

Lastly, I wanted to increase the usability of the chip. The AS2650 required a TON of external support circuitry to wire up memory and IO. I partially solved this already by moving the most commonly needed periphery on-die, but what about RAM? Well, this year, Global Foundries decided to bless us with some of their SRAM IP. That is, layouts for blocks of RAM in various capacities that could be turned into macros and instantiated on die. And, well, everything I did so far used less than half the available die area. So, time to just spam as much RAM as would fit. Which turned out to total 4096 bytes, mapped from the beginning of the CPU’s address space. External memory accesses could still be done by going past this range.

But, how to initialize this memory? A small, fixed program that does memory initialization from a different data source, also known as a bootloader, is usually required for this purpose. I could’ve used the caravel management controller for this purpose, but I decided a more fun way would be to use the AS2650-2 itself (note: I still consider it one of this design’s few major mistakes to not make initializating RAM with the management controller an option).
It had everything required to copy 4KiB of data from one of those spiflash ICs I love so much. So, another macro on the die is simply a 170 byte large ROM that the CPU can start executing from, which has exactly that job. After the bootloader program has completed, the ROM is permanently disabled, like it was never there, leaving only the RAM space to execute code from.

I also decided to more heavily make use of the management controller. Did I mention that this thing had powerful debugging interfaces like a wishbone bus and logic analyzer probes? And that they went tragically unused in the first AS2650? Well, those are perfect for on-die debugging, and setting the processor operating mode. Wether the internal RAM is enabled or the boot ROM is used, for instance. Some parameters of code in the boot ROM can even be adjusted over wishbone!

Well, that was certainly a lot. On to tapeout and getting some rest- wait. What? The deadline is only halfway over. Hold o-

Wait, there’s more?

If anyone ever wonders what 100% of my power looks like, this is it! I was debating not starting a second project on the shuttle and to just wait out the remainder of the deadline, but the ideas kept flooding my mind so....why not?

Instead of restricting myself to a single project, I decided to do as many as I could, all concentrated on a single die. Similar to TinyTapeout, they would all share the chip pins through a multiplexer. Just as in the AS2650-2, the management controller could be used to configure things, mainly selecting the active design. I also integrated 64 bytes of RAM in the multiplexer, though this was before the SRAM macros were usable (I’m not even going to go into details on how hellish they were to integrate), so its implemented in RTL through a reg [7:0] RAM [63:0]; (tbh I could, and totally should, have fit twice as much memory) and synthesized into flip-flops.
Now, although I invited one other person to contribute designs to the project, only two small macros belong to them. For the most part, it was just me going crazy making one design after the other.

So, what exactly did I make? Well, a lot actually. As before, to get the full details, take a look at the documentation. But, here’s the quick rundown:

SID

This is one I’ve been wanting to do for a really, really long time: a re-creation of the C64 SID. An absolutely iconic sound chip that has been slowly dying out as its been out of production for ages. So, why not combat that by making more? Although GFMPW-1 was digital-only, so the whole thing is implemented in purely digital logic, needing an external DAC. Still, its not an emulation or FPGA, but dedicated circuitry purpose-built to generate audio. I think that still makes it a "true" SID in my eyes. There’s also two of them. I applied some agressive optimizations to make each SID as small as possible, so I could fit two within a reasonable area.

I was actually so determined to get this design successfully taped out that I duplicated it onto the AS2650-2 die (though only a single SID), which is the story of how that got there.

Diceroll & TBB1143

These are two designs from my very first tapeout on TinyTapeout 2. I thought it would be kinda poetic to do that, so I put the diceroll. The TBB1143 sound chip was supposed to cleverly use a ring oscillator to get around clock speed limitations on TinyTapeout 2, but I somehow messed it up, and it doesn’t work. Yeah, I was salty enough about that to include a fixed version here.

MC14500

This one is not just a re-implementation of the MC14500, but a whole computer around it. There is a project I did where I built a whole computer around the MC14500, capable of running complex arithmatic. I decided to just 1:1 duplicate it onto this integrated circuit, to see how small I could make a functional computer. The only differences are the reduced amount of RAM, using only the 64 bytes from the multiplexer, and the program counter gaining one more bit to expand the ROM space to 128KiB. But it is fully compatible with software written for the bulky discrete implementation of this machine. I even purposefully implemented a hardware bug to achieve this compatibility.

The result is indeed tiny, a mere 185 by 185µm. Compare that to the 1050 by 700µm of the AS2650-2 CPU core. It is so tiny, its barely visible on microscope dieshots without zooming in. And its almost certainly invisible to the naked eye. Quite impressive to go from a amount of discrete ICs that I struggle to find space for on my workbench to an almost invisible size.

QCPU

Which stands for "Quick CPU", of course. Not because the CPU itself is fast, but because I designed it in a day. I wanted to see how fast I could make a microcontroller, including coming up with my own CPU architecture, from scratch. A speedrun, if you will.

8 hours.

Yes, that includes verification.

This one was actually quite fun to work on. I decided that I would try making it a harvard architecture processor, and execute code directly from a spiflash in QSPI mode for higher bandwidth. This allowed me to easily use 16-bit wide opcodes, even though the CPU itself is 8-bit. I took a lot of inspiration from RISC-V and came up with something certainly interesting. It has 16 registers, including a zero-reg, and is a loadstore-type architecture, with most operations taking place through the Reg/Reg opcode type, which facilitates arithmatic and logic operations between two registers. Conditional branch and uncondition jump instructions make up most of the rest of the ISA.

For IO devices, I restrained myself a lot compared to the AS2650-2. There is two 8-bit wide GPIO ports and serial ports, but only a single timer and a very basic interrupt model. There is also once again only the 64 bytes of RAM in the multiplexer available (though the addressing limit is 256 bytes). Still, a very neat little design.

AS-11

Somehow, I got talked into doing a PDP-11/40 compatible processor, and this....well, its certainly an attempt. At this point, my development process was once again falling appart and I was rushing to get implementations done. I was still trying to write as many testbenches as I could, but was falling back to simpler methods of building those tests and skipping testing non-essential components entirely. The deadline was approaching, but nontheless, I kept going.

Final Multi-project layout and project map

Tholin’s RISC-V

I was actually about to call it quits, as the deadline was only one day away. But I had a very fateful encounter with a certain avali, while I was thinking out loud if 24 hours would be enough for me to create a simple RISC-V core. They dared to answer me, saying there was no way I could.
And at that point, it was a challenge. My hands were tied. I had to do this, to defend my honour.

Luckily, RISC-V is very very easy to implement, and I didn’t even have to invent my own ISA, so after 8 hours, I called it done. Really, the difficult part wasn’t the ISA, but the physical memory interface, something I struggled with on the AS-11 too. In both designs, the pin count limited me to a 16-bit bus multiplexing address and data on the same pins, forming a pretty sizable bottleneck.

Though besides implementing the base integer ISA, I even hit some stretch goals, like implementing the multiply/divide extension using vlsiffra to generate a fast multiplier, getting some built-in peripherals done and verified here too, as well as a robust interrupt model.

I did a decent-ish job with verification here too. Not as in-depth as it should be, but if your RISC-V implementation can run programs compiled by GCC that use complex floating-point functions, without any bugs, I’d say its good to go.

This was all, of course, to the absolute bewilderment and mental anguish of the aforementioned avali. It appears I will be able to go another day with my pride intact!

Bring up - Preparations

At this point, I actually finalized my submissions, for real this time. But my hopes for getting both chip designs taped out were very low. GFMPW-1 was still a lottery, after all. And it was a very popular shuttle with a lot of submissions. Infact, at one point, so many people pushed to Efabless’ git servers, that they ran out of disk space and crashed, forcing Efabless to extend the deadline by one day while they fixed things.

Somehow, for some reason, both of my projects were listed in the accepted submissions table some weeks later. I was genuinely surprised, but it also meant I had to make preparations quickly for the amount of things I would have to bring up.

My most significant idea in this regard was it to make a generic breakout board for GFMPW-1 chips. This PCB breaks out most of the user IO pins on the chip onto a footprint exactly the same as a DIP-40 IC. All the components to keep the caravel management controller happy were right on the board, so it was a completely standalone breakout. All but one of my other PCB designs would be build around this breakout and other people in the open source silicon community even made use of it!

And then came the PCB designs for the individual projects. Oh man, so many PCB designs.

Bring up - Multi-Project

Multi-project die. Oh yeah, did I mention yet I figured out how to do silicon art?
High res image link

The two batches of chips were shipped separately, and this is actually the first ones I got, and not only did I get packaged ICs, but also raw dies! That means high-quality microscope die photos! But also a lot of work. Bringing up this number of designs is not easy.

Let me go over what worked, first. The SID was brought up first, followed by the other sound chips. There was some brief difficulties due to the setup/hold times on the chip inputs being unknown. Both SIDs work and can rock out to some tunes just fine, with the only issue being in my test rig: lag from piping the register updates over a 115200 UART for playback. The other PSGs seem fine too, even the TBB1143! Though I’ve not yet heard it play any more than test tunes. Obviously there is no existing music for my custom sound chip and I’m not a composer myself (maybe should’ve thought of that before I designed it).

The MC14500 computer worked perfectly, which also means the RAM in the multiplexer does too! I used a esp8266 to capture data from the clocked serial port on the MC14500 and show it to me as text. A bit overkill, but it works. I tested it using the mandelbrot renderer, and managed to get it to run at 27MHz! That is 13.5 times faster than the perfboard build! Its actually quite fast. I can see this unironically being a functional, low-area, low-power embedded processor. I may use it as such in future projects.

All this is, sadly, required to run QCPU

QCPU was annoying to bring up. It was supposed to be easy to wire up on a breadboard, but now requires even more hardware. An active level shifter to go between the 5V levels of GFMPW and 3.3V levels of the flash was always a requirement, but, of course, the one part of the design I didn’t verify with a testbench, the shifter direction signal, was not being generated correctly for the init sequence. Some 74-series logic helps fix it, but it doesn’t fix the init sequence being bad in general!
See, I decided I want maximum throughput by using the quad mode on the flash to transfer 4 bits at a time, which requires setting the Q bit in its status register to enable. So, its a short command sequence to set that bit and its ready to go, right?

Weeeell it turns out these parts implement most of the status register bits as non-volatile. Meaning they’re in flash. Meaning the flash programming time of 1 - 2ms needs to be met. Not only am I not doing that, but the official Winbond verilog model of these parts does not simulate this timing. I had no indication this was bad as a result, and I am still salty about that, Winbond!

The management controller comes in clutch, though. Just a bit of hand-written RISC-V inline assembly to get the timing right, and I can use a GPIO pin under software control and an OR gate to override the flash chip-select signal at just the right time to make it not see that part of the initialization sequence. Still works as long as the Q bit is programmed beforehand, which is easy.

I haven’t had time yet to experiment much with this processor besides testing the ISA and on-die hardware. I did find an issue where it will drop pending interrupts if they come in right on the clock cycle going from an instruction fetch to execute. This makes timer interrupts difficult to use, but doesn’t affect external interrupts, which can be alligned to a safe cycle with one D-flip-flop. All else seems good, and like a fine little architecture.

The AS-11 was the most difficult to get to work at all, and is the second case study in a row on why simulations should not be implicitly trusted. It just would not want to execute code at all, which was insanely frustrating! I spent a lot of time on its PCB, as you can tell. It has a mockup of a PDP-11 front panel, with address and data displays and control switches, though in this case, they’ll be driven in software, but still used for data entry and readout.
I even used expensive FRAM to emulate the behavior of the magnetic core memory used on PDP-11 microcomputers (this is why they’re on breakouts - I am not risking 30$ of memory directly on the PCB).

In the end, the problem was simple: I wanted the address latch line, which tells the system the bits on the 16-bit bus are a memory address needing to be latched, to stretch only half a clock cycle, to meet hold times on the latches.
How did I do this? Well, assign latch_enable = latch_enable_pre && clk;.

Yep, I commmitted the sin of ANDing with the clock, after years of other logic designers telling me: that’s bad! Don’t do that! But, because I used to do this a LOT in discrete logic with no bad consequence, I thought they were just wrong.

Turns out, they’re not. To the side, you’ll find screenshots of oscilloscope traces. The first shows the intended latch-enable signal followed by a tiny, second pulse that’s almost too short for the scope’s sampling rate. For a few nanoseconds, there is an overlap between latch_enable_pre still transitioning from high to low after the clock going high, causing a brief second pulse on the line.
The second trace shows the effect of this on one of the RAM’s data lines (blue). The longer pulse is an address bit from the CPU, but the second spike is from RAM. It briefly outputs the correct data (a logic one) before the second latch-enable pulse comes in, unintentionally changing the address to garbage and thus the output data too.

Ironically, the fix for this is getting out a 74HC08 AND gate and ANDing the latch-enable line by the clock a second time. The 74HC08 is obviously slower than whatever standard cell AND-gate on the chip, and thus swallows the second pulse. Or at least, I think so.

But the most interesting part to me is that none of this showed up in simulation in the testing framework by Efabless, which uses iverilog. However, a simulation I set up in Verilator later did, infact, simulate this problem correctly, breaking just as the real hardware. Even though it was a RTL simulation without timing. Interesting. I will definitely be using a combination of both simulators in the future, taking either with a grain of salt. Verilator also tried to tell me the MC14500 computer wouldn’t work, even though it clearly does.

Oh, and of course, its not actually PDP-11 compatible. A couple instructions behave incorrectly and the whole relative addressing mode does the address computation differently. Its not wrong, just different than the PDP-11’s spec, which I misread in a few places, evidently. This reminds me a lot of the original AS2650. No wonder, with how rushed this design was too. sight I just don’t learn sometimes. Still, its a functional processor and I still intend to have fun with it. It also can still run C programs. It appears the GCC never makes use of the broken instructions or addressing modes, which I find interesting.

That only leaves the RISC-V CPU to test. To recap: I created that thing in less than 8 hours, at the challenge of someone, with just barely enough verification to demonstrate basic functionality.
It works perfectly! There is some minor problem with generating an interrupt with a timer and with setting up the interrupt routine address, but both are fixed with a single line of C code, or two assembly instructions. Everything else: perfect! This is a fact that has ensured that that friend of mine will never forget about this (something tells me they’re impressed).

This is actually weird, because I used the same clock ANDing thing as with the AS-11, but with no problem this time. Maybe its because the design was blessed by the ASIC gods once they saw I was building it to defend my honor. Or maybe its because differences in synthesis. Or because I’m using slower 74-series chips for the address latches. Who knows.

The little test board for it has a 512KiB of SRAM and 4MiB of UV-EPROM (because I thought it looks cool) as well as a slot for a microSD card and a RTC module, and a place for me to later connect a little ethernet module to. I thought giving this thing a network connection would be a fun project, but that’s for the future. I even benchmarked it with some raytracer code from GitHub, getting a score of 1.7, which is actually not that good in comparison to other softcores. Its probably due to the 16-bit memory bus forming a major bottleneck. It could be worse. I benchmarked the caravel management controller, also a RISC-V CPU, which got a score of 0.04, worse than SERV, which is funny to me.

All in all: a very successful little chip with a lot of functionality! I am especially hoping for some crazy things to happen with the SID replica in the near future!

Bonus pics: close-up of breakout and the MC14500 computer being tested

Bring up - AS2650-2

AS2650-2 die
High res image link

It actually took me a while to bring-up the AS2650-2. The original plan was to simply bodge-wire a spiflash to the generic breakout board from Efabless and use the on-die RAM and boot ROM to start using the CPU. This did not work out. No matter what I did, the internal RAM would not respond, leading me to believe it was not wired up correctly by OpenLane, and thus broken.
This was quite frustrating since that was arguably the most important feature of the new design, as it would make it a thousand times more easy to use. I had even been looking forward to using this chip on a breadboard as a microcontroller for various projects.

So, on to designing a PCB I went, with 32KiB of external SRAM. This took a bit of support logic due to the write-enable signal not being generated correctly. Its shape is supposed to be configurable, but instead, I am stuck with it stretching a whole clock pulse, which the SRAM would not accept. This ultimately dropped the maximum clock speed of the system to 10MHz.
But I decided to take the opportunity to make something cool, combining a LCD display and some buttons with the CPU to make a handheld gaming thing, using the SID for audio. Its not comfortable to hold, or to use, but you can play snake on it! Or make it play Bad Apple (this time with sound!). And only has a few bodge wires from where I messed up a signal polarity.

And then it turns out the caravel management controller has a bug! If it is instructed to write to a user-area range wishbone address, the cache breaks! A previous memory or wishbone write becomes broken, in my case corrupting a variable. This had the effect of writing a broken value to the wishbone register containing the internal RAM enable. Its possible to work around this issue once aware of it, but man, I did so much work with the PCB just to find out by accident that the internal RAM was working the whole time.

Its good I did, though, as the internal RAM is very fast to access. So fast, infact, that when using only it, the CPU can push 55MHz! That is quite good and even better than what I had hoped for. It is absolutely an improvement over the 25MHz of the first version. The chip gets very toasty too, at this speed. Hottest chip I own for sure.

This project was a pretty nice test of all the hardware, too. It required all the GPIOs, timers and the SPI bus. The UART got a workout too with some debug logs. And its all performing as expected. I think this chip is the greatest accomplishment of my life so far and definitely shows what difference a good development process can make. Did I mention yet that it has better IPC too?
I think the fact that its bring-up went so smooth, I have barely anything to say about it, should speak volumes about its quality. It just works™. Cannot wait to see what else I’ll be able to use it for in the future.

Conclusion

I had one last chance to prove myself in the realm of integrated circuit development, and I took it! Taking all the lessons learned from my bumpy takeoff and using them to create two good chips, one of which I will forever be particularly proud of. I am incredibly happy with the outcome of this part of the journey. I’ve also managed to supply myself with enough complex custom IC designs to keep me busy for years, and some will be augmenting my personal projects for a long time.
But, I’m not intent on calling it quits. I have grander ideas still that I wish to see come to fruition, and soon, I would have a chance to do so.

TO BE CONTINUED!

Related Repositories

Description	Link
`AS2650-2 submission repo`	AvalonSemiconductors/AS2650
`AS2650-2 bring-up board hardware, firmware and software`	AvalonSemiconductors/AS2650-bring-up
`Multi-project submission repo`	AvalonSemiconductors/gfmpw1-multi
`Multi-project bring-up board hardwares, firmwares and softwares`	AvalonSemiconductors/gfmpw1-multi-bringup
`Assembler for all my custom CPU architectures`	AvalonSemiconductors/asl-avalonsemi

Custom Silicon Adventure (p.2)