As the lead coder of bsnes, I've been attempting to perfect Super Nintendo emulation for the past 15 years. We are now at a point where that goal is in sight, but there we face one last challenge: accurate cycle timing of the SNES video processors. Getting that final bit of emulation accuracy will require a community effort that I hope some of you can help with. But first, let me recap how far we've come.
Where we are
Today, SNES emulation is in a very good place. Barring unusual peripherals that are resistant to emulation (such as a light-sensor based golf club, an exercise bike, or a dial-up modem used to place real-money bets on live horse races in Japan), every officially licensed SNES title is fully playable, and no game is known to have any glaring issues.
SNES emulation has gotten so precise that I've even taken to splitting my emulator into two versions: higan, which focuses on absolute accuracy and hardware documentation; and bsnes, which focuses on performance, features, and ease of use.
Some amazing things have come out of SNES emulation recently, including:
- Low-level emulation of all SNES coprocessors
- HD mode 7 support
- Slowdown removal
- Widescreen support
- MSU1 for CD-audio and FMV
- Run-ahead for latency reduction
- Dynamic rate control for perfect audio-video synchronization
… and much more!
So that's it, right? Kudos on a job well done, thanks for all the fish? Well… not quite.
Today, we enjoy cycle-level accuracy for nearly every component of the SNES. The sole exception is the PPUs (picture processing units), which are used to generate the video frames sent to your screen. We mostly know how the PPUs work, but we have to make guesses for some functionality that result in less than total perfection.
The remaining issues are relatively small ones, in the grand scheme of things. If you're not interested in the pursuit of one hundred percent faithful emulation perfection for its own sake, I am not going to be able to convince you of the need for improving SNES PPU emulation further. As with any goal in life, the closer we get to perfection, the smaller the returns.
I can tell you why this is important to me: it's my life's work, and I don't want to have to say I came this close to finishing without getting the last piece of it right. I'm getting older, and I won't be around forever. I want this final piece solved so that I can feel confident in my retirement that the SNES has been faithfully and completely preserved through emulation. No stone was left unturned, no area left unfinished. I want to say that it's done.
If you're still intrigued, read on for a deep dive into the background of the problem and my proposed solutions.
Modeling the SNES design
Let's start by taking a look at the components that make up the SNES:
The arrows indicate the direction that the various processors in the SNES can communicate with one another, and the dotted lines represent memory chip connections.
The key thing to take away right now is to note that the video and audio output are sent directly from the PPU and DSP specifically. That means they function like “black boxes” where we dont have any visibility into what happens inside. This will be important later on.
Imagine you are emulating a CPU's "multiply" instruction, which takes two registers (variables), multiplies them together, and produces a result and some flags that represent the status of the result (such as overflow).
We could devise a software program that multiplies every possible value from 0 to 255 as both the multiplier and multiplicand. Then we could output both the numeric and flag results of the multiplication. This would produce two 65,536-entry tables.
By analyzing these tables, we could determine exactly how and when the CPU results were set certain ways. Then we could modify our emulators so, when running the same test, we produce exactly the same tables at the same times.
Now let's say the CPU had 16-bit x 16-bit multiplications. Testing every possible value would generate 4 billion results, which is starting to push what is practical to test in a reasonable amount of time. If the CPU had 32-bit x 32-bit multiplications, it wouldnt be practical to test all combinations of inputs before the heat death of the universe (with current technology, at least).
In cases like this, we would have to get more selective with our tests and try to determine exactly when flags might change, when results might overflow, and so forth. Otherwise we'd have tests that would never complete.
Multiplication is a fairly trivial operation, but this is the general process behind reverse engineering, and it extends to more complex operations such as how the SNES' horizontal blanking DMA (direct memory access) transfers work. We create tests that try to detect what happens on edge cases, then confirm that our emulation behaves identically to a real SNES.
Oscillators and cycles
The SNES contains two oscillators: a crystal clock that runs at ~21MHz, which controls the CPU and PPUs; and a ceramic resonator that runs at ~24MHz, which controls the SMP and DSP. Cartridge coprocessors will sometimes use the ~21MHz CPU oscillator and sometimes include their own oscillators that run at different frequencies.
A clock is the core timing element of any system, and the SNES is designed to perform various tasks at certain frequencies and times.
If you imagine a 100Hz clock, it is a device with a digital pin that transitions to logic high (+5 volts, for instance), and then back to logic low (0 volts, or ground) 100 times per second. So every second, the pin voltage will fluctuate 200 times total: 100 rising clock edges and 100 falling clock edges.
A clock cycle is generally treated as one full transition, so a 100Hz clock would generate 100 clock cycles per second. There are some systems that require distinguishing between rising and falling edges, and for those, we break this further down into half-cycles to denote each phase (high or low) of the clock signal.
The key goal of an authentic emulator is to perform tasks in exactly the same ways and at exactly the same times as the real hardware. It doesn't much matter specifically how the tasks are performed. All that matters is that the emulator, when given the same inputs, generates the same outputs with the same timing as real hardware.
Sometimes, operations happen over time. Take SNES CPU multiplication, for instance. Rather than pausing to wait for multiplication to complete, the SNES CPU calculates the multiplication result one bit at a time in the background over eight CPU opcode cycles. This allows your code to possibly do other things while waiting on the multiplication to complete.
Any commercially released software is likely to wait those eight cycles, because if you try to read the result before it's ready, you will get a partially computed result instead. Yet earlier SNES emulators gave correct results immediately, without waiting these extra cycles.
When hobbyists started creating and testing homebrew software via emulators, this discrepancy started to cause some problems. Some of this software, such as many early Super Mario World ROM hacks, only worked correctly on these earlier emulators, and not on real SNES hardware. That's because they were designed with the emulator's immediate (and inauthentic-to-real-hardware) multiplication results in mind.
As emulators improved, this old software broke, and we have had to subsequently offer compatibility options in our newer emulators in order to not lose this software to time. Yes, as surreal as it is to say, these days our emulators have to emulate other emulators! How meta!
The nice thing about the CPU multiplication delay is that it's very predictable: the eight computation cycles start immediately after requesting a multiplication. By writing code to read the results after every cycle, we were able to confirm that the SNES CPU was using the Booth algorithm for multiplication.
Other operations are not so simple to model, since they happen asynchronously in the background. The SNES CPU's DRAM refresh is one such case.
During the rendering of every scanline, at a certain point, the entire SNES CPU freezes for a short duration as the contents of the RAM chip are refreshed. This is needed because, as a cost-cutting measure, the SNES used dynamic RAM (rather than static RAM) for its main CPU memory. Dynamic RAM must be periodically refreshed in order to preserve its contents over time.
The key insight to figuring out the precise timing of these operations was to take advantage of the SNES PPUs horizontal and vertical counters. These counters advance and are reset after each horizontal and vertical blanking period. However, their precision is only a quarter of the SNES' CPU oscillator frequency; that is to say, the horizontal counter increments only once every four clock cycles.
By reading the counters multiple times, I was able to determine which quarter of a clock cycle the counter was aligned with. By combining that insight with a specially crafted function that could step by a precise, user-specified number of clock cycles, it became possible to perfectly align the SNES CPU to any exact clock cycle position I wanted.
By iterating over a range of clock cycles in a loop, I could determine exactly when certain operations (such as DRAM refresh, HDMA transfers, interrupt polling, etc.) would occur, and I was able to reproduce this precisely under emulation.
The SNES SMP chip has its own timers as well, and similar reverse engineering was successful against that processor as well. I could spend an entire article talking about the SMP TEST register alone, which allows coders to control the clock divider of the SMP and its timers, among other horrible things. Suffice it to say that, while it was not an easy or fast process, we were ultimately victorious.
There were a whole host of SNES coprocessors used inside various game cartridges that needed to be tamed as well. From dedicated general-purpose CPUs like the SuperFX and SA-1, to digital signal processors like the DSP-1 and Cx4, to decompression accelerators like the S-DD1 and SPC7110, to real-time clocks from Sharp and Epson, and more…
That means an SNES emulator needs to be able to handle the instruction and pixel caches of the SuperFX; the memory bus conflict arbitrator of the SA-1 (which allowed the SNES CPU and SA-1 to share the same ROM and RAM chips simultaneously); the embedded firmware of the DSP-1 and Cx4; the prediction-based arithmetic coders of the S-DD1 and SPC7110; and the odd BCD (binary-coded decimal) edge cases of the real-time clocks. Slowly but surely, by applying the above techniques to determine correctness and timing, we were able to near-perfectly emulate all of these chips.
It actually took a massive effort and thousands of dollars to decap and extract the programming firmware from the digital signal processors used in various games. In one instance, emulation of the NEC uPD772x led to code from higan being used to save the late professor Stephen Hawking's voice!
In another case, we had to reverse-engineer the entire instruction set of the Hitachi HG51B architecture, because this architecture was never publicly documented. In yet another, one game (Hayazashi Nidan Morita Shougi 2) ended up containing a full-blown 32-bit, 21MHz ARM6 CPU to accelerate its Japanese chess engine!
Preserving all of the SNES coprocessors alone was a multi-year journey full of challenges and surprises.
Processing digital signal
Not to be confused with the DSP-1 cartridge coprocessor, the Sony S-DSP (digital signal processor) chip is what generated the distinctive sound from the SNES. This chip combined eight voice channels with 4-bit ADPCM encoding to produce a 16-bit stereo signal.
On the surface, and per the system diagram from earlier, the DSP initially looks like a black box: you configure the voice channels and mixer settings and sit back as it generates sound to be sent to your speakers.
But one key feature allowed a developer by the name of blargg to fully reverse-engineer this chip: the echo buffer. The SNES DSP has a feature that mixes the outputs from previous samples together to produce an echo effect. This happens at the very end of the audio generation process (aside from one last final mute flag that can be applied to silence all audio output.)
By writing carefully cycle-timed code and monitoring those echo results, it became possible to discover the exact order of operations the SNES DSP would take to generate each sample and to produce cycle-accurate, bit-perfect audio.
Preserving the PPUs
This all leads us to the final piece of the SNES architectural diagram: the PPU-1 and PPU-2 chips. Thanks to John McMaster, we have 20x magnification scans of the S-PPU1 (revision 1) and the S-PPU2 (revision 3) chips.
The die scans above highlight that these are obviously not general-purpose CPUs, nor are they custom architectures executing operation codes from an internal firmware program ROM. They are both dedicated, hard-coded logic circuits that take the inputs of various registers and memory and produce the video output to your monitor, one scanline at a time.
The reason why the PPUs remain the final frontier of SNES emulation is because, unlike every component discussed until now, the PPUs are truly a black box. You can configure them to any state you want, but the SNES CPU has no way of directly observing what they generate.
To use our previous multiplication example as an analogy, imagine if you requested the result of 3 * 7, but instead of receiving a binary answer, you instead received a fuzzy, analog image showing the numerals '21' on your screen instead. Anyone running your software could see the 21, but you couldn't write a test program to automatically confirm that they were seeing the right answer. Manual human verification of this kind of result doesn't scale beyond a few thousand tests, and we're going to need several million to really hone down the exact PPU behavior.
Now I know what you're probably thinking: "But byuu, wouldn't it be easy to use a capture card, perform a lot of image processing, roughly match it to the emulator's digital screen image, and pass-fail the test based on that?"
Well, probably! Especially if the test were as simple as two giant numbers spanning the entire size of the screen.
But what if our testing was very nuanced, and we were trying to detect a half-shade color difference of a single pixel? What if we wanted to run a million consecutive tests and didn't necessarily know what they were going to generate just yet, but still wanted to match it to the output of our emulation?
Nothing beats the convenience and certainty of digital data, an exact stream of bits that you either match or don't match. The analog domain of CRT imaging doesnt provide that.
Why does this matter?
With the exception of one game (Air Strike Patrol), all officially licensed SNES software is (intended to be) scanline based. These games do not attempt to change the PPU rendering state in the middle of an actively rendering scanline (a programming trick known as a “raster effect”). That means the timing to run the vast majority of games doesn't need to be incredibly precise; as long as youre ready in time for the next full scanline, youre OK.
But it does matter for this one game.
Above, you see that the "Good Luck" text in Air Strike Patrol is being rotated from frame to frame. The game does this by modifying background layer 3's (BG3) vertical scrolling position. However, the HUD display on the left (where it displays you have 39 missiles available) is also on the same background layer.
The game manages this split by changing BG3's scroll position after the HUD on the left has rendered but before the "Good Luck" text begins rendering on each scanline. It can get away with this because BG3 is transparent outside of the HUD and the text, so there is nothing to really draw between those two points, regardless of the vertical scroll register value. This behavior tells us that the scroll registers can be changed at any point during rendering.
Above is the infamous plane shadow near the bottom of the screen. This effect is rendered by changing the screen display brightness register for short bursts over the span of five scanlines.
While playing the game, you'll note that the shadow is rather erratic. In the above image it looks a bit like a 'c', but frame to frame the shape is constantly changing in length and start point for each scanline. Air Strike Patrol just aimed in the general ballpark of where they wanted the shadow to appear, and then went for it, guns blazing. It mostly worked.
Emulating that behavior correctly requires cycle-perfect timing that is extremely difficult to get absolutely right in an emulator.
Finally, here is the pause screen. This one toggles BG3 on during the yellow and black border on the left, and off again during the same border on the right, to draw the gray lines on the screen. It also alternates which scanlines display these gray lines every other frame to create a shaking effect for the overlay.
If you zoom in on the emulated image above, you'll notice that there are a few missing pixels on the left-hand edge of these gray lines for a couple of scanlines. Thats because my emulation of the PPU is not one hundred percent cycle-perfect. In this case, it is triggering the BG3 toggle effect a bit later than it is supposed to.
I could very easily adjust the timings to render this image correctly. But that adjustment is just as likely to have adverse effects in other titles that modify PPU display registers mid-scanline. While Air Strike Patrol is the only game to do this intentionally, there are at least a dozen games that do so accidentally (maybe they had an IRQ fire a bit too early or too late).
Sometimes this produces some brief, visible corruption that was overlooked in development (as with Full Throttle Racing when transitioning between the shop and game). Sometimes the writes happen during an otherwise transparent part of the screen and thus cause no visual anomalies (such as with the HP status display in Dai Kaijuu Monogatari II.) Even these “invisible” edge cases can cause problems with less-precise scanline renderers used in more performant emulators, though.
Even discounting Air Strike Patrol, all of these accidental (but effective) raster effects in SNES software make it functionally impossible to design a PPU renderer that generates an entire scanline with cycle-perfect accuracy.
With bsnes, through years of trial-and-error, I have created a list of these “raster effect” games. I've also crafted custom render positions that enable a much faster scanline-based renderer to display all of these games properly (save for Air Strike Patrol, of course). These essentially amount to unsatisfying per-game hacks, though.
I also have a cycle-based PPU renderer that doesn't need any of these hacks, but eventually it causes tiny, one-to-four pixel differences with actual hardware, as in the last Air Strike Patrol screenshot pictured above.
Latching internal registers
The cause of these slight misses comes down to latching behavior timings.
Let's say the SNES is rendering its iconic mode 7, which is an affine texture transformation with per-scanline parameter adjustments. To determine any given pixel on the screen, a computation such as this can be performed:
px = a * clip(hoffset - hcenter) + b * clip(voffset - vcenter) + b * y + (hcenter << 8) py = c * clip(hoffset - hcenter) + d * clip(voffset - vcenter) + d * y + (vcenter << 8)
A real SNES would struggle to perform these six multiplications fast enough for every single rendered pixel in a frame. But none of these values changes from pixel to pixel (or at least, they aren't supposed to), so we only have to compute px and py once at the start of every scanline. Thus, the PPU caches static results into latches, which are essentially copies of PPU registers that may have been transformed or may be transformed further as time goes by.
The x,y coordinates are then transformed by mode 7 like so:
ox = (px + a * x) >> 8 oy = (py + c * x) >> 8
Although x changes every pixel, we know that it increments by one every time. By keeping internal accumulators, we can simply add constant values a and c to ox and oy once every pixel, instead of having to perform two multiplications for every pixel.
The question then becomes, at what exact cycle position does the PPU read the values a and c from the CPU-accessible SNES PPU external registers?
If we guess a time that is too soon, it may break a certain subset of games. If we guess a time that is too late, it may break a different subset of games.
The easy approach is to just keep waiting for bug reports and adjusting these positions to resolve issues in any specific game. But by doing this, we will never find out the exact positions, only approximations.
And any time we change one of these variables, we are not feasibly going to be able to retest the entire ~3,500-game SNES library to spot any regressions that our changes might have caused.