published on 2024-07-31 by dzwdz
m8trix by HellMood is one of my favorite demos. It packs a pretty cool Matrix-style effect in only 8 bytes:
The author even provided the source with some comments:
org 100h
S:
les bx,[si] ; sets ES to the screen, assume si = 0x100
; 0x101 is SBB AL,9F and changes the char
; without CR flag, there would be
; no animation ;)
lahf ; gets 0x02 (green) in the first run
; afterwards, it is not called again
; because of alignment ;)
stosw ; print the green char ...
; (is also 0xAB9F and works as segment)
inc di ; and skip one row
inc di ;
jmp short S+1 ; repeat on 0x101
…yeah, I didn’t really get it at first either. Let’s try to actually understand how it works (and learn some stuff about DOS along the way).
Note that I’ll be using hexadecimal numbers “by default” (without 0x) throughout this article to be consistent with DEBUG’s output.
The only tool I’ll be using on DOS’s side will be DEBUG. It’s a delightful little tool that ships with MS-DOS. I’ve personally used the FreeDOS version under DOSBox, as that’s what I had handy.
There’s builtin help if you type in ?
, you can also
check out this
more in-depth guide, or this video of someone
using it to assemble new binaries.
There’s a small issue, though. m8trix doesn’t actually work as-is under DEBUG, for reasons I’ll explain later.
If you’re a bit rusty on how real mode segmentation works, then
here’s a quick reminder. There are a few 16-bit segment registers
(CS
, DS
, SS
, ES
).
When you reference memory in real mode you always1 use
one of those registers, even if it’s implicit.
If you reference ES:BX
, the real address this maps to is
computed as ES * 0x10 + BX
. This means that there are
multiple ways to reference one physical memory location (even if that is
only slightly relevant here).
As another example, B800:1234
points to
B9234
.
C:\M8TRIX>debug M8TRIX.COM
-U ; disassemble the beginning of the program
073D:0100 C41C LES BX,[SI]
073D:0102 9F LAHF
073D:0103 AB STOSW
073D:0104 47 INC DI
073D:0105 47 INC DI
073D:0106 EBF9 JMP 0101
-U 101 ; disassemble the loop body
073D:0101 1C9F SBB AL,9F
073D:0103 AB STOSW
073D:0104 47 INC DI
073D:0105 47 INC DI
073D:0106 EBF9 JMP 0101
-R ; look at the registers
AX=FFFF BX=0000 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0000 DI=0000
DS=073D ES=073D SS=073D CS=073D IP=0100 NV UP EI PL ZR NA PE NC
073D:0100 C41C LES BX,[SI] DS:0000=20CD
Let’s step through this.
AX=FFFF BX=0000 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0000 DI=0000
DS=073D ES=073D SS=073D CS=073D IP=0100 NV UP EI PL ZR NA PE NC
073D:0100 C41C LES BX,[SI] DS:0000=20CD
-T ; single step and show register state
AX=FFFF BX=20CD CX=0008 DX=0000 SP=FFFE BP=0000 SI=0000 DI=0000
DS=073D ES=9FFF SS=073D CS=073D IP=0102 NV UP EI PL ZR NA PE NC
LES
loads a far pointer from memory. The first two bytes
of [SI]
will be loaded into BX
, and the next
two bytes will be loaded into ES
.
We’re implicitly using the DS
segment here, which is
where DOS loaded our program into. To be more exact – our program was
loaded into DS:0100
, whereas DS:0000
(which
[SI]
points at) contains the Program
Segment Prefix. Let’s take a look at it:
-d 0000
073D:0000 CD 20 FF 9F 00 EA FF FF-AD DE BD 1D 94 01 00 00 . ..............
[...]
-u 0000
073D:0000 CD20 INT 20
073D:0002 FF9F00EA CALL FAR [BX+EA00]
[...]
The first two bytes always contain INT 20
, the
instruction that quits your program. This means that you can quit your
program by jumping to CS:0000
(CS
=
DS
= SP
). DOS also ensures that the word on
top of the stack is 0000
, so you can quit with a
RET
. Nifty. It also means that BX
will always
be set to 20CD
, but we don’t actually really care about
that.
The next two bytes point to the segment of the first free byte in
memory. So, by loading them into ES
, we make it point to
the first free area in memory. On most systems that will be
9FFF
. This is very convenient, as the mode 13 framebuffer
begins at A0000
, or 9FFF:0010
. This is a well
known sizecoding trick.
…except mode 13 is a graphic mode. We’re in mode 32, a
text mode, and the text buffer is located at B800
,
completely out of reach of ES
. What?
Well, DEBUG fooled us. When you start a program under DOS,
SI=0100
. Usually. However, for
whatever reason, DEBUG zeroes it out instead. You can fix it by running
RSI 0100
3 before the first instruction. This
is also why the page I’ve linked to uses [BX]
, as you can
count on it actually being zero.
But let’s get back to m8trix. If SI=0100
, then
[SI]
points to the beginning of our program!
-RSI 0100
-R
AX=FFFF BX=0000 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0000
DS=073D ES=073D SS=073D CS=073D IP=0100 NV UP EI PL ZR NA PE NC
073D:0100 C41C LES BX,[SI] DS:0100=1CC4
-T
AX=FFFF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0000
DS=073D ES=AB9F SS=073D CS=073D IP=0102 NV UP EI PL ZR NA PE NC
073D:0102 9F LAHF
-U 100
073D:0100 C41C LES BX,[SI]
073D:0102 9F LAHF
073D:0103 AB STOSW
As you can see, this means that BX=1CC4
(the
LES
instruction itself), and ES=AB9F
. This
means that ES
spans AB9F0-BB9F0
, which
includes the entire text buffer!
AX=FFFF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0000
DS=073D ES=AB9F SS=073D CS=073D IP=0102 NV UP EI PL ZR NA PE NC
073D:0102 9F LAHF
-t
AX=46FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0000
DS=073D ES=AB9F SS=073D CS=073D IP=0103 NV UP EI PL ZR NA PE NC
LAHF
is pretty straightforward, it just loads the top
byte of FLAGS
into AH
. Except, once again,
DEBUG
doesn’t set the FLAGS
register
correctly. If we were to run m8trix outside of DEBUG, the top
byte of flags would be 02
, and thus this instruction would
set AH=02
. This can be fixed in the debugger by running
RAX 02FF
.
-rax 02FF
-rax 02FF
-r
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0000
DS=073D ES=AB9F SS=073D CS=073D IP=0103 NV UP EI PL ZR NA PE NC
073D:0103 AB STOSW
-t
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0002
DS=073D ES=AB9F SS=073D CS=073D IP=0104 NV UP EI PL ZR NA PE NC
STOSW
– “Store (word) string” – is a bit more complex.
It writes the word at AX
to ES:DI
, and
increments4 DI
by two – the amount
of bytes written.
This instruction will be run over and over again, with
DI
taking on every even value and overflowing every once in
a while, overwriting everything in ES
– including the text
buffer – over and over again.
Each character in the text buffer is represented by a word, so each
STOSW
writes a complete character to the screen.
AH=02
sets the color to dark green, and AL
(which changes each iteration) chooses the character
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0002
DS=073D ES=AB9F SS=073D CS=073D IP=0104 NV UP EI PL ZR NA PE NC
073D:0104 47 INC DI
-t
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0003
DS=073D ES=AB9F SS=073D CS=073D IP=0105 NV UP EI PL NZ NA PE NC
073D:0105 47 INC DI
-t
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0004
DS=073D ES=AB9F SS=073D CS=073D IP=0106 NV UP EI PL NZ NA PO NC
073D:0106 EBF9 JMP 0101
-t
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0004
DS=073D ES=AB9F SS=073D CS=073D IP=0101 NV UP EI PL NZ NA PO NC
073D:0101 1C9F SBB AL,9F
We don’t want the columns to be packed too tightly together, so we
skip every other character by adding two bytes to DI
.
We then jump to 0101
, uncovering a hidden
SBB
.
AX=02FF BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0004
DS=073D ES=AB9F SS=073D CS=073D IP=0101 NV UP EI PL NZ NA PO NC
073D:0101 1C9F SBB AL,9F
-t
AX=0260 BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0004
DS=073D ES=AB9F SS=073D CS=073D IP=0103 NV UP EI PL NZ NA PE NC
This is the last instruction, and it’s the one that modifies
AL
to animate the character. It subtracts 9F
from AL
with borrow, which is pretty much the grade-school
approach. That is – if it underflows, it will “borrow” a bit from the
next byte by setting the carry flag. The next SBB
will see
that the carry flag is set, subtract an additional 1
, and
unset the carry flag (unless it also underflowed).
-rax 028F
-r
AX=028F BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0008
DS=073D ES=AB9F SS=073D CS=073D IP=0101 NV UP EI PL NZ NA PO NC
073D:0101 1C9F SBB AL,9F
-t
AX=02F0 BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0008
DS=073D ES=AB9F SS=073D CS=073D IP=0103 NV UP EI NG NZ NA PE CY
-rip 0101 ; i don't care about the rest of the loop, just run the SBB again
-t
AX=0250 BX=1CC4 CX=0008 DX=0000 SP=FFFE BP=0000 SI=0100 DI=0008
DS=073D ES=AB9F SS=073D CS=073D IP=0103 NV UP EI PL NZ AC PE NC
Notice how the second SBB
subtracted A0
instead of 9F
because of the carry flag.
Why does that matter? Let’s imagine this was a regular
SUB
instead, without a borrow. 9F
is odd
(coprime to 100
), so it would take 100
iterations for AL
to loop around (remember, we’re working
with hexadecimal here). The loop runs for 10000/2=8000
iterations before DI
repeats, and 8000
is
divisible by 100
, so each pass would have the exact same
AL
values for each character. Instead of an animation we’d
get a much less impressive static screen.
Instead, AL
repeats every 55
(decimal 85)
SBB
calls, which is coprime to 100
, so the
AL
values will differ from pass to pass. There’s probably a
way to determine the period by hand but I just used Python. Not all
operands work for this, but 9F
seems to be one of the good
ones.
To quote the author, “without CR flag, there would be no animation :)”.
I think I’ve explained every aspect of how m8trix works by now. I don’t think I need to tell you how brilliant it is.
Notice how the third byte has three different meanings! At first it’s
read as the low byte of the segment offset, then it’s part of the
LAHF
instruction, and then it’s the operand for the
SBB
.
STOSW
is not only the perfect instruction for writing
characters in text mode, it also works as the high byte of the segment
offset that you need to write those characters in the first place.
Everything fits together so nicely :)
Soon after m8trix was published, several people tried coming up with ideas to shrink it down even more. What follows is the final version HellMood published:
C:\M8TRIX>debug M7TRIX.COM
-U
073D:0100 C41C LES BX,[SI]
073D:0102 9F LAHF
073D:0103 AB STOSW
073D:0104 91 XCHG AX,CX
073D:0105 EBFA JMP 0101
-U 101
073D:0101 1C9F SBB AL,9F
073D:0103 AB STOSW
073D:0104 91 XCHG AX,CX
073D:0105 EBFA JMP 0101
Not only is this version smaller, it also looks better, as it clears the screen! It’s also simple enough that I won’t bother tracing through it again.
In short – instead of skipping over every other column, we swap
AX
and CX
back and forth. Both are running the
same character animation, but, as CH=00
, every other column
is rendered as black or black, so the characters are invisible. This
takes care both of skipping columns AND clearing the screen.
The character cycle is apparently5 different because the carry flag gets reused between odd and even columns, but the period still works out to be 85 – which I find interesting but I don’t really feel like researching why that is.
This is a slightly modified version that works under DEBUG and doesn’t use misaligned jumps. It’s easy to experiment with as you can just load it into DEBUG, use the assembler to change a single instruction, and see what happens.
073D:0100 BB9FAB MOV BX,AB9F
073D:0103 8EC3 MOV ES,BX
073D:0105 B402 MOV AH,02
073D:0107 AB STOSW
073D:0108 47 INC DI
073D:0109 47 INC DI
073D:010A 1C9F SBB AL,9F
073D:010C EBF9 JMP 0107
At least I think so, but I’m not sure.↩︎
MOV AH, 0F; INT 10
,
and look at the registers. AL
is the current mode.↩︎
No, RSI
doesn’t stand
for the 64-bit register. R
is the register command, which
accepts SI
as the argument.↩︎
If the direction flag was set, it would instead decrement it.↩︎
The Python script I’m using for testing says so, but I can’t really tell if that’s true by just looking at the output.↩︎