LLX > Neil Parker > Apple II > Kaleidoscope

Adventures with Kaleidoscope

(or, Proof That Neil Has Far Too Much Free Time)

(There's also a version in Flash now too. The Java applet that used to be on this page is archived here.)

Those of you who have been in the Apple II world long enough to remember the DOS 3.3 System Master disk may recall the Integer BASIC program on it called COLOR DEMO. One of its menu options is KALEIDOSCOPE, which draws pretty symmetrical pictures on the lo-res screen.

I occasionally like to run COLOR DEMO and watch the pretty kaleidoscope colors. Listing the source code, I noticed that it doesn't run forever--it stops after a fixed number of iterations. But each iteration was rather slow, and I never had the patience to watch the whole thing from start to finish.

Then one day I studied the listing more closely, and it occurred to me that the source code was rather inefficient, and that I could probably speed it up quite a bit with a little rewriting--maybe by enough to watch the whole thing. So this article is about my adventures optimizing KALEIDOSCOPE.

The original Integer BASIC source code, stripped of extraneous fluff, is as follows:

  700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19:
      FOR J=0 TO 19:K=I+J
  750 COLOR=J*3/(I+3)+I*W/12
  760 PLOT I,K: PLOT K,I: PLOT 40-I,40-K: PLOT 40-K,40-I
  770 PLOT K,40-I: PLOT 40-I,K: PLOT I,40-K: PLOT 40-K,I
  780 NEXT J,I,W

There are a couple of obvious things that can be improved in this code. In line 750, the expression I*W/12 is recomputed on each pass through the J loop, even though it doesn't depend on J. Likewise, there's no need to recompute 40-I and 40-K for each PLOT statement.

So here's an improved routine:

  700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19: X=I*W/12:
      I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K
  750 COLOR=J*3/(I+3)+X
  760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40
  770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I
  780 NEXT J,I,W

(Moving X=I*W/12 and I40=40-I out of the J loop is called loop invariant removal. Precalculating I40 and K40 instead of recalculating them each time is called common subexpression elimination.)

But that's not the best that can be done. Multiplication on any 6502-based hardware is always a somewhat slow operation, due to the 6502's lack of a hardware multiply instruction. The J*3 operation can clearly be made more efficient by writing it as J+J+J. What's not quite so easy to see is that something similar can be done with I*W...note that each time through the W loop, I*W starts out with the value of W, and each time through the I loop, it increases its value by W. Thus the following code is mathematically the same as the above, but uses only quick additions instead of slow multiplications:

  700 GR : CALL -936: FOR W=3 TO 50:IW=0: FOR I=1 TO 19:IW=IW+W:
      X=IW/12: I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K
  750 COLOR=(J+J+J)/(I+3)+X
  760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40
  770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I
  780 NEXT J,I,W

(The IW optimization is called strength reduction).

This code runs noticeably faster than the original code, but I still found it too slow to wait for. Clearly more drastic measures are called for...

Machine language! That's right...my next step was to hand-compile the BASIC into into 6502 code. Not only does machine code run much faster than interpreted BASIC could ever hope to, there are additional optimizations that wouldn't be of much help to the BASIC but would help machine language.

First the variables, all on page zero for speed. All the variables used in the BASIC code can be stored in single bytes, except for IW, which needs to store values up to 950. The two additional variables A1L and A1H are temporaries used in the division by 12 and by (I+3). Memory locations were chosen to avoid conflict with BASIC's zero-page usage.

                1    W        =     6
                2    I        =     7
                3    J        =     8
                4    K        =     9
                5    A1L      =     $3C
                6    A1H      =     $3D
                7    K40      =     $FC
                8    I40      =     $FD
                9    IWL      =     $FE
                10   IWH      =     $FF

Necessary I/O locations:

                11   KBD      =     $C000
                12   KBDSTRB  =     $C010
                13   LORES    =     $C056
                14   AN3OFF   =     $C05F

Necessary ROM routines:

                15   PLOT     =     $F800
                16   SETCOL   =     $F864
                17   SETGR    =     $FB40
                18   HOME     =     $FC58

Now for the code. First we initialize the lo-res screen and clear the text area at the bottom. The LDA LORES is done because SETGR doesn't, and the LDA AN3OFF is in case somebody was using double-res graphics recently.

                19            ORG   $2000
2000: 20 40 FB  20            JSR   SETGR      ;GR
2003: AD 56 C0  21            LDA   LORES
2006: AD 5F C0  22            LDA   AN3OFF
2009: 20 58 FC  23            JSR   HOME       ;HOME

Now the W loop. This just starts it out--the increment of W and test for the final value are done at the bottom of the loop.

200C: A9 03     24            LDA   #3         ;FOR W=3 TO 50
200E: 85 06     25            STA   W

Initialize IW. I40 is also initialized here so instead of subtracting I from 40 each time, we can just DEC it each time through the I loop. This is another strength reduction, which wasn't worthwhile in BASIC but is in machine language.

2010: A9 00     26   WLP      LDA   #0         ;IW=0
2012: 85 FE     27            STA   IWL
2014: 85 FF     28            STA   IWH
2016: A9 28     29            LDA   #40        ;I40=40
2018: 85 FD     30            STA   I40

Now the I loop begins. It works just like the W loop.

201A: A9 01     31            LDA   #1         ;FOR I=1 TO 19
201C: 85 07     32            STA   I

IW=IW+W is straightforward.

201E: A5 06     33   ILP      LDA   W          ;IW=IW+W
2020: 18        34            CLC
2021: 65 FE     35            ADC   IWL
2023: 85 FE     36            STA   IWL
2025: 90 02     37            BCC   ILP1
2027: E6 FF     38            INC   IWH

Here IW is divided by 12, and the result is stored in the X register (which is safe because nothing else in the code clobbers it). The division is done using a standard 16-bit by 8-bit long division algorithm (with the dividend in A1L and A1H), coded inline because it's used only once and it saves the cycles that JSR/RTS would use.

2029: 85 3C     39   ILP1     STA   A1L        ;X=IW/12
202B: A5 FF     40            LDA   IWH
202D: 85 3D     41            STA   A1H
202F: A0 10     42            LDY   #16        ;(inline 16/8 division)
2031: A9 00     43            LDA   #0
2033: 06 3C     44   DIV1LP   ASL   A1L
2035: 26 3D     45            ROL   A1H
2037: 2A        46            ROL
2038: C9 0C     47            CMP   #12
203A: 90 04     48            BCC   DIV1A
203C: E9 0C     49            SBC   #12
203E: E6 3C     50            INC   A1L
2040: 88        51   DIV1A    DEY
2041: D0 F0     52            BNE   DIV1LP
2043: A6 3C     53            LDX   A1L

Here's the previously-promised decrement of I40. Also, K40 is set up to receive the same treatment.

2045: C6 FD     54            DEC   I40        ;I40=I40-1
2047: A5 FD     55            LDA   I40        ;K40=I40
2049: 85 FC     56            STA   K40

The beginning of the J loop. The end of the loop (below) will guarantee that the accumulator holds the current value of J for the K=I+J computation.

204B: A9 00     57            LDA   #0         ;FOR J=0 TO 19
204D: 85 08     58            STA   J
204F: 18        59   JLP      CLC              ;K=I+J
2050: 65 07     60            ADC   I
2052: 85 09     61            STA   K

Since you can't control-C a machine language program, this gives the user a chance to quit at any time by pressing a key.

2054: AD 00 C0  62            LDA   KBD        ;IF PEEK(KBD)>127 THEN DONE
2057: 10 06     63            BPL   NOKEY
2059: 8D 10 C0  64            STA   KBDSTRB
205C: 4C DD 20  65            JMP   DONE

The color computation. The shift-and-add computation below is quicker that adding J to itself three times. Note that J is never greater than 19, so the carry flag is automatically guaranteed to be clear before the first two ADC instructions. Another inline division is used (A1L by A1H, result in A1L)...this one only needs to divide 8 bits by 8 bits.

205F: A5 08     66   NOKEY    LDA   J          ;COLOR=J*3/(I+3)+X
2061: 0A        67            ASL
2062: 65 08     68            ADC   J
2064: 85 3C     69            STA   A1L
2066: A5 07     70            LDA   I
2068: 69 03     71            ADC   #3
206A: 85 3D     72            STA   A1H
206C: A9 00     73            LDA   #0         ;(inline 8/8 division)
206E: A0 08     74            LDY   #8
2070: 06 3C     75   DIV2LP   ASL   A1L
2072: 2A        76            ROL
2073: C5 3D     77            CMP   A1H
2075: 90 04     78            BCC   DIV2A
2077: E5 3D     79            SBC   A1H
2079: E6 3C     80            INC   A1L
207B: 88        81   DIV2A    DEY
207C: D0 F2     82            BNE   DIV2LP
207E: 8A        83            TXA
207F: 18        84            CLC
2080: 65 3C     85            ADC   A1L
2082: 20 64 F8  86            JSR   SETCOL

The eight plot commands. These could be made faster by rearranging them so that all the plots with the same Y coordinate are done in groups, which minimizes the number of times the row address has to be calculated. But I kept the plot order the same to guarantee that the display would identical to the original BASIC display.

2085: A4 07     87            LDY   I          ;PLOT I,K: etc...
2087: A5 09     88            LDA   K
2089: 20 00 F8  89            JSR   PLOT
208C: A4 09     90            LDY   K
208E: A5 07     91            LDA   I
2090: 20 00 F8  92            JSR   PLOT
2093: A4 FD     93            LDY   I40
2095: A5 FC     94            LDA   K40
2097: 20 00 F8  95            JSR   PLOT
209A: A4 FC     96            LDY   K40
209C: A5 FD     97            LDA   I40
209E: 20 00 F8  98            JSR   PLOT
20A1: A4 09     99            LDY   K
20A3: A5 FD     100           LDA   I40
20A5: 20 00 F8  101           JSR   PLOT
20A8: A4 FD     102           LDY   I40
20AA: A5 09     103           LDA   K
20AC: 20 00 F8  104           JSR   PLOT
20AF: A4 07     105           LDY   I
20B1: A5 FC     106           LDA   K40
20B3: 20 00 F8  107           JSR   PLOT
20B6: A4 FC     108           LDY   K40
20B8: A5 07     109           LDA   I
20BA: 20 00 F8  110           JSR   PLOT

The second half of the K40 strength reduction, and the end of the J loop:

20BD: C6 FC     111           DEC   K40        ;K40=K40-1
20BF: E6 08     112           INC   J          ;NEXT J
20C1: A5 08     113           LDA   J
20C3: C9 14     114           CMP   #20
20C5: D0 88     115           BNE   JLP

The end of the I loop. Alas, the top of the loop is too far away to reach with a BNE.

20C7: E6 07     116           INC   I          ;NEXT I
20C9: A5 07     117           LDA   I
20CB: C9 14     118           CMP   #20
20CD: F0 03     119           BEQ   NOILP
20CF: 4C 1E 20  120           JMP   ILP

And the end of the W loop:

20D2: E6 06     121  NOILP    INC   W          ;NEXT W
20D4: A5 06     122           LDA   W
20D6: C9 33     123           CMP   #51
20D8: F0 03     124           BEQ   DONE
20DA: 4C 10 20  125           JMP   WLP
20DD: 60        126  DONE     RTS

--End assembly, 222 bytes, Errors: 0

And that's it.

The machine-language version runs in about 20 seconds on a standard Apple II, and in about 8 seconds on a IIGS at full speed. That's plenty fast enough to satisfy my curiosity--I had no problem waiting for this version to finish.

So was it worth it? Well...that depends. I was glad to finally have a version that I could watch from start to finish, but the final pattern isn't really any more spectacular than the early stages.

I had entirely too much fun writing and optimizing the machine language version.

Here's the source code and binary for those who want to try the machine language version themselves but don't want to type in the hex code manually. And here's the assembly listing for those who want to see it without the running commentary.

(Just so Applesoft programmers don't feel left out, only a few minor modifications need to be made to convert the Integer program to Applesoft.)

700  GR : HOME : FOR W = 3 TO 50:IW = 0: FOR I = 1 TO 19:IW = IW + W:
     X =  INT (IW / 12): I4 = 40 - I: FOR J = 0 TO 19:K = I + J:K4 = 40 - K
750  COLOR=  INT ((J + J + J) / (I + 3)) + X

(This could be speeded up slightly by putting all the numeric constants in variables, since Applesoft can look up a variable faster than it can parse a constant. It can be speeded up a lot by compiling with the Beagle Compiler, but nowhere near as much as the handwritten assembly code above.)

Update: Since writing the above, I've learned a couple of things: First, the Integer BASIC code apparently didn't originate in COLOR DEMO, but dates back to the famous Red Book, the original reference manual that shipped with early Apple II's. I've never seen an authentic Red Book, but apparently the kaleidoscope program was originally called ROD'S COLOR PATTERN, and was attributed to Randy Wigginton.

Second, I'm not the only person out there in Internet-land who's tried hand-compiling this code. The earliest ones I've seen were published in the Apple Assembly Line newsletter--the January 1984 issue contains a 68000 version by Bob Urschel (intended to run on an Apple II with a 68000 coprocessor card), and the March 1984 issue contains a 6502 version by Charles H. Putney. There's also a version by John B. Matthews.

None of these versions uses quite the same approach as my code. Bob Urschel's version is a straight translation of the original Integer BASIC, with no attempts at optimization at all. Charles Putney's version does the loop invariant removals and the common subexpression eliminations, but uses table lookups for I*W/12, J*3/(I+3), and the PLOTs. John Matthews's version is somewhere between those two extremes, with no optimizations for the color computation, but using table lookups to speed up the PLOTs. My effort probably falls somewhere between the Matthews and Putney versions.

LLX > Neil Parker > Apple II > Kaleidoscope

Original: June 4, 2004
Modified: August 21, 2005--added Update
Modified: February 5, 2007--Apple Assembly Line is at a new URL
Modified: June 1, 2008--John B. Matthews moved to a new URL
Modified: October 20, 2008--added link to Flash version
Modified: March 8, 2017--Replaced Java applet with Javascript, because browsers don't support the Java plugin anymore. Also, Apple Assembly Line and John B. Matthews are at new URLs again.