Regression with the C64

When I set the start point of my career as a programmer, computer scientist, or whatever my profession is called, it would be the 24th December of 1991 at last. I was 12 and got my very own Commodore 64, with Monitor and Floppy drive. On this day, I shouted my first “Hello, world!” on a bluish screen.
With just about 64kb of RAM it and a little Microsoft Basic ROM on top. But let’s face today.

A current CPU like an Intel i7 has about 2.16 billion transistors. The old CPU of a C64 had exactly 3510 of it. Not more, not less. It was enough to play games, run a whole company, and do a lot of incredible stuff. And what we are doing with it? Nothing more than yesterday. We present information and transform them. A few days ago, I came up with the idea, that machine learning is possible with the usage of fewer than 3500 transistors. And well I was right.

I transformed the example for a Regression to a C64 Assembly program. I expected that it would need more time, a lot of more code, and maybe it would not be possible. But well, it was easy!

The proof of concept

If you own a Commodore you can try this assembly code for your own. I know that this piece of code is not well created, but at least it does what machine learning does at all. And somehow it does good. It uses a fracture of CPU steps than a modern machine would perform.

Update, because of some mails I got:
The code below is the disassembly of the binary that was tailored for the MOS6502 to get an overview of the steps it has to perform. Despite the NOP Opcode, most of all steps perform 3 or 4 ticks in the CPU. (NOP would take two at least). The original code was written in C and cross-compiled. In my humble opinion, C code is the best way to produce well-designed binaries. Thanks for the hint, that this wasn’t clear as I released the post.

The whole disassembly information can be seen here:

* = $0801
   ora ( $08, x)
   anc #$08
   bmi Routine_080a
   shx $3032, y

   rol $31, x
   lda $01
   sta $0c9d
   and #$f8
   ora #$06
   sta $01
   stx $0c9e
   jsr Routine_0c41
   jsr Routine_09f7
   jsr Routine_0a0e
   ldx #$19

   lda $0c9f, x
   sta $02, x
   bpl Routine_082d
   sta $90
   ldx $0c9e
   ldx $0c9d
   stx $01
!byte $20,$1A,$0A,$20,$60,$0B,$20,$60
!byte $0B,$20,$60,$0B,$20,$60,$0B,$A0
!byte $0C,$20,$E5,$0B,$A0,$07,$B9,$64
!byte $0C,$91,$02,$88,$10,$F8,$20,$27
!byte $0A,$A0,$07,$B9,$6C,$0C,$91,$02
!byte $88,$10,$F8,$A2,$00,$A9,$01,$A0
!byte $1C,$20,$C7,$0B,$A0,$1F,$20,$8B
!byte $0B,$A0,$21,$20,$AE,$0A,$20,$4B
!byte $0A,$F0,$05,$30,$03,$4C,$3C,$09
!byte $A0,$1D,$20,$8B,$0B,$A0,$1F,$20
!byte $AE,$0A,$20,$EF,$09,$A0,$0A,$20
!byte $80,$0A,$20,$B7,$0A,$20,$9F,$0A
!byte $20,$B6,$09,$A0,$1A,$20,$C7,$0B
!byte $A0,$1B,$20,$8B,$0B,$A0,$1F,$20
!byte $AE,$0A,$20,$EF,$09,$A0,$0A,$20
!byte $80,$0A,$20,$B7,$0A,$20,$7A,$0B
!byte $A0,$21,$20,$AE,$0A,$20,$EF,$09
!byte $A0,$0C,$20,$80,$0A,$20,$B7,$0A
!byte $20,$9F,$0A,$20,$C1,$0A,$20,$B6
!byte $09,$A0,$18,$20,$C7,$0B,$A0,$19
!byte $20,$8B,$0B,$A0,$1F,$20,$AE,$0A
!byte $20,$EF,$09,$20,$77,$0A,$20,$B7
!byte $0A,$20,$9F,$0A,$20,$B6,$09,$A0
!byte $16,$20,$C7,$0B,$A0,$17,$20,$8B
!byte $0B,$A0,$1F,$20,$AE,$0A,$20,$EF
!byte $09,$A0,$0A,$20,$80,$0A,$20,$B7
!byte $0A,$20,$7A,$0B,$A0,$21,$20,$AE
!byte $0A,$20,$EF,$09,$20,$7E,$0A,$20
!byte $B7,$0A,$20,$9F,$0A,$20,$C1,$0A
!byte $20,$B6,$09,$A0,$14,$20,$C7,$0B
!byte $A0,$1C,$A2,$00,$A9,$01,$20,$D2
!byte $09,$4C,$74,$08,$A0,$21,$20,$8B
!byte $0B,$A0,$17,$20,$AE,$0A,$20,$C1
!byte $0A,$20,$64,$0B,$A0,$1F,$20,$8B
!byte $0B,$A0,$1B,$20,$AE,$0A,$20,$C1
!byte $0A,$20,$D2,$0B,$20,$64,$0B,$A0
!byte $23,$20,$8B,$0B,$A0,$1D,$20,$AE
!byte $0A,$20,$C1,$0A,$20,$64,$0B,$A0
!byte $21,$20,$8B,$0B,$A0,$21,$20,$AE
!byte $0A,$20,$C1,$0A,$20,$D2,$0B,$20
!byte $36,$0A,$A0,$10,$20,$C7,$0B,$A0
!byte $19,$20,$8B,$0B,$A0,$15,$20,$8B
!byte $0B,$A0,$1F,$20,$AE,$0A,$20,$C1
!byte $0A,$20,$D2,$0B,$20,$64,$0B,$A0
!byte $21,$20,$AE,$0A,$20,$36,$0A,$A0
!byte $12,$20,$C7,$0B,$A2,$00,$8A,$A0
!byte $20,$4C,$E2,$09,$A2,$00,$18,$A0
!byte $00,$71,$02,$C8,$85,$12,$8A,$71
!byte $02,$AA,$18,$A5,$02,$69,$02,$85
!byte $02,$90,$02,$E6,$03,$A5,$12,$60
!byte $A0,$00,$18,$71,$02,$91,$02,$48
!byte $C8,$8A,$71,$02,$91,$02,$AA,$68
!byte $60,$C8,$48,$18,$98,$65,$02,$85
!byte $02,$90,$02,$E6,$03,$68,$60,$86
!byte $12,$0A,$26,$12,$A6

!byte $AD,$74,$0C,$AE,$75,$0C,$20,$64
!byte $0B,$AD,$76,$0C,$AE,$77,$0C,$20
!byte $64,$0B,$A0,$04,$4C

!byte $08,$A0,$00,$F0,$07,$A9,$74,$A2
!byte $0C,$4C,$78,$0C,$60,$A5,$02,$38
!byte $E9,$04,$85,$02,$90,$01,$60,$C6
!byte $03,$60,$A5,$02,$38,$E9,$08,$85
!byte $02,$90,$01,$60,$C6,$03,$60,$A2
!byte $00,$20,$A5,$0B,$A6,$0B,$A5,$12
!byte $45,$13,$10,$05,$A5,$0A,$4C,$44
!byte $0B,$A5,$0A,$60,$A2,$00,$85,$04
!byte $86,$05,$A0,$00,$B1,$02,$AA,$E6
!byte $02,$D0,$02,$E6,$03,$B1,$02,$E6
!byte $02,$D0,$02,$E6,$03,$38,$E5,$05
!byte $D0,$09,$E4,$04,$F0,$04,$69,$FF
!byte $09,$01,$60,$50,$FD,$49,$FF,$09
!byte $01,$60,$18,$69,$02,$90,$01,$E8
!byte $60,$A0,$04,$84,$12,$18,$65,$12
!byte $90,$01,$E8,$60,$A0,$01,$B1,$02
!byte $AA,$88,$B1,$02,$E6,$02,$F0,$05
!byte $E6,$02,$F0,$03,$60,$E6,$02,$E6
!byte $03,$60,$A0,$01,$85,$0A,$86,$0B
!byte $B1,$0A,$AA,$88,$B1,$0A,$60,$A0
!byte $01,$B1,$02,$AA,$88,$B1,$02,$60
!byte $A2,$00,$18,$65,$02,$48,$8A,$65
!byte $03,$AA,$68,$60,$85,$10,$8A,$F0
!byte $2E,$86,$11,$20,$52,$0B,$98,$A4
!byte $0B,$F0,$27,$85,$12,$A0,$10,$46
!byte $11,$66,$10,$90,$0B,$18,$65,$0A
!byte $AA,$A5,$0B,$65,$12,$85,$12,$8A
!byte $66,$12,$6A,$66,$11,$66,$10,$88
!byte $D0,$E9,$A5,$10,$A6,$11,$60,$4C
!byte $08,$0B,$86,$0B,$A4,$0A,$A6,$10
!byte $86,$0A,$84,$10,$A0,$08,$4C,$12
!byte $0B,$85,$10,$20,$52,$0B,$98,$A0
!byte $08,$A6,$0B,$F0,$1D,$85,$11,$46
!byte $10,$90,$0B,$18,$65,$0A,$AA,$A5
!byte $0B,$65,$11,$85,$11,$8A,$66,$11
!byte $6A,$66,$10,$88,$D0,$EB,$AA,$A5
!byte $10,$60,$46,$10,$90,$03,$18,$65
!byte $0A,$6A,$66,$10,$88,$D0,$F5,$AA
!byte $A5,$10,$60,$E0,$00,$10,$0D,$18
!byte $49,$FF,$69,$01,$48,$8A,$49,$FF
!byte $69,$00,$AA,$68,$60,$A0,$01,$B1
!byte $02,$85,$0B,$88,$B1,$02,$85,$0A
!byte $4C,$91,$0A,$A9,$00,$A2,$00,$48
!byte $A5,$02,$38,$E9,$02,$85,$02,$B0
!byte $02,$C6,$03,$A0,$01,$8A,$91,$02
!byte $68,$88,$91,$02,$60,$A0,$01,$85
!byte $0A,$86,$0B,$B1,$0A,$AA,$88,$B1
!byte $0A,$4C,$64,$0B,$A0,$03,$A5,$02
!byte $38,$E9,$02,$85,$02,$B0,$02,$C6
!byte $03,$B1,$02,$AA,$88,$B1,$02,$A0
!byte $00,$91,$02,$C8,$8A,$91,$02,$60
!byte $86,$13,$E0,$00,$10,$03,$20,$44
!byte $0B,$85,$10,$86,$11,$20,$89,$0A
!byte $86,$12,$E0,$00,$10,$03,$20,$44
!byte $0B,$85,$0A,$86,$0B,$4C,$03,$0C
!byte $A0,$00,$91,$02,$C8,$48,$8A,$91
!byte $02,$68,$60,$A2,$00,$38,$49,$FF
!byte $A0,$00,$71,$02,$C8,$48,$8A,$49
!byte $FF,$71,$02,$AA,$68,$4C,$E1,$09
!byte $98,$49,$FF,$38,$65,$02,$85,$02
!byte $B0,$02,$C6,$03,$60,$A2,$00,$85
!byte $10,$86,$11,$20,$52,$0B,$20,$03
!byte $0C,$A5,$0A,$A6,$0B,$60,$A9,$00
!byte $85,$05,$A0,$10,$A6,$11,$F0,$1F
!byte $06,$0A,$26,$0B,$2A,$26,$05,$AA
!byte $C5,$10,$A5,$05,$E5,$11,$90,$08
!byte $85,$05,$8A,$E5,$10,$AA,$E6,$0A
!byte $8A,$88,$D0,$E4,$85,$04,$60,$06
!byte $0A,$26,$0B,$2A,$B0,$04,$C5,$10
!byte $90,$04,$E5,$10,$E6,$0A,$88,$D0
!byte $EE,$85

   nop $60
   lda #$b9
   sta $0a
   lda #$0c
   sta $0b
   lda #$00
   ldx #$00
   beq Routine_0c5c

   sta ( $0a ), y
   bne Routine_0c52
   inc $0b
   bne Routine_0c52

   cpy #$00
   beq Routine_0c65
   sta ( $0a ), y
   bne Routine_0c5c

!byte $43,$00,$69,$03,$14,$06,$00,$00
!byte $45,$00,$6E,$03,$F0,$0B,$00,$00
!byte $00,$00

   sta Routine_0c86
   stx $0c87
   sta $0c8d
   stx $0c8e

   lda $ffff, y
   sta $0c97
   lda $ffff, y
   sta $0c96
   sty $0c99
   jsr $ffff
   ldy #$ff
   bne Routine_0c86
!byte $00,$00,$00,$00,$00,$00,$00,$00
!byte $00,$00,$00,$00,$00,$00,$00,$00
!byte $00,$00,$00,$00,$00,$00,$00,$00
!byte $00,$00

   ldy #$00
   beq Routine_0cc6
   lda #$df
   ldx #$0c
   jmp Routine_0c78

!byte $A2
   ora $02b5, y
   sta $0c9f, x
   lda #$00
   ldx #$d0
   sta $02
   stx $03
   lda #$0e
   jsr $ffd2
   jmp Routine_0cb9

I also added a second example, that performs a very simple regression in assembly at last by user input and consumes a fracture of what it would take to present the bios of a modern machine. Here also the code is not well written and it would perform even faster. I also need to make a better representation of floats, which is a bit more difficult. Unfortunately, you have to use a lot of tricks to create divisions, which blows up the code

You can download it here and run it in VICE. This VICE Emulator you can get here.

The result is a little proof of concept, that you could perform complex math on a very tiny system. To perform the same code on another System, you need more than a few transistors, 64 kilobytes of RAM, and a small ROM with an even small BASIC interpreter. You need at least a small operating system, a much bigger Python interpreter, and a lot of libraries to perform the same actions in mostly the same amount of time.

Thanks to the guys of visual6502 we can now look into the “Brain” during it is working. So this little program is more than a little concept, done on a rainy day, just to have something to post. You can step through the memory of the C64 and import every step just to see it think. And this is a big leap forward, by doing a few steps backward.

So Moore’s law looks more like a Joke. We are doing the same things with more transistors at the same time.

The abuse of Moores Law

What we did in the last decades is to make the representation of the User Interface look nicer and make sure that it looks good on an 8K Monitor. That’s is no improvement. This is a luxury we shouldn’t have when we want to tackle Cancer or solve the problem that there are people starving outside. At last, we have people who invent new CPUs to create self-driving cars like Tesla does. There is nothing wrong with it when we want to make our cars more safety with AI and ML instead of using them to make a fancy GUI in a car. That is nothing for what I would spend 140K Bugs for. At least we throw transistors like bullets on a problem. This is a very American way to solve things. And to be honest: When I say “American way”, it’s mostly meant as “fu$%ing wrong way”.

At last, we improved things and it isn’t that fair, to claim, that it isn’t good. But most of the problems we have are solved by adding more power and transistors. Despite transistors in the CPU we already have, we add more of them in dedicated Hardware called GPU, because the ones we already got aren’t good enough to perform floating numbers.
After you’ve run this little program, it sounds like a Joke.

From the purely mathematical point of view, we could do about 720000 regressions at mostly the same time with just one mediocre Intel i7. It would take us easy steps to perform character recognition on a C64 by the fact, that a floating window about the information in an image would take 32 by 32 numbers in an array to recognize characters on them. And these are integer numbers as well so you don’t have to fiddle around with floats.

The way we are solving problems with machine learning is not a calculation power or speed problem anymore. It is a problem with the software we are using to archive these goals. Nobody cares about the fact, that TensorFlow, PyTorch, etc. are just wrapper to keep you away from what happens under the hood. And most of these methods in TensorFlow we are using just use other libraries and methods. This complexity kills the innovation in a very special way.

Especially with floating points, I’m not happy with them. The architecture is somehow wrong. Are there really floats out there? From my point, there aren’t. I always teach my son, it doesn’t matter if you use the number as an integer or even as a float number. You just have to remind the scale of the number, at last, to represent the scale later by adding a dot somewhere. And even most operations in computation could do the same without using floating. But I’m losing track, so let’s get back.

Wasting power

Modern development has a giant problem of wasting power. A modern developer is more a person who knows predefined methods and classes without looking under the hood. There are a lot of “Experts” out there, who know node.js, Java, or just Spring Boot, Numpy, and so on. But this is not development. This is playing Bricks with predefined methods and classes. I work in the finance sector and a lot of these things I create could perform better on less big hardware. Most of it could be described as “Receiving a String, taking some information from it, storing it elsewhere, and sometimes sending it to someone else”.

Out there, there are developers who throw things like “Clean Code” after you with no idea how CPU’s work. And they should know, because they claim, that they are developers. But they aren’t. These people are “API Jockeys”. It is a homage to the word Disc Jockey. Those people who know to use an MP3 player on a Mac and claim they were musicians. You know!

At last, it would be something you could write in C very easily. But why we are wasting computing power and doesn’t use fewer resources? Why do I have to run Windows 10, waste tons of RAM for an IDE, load Spring Boot? Waste RAM to use rubbish libraries like Lombok to save me writing a Getter or Setter – wait – why do I use Getter and Setter?

The Answer?
We are lazy!
It is comfortable! I have a Gaming Mouse, a 4K resolution GUI, daemons who yell at me when code does not look fine in meaning by the author Robert C. Martin, and a lot of distracting toys like Microsoft Teams, Outlook, two different Browsers, and some Snake Oil running.

If I would go back in time and tell myself, that you will use a bloated programming language that runs on a Computer in your Computer? I would not believe it. No developer would do that! But we do. Every day!
It is going to be easier to waste computing power rather than using it in a good way. Thanks to CUDA everyone has GPUs running to solve “problems”. But we aren’t solving problems that way. We’re just making chip vendors and software companies rich.

When we look at the movie “The Terminator” we can see in some scenes that there were running commands for the MOS 6502 CPU (a similar CPU than C64 has – it was a 6510 CPU with a bank-switch for the outstanding SID sound chip and the “GPU”). It appeared every time the audience saw the story from the first-person view of the Terminator. And I think this could be a real thing, that might be real in another timeline. Because it would be perfectly fine to run a killer AI with a lot of those CPUs at last.
But maybe we are the lucky ones in our current timeline. They had Skynet, we have Intel, and Windows, and Intellij, and Docker, and a lot of different things that don’t make us think!
This post was written on a Windows 11 which already consumed 50% of the 64 GB RAM and constantly calculate something in the background at 3% of my CPU, which means there are about 9000 C64s constantly burning to write this text… no machine learning, no games… just writing this!


Shall we now get back to the 80s and only use C64? No, please not. This is not the outcome of this post. But we could start learning about what we want to do and how we archive this goal. We should step back and optimize our code. We should design code for machines rather than for humans.
We want to create brains and intelligence but avoid using our very own ones. We should step back to assembly or at last c (because the compiler does writing assembly perfectly for us.

The next stop will be real machine learning with a complete binary implementation of OCR running on MOS6502 CPU to proof, that a complete neural network will run on it fast and smooth.