In the old days there was software for the IBM PC that enabled "real" sound output on the internal speaker (which is connected to a single IO port bit, just like a speaker connected to Arduino pin.) You can think of it like this:
If you output a high-frequency signal to the speaker, well beyond what it is capable of converting to sound, it will essentially average the signal into a physical position. Output a square wave and the speaker cone will sit in the middle of its range. Vary the duty cycle of the square wave plus or minus the 50% mark and you move the cone to one side or the other of center. Do that fast enough, and it will start making noise again. There were some impressive demos (N-part polyphonic music, speech), in the sense of "it is impressive that the bear dances it all."
I haven't looked at the 1-bit sound code, but I would imagine that it is essentially pre-computing the bitstream that results. I think theory says that speeding up the bitstream gets you better audio, probably pretty much exactly as much as adding more bits to the encoding would (because that's what it is really doing.) By the time you get better quality sound, you're probably using as much memory as a more conventional storage scheme.
There are several audio compression schemes that were developed for telecommunications that transmit "understandable voice" at relatively low bitrates (8kbps was "easy", IIRC.) They might require special hardware. But even at 8kbps, all of the internal memory of the ATmega328 put together is going to be "not many" seconds of audio.
And in the end it's pretty silly. Memory has obeyed Moore's law and is extremely cheap these days. Spend $5 on a flash card and get a couple hour's worth of CD-quality sound...