The usual way to save "a lot" of space using assembly language is to define your own custom register usage scheme, rather than C's "well-structured ABI."
For example, in the C code you posted, write() is relatively large, because the C ABI specifies that called functions can modify the registers used for argument passing, and since write() calls other functions, it has to save the four arguments that were passed to it.
void write(uint8_t first, uint8_t second, uint8_t third, uint8_t fourth) {
aa: 0f 93 push r16
ac: 1f 93 push r17
ae: cf 93 push r28
b0: df 93 push r29
b2: 08 2f mov r16, r24
b4: 16 2f mov r17, r22
b6: d4 2f mov r29, r20
b8: c2 2f mov r28, r18
ba: c3 df rcall TM1637_start
bc: 80 e4 ldi r24, 0x40
be: cd df rcall TM1637_write_byte
c0: c6 df rcall TM1637_stop
c2: bf df rcall TM1637_start
c4: 80 ec ldi r24, 0xC0
c6: c9 df rcall TM1637_write_byte
c8: 80 2f mov r24, r16
ca: c7 df rcall TM1637_write_byte
cc: 81 2f mov r24, r17
ce: c5 df rcall TM1637_write_byte
d0: 8d 2f mov r24, r29
d2: c3 df rcall TM1637_write_byte
d4: 8c 2f mov r24, r28
d6: c1 df rcall TM1637_write_byte
d8: df 91 pop r29
da: cf 91 pop r28
dc: 1f 91 pop r17
de: 0f 91 pop r16
e0: b6 cf rjmp TM1637_stop
If you re-write those sub-functions (write_byte, start, stop) to NOT modify those registers, write could become much shorter (about 50% savings):
rcall TM1637_start
ldi r2, 0x40
rcall TM1637_write_byte
rcall TM1637_stop
rcall TM1637_start
ldi r2, 0xC0
rcall TM1637_write_byte
mov r2, r24
rcall TM1637_write_byte
mov r2, r22
rcall TM1637_write_byte
mov r2, r20
rcall TM1637_write_byte
mov r24, r18
rcall TM1637_write_byte
rjmp TM1637_stop
Of course, it may not be easy to write those sub-functions with fewer registers.
You can also put "commonly used constants" into particular registers for use by in/out.