Nice to see gcc optimizes the loop so well and there's no need for inline asm. Didn't know there was instruction that does both load with post increment.
This is one of the reasons I recommend against using asm unless you absolutely have to (which is practically never).
The compiler generates good code, and unless you are very, very familiar with the underlying hardware (as the compiler-writers happen to be) you may choose sub-optimal ways of solving the problem.
By all means decompile and see what is generated. That can give hints about ways of optimizing (for example) how you store data in arrays. But ultimately you practically never need to out-guess the compiler.