Due SAM3X8E SysTick Problem

I have tried to test [assembly language re-attached with __ASM]...

You're not understanding me correctly. That assembly code was not "suggested" as a faster solution than your C code, that was the code that your C code actually produces.

  A[Index++] = PIOD->PIO_PDSR;

Becomes:
   80156:       6bd8            ldr     r0, [r3, #60]   ;  load from PIOD->PIO_DSR
   80158:       b2c0            uxtb    r0, r0          ; convert byte to long
   8015a:       9000            str     r0, [sp, #0]    ; store int "A" local array.

That's pretty close to optimal. The compiler is even "unrolling" the "Index++" into increasing static offsets, because that's quicker/smaller/uses less registers, than actually keeping and incrementing a counter or pointer.
But the exact code isn't as important as the other example, which shows:

   80152:       6bd8            ldr     r0, [r3, #60]   ;  load from PIOD->PIO_DSR
   80154:       6bd8            ldr     r0, [r3, #60]   ;  load from PIOD->PIO_DSR
   80156:       6bd8            ldr     r0, [r3, #60]   ;  load from PIOD->PIO_DSR
   80158:       6bd8            ldr     r0, [r3, #60]   ;  load from PIOD->PIO_DSR

Just a set of consecutive reads. The compiler has "figured out" that you were storing the values in an array that was never actually used, and so it just avoids storing them at all. It doesn't create the array either. The only reason that it bothers to read the registers is that they're declared as "volatile", which means that the compiler MUST access them when you say so.
Now, your timing says that each of those "ldr" instructions takes 4 cycles. That's a bit annoying, since supposedly "ARM architecture executes most instructions in a single cycle", but you have flash memory with wait states, and you have a slow "peripheral bus" involved, and you have memory accessed every instruction (which might stall pipelines?), so it's not really that surprising that it apparently takes 4 cycles.
That also means that the 8 cycles per read/store that you're seeing is as fast as you can expect it to go if you actually save the data. You MIGHT get slightly faster by moving the code to RAM, but it's pretty uncertain (fewer wait states, but more contention.)