well, i never thought i could receive compliments here, so i'm kind of happy
however, the second part of your explanation made me a bit cofused.. you've "raised" an important issue.. the first thing i thought was to resolve the problem by software, in this way (or something similiar):
 currentMicros=micros();         //Read the current time, checking if 32us have passed
 if(currentMicros-previousMicros>=32){
  previousMicros=currentMicros;
  sample=0;
  for(pins=0;pins<TOTAL_PINS;pins++){  //Search among all the pins
   if(playflags[pins]==1)       //the ones whose flag has been set and
    sample+=Samples[pins][i];     //sum together their respective samples.
  }
  digitalWrite(latchP, LOW);           //Toggle the latch clock
  shiftOut(dataP, clockP, MSBFIRST, (sample>>8)); //Send the first half of data to the shift reg
  shiftOut(dataP, clockP, MSBFIRST, sample);   //Send the second half of data to the shift reg
  digitalWrite(latchP, HIGH);           //Toggle the latch clock again
  i++;                      //Increase the counter
  if(i==LAST_SAMPLE){             //If the the last sample has been reached,
   i=0;                   //reset the counter and
   for(pins=0;pins<TOTAL_PINS;pins++)    //all the play flags.
    playflags[pins]=0;
  }
 }
in this way, i'm able to send the data once, at twice of the depth, using only 2 registers, but i'll need a bigger R2R ladder. i'm still wondering if arduino's computing capabilities are enough to do all those instructions in just 32us (that are about 512 clock cycles).
Anyway, if i would use different regs and ladders for every sample, how can i mix the sounds together?