You've got several possibilities.
Drive all leds directly using quite large resistors. Should you want to turn on all leds at full capacity at the same time in your program, normally 40x20mA will flow. That's too much. By using larger resistors you can limit the current to less then 5 mA per led resulting in less then 40x5mA when software turns on all leds at the same time.
You'll probably still see a lot of light, but not as bright as possible.
You could... also drive them normally, but you'll have to be very sure your programm doesn't turn on too many leds at the same time. With 800mA flowing, the possibility of smelling magic smoke is quite high. 1 mistake while writing/testing the program can be enough to need a new Mega. Internally the mega uses different ports up to 8 pins per port, you'll also need to check the datasheet to see what a port can handle max.
An option I didn't think of yesterday is using transistors. The right ones can handle more current and they hardly put a strain on the Mega. A chip like the ULN2003 houses 7 transistors, so 6 chips would be enough to drive all leds at full capacity at the same time. Its relative the 2803 houses 8, requiring 5 of 'm.
With 40 leds there's one disadvantage using the options above. For fading you'll need PWM. The mega has 14 PWM pins by itself which can easily be controlled by the analogWrite()-function. Other pins can... be used as PWM pins too, but you'll need to write software/use a library to make that possible. You'll need to manipulate ports/pins as Aeturnalus wrote.
An example can be found at http://arduino.cc/playground/Main/PWMallPins
A TLC5940 has 16 outputpins and has a PWM-function for each pin. In the example 5 pins are needed to control the chip.
4 of those are used to make sure the transmission of data is done correctly while the remaining one actually is used to transmit the data needed.
Besides having that 1 input pin for data... the chip also has an output pin which makes it possible to daisychain 'm. That pin can be connected to the input pin of the next TLC5940 (which also has an output pin). Although the comparison isn't correct the combination input/output pin makes it possible to use the chip like domino-blocks. By tumbling the first block of 16 all will fall down and by placing another block of 16 behind it, it's possible to let them fall as well by tumbling the first block of the first set. Domino is too simple since all blocks will fall, you can't command one to stay up and the next to fall, something you (more or less) can with a shift register.
The other 4 signals of the second (third/fourth/etc) chip behave exactly the same as those on the first. By connecting each individual pin of those 4 to the same pin on the second/third/... chip, each chip can be controlled and each outputpin can be set to the fade-value you need using just 5 pins of the arduino.
There probably will be a limit to the amount of chips you can control this way, but it wouldn't surprise me it's higher as 15-20.
There are a few libraries to drive the chip, it isn't the simplest shift register though.
If you would like to experiment with shiftregisters, the 74hc595 is probably one of the easiest to control, but has no PWM-possibilities by itself.
- There's much available, it wouldn't surprise me if there are for example chips available controlled by I2C/other protocols that also are intended to create PWM-signals.