The ATmega32, with Vs=5 V, can source and sink +/-20 mA per pin, with a loss in voltage of about 0.7 V .. 0.8 V

and it can source and sink 4·(20 mA)=80 mA in total (per package), and even per port, with no problems.

Assuming you do not want to exceed the 20 mA rating per LED, this would be a different way of doing this:

U1 = ADG1636. It has two SPDT switches. Each switch connection can carry 238 mA (max), in any direction, at 25 ºC. That is well higher than 4·(20 mA)=80 mA. So, U1 acts as a high-current buffer. The IC costs $1.83 in 1 kpcs.
Rs = (5-2-0.7 V)/(20 mA)=115 \$\Omega\$, 1/4 W. You only need four of them.
In order to be safe to connect pairs of LEDs in anti-parallel, as shown, it must be \$V_{F}<|V_{Rmax}|\$, and that condition is usually satisfied.
Steps:
1) Set B=0 (as shown in Fig). That will give you access to diodes D9 to D16. Diodes D1 to D8 will all be off.
2) Set A=0 (as shown in Fig). That will give you access to diodes D10, D12, D14 and D16.
3) Set C=A if you want D10 to be off. Set C=!A (! means negated) if you want D10 to be on.
4) Simultaneously with 3), do the same for {D,D12}, {E,D14}, {F,D16}.
5) Set A=1. That will give you access to diodes D9, D11, D13 and D15.
6) Repeat 3) and 4), but for {C,D9}, {D,D11}, {E,D13}, {F,D15}.
7) Set B=1. That will give you access to diodes D1 to D8.
8) Repeat 2) to 6), but for diodes D1 to D8.
9) Repeat 1) to 8), for each new whole cycle.
With that, each diode will be on with a duty cycle of 1/4 (which is good, taking into account that you have 16 diodes). And yes, you can mix PWM with this idea, if you want to gradually control the brightness.
As I said, this solution does not exceed the 20 mA rating per LED, so the maximum brightness that you will see will be 1/4 of the maximum brightness that each LED can produce. If you want more brightness, use LEDs that produce more mcd/mA. This will keep intact their long life.
Thanks to the high current capability of U1, the amount of light that each LED will produce will not depend on the total number of LEDs that are on.
And, you still need only six GPIO lines from your MCU. With just one external IC, instead of decoder + buffers or transistors. This is more expensive, but more compact (if that is critical), and with a slightly easier wiring (six lines, instead of eight, go to the LED matrix). This is more of a curious and academical answer, in my opinion.
Added for Federico Russo: what you say was already addressed in my paragraph "As I said, this solution[...]". Forcing 80 mA through a 20 mA LED, even for 1/4 of the time, is not a good idea. Its life will be shortened. And not due to excessive dissipation (which is the same), but due to electromigration (wich is proportional to the current). See this reference from Cree. Excerpt:
Repetitive pulsing
The second type of over-current condition, high-current repetitive
pulsing, may or may not result in an early catastrophic failure of the
LED. Repetitive high-current pulsing may result in a shortened life
expectancy for the LED compared to the usual expected lifetime, on the
order of tens or hundreds of thousands of hours. A particular device
subjected to repeated transients at an amplitude some percentage above
the data-sheet limits but below the threshold required for
single-pulse failure will still eventually fail. The failure mechanism
will most likely be due to electromigration as enough metal ions are
eventually shifted away from their original lattice positions.
If you want same life and same light for less current, use LEDs that produce more mcd at 20 mA.