Performance issues With the W5100 Ethernet Shield

Hello Arduino Community,
I'm not a foreign englisch speaker so please excuse my spelling mistakes.
In order to control my Arduino over my home Network, i wrote a little HTML-page with jQuery with about 2 kB. Then i shortened the code manually an stored it into some Strings on the Arduino.
First of all I figuered out that the RAM of the UNO is to small for my String-storing, so i'm now using a Mega2560 - but that was not the Problem.
When i tried it out, i wondered that it takes over 600ms to transfer the Data back from the Arduino to me. 2kB in 600 ms ~~3,3kB/s so about @16MHz 5000CPU-Cycles for each transferred Byte. There has to be something wrong.

After doing some Researches on the Internet, i started Wireshark and figured out, that each Request to the arduino makes about 2000 TCP fragments - each for 1 usable byte. This always produced an header overload of 60 bytes for 1 usable byte, whichg means 6000% more data to send.

After researching in the arduino library i figured out the problem causing lines in "arduino-1.0.5\hardware\arduino\cores\arduino\print.cpp" line 54

size_t Print::print(const String &s)
{
  size_t n = 0;
  for (uint16_t i = 0; i < s.length(); i++) {
    n += write(s[i]);
  }
  return n;
}

Each element there is seperatly sent via write(s*);[/glow]. This might work for a Serial communication but here it causes the massive header overload, because each write() starts a fragment.
After i realized this issue i replaced the seperate String storage with a direct Code Copy and Paste. This already gave the communication a giant boost.
But it still wasn't perfect i figured out that every println produced 3 TCP fragments. This is caused by the "println()" in line 127 wich is called in every println(something toPrint)
*</mark> <mark>*size_t Print::println(void) {   size_t n = print('\r');   n += print('\n');   return n; }*</mark> <mark>*
On top if you combine successive multiple prints again each produces one header.
One temporary solution for me, would be to create an helper Class which automatically collects and fragments the data - but this will only unneccesarily use ram and cpu-power, because the W5100 has its own memory where the data could be stored into.
A much more better way would be to collect the data in the chip and flush it when the fragment is full, waiting time is over, or the user manually triggers the transmission - this should release more CPU -power because it hasn't to wait in NOP loops for the data to be sent by the chip. And on top this boost should even be availalable on libraries which are already depending on the Ethernet Library. But that for, the programmer of the library would be needed to perform the changes, because the settling in period would be too big and for him or her the changes should be nothing much difficult.
As far as i'm currently not at home, i can't try out my solution and i hope you trust my test results which i described here.
I hope that someone of the Arduino Community takes Care of this Problem. I really understand that no one would care if it would only be my issue, but I think that many developers could benefit from some changes in the Software.
Greetings,
haniham*

very impressive,

dont know what can be done,
the arduino guys are very slow at changing the system,
look a the list of bugs / features / improvements that have been worked out and are still not used.

I guess , this works, so its used.

You would have to account for when a single byte is written,
do you wait a certain time before putting that into TCP/IP the packet, how long ?

dont know what can be done,

I do. DO NOT USE String!

PaulS:

dont know what can be done,

I do. DO NOT USE String!

Sure, this is one possibility i already described but println still makes 3 unneeded tcp fargments wich result in 3*60 = 180 extra bytes each.