Does anyone have code to strip/transform HTML entities?

I have a project where I am parsing HTML/XML, and displaying some strings. Sometimes, the HTML has characters encoded as HTML entities, and I'd like to either strip them out, or better yet, transform the easy ones to their ASCII equivalent (where there is one).

This isn't rocket science, and I can certainly write the code to do it, but I dislike reinventing wheels. Can anyone point to a handy-dandy wheel I might use?

Thanks for any pointers!

I recently did a regular expression matching library:

Or directly to here:

http://www.gammon.com.au/forum/?id=11063

You could use that to find … I suggest something like:

(</?%a->)

That would capture “less than” followed by optional “/” followed by letters in a non-greedy way, followed by “greater than”.

Or even:

(<[^>]+>)

That is, “<” followed by anything other than > up to a >.

Or:

(<.->)

The captured text could be looked-up for some sort of replacement, or just use the offsets to work out what parts to omit.

Hi Nick- thanks for the suggestion. I saw your notice about your regexp the other day. Not being much of a regexp jockey (whenever I use it, I have to look up the doc...), it didn't occur to me as a possible solution, but it seems like it should work.

That said, I think the code you've posted is for html tags. Entities are those things like:

<

-or-

ů

-or-

እ

Which should be simple enough to write a regexp for. Thanks for the idea.

Honestly I think in this concrete case it's totally overkill use the regexp library when you could simply use the replace function

An example to restore the "

String cleanString = server.arg(0);
  cleanString.replace("%22", "\"");