Inconsistent Behavior of String Methods for Special Unicode Characters

August 19th 2013 Unicode .NET Framework

Take a look at the following function:

string StripSpaces(string input)
{
    while (input.IndexOf("  ") >= 0)
    {
        input = input.Replace("  ", " ");
    }
    return input;
}

Can you think of an input causing it to loop infinitely? No? Try calling it like this, then:

var result = StripSpaces(" \ufffd ");

Yes, in this case the method actually never returns because of the special 0xFFFD Unicode replacement character. It seems that different String methods handle it in different ways:

  • IndexOf() ignores it when searching for patterns therefore it finds two spaces in the above input string
  • Replace() is aware of it and therefore doesn't replace the two spaces with a single one, keeping the string unchanged and causing an infinite loop in the above method.

When will you encounter the replacement character? Typically it is returned when a file contains an invalid byte value for the given encoding.

What can you do about it? Strip it from the input string like this:

var input = input.Replace("\ufffd", "")

Lesson of the day? Never trust your input.

Get notified when a new blog post is published (usually every Friday):

If you're looking for online one-on-one mentorship on a related topic, you can find me on Codementor.
If you need a team of experienced software engineers to help you with a project, contact us at Razum.
Copyright
Creative Commons License