Contact   •   Products   •   Search

Rick Strahl's Web Log

Wind, waves, code and everything in between...
ASP.NET • C# • HTML5 • JavaScript • AngularJs

Detecting Text Encoding for StreamReader


I keep running into issues in regards to auto-detection of file types when using StreamReader. StreamReader supports byte order mark detection and in most cases that seems to be working Ok, but if you deal with a variety of different file encodings for input files using the default detection comes up short.

I posted a JavaScript Minifier application yesterday and somebody correctly pointed out that the text encoding was incorrect. It turns out part of the problem is the code I snatched from Douglas Crockford's original C# minifier code, but there's also an issue with some of the code I added to provide string translations.

StreamReader() specifically has an overload that's supposed to help with detection of byte order marks and based on that is supposed to sniff the document's encoding. It actually works but only if the content is encoded as UTF-8/16/32 - ie. when it actually has a byte order mark. It doesn't revert back to Encoding.Default if it can't find a byte order mark - the default without a byte order mark is UTF-8 which usually will result in invalid text parsing. For my converter this translated into problems when the source JavaScript files were not encoded with UTF-8, but it worked fine with any of the UTF-xx encodings which is why I missed this.

There are a few other oddities. For example, Encoding.UTF8 is configured in such a way that when you write out to a StreamWriter it will always write out the Byte Order Mark unless you explicitly create a new instance with the constructor that disables it (ie. new UTF8Encoding(false)) which can really bite you if you're writing XML into an XMLWriter through a StreamWriter since Encoding.UTF8 is the default. HTTP output should never include a BOM - it's used only for files as content markers.

So anyway, every time I run into this I play around for a bit trying different encodings, usually combinations of Encoding.Default, Encoding.UTF8 and Encoding.Unicode, none of which really work consistently in all cases. What's really needed is some way to sniff the Byte Order Marks and depending on which one is present apply the appropriate Encoding to the StreamReader's constructor.

I ended up creating a short routine that tries to sniff the file's type which looks like this since I couldn't find anything in the framework that does this:

/// <summary>

/// Detects the byte order mark of a file and returns

/// an appropriate encoding for the file.

/// </summary>

/// <param name="srcFile"></param>

/// <returns></returns>

public static Encoding GetFileEncoding(string srcFile)

{

    // *** Use Default of Encoding.Default (Ansi CodePage)

    Encoding enc = Encoding.Default;

 

    // *** Detect byte order mark if any - otherwise assume default

    byte[] buffer = new byte[5];

    FileStream file = new FileStream(srcFile, FileMode.Open);

    file.Read(buffer, 0, 5);

    file.Close();

 

    if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)

        enc = Encoding.UTF8;

    else if (buffer[0] == 0xfe && buffer[1] == 0xff)

        enc = Encoding.Unicode;

    else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)

        enc = Encoding.UTF32;

    else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)

        enc = Encoding.UTF7;

 

    return enc;

}

 

/// <summary>

/// Opens a stream reader with the appropriate text encoding applied.

/// </summary>

/// <param name="srcFile"></param>

public static StreamReader OpenStreamReaderWithEncoding(string srcFile)

{

    Encoding enc = GetFileEncoding(srcFile);

    return new StreamReader(srcFile, enc);

}

This seems to do the trick with various different types of encodings I threw at it. The file to file conversion uses a StringReader for input and StringWriter for output which looks like this:

/// <summary>

/// Minifies a source file into a target file.

/// </summary>

/// <param name="src"></param>

/// <param name="dst"></param>

public void Minify(string srcFile, string dstFile)

{           

    Encoding enc = StringUtilities.GetFileEncoding(srcFile);

 

    using (sr = new StreamReader(srcFile,enc))

    {

        using (sw = new StreamWriter(dstFile,false,enc))

        {

            jsmin();

        }

    }

}

This detects the original encoding and opens the input file and then writes the output back with the same encoding which is what you'd expect. The only thing here is that if for some reason the file is UTF-8 (or 16/32) encoded and there's no BOM the default will revert - potentially incorrectly - to the Default Ansi encoding. I suppose that's reasonable since that's the most likely scenario for source code files generated with Microsoft tools anyway.

When dealing with string values only, it's best to use Unicode encoding. There's another little tweak I had to make to the minifier, which relates to the string processing which has a similar issue:

/// <summary>

/// Minifies an input JavaScript string and converts it to a compressed

/// JavaScript string.

/// </summary>

/// <param name="src"></param>

/// <returns></returns>

public string MinifyString(string src)

{                       

    MemoryStream srcStream = new MemoryStream(Encoding.Unicode.GetBytes(src));

    MemoryStream tgStream = new MemoryStream(8092);

 

    using (sr = new StreamReader(srcStream,Encoding.Unicode))

    {

        using (sw = new StreamWriter(tgStream,Encoding.Unicode))

        {

            jsmin();

        }

    }

 

    return Encoding.Unicode.GetString(tgStream.ToArray());

}

Notice that when using strings as input it's best to use Unicode encoding since in .NET strings are always Unicode (unless a specific encoding was applied). The original code I used skipped the Encoding.Unicode on the Reader and Writer which also caused formatting issues with extended characters.

Encodings are a confusing topic even once you get your head around how encodings relate to the binary signature (the actual bytes) of your text. This is especially true for streams in .NET because many of the text based streams already apply default encodings and because streams are often passed to other components that also expose Encodings (like an XmlReader for example).

Hopefully a routine like the above (and this entry <g>) will jog my memory 'next time'.

Make Donation
Posted in .NET  CSharp  


Feedback for this Post

 
# re: Detecting Text Encoding for StreamReader
by Luke Breuer November 28, 2007 @ 6:46am
Did you try instantiating StreamReader with a constructor that takes an Encoding? I, err, am making a *very* educated guess (seeing as one is not supposed to use Reflector on MS assemblies) that this was your problem. Also note that you have to actually read from the stream/file before encoding is detected.
# re: Detecting Text Encoding for StreamReader
by Rick Strahl November 28, 2007 @ 12:13pm
@Luke - yes using the correct encoding works but the problem is first detecting what the encoding of the file actually is. I can't apply the correct encoding until I know what it actually is <s>...

I didn't get out Reflector to check what the detectEncoding parameter does but from what I can see it only detects documents with a BOM. No BOM (or omitting the the detection parameter) and it'll use default UTF8 Encoding.
# re: Detecting Text Encoding for StreamReader
by Justin Van Patten November 28, 2007 @ 10:50pm
Any reason why you can't use StringReader/StringWriter instead of StreamReader/StreamWriter in MinifyString? This would avoid the need for the MemoryStreams.

// Change sr to TextReader and sw to TextWriter
public string MinifyString(string src) {
using (sr = new StringReader(src))
using (sw = new StringWriter()) { // No need to nest the using statements
jsmin();
return sw.ToString();
}
}
# re: Detecting Text Encoding for StreamReader
by Rick Strahl November 29, 2007 @ 7:43pm
In the code above I'm using existing code from Douglas Crockford that's using StreamReader/Writer to deal with file conversion. When I added the string conversion I needed to use a StreamReader to reuse that code.

Hmmm... actually taking another look casting all of those StreamReader/Writer to TextReader/Writer does the trick on the class:

public string MinifyString(string src)
{                         
    using (sr = (TextReader) new StringReader(src) )
    {
        using (sw = (TextWriter) new StringWriter() )
        {
            jsmin();
            return sw.ToString();
        }
    }    
}
# re: Detecting Text Encoding for StreamReader
by Will 保哥 December 01, 2007 @ 11:00pm
I think your code has some bug. You have to distinguish between Big-Endian and Little-Endian. For your code 0xFE 0xFF is belong to Unicode (Big-Endian).

Shown my code below:

if (buffer[0] == 0xFE && buffer[1] == 0xFF)
{
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
}
if (buffer[0] == 0xFF && buffer[1] == 0xFE)
{
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
}
# re: Detecting Text Encoding for StreamReader
by Will 保哥 December 01, 2007 @ 11:05pm
# re: Detecting Text Encoding for StreamReader
by Ken Prat February 28, 2008 @ 11:33am
Note that the BOM is optional for UTF-8, so your code will incorrectly interpret such a non-BOM UTF-8 file as Encoding.Default...
# re: Detecting Text Encoding for StreamReader
by Rick Strahl February 28, 2008 @ 7:32pm
Ken - correct, but if you have to guess there's really no other way to tell is there? I suppose if you're reading an XML document (most likely scenario) then you could check the processing instruction.
# re: Detecting Text Encoding for StreamReader
by Glenn Slayden May 22, 2008 @ 3:24pm
In Will's code (above), the comment "unicodeFFFE" does not indicate the correct byte sequence, although this the code itself and the rest of the comment (referring to "Big-Endian") is correct.
# re: Detecting Text Encoding for StreamReader
by ae kiquenet July 31, 2008 @ 2:31pm
Hi mister,

using GetPreamble() method of each Encoding type for compare headers, I think is more generic. What about this ?

Thanks
# re: Detecting Text Encoding for StreamReader
by Ritesh Totlani September 03, 2008 @ 10:01pm
Hi,
I tried to find the Encoding of a UTF-7 decoded file, but it is not working proper.For other files it is not giving any problems. Kinldy revert back to me if you find any solution for UTF-7, type of file
# re: Detecting Text Encoding for StreamReader
by seminda Rajapaksha September 05, 2008 @ 10:36pm
I try this but there is a posibility of fail this GetFileEncoding method.Because there can be more conbination of unicode caracters. I think this work for some files but not for the all the files.

More Details:
http://en.wikipedia.org/wiki/UTF-8
# re: Detecting Text Encoding for StreamReader
by Reza January 27, 2009 @ 7:27am
I used it for German umlaut and it works perfect.
thanks very much.
# re: Detecting Text Encoding for StreamReader
by espinete May 19, 2009 @ 3:01am
Mister, what is jsmin(); ??

Thanks
# re: Detecting Text Encoding for StreamReader
by espinete May 19, 2009 @ 3:10am
More information


Byte order mark Description
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian

Note: Microsoft uses UTF-16, little endian byte order

How can I detect not BOM UTF-8 File ?

Thanks
# re: Detecting Text Encoding for StreamReader
by espinete May 19, 2009 @ 8:40am
//http://www.mindspring.com/~markus.scherer/unicode/bomsig/tn-bomsig-1-20051026.html
//Table 1: Unicode Signature Byte Sequences
//Byte Sequence Encoding
//FE FF UTF-16BE

//FF FE (not followed by 00 00) UTF-16LE

//00 00 FE FF UTF-32BE

//FF FE 00 00 UTF-32LE

//EF BB BF UTF-8

//0E FE FF SCSU

//FB EE 28 BOCU-1 (U+FEFF must be removed after conversion)

//2B 2F 76 38 2D or
//2B 2F 76 38 or
//2B 2F 76 39 or
//2B 2F 76 2B or
//2B 2F 76 2F UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)

//DD 73 66 73 UTF-EBCDIC



several questions about Encoding ...

For example,

1.

in Visual Studio, in source code

string s = "text in source file in vs";

By default, what encoding has the string "text in source file in vs" ?? UTF-16 ?


2.
If you create a text file in Visual Studio, does creates it with encoding utf-8?

if you add in VStudio a text file, does retains its existing encoding?

And if you make changes and save these changes, does remains (maintain) the encoding?

3. Can I detect for 100% cases the encoding of a file, for example, a text file like File1.txt?

4. Which encoding is used if a read file using File.ReadAllBytes (filepath), which returns an array of bytes. This method has no more overload.

4b. If you get an array of bytes (byte []) that corresponds to the data of one text file,

Can I detect his encoding ?


5. If Visual Studigo add files like a resource (Binary type) and can I detect the encoding of that file?

internal static byte [] (file1
get (object obj = ResourceManager.GetObject ( "file1" resourceCulture) return ((byte []) (obj));)
)

6. What about the BOM ?. Any editors text treat or not? (such as Notepad, UltraEdit ... And Visual Studio?)

The question key is (if I have filepath string or a byte []), I need to be able to detect what encoding is the file.

Thanks in advanced, greetings, regards !!!
# re: Detecting Text Encoding for StreamReader
by Orlando June 30, 2009 @ 11:21am
Good article Rick.

I agree with Will, your code has a bug.

This:

else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;


should read:

else if (buffer[0] == 0xff && buffer[1] == 0xfe)
        enc = Encoding.Unicode; // utf-16le


You are interpreting the BOM for utf-16be as if it were utf-16le.

In addition to the correction you could make your code more robust by adding this branch to figure utf-16be:

else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.BigEndianUnicode; // utf-16be
# re: Detecting Text Encoding for StreamReader
by Farhan July 08, 2009 @ 3:27am
Hi , I just want to know how do you detect encoding for a UTF-8 file without BOM.
# re: Detecting Text Encoding for StreamReader
by Enzo July 20, 2009 @ 1:12pm
Farhan,

There is no such thing. You may be talking about a Code Page. For Code Page files you either need to be told or your need to use some statistical heuristic to determine the file.
# re: Detecting Text Encoding for StreamReader
by Izzy February 01, 2010 @ 1:22pm
// In C... load the whole text into a char array and pass it to the function
BOOL is_utf8(char *str)
{
    int c=0, b=0;
    int i;
    int bits=0;
    int len = strlen(str);
 
    for(i=0; i<len; i++)
    {
        c = str[i];
        if(c > 128)
        {
            if((c >= 254)) return FALSE;
            else if(c >= 252) bits=6;
            else if(c >= 248) bits=5;
            else if(c >= 240) bits=4;
            else if(c >= 224) bits=3;
            else if(c >= 192) bits=2;
            else return FALSE;
 
            if((i+bits) > len) return FALSE;
            while(bits > 1)
            {
                i++;
                b = str[i];
                if(b < 128 || b > 191) return FALSE;
                bits--;
            }
        }
    }
    return TRUE;
}
# re: Detecting Text Encoding for StreamReader
by Daniel Pepermans May 04, 2010 @ 8:48am
In response to: "Hi , I just want to know how do you detect encoding for a UTF-8 file without BOM."

I am working with SQL script files and I check bytes 1, 3 and 5 for 0x0.

This is probably not perfect but seems to be working okay.

I found this out because SQL Management Studio can script to Unicode (with BOM). If you then edit the script in Textpad the format stays the same (Unicode) but the BOM is removed (this is the default option in Textpad - Configure/Preferences/Document Classes).

Now I use code similar to the above to read a BOM and if none is found but the first 3 characters contain 0x0 in the second byte I can assume that the character set is using 2 bytes per character.
# re: Detecting Text Encoding for StreamReader
by RRR August 11, 2010 @ 6:48am
I couldn't come up with anything more elegant than this. And this is not exactly elegant...

bool firstline = true;
 
using (StreamReader fileOriginal = new StreamReader(originalFileName))
{
  line = fileOriginal.ReadLine(); // force StreamReader to determine the encoding (retrievable through fileOriginal.CurrentEncoding)
  firstLine = true;
 
  using (StreamWriter fileModified = new StreamWriter(modifiedFileName, false, fileOriginal.CurrentEncoding))
  {
    while (firstLine || ((line = fileOriginal.ReadLine()) != null))
    {
      firstLine = false; // I know, it gets set again and again. Maybe you prefer an "if" here.
 
      // do stuff here
    }
  } // StreamWriter
} // StreamReader
# re: Detecting Text Encoding for StreamReader
by SharK August 17, 2010 @ 2:32am
If you prefer a soluce "out of the box" : http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
# re: Detecting Text Encoding for StreamReader
by MC In the Burgh September 30, 2010 @ 10:48am
hi,
great blog post, it helps me understand this topic more.
I am trying to open an excel document and then figure out the encoding used for the file and I pretty much am doing what you explain but when I put a trace on my program and translate the Character Numeric values to text using my handy chart I realize that it just grabs double quotes and then the first 4 characters in the topmost and leftmost cell of the excel sheet that is visible to the human eye.

In other words, there was no encoding information at the start of the file, just the text that everyone can see if the file when they open it. Just wondering in this situation do you (or anyone reading) have any good ideas on how to figure out the encoding of a file programmatically? thanks

MC
# UnauthoritzedAccessException
by Timo October 22, 2010 @ 6:15am
Hi,

thanks for the code snippet!

Note that you might specify FileAccess.Read mode when creating the FileStream to prevent the UnauthoritzedAccessException on files that are read-only:

FileStream file = new FileStream(srcFile, FileMode.Open, FileAccess.Read);
# re: Detecting Text Encoding for StreamReader
by Scott November 28, 2012 @ 4:09am
A few years on, Rick, but this is absolutely brilliant - thanks a lot.
I had exactly this issue where the StreamReader could not detect the correct encoding and this sorted it for me.
# re: Detecting Text Encoding for StreamReader
by Adam Law February 08, 2014 @ 8:49am
There is no easy answer to this ...

I had to modify https://github.com/dalimian/fconvert to get a reasonable solution. Most other solutions to the encoding issue online (like the one above) seem to not work well (eg determining the difference between UTF-8 and windows-1252)

https://github.com/dalimian/fconvert applied the best heuristic solution I have seen. If the file is a chimera ... you can play around with fconvert to provide the best solution.
 


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2014