Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
Markdown Monster - The Markdown Editor for Windows

Detecting Text Encoding for StreamReader


:P
On this page:

I keep running into issues in regards to auto-detection of file types when using StreamReader. StreamReader supports byte order mark detection and in most cases that seems to be working Ok, but if you deal with a variety of different file encodings for input files using the default detection comes up short.

I posted a JavaScript Minifier application yesterday and somebody correctly pointed out that the text encoding was incorrect. It turns out part of the problem is the code I snatched from Douglas Crockford's original C# minifier code, but there's also an issue with some of the code I added to provide string translations.

StreamReader() specifically has an overload that's supposed to help with detection of byte order marks and based on that is supposed to sniff the document's encoding. It actually works but only if the content is encoded as UTF-8/16/32 - ie. when it actually has a byte order mark. It doesn't revert back to Encoding.Default if it can't find a byte order mark - the default without a byte order mark is UTF-8 which usually will result in invalid text parsing. For my converter this translated into problems when the source JavaScript files were not encoded with UTF-8, but it worked fine with any of the UTF-xx encodings which is why I missed this.

There are a few other oddities. For example, Encoding.UTF8 is configured in such a way that when you write out to a StreamWriter it will always write out the Byte Order Mark unless you explicitly create a new instance with the constructor that disables it (ie. new UTF8Encoding(false)) which can really bite you if you're writing XML into an XMLWriter through a StreamWriter since Encoding.UTF8 is the default. HTTP output should never include a BOM - it's used only for files as content markers.

So anyway, every time I run into this I play around for a bit trying different encodings, usually combinations of Encoding.Default, Encoding.UTF8 and Encoding.Unicode, none of which really work consistently in all cases. What's really needed is some way to sniff the Byte Order Marks and depending on which one is present apply the appropriate Encoding to the StreamReader's constructor.

I ended up creating a short routine that tries to sniff the file's type which looks like this since I couldn't find anything in the framework that does this:

/// <summary>

/// Detects the byte order mark of a file and returns

/// an appropriate encoding for the file.

/// </summary>

/// <param name="srcFile"></param>

/// <returns></returns>

public static Encoding GetFileEncoding(string srcFile)

{

    // *** Use Default of Encoding.Default (Ansi CodePage)

    Encoding enc = Encoding.Default;

 

    // *** Detect byte order mark if any - otherwise assume default

    byte[] buffer = new byte[5];

    FileStream file = new FileStream(srcFile, FileMode.Open);

    file.Read(buffer, 0, 5);

    file.Close();

 

    if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)

        enc = Encoding.UTF8;

    else if (buffer[0] == 0xfe && buffer[1] == 0xff)

        enc = Encoding.Unicode;

    else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)

        enc = Encoding.UTF32;

    else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)

        enc = Encoding.UTF7;

 

    return enc;

}

 

/// <summary>

/// Opens a stream reader with the appropriate text encoding applied.

/// </summary>

/// <param name="srcFile"></param>

public static StreamReader OpenStreamReaderWithEncoding(string srcFile)

{

    Encoding enc = GetFileEncoding(srcFile);

    return new StreamReader(srcFile, enc);

}

This seems to do the trick with various different types of encodings I threw at it. The file to file conversion uses a StringReader for input and StringWriter for output which looks like this:

/// <summary>

/// Minifies a source file into a target file.

/// </summary>

/// <param name="src"></param>

/// <param name="dst"></param>

public void Minify(string srcFile, string dstFile)

{           

    Encoding enc = StringUtilities.GetFileEncoding(srcFile);

 

    using (sr = new StreamReader(srcFile,enc))

    {

        using (sw = new StreamWriter(dstFile,false,enc))

        {

            jsmin();

        }

    }

}

This detects the original encoding and opens the input file and then writes the output back with the same encoding which is what you'd expect. The only thing here is that if for some reason the file is UTF-8 (or 16/32) encoded and there's no BOM the default will revert - potentially incorrectly - to the Default Ansi encoding. I suppose that's reasonable since that's the most likely scenario for source code files generated with Microsoft tools anyway.

When dealing with string values only, it's best to use Unicode encoding. There's another little tweak I had to make to the minifier, which relates to the string processing which has a similar issue:

/// <summary>

/// Minifies an input JavaScript string and converts it to a compressed

/// JavaScript string.

/// </summary>

/// <param name="src"></param>

/// <returns></returns>

public string MinifyString(string src)

{                       

    MemoryStream srcStream = new MemoryStream(Encoding.Unicode.GetBytes(src));

    MemoryStream tgStream = new MemoryStream(8092);

 

    using (sr = new StreamReader(srcStream,Encoding.Unicode))

    {

        using (sw = new StreamWriter(tgStream,Encoding.Unicode))

        {

            jsmin();

        }

    }

 

    return Encoding.Unicode.GetString(tgStream.ToArray());

}

Notice that when using strings as input it's best to use Unicode encoding since in .NET strings are always Unicode (unless a specific encoding was applied). The original code I used skipped the Encoding.Unicode on the Reader and Writer which also caused formatting issues with extended characters.

Encodings are a confusing topic even once you get your head around how encodings relate to the binary signature (the actual bytes) of your text. This is especially true for streams in .NET because many of the text based streams already apply default encodings and because streams are often passed to other components that also expose Encodings (like an XmlReader for example).

Hopefully a routine like the above (and this entry <g>) will jog my memory 'next time'.

Posted in .NET  CSharp  

The Voices of Reason


 

Luke Breuer
November 28, 2007

# re: Detecting Text Encoding for StreamReader

Did you try instantiating StreamReader with a constructor that takes an Encoding? I, err, am making a *very* educated guess (seeing as one is not supposed to use Reflector on MS assemblies) that this was your problem. Also note that you have to actually read from the stream/file before encoding is detected.

Rick Strahl
November 28, 2007

# re: Detecting Text Encoding for StreamReader

@Luke - yes using the correct encoding works but the problem is first detecting what the encoding of the file actually is. I can't apply the correct encoding until I know what it actually is <s>...

I didn't get out Reflector to check what the detectEncoding parameter does but from what I can see it only detects documents with a BOM. No BOM (or omitting the the detection parameter) and it'll use default UTF8 Encoding.

Justin Van Patten
November 28, 2007

# re: Detecting Text Encoding for StreamReader

Any reason why you can't use StringReader/StringWriter instead of StreamReader/StreamWriter in MinifyString? This would avoid the need for the MemoryStreams.

// Change sr to TextReader and sw to TextWriter
public string MinifyString(string src) {
using (sr = new StringReader(src))
using (sw = new StringWriter()) { // No need to nest the using statements
jsmin();
return sw.ToString();
}
}

Rick Strahl
November 29, 2007

# re: Detecting Text Encoding for StreamReader

In the code above I'm using existing code from Douglas Crockford that's using StreamReader/Writer to deal with file conversion. When I added the string conversion I needed to use a StreamReader to reuse that code.

Hmmm... actually taking another look casting all of those StreamReader/Writer to TextReader/Writer does the trick on the class:

public string MinifyString(string src)
{                         
    using (sr = (TextReader) new StringReader(src) )
    {
        using (sw = (TextWriter) new StringWriter() )
        {
            jsmin();
            return sw.ToString();
        }
    }    
}

Will 保哥
December 01, 2007

# re: Detecting Text Encoding for StreamReader

I think your code has some bug. You have to distinguish between Big-Endian and Little-Endian. For your code 0xFE 0xFF is belong to Unicode (Big-Endian).

Shown my code below:

if (buffer[0] == 0xFE && buffer[1] == 0xFF)
{
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
}
if (buffer[0] == 0xFF && buffer[1] == 0xFE)
{
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
}

Will 保哥
December 01, 2007

# re: Detecting Text Encoding for StreamReader


Ken Prat
February 28, 2008

# re: Detecting Text Encoding for StreamReader

Note that the BOM is optional for UTF-8, so your code will incorrectly interpret such a non-BOM UTF-8 file as Encoding.Default...

Rick Strahl
February 28, 2008

# re: Detecting Text Encoding for StreamReader

Ken - correct, but if you have to guess there's really no other way to tell is there? I suppose if you're reading an XML document (most likely scenario) then you could check the processing instruction.

Glenn Slayden
May 22, 2008

# re: Detecting Text Encoding for StreamReader

In Will's code (above), the comment "unicodeFFFE" does not indicate the correct byte sequence, although this the code itself and the rest of the comment (referring to "Big-Endian") is correct.

ae kiquenet
July 31, 2008

# re: Detecting Text Encoding for StreamReader

Hi mister,

using GetPreamble() method of each Encoding type for compare headers, I think is more generic. What about this ?

Thanks

Ritesh Totlani
September 03, 2008

# re: Detecting Text Encoding for StreamReader

Hi,
I tried to find the Encoding of a UTF-7 decoded file, but it is not working proper.For other files it is not giving any problems. Kinldy revert back to me if you find any solution for UTF-7, type of file

seminda Rajapaksha
September 05, 2008

# re: Detecting Text Encoding for StreamReader

I try this but there is a posibility of fail this GetFileEncoding method.Because there can be more conbination of unicode caracters. I think this work for some files but not for the all the files.

More Details:
http://en.wikipedia.org/wiki/UTF-8

Reza
January 27, 2009

# re: Detecting Text Encoding for StreamReader

I used it for German umlaut and it works perfect.
thanks very much.

espinete
May 19, 2009

# re: Detecting Text Encoding for StreamReader

Mister, what is jsmin(); ??

Thanks

espinete
May 19, 2009

# re: Detecting Text Encoding for StreamReader

More information


Byte order mark Description
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian

Note: Microsoft uses UTF-16, little endian byte order

How can I detect not BOM UTF-8 File ?

Thanks

espinete
May 19, 2009

# re: Detecting Text Encoding for StreamReader

//http://www.mindspring.com/~markus.scherer/unicode/bomsig/tn-bomsig-1-20051026.html
//Table 1: Unicode Signature Byte Sequences
//Byte Sequence Encoding
//FE FF UTF-16BE

//FF FE (not followed by 00 00) UTF-16LE

//00 00 FE FF UTF-32BE

//FF FE 00 00 UTF-32LE

//EF BB BF UTF-8

//0E FE FF SCSU

//FB EE 28 BOCU-1 (U+FEFF must be removed after conversion)

//2B 2F 76 38 2D or
//2B 2F 76 38 or
//2B 2F 76 39 or
//2B 2F 76 2B or
//2B 2F 76 2F UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)

//DD 73 66 73 UTF-EBCDIC



several questions about Encoding ...

For example,

1.

in Visual Studio, in source code

string s = "text in source file in vs";

By default, what encoding has the string "text in source file in vs" ?? UTF-16 ?


2.
If you create a text file in Visual Studio, does creates it with encoding utf-8?

if you add in VStudio a text file, does retains its existing encoding?

And if you make changes and save these changes, does remains (maintain) the encoding?

3. Can I detect for 100% cases the encoding of a file, for example, a text file like File1.txt?

4. Which encoding is used if a read file using File.ReadAllBytes (filepath), which returns an array of bytes. This method has no more overload.

4b. If you get an array of bytes (byte []) that corresponds to the data of one text file,

Can I detect his encoding ?


5. If Visual Studigo add files like a resource (Binary type) and can I detect the encoding of that file?

internal static byte [] (file1
get (object obj = ResourceManager.GetObject ( "file1" resourceCulture) return ((byte []) (obj));)
)

6. What about the BOM ?. Any editors text treat or not? (such as Notepad, UltraEdit ... And Visual Studio?)

The question key is (if I have filepath string or a byte []), I need to be able to detect what encoding is the file.

Thanks in advanced, greetings, regards !!!

Orlando
June 30, 2009

# re: Detecting Text Encoding for StreamReader

Good article Rick.

I agree with Will, your code has a bug.

This:

else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;


should read:

else if (buffer[0] == 0xff && buffer[1] == 0xfe)
        enc = Encoding.Unicode; // utf-16le


You are interpreting the BOM for utf-16be as if it were utf-16le.

In addition to the correction you could make your code more robust by adding this branch to figure utf-16be:

else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.BigEndianUnicode; // utf-16be

Farhan
July 08, 2009

# re: Detecting Text Encoding for StreamReader

Hi , I just want to know how do you detect encoding for a UTF-8 file without BOM.

Enzo
July 20, 2009

# re: Detecting Text Encoding for StreamReader

Farhan,

There is no such thing. You may be talking about a Code Page. For Code Page files you either need to be told or your need to use some statistical heuristic to determine the file.

Izzy
February 01, 2010

# re: Detecting Text Encoding for StreamReader

// In C... load the whole text into a char array and pass it to the function
BOOL is_utf8(char *str)
{
    int c=0, b=0;
    int i;
    int bits=0;
    int len = strlen(str);
 
    for(i=0; i<len; i++)
    {
        c = str[i];
        if(c > 128)
        {
            if((c >= 254)) return FALSE;
            else if(c >= 252) bits=6;
            else if(c >= 248) bits=5;
            else if(c >= 240) bits=4;
            else if(c >= 224) bits=3;
            else if(c >= 192) bits=2;
            else return FALSE;
 
            if((i+bits) > len) return FALSE;
            while(bits > 1)
            {
                i++;
                b = str[i];
                if(b < 128 || b > 191) return FALSE;
                bits--;
            }
        }
    }
    return TRUE;
}

Daniel Pepermans
May 04, 2010

# re: Detecting Text Encoding for StreamReader

In response to: "Hi , I just want to know how do you detect encoding for a UTF-8 file without BOM."

I am working with SQL script files and I check bytes 1, 3 and 5 for 0x0.

This is probably not perfect but seems to be working okay.

I found this out because SQL Management Studio can script to Unicode (with BOM). If you then edit the script in Textpad the format stays the same (Unicode) but the BOM is removed (this is the default option in Textpad - Configure/Preferences/Document Classes).

Now I use code similar to the above to read a BOM and if none is found but the first 3 characters contain 0x0 in the second byte I can assume that the character set is using 2 bytes per character.

RRR
August 11, 2010

# re: Detecting Text Encoding for StreamReader

I couldn't come up with anything more elegant than this. And this is not exactly elegant...

bool firstline = true;
 
using (StreamReader fileOriginal = new StreamReader(originalFileName))
{
  line = fileOriginal.ReadLine(); // force StreamReader to determine the encoding (retrievable through fileOriginal.CurrentEncoding)
  firstLine = true;
 
  using (StreamWriter fileModified = new StreamWriter(modifiedFileName, false, fileOriginal.CurrentEncoding))
  {
    while (firstLine || ((line = fileOriginal.ReadLine()) != null))
    {
      firstLine = false; // I know, it gets set again and again. Maybe you prefer an "if" here.
 
      // do stuff here
    }
  } // StreamWriter
} // StreamReader

SharK
August 17, 2010

# re: Detecting Text Encoding for StreamReader

If you prefer a soluce "out of the box" : http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

MC In the Burgh
September 30, 2010

# re: Detecting Text Encoding for StreamReader

hi,
great blog post, it helps me understand this topic more.
I am trying to open an excel document and then figure out the encoding used for the file and I pretty much am doing what you explain but when I put a trace on my program and translate the Character Numeric values to text using my handy chart I realize that it just grabs double quotes and then the first 4 characters in the topmost and leftmost cell of the excel sheet that is visible to the human eye.

In other words, there was no encoding information at the start of the file, just the text that everyone can see if the file when they open it. Just wondering in this situation do you (or anyone reading) have any good ideas on how to figure out the encoding of a file programmatically? thanks

MC

Timo
October 22, 2010

# UnauthoritzedAccessException

Hi,

thanks for the code snippet!

Note that you might specify FileAccess.Read mode when creating the FileStream to prevent the UnauthoritzedAccessException on files that are read-only:

FileStream file = new FileStream(srcFile, FileMode.Open, FileAccess.Read);

Scott
November 28, 2012

# re: Detecting Text Encoding for StreamReader

A few years on, Rick, but this is absolutely brilliant - thanks a lot.
I had exactly this issue where the StreamReader could not detect the correct encoding and this sorted it for me.

Adam Law
February 08, 2014

# re: Detecting Text Encoding for StreamReader

There is no easy answer to this ...

I had to modify https://github.com/dalimian/fconvert to get a reasonable solution. Most other solutions to the encoding issue online (like the one above) seem to not work well (eg determining the difference between UTF-8 and windows-1252)

https://github.com/dalimian/fconvert applied the best heuristic solution I have seen. If the file is a chimera ... you can play around with fconvert to provide the best solution.

Jim Casement
May 05, 2016

# re: Detecting Text Encoding for StreamReader

Great Post! Still relevant and cured my troubles when processing files from Japan.
I owe you many beers.

West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024