Detecting Text Encoding for StreamReader
I keep running into issues in regards to auto-detection of file types when using StreamReader. StreamReader supports byte order mark detection and in most cases that seems to be working Ok, but if you deal with a variety of different file encodings for input files using the default detection comes up short.
I posted a JavaScript Minifier application yesterday and somebody correctly pointed out that the text encoding was incorrect. It turns out part of the problem is the code I snatched from Douglas Crockford's original C# minifier code, but there's also an issue with some of the code I added to provide string translations.
StreamReader() specifically has an overload that's supposed to help with detection of byte order marks and based on that is supposed to sniff the document's encoding. It actually works but only if the content is encoded as UTF-8/16/32 - ie. when it actually has a byte order mark. It doesn't revert back to Encoding.Default if it can't find a byte order mark - the default without a byte order mark is UTF-8 which usually will result in invalid text parsing. For my converter this translated into problems when the source JavaScript files were not encoded with UTF-8, but it worked fine with any of the UTF-xx encodings which is why I missed this.
There are a few other oddities. For example, Encoding.UTF8 is configured in such a way that when you write out to a StreamWriter it will always write out the Byte Order Mark unless you explicitly create a new instance with the constructor that disables it (ie. new UTF8Encoding(false)) which can really bite you if you're writing XML into an XMLWriter through a StreamWriter since Encoding.UTF8 is the default. HTTP output should never include a BOM - it's used only for files as content markers.
So anyway, every time I run into this I play around for a bit trying different encodings, usually combinations of Encoding.Default, Encoding.UTF8 and Encoding.Unicode, none of which really work consistently in all cases. What's really needed is some way to sniff the Byte Order Marks and depending on which one is present apply the appropriate Encoding to the StreamReader's constructor.
I ended up creating a short routine that tries to sniff the file's type which looks like this since I couldn't find anything in the framework that does this:
/// <summary>
/// Detects the byte order mark of a file and returns
/// an appropriate encoding for the file.
/// </summary>
/// <param name="srcFile"></param>
/// <returns></returns>
public static Encoding GetFileEncoding(string srcFile)
{
// *** Use Default of Encoding.Default (Ansi CodePage)
Encoding enc = Encoding.Default;
// *** Detect byte order mark if any - otherwise assume default
byte[] buffer = new byte[5];
FileStream file = new FileStream(srcFile, FileMode.Open);
file.Read(buffer, 0, 5);
file.Close();
if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
enc = Encoding.UTF8;
else if (buffer[0] == 0xfe && buffer[1] == 0xff)
enc = Encoding.Unicode;
else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
enc = Encoding.UTF32;
else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
enc = Encoding.UTF7;
return enc;
}
/// <summary>
/// Opens a stream reader with the appropriate text encoding applied.
/// </summary>
/// <param name="srcFile"></param>
public static StreamReader OpenStreamReaderWithEncoding(string srcFile)
{
Encoding enc = GetFileEncoding(srcFile);
return new StreamReader(srcFile, enc);
}
This seems to do the trick with various different types of encodings I threw at it. The file to file conversion uses a StringReader for input and StringWriter for output which looks like this:
/// <summary>
/// Minifies a source file into a target file.
/// </summary>
/// <param name="src"></param>
/// <param name="dst"></param>
public void Minify(string srcFile, string dstFile)
{
Encoding enc = StringUtilities.GetFileEncoding(srcFile);
using (sr = new StreamReader(srcFile,enc))
{
using (sw = new StreamWriter(dstFile,false,enc))
{
jsmin();
}
}
}
This detects the original encoding and opens the input file and then writes the output back with the same encoding which is what you'd expect. The only thing here is that if for some reason the file is UTF-8 (or 16/32) encoded and there's no BOM the default will revert - potentially incorrectly - to the Default Ansi encoding. I suppose that's reasonable since that's the most likely scenario for source code files generated with Microsoft tools anyway.
When dealing with string values only, it's best to use Unicode encoding. There's another little tweak I had to make to the minifier, which relates to the string processing which has a similar issue:
/// <summary>
/// Minifies an input JavaScript string and converts it to a compressed
/// JavaScript string.
/// </summary>
/// <param name="src"></param>
/// <returns></returns>
public string MinifyString(string src)
{
MemoryStream srcStream = new MemoryStream(Encoding.Unicode.GetBytes(src));
MemoryStream tgStream = new MemoryStream(8092);
using (sr = new StreamReader(srcStream,Encoding.Unicode))
{
using (sw = new StreamWriter(tgStream,Encoding.Unicode))
{
jsmin();
}
}
return Encoding.Unicode.GetString(tgStream.ToArray());
}
Notice that when using strings as input it's best to use Unicode encoding since in .NET strings are always Unicode (unless a specific encoding was applied). The original code I used skipped the Encoding.Unicode on the Reader and Writer which also caused formatting issues with extended characters.
Encodings are a confusing topic even once you get your head around how encodings relate to the binary signature (the actual bytes) of your text. This is especially true for streams in .NET because many of the text based streams already apply default encodings and because streams are often passed to other components that also expose Encodings (like an XmlReader for example).
Hopefully a routine like the above (and this entry <g>) will jog my memory 'next time'.
Other Posts you might also like
The Voices of Reason
# re: Detecting Text Encoding for StreamReader
I didn't get out Reflector to check what the detectEncoding parameter does but from what I can see it only detects documents with a BOM. No BOM (or omitting the the detection parameter) and it'll use default UTF8 Encoding.
# re: Detecting Text Encoding for StreamReader
// Change sr to TextReader and sw to TextWriter
public string MinifyString(string src) {
using (sr = new StringReader(src))
using (sw = new StringWriter()) { // No need to nest the using statements
jsmin();
return sw.ToString();
}
}
# re: Detecting Text Encoding for StreamReader
Hmmm... actually taking another look casting all of those StreamReader/Writer to TextReader/Writer does the trick on the class:
public string MinifyString(string src) { using (sr = (TextReader) new StringReader(src) ) { using (sw = (TextWriter) new StringWriter() ) { jsmin(); return sw.ToString(); } } }
# re: Detecting Text Encoding for StreamReader
Shown my code below:
if (buffer[0] == 0xFE && buffer[1] == 0xFF)
{
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
}
if (buffer[0] == 0xFF && buffer[1] == 0xFE)
{
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
}
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
using GetPreamble() method of each Encoding type for compare headers, I think is more generic. What about this ?
Thanks
# re: Detecting Text Encoding for StreamReader
I tried to find the Encoding of a UTF-7 decoded file, but it is not working proper.For other files it is not giving any problems. Kinldy revert back to me if you find any solution for UTF-7, type of file
# re: Detecting Text Encoding for StreamReader
More Details:
http://en.wikipedia.org/wiki/UTF-8
# re: Detecting Text Encoding for StreamReader
thanks very much.
# re: Detecting Text Encoding for StreamReader
Thanks
# re: Detecting Text Encoding for StreamReader
Byte order mark Description
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
Note: Microsoft uses UTF-16, little endian byte order
How can I detect not BOM UTF-8 File ?
Thanks
# re: Detecting Text Encoding for StreamReader
//Table 1: Unicode Signature Byte Sequences
//Byte Sequence Encoding
//FE FF UTF-16BE
//FF FE (not followed by 00 00) UTF-16LE
//00 00 FE FF UTF-32BE
//FF FE 00 00 UTF-32LE
//EF BB BF UTF-8
//0E FE FF SCSU
//FB EE 28 BOCU-1 (U+FEFF must be removed after conversion)
//2B 2F 76 38 2D or
//2B 2F 76 38 or
//2B 2F 76 39 or
//2B 2F 76 2B or
//2B 2F 76 2F UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)
//DD 73 66 73 UTF-EBCDIC
several questions about Encoding ...
For example,
1.
in Visual Studio, in source code
string s = "text in source file in vs";
By default, what encoding has the string "text in source file in vs" ?? UTF-16 ?
2.
If you create a text file in Visual Studio, does creates it with encoding utf-8?
if you add in VStudio a text file, does retains its existing encoding?
And if you make changes and save these changes, does remains (maintain) the encoding?
3. Can I detect for 100% cases the encoding of a file, for example, a text file like File1.txt?
4. Which encoding is used if a read file using File.ReadAllBytes (filepath), which returns an array of bytes. This method has no more overload.
4b. If you get an array of bytes (byte []) that corresponds to the data of one text file,
Can I detect his encoding ?
5. If Visual Studigo add files like a resource (Binary type) and can I detect the encoding of that file?
internal static byte [] (file1
get (object obj = ResourceManager.GetObject ( "file1" resourceCulture) return ((byte []) (obj));)
)
6. What about the BOM ?. Any editors text treat or not? (such as Notepad, UltraEdit ... And Visual Studio?)
The question key is (if I have filepath string or a byte []), I need to be able to detect what encoding is the file.
Thanks in advanced, greetings, regards !!!
# re: Detecting Text Encoding for StreamReader
I agree with Will, your code has a bug.
This:
else if (buffer[0] == 0xfe && buffer[1] == 0xff) enc = Encoding.Unicode;
should read:
else if (buffer[0] == 0xff && buffer[1] == 0xfe) enc = Encoding.Unicode; // utf-16le
You are interpreting the BOM for utf-16be as if it were utf-16le.
In addition to the correction you could make your code more robust by adding this branch to figure utf-16be:
else if (buffer[0] == 0xfe && buffer[1] == 0xff) enc = Encoding.BigEndianUnicode; // utf-16be
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
There is no such thing. You may be talking about a Code Page. For Code Page files you either need to be told or your need to use some statistical heuristic to determine the file.
# re: Detecting Text Encoding for StreamReader
// In C... load the whole text into a char array and pass it to the function
BOOL is_utf8(char *str)
{
int c=0, b=0;
int i;
int bits=0;
int len = strlen(str);
for(i=0; i<len; i++)
{
c = str[i];
if(c > 128)
{
if((c >= 254)) return FALSE;
else if(c >= 252) bits=6;
else if(c >= 248) bits=5;
else if(c >= 240) bits=4;
else if(c >= 224) bits=3;
else if(c >= 192) bits=2;
else return FALSE;
if((i+bits) > len) return FALSE;
while(bits > 1)
{
i++;
b = str[i];
if(b < 128 || b > 191) return FALSE;
bits--;
}
}
}
return TRUE;
}
# re: Detecting Text Encoding for StreamReader
I am working with SQL script files and I check bytes 1, 3 and 5 for 0x0.
This is probably not perfect but seems to be working okay.
I found this out because SQL Management Studio can script to Unicode (with BOM). If you then edit the script in Textpad the format stays the same (Unicode) but the BOM is removed (this is the default option in Textpad - Configure/Preferences/Document Classes).
Now I use code similar to the above to read a BOM and if none is found but the first 3 characters contain 0x0 in the second byte I can assume that the character set is using 2 bytes per character.
# re: Detecting Text Encoding for StreamReader
bool firstline = true; using (StreamReader fileOriginal = new StreamReader(originalFileName)) { line = fileOriginal.ReadLine(); // force StreamReader to determine the encoding (retrievable through fileOriginal.CurrentEncoding) firstLine = true; using (StreamWriter fileModified = new StreamWriter(modifiedFileName, false, fileOriginal.CurrentEncoding)) { while (firstLine || ((line = fileOriginal.ReadLine()) != null)) { firstLine = false; // I know, it gets set again and again. Maybe you prefer an "if" here. // do stuff here } } // StreamWriter } // StreamReader
# re: Detecting Text Encoding for StreamReader
# re: Detecting Text Encoding for StreamReader
great blog post, it helps me understand this topic more.
I am trying to open an excel document and then figure out the encoding used for the file and I pretty much am doing what you explain but when I put a trace on my program and translate the Character Numeric values to text using my handy chart I realize that it just grabs double quotes and then the first 4 characters in the topmost and leftmost cell of the excel sheet that is visible to the human eye.
In other words, there was no encoding information at the start of the file, just the text that everyone can see if the file when they open it. Just wondering in this situation do you (or anyone reading) have any good ideas on how to figure out the encoding of a file programmatically? thanks
MC
# UnauthoritzedAccessException
thanks for the code snippet!
Note that you might specify FileAccess.Read mode when creating the FileStream to prevent the UnauthoritzedAccessException on files that are read-only:
FileStream file = new FileStream(srcFile, FileMode.Open, FileAccess.Read);
# re: Detecting Text Encoding for StreamReader
I had exactly this issue where the StreamReader could not detect the correct encoding and this sorted it for me.
# re: Detecting Text Encoding for StreamReader
I had to modify https://github.com/dalimian/fconvert to get a reasonable solution. Most other solutions to the encoding issue online (like the one above) seem to not work well (eg determining the difference between UTF-8 and windows-1252)
https://github.com/dalimian/fconvert applied the best heuristic solution I have seen. If the file is a chimera ... you can play around with fconvert to provide the best solution.
# re: Detecting Text Encoding for StreamReader
I owe you many beers.
# re: Detecting Text Encoding for StreamReader