Background and motivation
The new API purpose is creation of a new property of StreamReader called CurrentBOM, which allows one to determine the presence of a byte order mark (BOM) via boolean result. This is due to how the present behavior of CurrentEncoding is not designed to properly detect existence of a BOM in UTF encoded files.
In addition, approving this proposal avoids the .NET+PWSH community headaches when interoperating with UTF files at scale.
Specifically:
- With C# reading files line by line in an attempt to detect existence of a BOM. This approach is wasteful in comparison to the proposed, and less efficient speedwise then ReadToEnd().
- With .NETized PWSH, it's being forced to do concessionary workarounds (e.g. reading twice, reading once and doing byte conversions in-memory to do a manual custom one-off check, etc.).
Example PWSH code that demonstrates the problem below:
$bompath="$env:TEMP\bom.txt"
$utf8Bom=[System.Text.UTF8Encoding]::new($true)
[System.IO.File]::WriteAllText($bompath,'',$utf8Bom)
$bytes=[System.IO.File]::ReadAllBytes($bompath)
if($bytes.Length -eq 3 -and $bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF){
write-host 'BOM DETECTED'
}else{
write-host 'NO BOM DETECTED'
}
$sr=New-Object System.IO.StreamReader($bompath,$false)
$encoding=$sr.CurrentEncoding
$encoding.GetPreamble().length
$nobompath="$env:TEMP\bom-no.txt"
$utf8NoBom=[System.Text.UTF8Encoding]::new($false)
[System.IO.File]::WriteAllText($nobompath,'',$utf8NoBom)
$bytes=[System.IO.File]::ReadAllBytes($nobompath)
if($bytes.Length -eq 3 -and $bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF){
write-host 'BOM DETECTED'
}else{
write-host 'NO BOM DETECTED'
}
$sr=New-Object System.IO.StreamReader($nobompath,$false)
$encoding=$sr.CurrentEncoding
$encoding.GetPreamble().length
What is expected:
BOM DETECTED
3
NO BOM DETECTED
0
What you get:
BOM DETECTED
3
NO BOM DETECTED
3
API Proposal
namespace System.IO;
public partial class StreamReader
{
public bool CurrentBOM { get; }
}
API Usage
using System;
using System.IO;
using System.Text;
class Test
{
public static void Main()
{
string path = @"c:\temp\MyTest.txt";
try
{
if (File.Exists(path))
{
File.Delete(path);
}
//Use UTF-16 encoding
using (StreamWriter sw = new StreamWriter(path, false, new UnicodeEncoding()))
{
sw.WriteLine("My test");
sw.WriteLine("text.");
}
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for BOM after reading, or at least after the first read.
Console.WriteLine("BOM present: {0}.", sr.CurrentBOM);
}
}
catch (Exception e)
{
Console.WriteLine("The process failed: {0}", e.ToString());
}
}
}
Alternative Designs
One could amend the CurrentEncoding Property such that when performing a GetPreamble() the length is properly detected (i.e. proper endian, number of bytes, and type of bytes in precise order). While doing so would introduce a breaking change (i.e. a feature not implemented accurately is now accurate and precise) having accuracy and precision on the logic is reasonable.
Risks
None identified.
Background and motivation
The new API purpose is creation of a new property of StreamReader called CurrentBOM, which allows one to determine the presence of a byte order mark (BOM) via boolean result. This is due to how the present behavior of CurrentEncoding is not designed to properly detect existence of a BOM in UTF encoded files.
In addition, approving this proposal avoids the .NET+PWSH community headaches when interoperating with UTF files at scale.
Specifically:
Example PWSH code that demonstrates the problem below:
What is expected:
What you get:
API Proposal
API Usage
Alternative Designs
One could amend the CurrentEncoding Property such that when performing a GetPreamble() the length is properly detected (i.e. proper endian, number of bytes, and type of bytes in precise order). While doing so would introduce a breaking change (i.e. a feature not implemented accurately is now accurate and precise) having accuracy and precision on the logic is reasonable.
Risks
None identified.