关于文件流Seek以及Read操作的一点不满

关于文件流Seek以及Read操作的一点不满
问题

对于读取文件某指定位置开始的一段数据的操作，我们一般可以用如下的代码来实现：
Read File Stream Content

private static string ReadContent(string fileName, int position, int length)

{

    if (!File.Exists(fileName))

    {

        throw new FileNotFoundException("The specified file is not found : " + fileName);

    }

    using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))

    using (StreamReader reader = new StreamReader(stream))

    {

        reader.BaseStream.Seek(position, SeekOrigin.Begin);

        char[] buffer = new char[length];

        reader.Read(buffer, 0, length);

        return new string(buffer, 0, length);

    }

}
这样的操作在代码上看来比较直观也易于理解。如果想在同一个文件中读取多个这样的内容段，一般可以写成如下（指定多个位置和多个需要对应读取的长度，参数列表仅为示意）：
Read Content With Seeking

private static string[] ReadContents(string fileName, int[] positions, int[] lengths)

{

    if (!File.Exists(fileName))

    {

        throw new FileNotFoundException("The specified file is not found : " + fileName);

    }

    using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))

    using (StreamReader reader = new StreamReader(stream))

    {

        string[] contents = new string[positions.Length];

        for (int i = 0; i < positions.Length; i++)

        {

            reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);

            char[] buffer = new char[lengths[i]];

            reader.Read(buffer, 0, lengths[i]);

            contents[i] = new string(buffer, 0, lengths[i]);

        }

        return contents;

    }

}
这看起来也没有什么问题。但是如果我们提供一段测试程序，就会发现出乎意料的结果：
Test App

static void Main(string[] args)

{

    string fileName = @"text.txt";

    using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))

    using (StreamWriter writer = new StreamWriter(stream))

    {

        writer.Write("ABCDEFGHIJKLMNOPQ");

    }

    Console.WriteLine(ReadContent(fileName, 4, 2));

    Console.WriteLine(ReadContent(fileName, 10, 2));

    Console.WriteLine(ReadContent(fileName, 7, 2));

    Console.WriteLine();

    string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });

    foreach (var item in contents)

    {

        Console.WriteLine(item);

    }

    Console.ReadKey();

}
输出是：

所以当我们在同一个流中尝试定位的时候，类库API并没有按照我们预想的那样，取出对应的内容。而看起来像是，在一个文件流对象发生第一次Seek之后，其后的所有Seek操作都失效了！这是为什么呢？

分析

事实上， StreamReader为了性能的考虑，在自己的内部内置并维护了一个byte buffer。如果在声明StreamReader对象的时候没有指定这个buffer的尺寸，那么它的默认大小是1k。如果是文件流，那么这个buffer的默认大小是4K。所有Read操作，都直接或间接转换为了对这个buffer的操作。
Buffer Size

// Using a 1K byte buffer and a 4K FileStream buffer works out pretty well

// perf-wise.  On even a 40 MB text file, any perf loss by using a 4K

// buffer is negated by the win of allocating a smaller byte[], which

// saves construction time.  This does break adaptive buffering,

// but this is slightly faster.

internal const int DefaultBufferSize = 1024;  // Byte buffer size

private const int DefaultFileStreamBufferSize = 4096;

private const int MinBufferSize = 128;
Read Buffer

        // This version has a perf optimization to decode data DIRECTLY into the

        // user's buffer, bypassing StreamWriter's own buffer.

        // This gives a > 20% perf improvement for our encodings across the board,

        // but only when asking for at least the number of characters that one

        // buffer's worth of bytes could produce.

        // This optimization, if run, will break SwitchEncoding, so we must not do

        // this on the first call to ReadBuffer.

        private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {

            charLen = 0;

            charPos = 0;

            if (!_checkPreamble)

                byteLen = 0;

            int charsRead = 0;

            // As a perf optimization, we can decode characters DIRECTLY into a

            // user's char[].  We absolutely must not write more characters

            // into the user's buffer than they asked for.  Calculating

            // encoding.GetMaxCharCount(byteLen) each time is potentially very

            // expensive - instead, cache the number of chars a full buffer's

            // worth of data may produce.  Yes, this makes the perf optimization

            // less aggressive, in that all reads that asked for fewer than AND

            // returned fewer than _maxCharsPerBuffer chars won't get the user

            // buffer optimization.  This affects reads where the end of the

            // Stream comes in the middle somewhere, and when you ask for

            // fewer chars than than your buffer could produce.

            readToUserBuffer = desiredChars >= _maxCharsPerBuffer;

            do {

                if (_checkPreamble) {

                    BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble.  Are two threads using this StreamReader at the same time?");

                    int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);

                    BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");

                    if (len == 0) {

                        // EOF but we might have buffered bytes from previous

                        // attempts to detecting preamble that needs to decoded now

                        if (byteLen > 0) {

                            if (readToUserBuffer) {

                                charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);

                                charLen = 0;  // StreamReader's buffer is empty.

                            }

                            else {

                                charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);

                                charLen += charsRead;  // Number of chars in StreamReader's buffer.

                            }

                        }

                        return charsRead;

                    }

                    byteLen += len;

                }

                else {

                    BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble.  Are two threads using this StreamReader at the same time?");

                    byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length);

                    BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");

                    if (byteLen == 0)  // EOF

                        return charsRead;

                }

                // _isBlocked == whether we read fewer bytes than we asked for.

                // Note we must check it here because CompressBuffer or

                // DetectEncoding will ---- with byteLen.

                _isBlocked = (byteLen < byteBuffer.Length);

                // Check for preamble before detect encoding. This is not to override the

                // user suppplied Encoding for the one we implicitly detect. The user could

                // customize the encoding which we will loose, such as ThrowOnError on UTF8

                // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesn't change the encoding or affect _maxCharsPerBuffer

                if (IsPreamble())

                    continue;

                // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it.

                if (_detectEncoding && byteLen >= 2) {

                    DetectEncoding();

                    // DetectEncoding changes some buffer state.  Recompute this.

                    readToUserBuffer = desiredChars >= _maxCharsPerBuffer;

                }

                charPos = 0;

                if (readToUserBuffer) {

                    charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);

                    charLen = 0;  // StreamReader's buffer is empty.

                }

                else {

                    charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);

                    charLen += charsRead;  // Number of chars in StreamReader's buffer.

                }

            } while (charsRead == 0);

            _isBlocked &= charsRead < desiredChars;

            //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+"  readToUserBuffer: "+readToUserBuffer);

            return charsRead;

        }
所以问题就转化为，当第二次调用BaseStream.Seek的时候，对应的buffer的内容并没有重新读取！所以第二次读取的时候，对应读取的内容其实是第一次seek后，对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了（或者完全不在缓存中）。

如果想要在第二次seek前刷新缓存，必须显式调用DiscardBufferedData（）：
Code Snippet

// DiscardBufferedData tells StreamReader to throw away its internal

// buffer contents.  This is useful if the user needs to seek on the

// underlying stream to a known location then wants the StreamReader

// to start reading from this new point.  This method should be called

// very sparingly, if ever, since it can lead to very poor performance.

// However, it may be the only way of handling some scenarios where

// users need to re-read the contents of a StreamReader a second time.

public void DiscardBufferedData() {

    byteLen = 0;

    charLen = 0;

    charPos = 0;

    decoder = encoding.GetDecoder();

    _isBlocked = false;

}
一点抱怨

记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.

我们来揣度一下如何设计.

如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.

如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.

总结

最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为，争论是每一种语言前进的动力。想说点什么，突然想起了上面的这个小例子。其实作为程序员，可能既不关注究竟是语言支撑模式，也不关注是不是类库支撑模式。唯希望在类库设计中，少一点上面这个例子中的灵机一动，多一点实实在在。
作者：Jeffrey Sun
出处：http://sun.cnblogs.com/
本文以“现状”提供且没有任何担保，同时也没有授予任何权利。本文版权归作者所有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。
相关阅读:
Python基础练习
 理解信息管理系统
 datatime处理日期和时间
 中文词频统计
 文件方式实现完整的英文词频统计实例
 组合数据类型练习，英文词频统计实例上
 英文词频统计预备，组合数据类型练习
 凯撒密码、GDP格式化输出、99乘法表
 字符串基本操作
 条件、循环、函数定义练习
原文地址：https://www.cnblogs.com/sun/p/1775311.html

关于文件流Seek以及Read操作的一点不满

问题

分析

一点抱怨

总结