• fgetws 讀取Unicode文件 (zz.IS2120@BG57IV3)


    //z 2012-11-22 18:48:32 IS2120@BG57IV3.T2690489747.K[T4,L45,R0,V24]

    fgetws 讀取Unicode文件


    fgetws _fgetts 读取 中文 乱码 unicode 双字节 多字节

    最近要讀取一個unicode文件做額外處理,但是透過 fgetws 去讀取檔案,利用WriteConsole顯示在console畫面卻無法正常顯示中文,查了好一陣子終於發現問題了;也就是在開啟檔案時你不可以使用 text mode去開啟檔案,必須透過binary mode開啟,這樣fgetws才會把檔案當作是unicode檔,也就是讀取時,會一次讀兩個bytes且不做任何轉換。

    當你用text mode開啟時,fgetws會假設input stream是multibyte characters,所以會做 MBCS-to-Unicode 轉換。當你用text 開啟檔案用 fgetws讀取資料時,你會發現明明是一個中文字,但是讀取後會分開放到你設定的buffer中,例如下面,但是實際上我們期望的是0x540d。
    buf[0] = 0x000d
    buf[1] = 0x0054

    //z 2012-11-22 18:50:40 IS2120@BG57IV3.T1612787755.K[T5,L47,R1,V24]
    這也是為啥讀出後直接丟給WriteConsole卻無法顯示中文的問題。

    所以如果要處理unicode的檔案請使用 binary mode開啟,詳細解說你以參考 MSND 的
    Unicode™ Stream I/O in Text and Binary Modes 文件。

    ex:

    FILE *inputfp;
    const _TCHAR *pfilename = _T("MY UNICODE FILE");

    //Don't use text mode to read a UNICODE file if you want use fgetws to read data from file.
    errno_t err = _tfopen_s(&inputfp, pfilename, _T("rb"));

    _fgetts(Msgbuf, MAX_BUF_SIZE, inputfp);

    DWORD dwCharWritten(0UL);

    WriteConsole (GetStdHandle(STD_OUTPUT_HANDLE),
                                 Msgbuf,
                                 static_cast (_tcsclen(Msgbuf)),
                                 &dwCharWritten,
                                 NULL);

    [Keyword]
    _fgetts unicode file
    //z 2012-11-22 18:50:40 IS2120@BG57IV3.T1612787755.K[T5,L47,R1,V24]

    When a Unicode stream I/O routine (such as fwprintf, fwscanf, fgetwc, fputwc, fgetws, or fputws) operates on a file that is open in text mode (the default), two kinds of character conversions take place:

    • Unicode-to-MBCS or MBCS-to-Unicode conversion. When a Unicode stream-I/O function operates in text mode, the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).
    • Carriage return – linefeed (CR-LF) translation. This translation occurs before the MBCS – Unicode conversion (for Unicode stream input functions) and after the Unicode – MBCS conversion (for Unicode stream output functions). During input, each carriage return – linefeed combination is translated into a single linefeed character. During output, each linefeed character is translated into a carriage return – linefeed combination.

    However, when a Unicode stream-I/O function operates in binary mode, the file is assumed to be Unicode, and no CR-LF translation or character conversion occurs during input or output. Use the _setmode( _fileno( stdin ), _O_BINARY ); instruction in order to correctly use wcin on a UNICODE text file.


  • 相关阅读:
    [BZOJ1013][JSOI2008]球形空间产生器sphere 高斯消元
    [Luogu1848][USACO12OPEN]书架Bookshelf DP+set+决策单调性
    [BZOJ1025][SCOI2009]游戏 DP+置换群
    [BZOJ1024][SCOI2009]生日快乐 搜索
    [BZOJ2002][Hnoi2010]Bounce弹飞绵羊 LCT
    「BZOJ 4565」「HAOI 2016」字符合并「区间状压DP」
    「BZOJ 5161」最长上升子序列「状压DP」
    「SPOJ TTM 」To the moon「标记永久化」
    「学习笔记」字符串大礼包
    「CF724G」Xor-matic Number of the Graph「线性基」
  • 原文地址:https://www.cnblogs.com/IS2120/p/6745798.html
Copyright © 2020-2023  润新知