• Tab Separated Values (TSV): a format for tabular data exchange


    Tab Separated Values (TSV): a format for tabular data exchange

    Tab Separated Values (TSV): a format for tabular data exchange

    TSV is a very simple textual data format which allows tabular data
    to be exhanged between applications that use different internal data
    formats. This document briefly explains this standardized format
    and gives simple examples of using it.

    Content:

    The TSV format

    The Tab Separated Values has been
    officially registered
    as an Internet media type
    (MIME type) under the name
    text/tab-separated-values. Note that
    major media type text means that the data is in textual
    format which can be viewed and edited using a normal text editor; in practice
    however this is seldom done - instead, the format is programmatically written
    and read.

    The abbreviation TSV (or tsv) is commonly used, and usually the file name
    suffix for TSV files is ".tsv".

    The registry entry is very short, since the format is very simple, but the
    description is paraphrased here in plain English (as opposite to the partly formalized
    notation in the registration):

    • A file in TSV format consists of lines.
    • Each line contain fields separated
      from each other by TAB characters
      (horizontal tab, HT, Ascii control code 9).
    • "Field" means here just any string of characters, excluding TABs.
      The point is simply that
      TABs divide a line into pieces, components.
    • Each line must contain the same number of fields.
    • The first line contains the names for the fields (on all lines), i.e.
      column headers.

    The last rule could be taken just as allowing the
    labeling of columns that way. Mostly if you wish to include just tabular data
    as such, it probably works OK. But it's usually recommendable to use some labeling
    there.

    Since TAB is used as a separator between fields, a field cannot contain a TAB.
    However, TABs usually don't appear in data items that you wish to tabulate,
    so this is seldom a restriction. There are various other formats which are very
    similar to TSV but use a different separator, such as
    Comma Separated Values
    (CSV).
    Note that commas, spaces, and other characters often used
    as separators in such formats
    appear rather often in data to be tabulated, at least in
    header fields. Special conventions can be made to allow the inclusion of
    the separator into data fields, but the extreme simplicity of the data
    format is then lost.

    The example given in the registration is the following,
    where <TAB> denotes a TAB character:

    Name<TAB>Age<TAB>Address
    Paul<TAB>23<TAB>1115 W Franklin
    Bessy the Cow<TAB>5<TAB>Big Farm Way
    Zeke<TAB>45<TAB>W Main St

    Not to be shown as is!

    Although TSV is a text format,
    a TSV file is not expected to appear in a nice tabular format
    when displayed or printed as such, e.g. by an editor. In special cases it might do that,
    if tab stops are set suitably in the environment and if the fields in a column
    have roughly the same width. But even the simple example probably fails
    when embedded into an HTML document (inside a
    PRE element, which is the best shot):

    Name	Age	Address
    Paul	23	1115 W Franklin
    Bessy the Cow	5	Big Farm Way
    Zeke	45	W Main St

    Using TABs in HTML is unreliable, but more fundamentally, tabbing with TAB is
    in itself not structural. It just takes to some predefined tab stop; typically the
    tab stops are set at 1st, 9th, 17th, etc. character position. This means that
    the data would look OK by accident only. And when using TSV, you are not
    supposed to try to "fix" things by using multiple TABs; that
    would mean that there are empty fields between
    the TABs, defeating the very idea of TSV.

    What happens if you ask your Web browser to show a TSV file, e.g. by following a
    link to a file containing our example? It really depends on
    many things; it's a matter of
    handling media types, which can be a bit confusing.
    It most essentially depends on the settings of your browser, but
    also on how the server sends it. But in any case, you as a user have
    at least as one option the possibility of saving the file
    onto local disk, to be processed separately using some suitable program (e.g.
    MS Excel).
    So TSV is one way of making tabular data available on a network in a universally
    accessible way, though it may require some skills on the user side to know what
    could be done with the data and do something useful with it.

    Importing TSV data to MS Excel

    We could give a TSV file as input to a
    spreadsheet application, like MS Excel.
    Such programs generally accept TSV format data.
    The data can then be viewed, printed, and manipulated in tabular format.
    And you can save the data in MS Excel's "native" format (which cannot
    be processed by other programs unless they have been specifically written
    to process it) or in TSV format, or in a few other formats.
    Our simple example data would get displayed roughly as follows:

    The data appears in columns labeled A, B, C, and rows labeled 1, 2, 3, 4 in framed cells. The cells that contain numbers only are right-aligned, other cells left-aligned. Cell (A,3) appears as truncated to "Bessy the".

    Cells with long content may look truncated. This is
    Excel's feature and often useful when manipulating spreadsheets.
    The presentation, both on screen and on paper, can be tuned as needed,
    using various tools in Excel.

    When you have Excel in use, you can open the the File menu,
    then select Open, and pick up the file to be opened. For finding the file,
    you may need to set value of the "Files of type:" setting (which is
    effectively a file type filter) to "All files (*.*)". (The details
    may vary depending on the version and configuration of the program.)

    But you can make things easier by setting things up so that your system
    (we're assuming some flavor of Windows here) automatically recognizes
    .tsv files as something to be opened in Excel, when you click on their
    icons. Assuming there is no such association for .tsv in your system yet,
    you'll be prompted for one when you first click on the icon of a .tsv file.
    When prompted for "Description of '.tsv' files:", you could reply e.g.
    "text/tab-separated-values". Then pick up Excel from the list of programs
    and click OK.

    When TSV data is imported into Excel, some data
    transformation and interpretation may take place. For example,
    some strings might be interpreted as dates and converted to a different
    date format, and Excel generally treats a string of digits as a number
    and right-justifies it. In such cases, "opening by clicking"
    could be inadequate, since it might not give you the option of
    specifying how the data is to be interpreted. Explicitly opening
    the file in Excel normally lets you use "Text import wizard" where you
    can specify the interpretation of each column.

    Generating TSV data

    For example, if you have a program of your own (say, a Fortran or C program,
    or a Perl script) that writes out some tabular data, and you'd like to be
    able to open (or let others open) the data in Excel, the TSV approach is rather
    simple. You just need to know how to write data in TSV format, which is a lot
    easier than trying to generate MS Excel format. Usually you
    can just use TAB characters in a string used to format the output, but see next
    paragraph as regards to editors; in C and Perl for example, you can alternatively use
    the notation \t inside a character constant or string constant.

    And, of course, you could type TSV data "by hand", using a text editor.
    Note that typing the TAB character may require some extra trick, since
    many editors process it as command-like, instead of entering it into data
    as a character. For example, in some modes Emacs turns TABs to sequences of
    spaces, which won't do in TSV; to prevent this, type control-Q before hitting
    the TAB key.

    As an example, here is
    Fortran 77 code that writes a matrix is TSV format:

          parameter(m=42)
          parameter(n=5)
          real a(m,n)
    ... code that writes data to matrix a ...
          do 200 i=1,m
     200     print 210, (a(i,j),j=1,n)
     210     format(F10.6,4('	',F10.6))

    The code is not particularly elegant, since the number of
    columns is hard-wired into the format statement, which
    causes the first number to printed as such and 4 more with a tab character
    before each of them. Between the apostrophes (single quotation marks), there
    is exactly one TAB character and nothing more; the appearance on your
    browser may vary.

    Exporting data in TSV format from Excel

    When you save a spreadsheet in Excel (using Save in the File menu,
    or using control-S), it normally uses by default its own internal data format.
    You can override this by using Save As... (in the File menu)
    instead and selecting the desired format. Among the available formats,
    you can pick up "Text (Tab delimited) (*.txt)", which means TSV.
    (Here, too, you may encounter some variation between versions of Excel.)
    Note that the default file name suffix will be .txt when; you may
    wish to override this, reserving .txt for plain text files.

    Saving as TSV generally loses formatting information.
    Quite often this is desirable, since if you e.g. need to E-mail some data
    or put a table onto the Web, information about
    the specific font faces, cell widths, and other
    presentational details are not needed and can cause problems.
    Note that
    Excel's own "Save As HTML" includes a horrendous amount of
    such data
    . There is software called
    Office 2000 HTML Filter 2.0
    available from Microsoft for free, but it fixes only part of the
    problem (for Excel files - for Word files, it's pretty good).

    You could use this possibility for the job of making data
    available on the Internet, or in an intranet. If you have the data in Excel
    format, it will be less universally accessible than e.g. TSV format. Although there are
    Excel viewers available free of charge for Windows, not everyone
    wants to install such software just to casually view some tabular data,
    and not everyone uses Windows at all. And even if people have Excel installed,
    they might be unwilling to open documents from unknown origin, in fear of
    macro viruses for example. So TSV might be a good idea at least as an alternative.
    You could always make the data available both in Excel format and as TSV, especially
    since saving as TSV may remove some formatting information.

    Converting from TSV format to HTML table format

    For putting data onto the Web in particular, you could additionally
    convert the data from TSV format into an
    HTML table and
    make it a separate HTML document, or insert it into an HTML document, e.g. as follows:

    Name Age Address
    Paul231115 W Franklin
    Bessy the Cow5Big Farm Way
    Zeke45W Main St

    Since TSV is a very simple format, and since the basic structure of HTML
    tables is very simple, such a conversion is rather trivial as soon as you
    have any programming tool. At the simplest, you could simply add the
    <table> tag before the data
    and the
    </table> tag after it, and
    insert
    <tr><td> at the start of each data line,
    and replace each tab with a <td>.
    That way, each line would be turned to a table row and each field to a table cell.
    However, this would misrepresent the first line, which is a line of header fields,
    at least by TSV definition, so it would be better to use th rather than
    td there. There are some additional considerations which make a
    somewhat more advanced strategy better. The following Perl code
    (also available as a separate file)
    tries to produce nice HTML markup and uses very simple heuristics:
    it assumes that any field which contains only digits
    and spaces
    should be right-aligned in a cell.

    print <<END;
    <table border="1" cellspacing="0" cellpadding="4">
    <thead>
    <tr><th>
    END
    chomp ($_ = <STDIN>);
    s?\t?</th><th>?g;
    print "$_</th></tr>\n";
    print "</tr>\n</thead>\n<tbody>\n";
    while (<>) {
        print "<tr>\n";
        my @fields = split('\t');
        for $cell(@fields) {
    	if($cell =~ /^(\d|\s)+$/) {
    	    print "<td align=\"right\">$cell</td>"; }
    	else {
    	    print "<td>$cell</td>"; } }
        print "</tr>\n"; }
    print "</tr>\n</tbody>\n</table>\n";

    The code creates just the table markup, so the result can be simply
    inserted into an existing HTML document. The example table above was created that way.

    You don't need to know
    Perl
    in order to use that program.
    (Of course you'd need some understanding of Perl in order to modify
    the program for your needs. But you might also consider editing the result
    using an editor, if you just wish to e.g. add an align="left"
    attribute into the first <tr> tag, to make the header cells
    left-aligned.)
    But you need a Perl interpreter, and you
    know how to run it. (Typically, you would, after saving the program under the
    name tsv2h.pl in a file, and having TSV data in a file tsv.tsv,
    go to command prompt level and
    just give a command like perl tsv2h.pl <tsv.tsv >tsvex.html
    to get the converted data into file tsv.html.)
    There are
    free Perl interpreters available for virtually every platform.
    For Win32, many people recommend ActivePerl.

  • 相关阅读:
    springboot---Shiro
    spring MVC 使用 modelAndView.setViewName("forward:*.action") 发送重定向
    数据库负载均衡、读写分离技术
    redis详解
    分布式--锁
    springboot---数据整合篇
    hadoop安装单机
    NIO系列之MINA
    JAVA中IO流模型BIO,NIO,AIO
    CPU-bound(计算密集型) 和I/O bound(I/O密集型)
  • 原文地址:https://www.cnblogs.com/lexus/p/2520761.html
Copyright © 2020-2023  润新知