Description

Command line utilities for tabular data files.

Package Information

Version 1.0.13 (2017-Jan-21)
Homepage https://github.com/eBay/tsv-utils-dlang
Repository https://github.com/eBay/tsv-utils-dlang
License BSL-1.0
Copyright Copyright (c) 2015-2017, eBay Software Foundation
Authors Jon Degenhardt
Registered by Jon Degenhardt
Sub packages
./common/
[directory based sub package]
./csv2tsv/
[directory based sub package]
./number-lines/
[directory based sub package]
./tsv-append/
[directory based sub package]
./tsv-filter/
[directory based sub package]
./tsv-join/
[directory based sub package]
./tsv-sample/
[directory based sub package]
./tsv-select/
[directory based sub package]
./tsv-summarize/
[directory based sub package]
./tsv-uniq/
[directory based sub package]
Dependencies

tsv-utils-dlang:csv2tsv

tsv-utils-dlang:tsv-filter

tsv-utils-dlang:tsv-sample

tsv-utils-dlang:common

tsv-utils-dlang:number-lines

tsv-utils-dlang:tsv-summarize

tsv-utils-dlang:tsv-join

tsv-utils-dlang:tsv-select

tsv-utils-dlang:tsv-uniq

tsv-utils-dlang:tsv-append

Installation

To use this package, put the following dependency into your project's dependencies section:

dub.json
dub.sdl

Readme

Command line utilities for tabular data files

This is a set of command line utilities for working with tab-separated value files. They were originally developed in Perl and used for day-to-day work in a large scale data mining environment. One of the tools was re-written in D as an exercise exploring the language. Significant performance gains and agreeable programmer characteristics soon led to writing the other utilities in D as well.

The tools have been made available in the hope they will benefit others needing similar tools or who are considering D as a programming language.

Information on the D programming language is available at: http://dlang.org/.

Contents:

The tools

These tools were developed for working with reasonably large data files. Larger than ideal for loading entirely in memory in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments. They work like traditional Unix command line utilities such as cut, sort, grep, etc., and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (tsv-filter). Documentation is available for each tool by invoking it with the --help option. If reading the code, look for the helpText variable near the top of the file.

The rest of this section contains a short description of each tool. There is more detail in the tool reference section towards the end of the file.

  • tsv-filter - Filter data file rows via numeric and string comparisons.
  • tsv-select - Keep a subset of the columns (fields) in the input.
  • tsv-summarize - Aggregate field values, summarizing across the entire file or grouped by key.
  • tsv-join - Join lines from multiple files using fields as a key.
  • tsv-append - Concatenate TSV files. Header aware; supports source file tracking.
  • tsv-uniq - Filter out duplicate lines using fields as a key.
  • tsv-sample - Uniform and weighted random sampling or permutation of input lines.
  • csv2tsv - Convert CSV files to TSV.
  • number-lines - Number the input lines.
  • Useful bash aliases
  • Other toolkits

tsv-filter

Outputs select lines by making numeric and string comparisons against individual fields. Multiple comparisons can be specified in a single call. A variety of numeric and string comparison operators are available as well as regular expressions. Example:

$ tsv-filter --ge 3:100 --le 3:200 --str-eq 4:red file.tsv

This outputs lines where field 3 satisfies (100 <= fieldval <= 200) and field 4 matches 'red'.

tsv-filter is the most widely applicable of the tools, as dataset pruning is a common task. It is stream oriented, so it can handle arbitrarily large files. It is quite fast, faster than other tools the author has tried. This makes it idea for preparing data for applications like R and Pandas. It is also convenient for quickly answering simple questions about a dataset. For example, to count the number of records with a non-zero value in field 3, use the command:

$ tsv-filter --ne 3:0 file.tsv | wc -l

See the tsv-filter reference for details.

tsv-select

A version of the Unix cut utility with the additional ability to re-order the fields. It also helps with header lines by keeping only the header from the first file (--header option). The following command writes fields [4, 2, 9] from a pair of files to stdout:

$ tsv-select -f 4,2,9 file1.tsv file2.tsv

Reordering fields and managing headers are useful enhancements over cut. However, much of the motivation for writing it was to explore the D programming language and provide a comparison point against other common approaches to this task. Code for tsv-select is bit more liberal with comments pointing out D programming constructs than code for the other tools.

See the tsv-select reference for details.

tsv-summarize

tsv-summarize runs aggregation operations on fields. For example, generating the sum or median of a field's values. Summarization calculations can be run across the entire input or can be grouped by key fields. As an example, consider the file data.tsv:

color   weight
red     6
red     5
blue    15
red     4
blue    10

Calculation of the sum and mean of the weight column are below. The first command runs calculations on all values. The second groups them by color.

$ tsv-summarize --header --sum 2 --mean 2 data.tsv
weight_sum  weight_mean
40          8

$ tsv-summarize --header --group-by 1 --sum 2 --mean 2 data.tsv
color  weight_sum  weight_mean
red    15          5
blue   25          12.5

Multiple fields can be used as the --group-by key. The file's sort order does not matter, there is no need to sort in the --group-by order first.

See the tsv-summarize reference for the list of statistical and other aggregation operations available.

tsv-join

Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. Example:

$ tsv-join --filter-file filter.tsv --key-fields 1,3 --append-fields 5,6 data.tsv

This reads filter.tsv, creating a lookup table keyed on fields 1 and 3. data.tsv is read, lines with a matching key are written to standard output with fields 5 and 6 from filter.tsv appended. This is a form of inner-join. Outer-joins and anti-joins can also be done.

Common uses for tsv-join are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.

See the tsv-join reference for details.

tsv-append

tsv-append concatenates multiple TSV files, similar to the Unix cat utility. It is header aware, writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row.

Concatenation with header support is useful when preparing data for traditional Unix utilities like sort and sed or applications that read a single file.

Source tracking is useful when creating long/narrow form tabular data. This format is used by many statistics and data mining packages. (See Wide & Long Data - Stanford University or Hadley Wickham's Tidy data for more info.)

In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized.

See the tsv-append reference for the complete list of options available.

tsv-uniq

Similar in spirit to the Unix uniq tool, tsv-uniq filters a dataset so there is only one copy of each line. tsv-uniq goes beyond Unix uniq in a couple ways. First, data does not need to be sorted. Second, equivalence is based on a subset of fields rather than the full line. tsv-uniq can also be run in an 'equivalence class identification' mode, where equivalent entries are marked with a unique id rather than being filtered. An example uniq'ing a file on fields 2 and 3:

$ tsv-uniq -f 2,3 data.tsv

tsv-uniq operates on the entire line when no fields are specified. This is a useful alternative to the traditional sort -u or sort | uniq paradigms for identifying unique lines in unsorted files, as it is often quite a bit faster.

As with tsv-join, this uses an in-memory lookup table to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance degrades after that.

See the tsv-uniq reference for details.

tsv-sample

For uniform random sampling, the GNU shuf program is quite good and widely available. For weighted random sampling the choices are limited, especially when working with large files. This is where tsv-sample is useful. It implements weighted reservoir sampling, with the weights taken from a field in the input data. Uniform random sampling is supported as well. Performance is good, it works quite well on large files. See the tsv-sample reference for details.

csv2tsv

Sometimes you have a CSV file. This program does what you expect: convert CSV data to TSV. Example:

$ csv2tsv data.csv > data.tsv

CSV files come in different formats. See the csv2tsv reference for details of how this tool operates and the format variations handled.

number-lines

A simpler version of the Unix 'nl' program. It prepends a line number to each line read from files or standard input. This tool was written primarily as an example of a simple command line tool. The code structure it uses is the same as followed by all the other tools. Example:

$ number-lines myfile.txt

See the number-lines reference for details.

Useful bash aliases

Any number of convenient utilities can be created using shell facilities. A couple are given below. One of the most useful is tsv-header, which shows the field number for each column name in the header. Very useful when using numeric field indexes.

  • tsv-header <file> - Outputs the column numbers and fields names for the file header (first line).
  • tsv-sort [options] [file...] - Runs sort, but with field separator set to TAB. Convenient when sorting on specific fields.

If you using a bash shell, add the definitions below to .bashrc or another init file. Similar aliases can be created for shells other than bash.

tsv-header () { head -n 1 $* | tr $'\t' '\n' | nl ; }
tsv-sort () { sort -t $'\t' $* ; }

Other toolkits

There are a number of toolkits that have similar or related functionality. Several are listed below. Those handling CSV files handle TSV files as well:

  • csvkit - CSV tools, written in Python.
  • csvtk - CSV tools, written in Go.
  • GNU datamash - Performs numeric, textual and statistical operations TSV files. Written in C.
  • dplyr - Tools for tabular data in R storage formats. Runs in an R environment, code is in C++.
  • miller - CSV and JSON tools, written in C.
  • tsvutils - TSV tools, especially rich in format converters. Written in Python.
  • xsv - CSV tools, written in Rust.

The different toolkits are certainly worth investigating if you work with tabular data files. Several have quite extensive feature sets. Each toolkit has its own strengths, your workflow and preferences are likely to fit some toolkits better than others.

File format is perhaps the most important dimension. CSV files cannot be processed reliably by traditional unix tools, so CSV toolkits naturally extend further into this space. However, this tends to increase complexity of the tools when working with TSV files.

Tradeoffs between file formats is its own topic. The tsvutils README has a nice discussion of the rationale for using TSV files. Note that many numeric CSV data sets use comma as a separator, but don't use CSV escapes. Such data sets can be processed reliabily by Unix tools and this toolset by setting the delimiter character.

An even broader list of tools can be found here: Structured text tools.

Installation

Download a D compiler (http://dlang.org/download.html). These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.070 or later, LDC version 1.0.0 or later.

Clone this repository, select a compiler, and run make from the top level directory:

$ git clone https://github.com/eBay/tsv-utils-dlang.git
$ cd tsv-utils-dlang
$ make         # For LDC: make DCOMPILER=ldc2

Executables are written to tsv-utils-dlang/bin, place this directory or the executables in the PATH. The compiler defaults to DMD, this can be changed on the make command line (e.g. make DCOMPILER=ldc2). DMD is the reference compiler, but LDC produces faster executables. (For some tools LDC is quite a bit faster than DMD.)

DUB, the D Package Manager, can also be used to install and build the executables. It is also possible to run build commands manually, see BUILD_COMMANDS for details.

Install using DUB

If you are already a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows:

$ dub fetch tsv-utils-dlang
$ dub run tsv-utils-dlang    # For LDC: dub run tsv-utils-dlang -- --compiler=ldc

The dub run command compiles all the tools. The executables are written to a DUB package repository directory. For example: ~/.dub/packages/tsv-utils-dlang-1.0.8/bin. Add the executables to the PATH. Installation to a DUB package repository is not always most convenient. As an alternative, clone the repository and run dub from the source directory. This puts the executables in the tsv-utils-dlang/bin directory:

$ git clone https://github.com/eBay/tsv-utils-dlang.git
$ dub add-local tsv-utils-dlang
$ cd tsv-utils-dlang
$ dub run      # For LDC: dub run -- --compiler=ldc2

See the Building and makefile section for more information.

The code

In this section:

Code structure

There is directory for each tool, plus one directory for shared code (common). The tools all have a similar structure. Code is typically in one file, e.g. tsv-uniq.d. Functionality is broken into three pieces:

  • A class managing command line options. e.g. tsvUniqOptions.
  • A function reading reading input and processing each line. e.g. tsvUniq.
  • A main routine putting it all together.

Documentation for each tool is found near the top of the main file, both in the help text and the option documentation.

The simplest tool is number-lines. It is useful as an illustration of the code outline followed by the other tools. tsv-select and tsv-uniq also have straightforward functionality, but employ a few more D programming concepts. tsv-select uses templates and compile-time programming in a somewhat less common way, it may be clearer after gaining some familiarity with D templates. A non-templatized version of the source code is included for comparison.

tsv-append has a simple code structure. It's one of the newer tools. It's only additional complexity is that writes to an 'output range' rather than directly to standard output. This enables better encapsulation for unit testing. tsv-sample, another new tool, is written in a similar fashion. The code is only a bit more complicated, but the algorithm is much more interesting.

tsv-join and tsv-filter also have relatively straightforward functionality, but support more use cases resulting in more code. tsv-filter in particular has more elaborate setup steps that take a bit more time to understand. tsv-filter uses several features like delegates (closures) and regular expressions not used in the other tools.

tsv-summarize is one or the more recent tools. It uses a more object oriented style than the other tools, this makes it relatively easy to add new operations. It also makes quite extensive use of built-in unit tests.

The common directory has code shared by the tools. At present this very limited, one helper class written as template. In addition to being an example of a simple template, it also makes use of a D ranges, a very useful sequence abstraction, and built-in unit tests.

New tools can be added by creating a new directory and a source tree following the same pattern as one of existing tools.

Coding philosophy

The tools were written in part to explore D for use in a data science environment. Data mining environments have custom data and application needs. This leads to custom tools, which in turn raises the productivity vs execution speed question. This trade-off is exemplified by interpreted languages like Python on the one hand and system languages like C/C++ on the other. The D programming language occupies an interesting point on this spectrum. D's programmer experience is somewhere in the middle ground between interpreted languages and C/C++, but run-time performance is closer to C/C++. Execution speed is a very practical consideration in data mining environments: it increases dataset sizes that can handled on the researcher's own machine, without needing to switch to a distributed compute environment. There is additional value in having data science practitioners program these tools quickly, themselves, without needing to invest time in low-level programming.

These tools were implemented with these trade-offs in mind. The code was deliberately kept at a reasonably high level. The obvious built-in facilities were used, notably the standard library. A certain amount of performance optimization was done to explore this dimension of D programming, but low-level optimizations were generally avoided. Indeed, there are options likely to improve performance, notably:

  • Custom I/O buffer management, including reading entire files into memory.
  • Custom hash tables rather than built-in associative arrays.
  • Avoiding garbage collection.

A useful aspect of D is that is additional optimization can be made as the need arises. Coding of these tools did utilize a several optimizations that might not have been done in an initial effort. These include:

  • The helper class in the common directory. This is an optimization for processing only the first N fields needed to for the particular invocation of the tool.
  • The template expansion done in tsv-select.
  • Reusing arrays every input line, without re-allocating. Some programmers would do this naturally on the first attempt, for others it would be a second pass optimization.

Building and makefile

Make setup

The makefile setup is very simplistic. It works reasonably in this case because the tools are small and have a very simple code structure, but it is not a setup that will scale to more complex programs. make can be run from the top level directory or from the individual tool directories. Available commands:

  • make release (default) - This builds the tools in release mode. Executables go in the bin directory.
  • make debug - Makes debug versions of the tools (with a .dbg extension).
  • make clean - Deletes executables and intermediate files.
  • make test - Makes debug versions of the tools and runs all tests.
  • make test-release - Makes release versions of the tools and runs all tests.
  • make test-nobuild - Runs tests against the current app builds. This is useful when using DUB to build.

Builds can be customized by changing the settings in makedefs.mk. The most basic customization is the compiler choice, this controlled by the DCOMPILER variable.

DUB package setup

A parallel build setup was created using DUB packages. This was done to better align with the D ecosystem. However, at present DUB does not have first class support for multiple executables, and this setup pushes the boundaries of what works smoothly. That said, the setup appears to work well. One specific functionality not supported are the test capabilities. However, after building with DUB tests can be run using the makefile setup. Here's an example:

$ cd tsv-utils-dlang
$ dub run
$ dub test tsv-utils-dlang:common
$ make test-nobuild

Unit tests

D has an excellent facility for adding unit tests right with the code. The common utility functions and the more recent tools take advantage of built-in unit tests. However, the earlier tools do not, and instead use more traditional invocation of the command line executables and diffs the output against a "gold" result set. The more recent tools use both built-in unit tests ad tests against the executable. This includes csv2tsv, tsv-summarize, tsv-append, and tsv-sample. The built-in unit tests are much nicer, and also the advantage of being naturally cross-platform. The command line executable tests assume a Unix shell.

Tests for the command line executables are in the tests directory of each tool. Overall the tests cover a fair number of cases and are quite useful checks when modifying the code. They may also be helpful as an examples of command line tool invocations. See the tests.sh file in each test directory, and the test makefile target in makeapp.mk.

The unit test built into the common code (common/src/tsvutil.d) illustrates a useful interaction with templates: it is quite easy and natural to unit test template instantiations that might not occur naturally in the application being written along with the template.

Performance

Performance is a key motivation for writing tools like this in D rather an interpreted language like Python or Perl. It is also a consideration in choosing between D and C/C++.

The tools created don't by themselves enable proper benchmark comparison. Equivalent tools written in the other languages would be needed for that. Still, there were a couple benchmarks that could be done to get a high level view of performance. These are given in this section.

Overall the D programs did well. Not as fast as a highly optimized C/C++ program, but meaningfully better than Python and Perl. Perl in particular fared quite poorly in these comparisons.

Perhaps the most surprising result is the poor performance of the utilities shipped with the Mac (cut, etc). It's worth installing the latest GNU coreutils versions if you are running on a Mac. (MacPorts and Homebrew are popular package managers that can install GNU tools.)

tsv-select performance

tsv-select is a variation on Unix cut, so cut is a reasonable comparison. Another popular option for this task is awk, which can be used to reorder fields. Simple versions of cut can be written easily in Python and Perl, which is what was done for these tests. (Code is in the benchmarks directory.) Timings were run on both a Macbook Pro (2.8 GHz Intel I7, 16GB ram, flash storage) and a Linux server (Ubuntu, Intel Xeon, 6 cores). They were run against a 2.5GB TSV file with 78 million lines, 11 fields per line. Most fields contained numeric data. These runs use cut -f 1,4,7 or the equivalent. Each program was run several times and the best time recorded.

Macbook Pro (2.8 GHz Intel I7, 16GB ram, flash storage); File: 78M lines, 11 fields, 2.5GB:

| Tool | version | time (seconds) | | ---------------------- |--------------- | -------------: | | cut (GNU) | 8.25 | 17.4 | | tsv-select (D) | ldc 1.0 | 31.4 | | mawk (M. Brennan Awk) | 1.3.4 20150503 | 51.1 | | cut (Mac built-in) | | 81.8 | | gawk (GNU awk) | 4.1.3 | 97.4 | | python | 2.7.10 | 144.1 | | perl | 5.22.1 | 231.3 | | awk (Mac built-in) | 20070501 | 247.3 |

Linux server (Ubuntu, Intel Xeon, 6 cores); File: 78M lines, 11 fields, 2.5GB:

| Tool | version | time (seconds) | | ---------------------- | -------------- | -------------: | | cut (GNU) | 8.25 | 19.8 | | tsv-select (D) | ldc 1.0 | 29.7 | | mawk (M. Brennan Awk) | 1.3.3 Nov 1996 | 51.3 | | gawk (GNU awk) | 3.1.8 | 128.1 | | python | 2.7.3 | 179.4 | | perl | 5.14.2 | 665.0 |

GNU cut is best viewed as baseline for a well optimized program, rather than a C/C++ vs D comparison point. D's performance for this tool seems quite reasonable. The D version also handily beat the version of cut shipped with the Mac, also a C program, but clearly not as well optimized.

tsv-filter performance

tsv-filter can be compared to Awk, and the author already had a perl version of tsv-filter. These measurements were run against four Google ngram files. 256 million lines, 4 fields, 5GB. Same compute boxes as for the tsv-select tests. The tsv-filter and awk/gawk/mawk invocations:

$ cat <ngram-files> | tsv-filter --ge 4:50 > /dev/null
$ cat <ngram-files> | awk  -F'\t' '{ if ($4 >= 50) print $0 }' > /dev/null

Each line in the file has statistics for an ngram in a single year. The above commands return all lines where the ngram-year pair occurs in more than 50 books.

Macbook Pro (2.8 GHz Intel I7, 16GB ram, flash storage); File: 256M lines, 4 fields, 4GB:

| Tool | version | time (seconds) | | ---------------------- | -------------- | -------------: | | tsv-filter (D) | ldc 1.0 | 33.5 | | mawk (M. Brennan Awk) | 1.3.4 20150503 | 52.0 | | gawk (GNU awk) | 4.1.3 | 103.4 | | awk (Mac built-in) | 20070501 | 314.2 | | tsv-filter (Perl) | | 1075.6 |

Linux server (Ubuntu, Intel Xeon, 6 cores); File: 256M lines, 4 fields, 4GB:

| Tool | version | time (seconds) | | ----------------------- | -------------- | -------------: | | tsv-filter (D) | ldc 1.0 | 34.2 | | mawk (M. Brennan Awk) | 1.3.3 Nov 1996 | 72.9 | | gawk (GNU awk) | 3.1.8 | 215.4 | | tsv-filter (Perl) | 5.14.2 | 1255.2 |

Relative performance of the tools

Runs against a 4.5 million line, 279 MB file were used to get a relative comparison of the tools. The original file was a CSV file, allowing inclusion of csv2tsv. The TSV file generated was used in the other runs. Running time of routines filtering data is dependent on the amount output, so a different output sizes were used. tsv-join depends on the size of the filter file, a file the same size as the output was used in these tests. Performance of these tools also depends on the options selected, so actuals will vary.

Macbook Pro (2.8 GHz Intel I7, 16GB ram, flash storage); File: 4.46M lines, 8 fields, 279MB:

| Tool | Records output | Time (seconds) | | ------------ | -------------: |--------------: | | tsv-filter | 513788 | 0.76 | | cut (GNU) | 4465613 | 1.16 | | number-lines | 4465613 | 1.21 | | tsv-filter | 4125057 | 1.25 | | tsv-uniq | 65537 | 1.56 | | tsv-join | 65537 | 1.61 | | tsv-select | 4465613 | 1.81 | | tsv-uniq | 4465613 | 4.34 | | csv2tsv | 4465613 | 6.49 | | tsv-join | 4465613 | 7.51 |

Performance of tsv-filter looks especially good, even when outputting a large number of records. It's not far off GNU cut. tsv-join and tsv-uniq are fast, but show an impact when larger hash tables are needed (4.5M entries in the slower cases). csv2tsv is a bit slower than the other tools for reasons that are not clear. It uses mechanisms not used in the other tools.

Tool reference

This section provides more detailed documentation about the different tools as well as examples. Material for the individual tools is also available via the --help option.

Common options and behavior

Information in this section applies to all the tools.

Specifying options

Multi-letter options are specified with a double dash. Single letter options can be specified with a single dash or double dash. For example:

$ tsv-uniq -h      # Valid
$ tsv-uniq --h     # Valid
$ tsv-uniq --help  # Valid
$ tsv-uniq -help   # Invalid.
Help (-h, --help, --help-verbose)

All tools print help if given the -h or --help option. Many of the tools provide more details with the --help-verbose option.

Field indices

Field indices are one-upped integers, following Unix conventions. Some tools use zero to represent the entire line (tsv-join, tsv-uniq).

UTF-8 input

These tools assume data is utf-8 encoded.

File format and alternate delimiters (--delimiter)

Any character can be used as a delimiter, TAB is the default. However, there is no escaping for including the delimiter character or newlines within a field. This differs from CSV file format which provides an escaping mechanism. In practice the lack of an escaping mechanism is not a meaningful limitation for data oriented files.

Aside from a header line, all lines are expected to have data. There is no comment mechanism and no special handling for blank lines. Tools taking field indices as arguments expect the specified fields to be available on every line.

Headers (-H, --header)

Most tools handle the first line of files as a header when given the -H or --header option. For example, tsv-filter passes the header through without filtering it. When --header is used, all files and stdin are assumed to have header lines. Only one header line is written to stdout. If multiple files are being processed, header lines from subsequent files are discarded.

Multiple files and standard input

Tools can read from any number of files and from standard input. As per typical Unix behavior, a single dash represents standard input when included in a list of files. Terminate non-file arguments with a double dash (--) when using a single dash in this fashion. Example:

$ head -n 1000 file-c.tsv | tsv-filter --eq 2:1000 -- file-a.tsv file-b.tsv - > out.tsv

The above passes file-a.tsv, file-b.tsv, and the first 1000 lines of file-c.tsv to tsv-filter and write the results to out.tsv.

tsv-filter reference

Synopsis: tsv-filter [options] [file...]

Filter lines of tab-delimited files via comparison tests against fields. Multiple tests can be specified, by default they are evaluated as AND clause. Lines satisfying the tests are written to standard output.

General options:

  • --help - Print help.
  • --help-verbose - Print detailed help.
  • --help-options - Print the options list by itself.
  • --H|header - Treat the first line of each file as a header.
  • --d|delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
  • --or - Evaluate tests as an OR rather than an AND. This applies globally.
  • --v|invert - Invert the filter, printing lines that do not match. This applies globally.

Tests:

Empty and blank field tests:

  • --empty FIELD - True if field is empty (no characters)
  • --not-empty FIELD - True if field is not empty.
  • --blank FIELD - True if field is empty or all whitespace.
  • --not-blank FIELD - True if field contains a non-whitespace character.

Numeric type tests:

  • --is-numeric FIELD - True if the field can be interpreted as a number.
  • --is-finite FIELD - True if the field can be interpreted as a number, and it is not NaN or infinity.
  • --is-nan FIELD - True if the field is NaN (including: "nan", "NaN", "NAN").
  • --is-infinity FIELD - True if the field is infinity (including: "inf", "INF", "-inf", "-INF")

Numeric comparisons:

  • --le FIELD:NUM - FIELD <= NUM (numeric).
  • --lt FIELD:NUM - FIELD < NUM (numeric).
  • --ge FIELD:NUM - FIELD >= NUM (numeric).
  • --gt FIELD:NUM - FIELD > NUM (numeric).
  • --eq FIELD:NUM - FIELD == NUM (numeric).
  • --ne FIELD:NUM - FIELD != NUM (numeric).

String comparisons:

  • --str-le FIELD:STR - FIELD <= STR (string).
  • --str-lt FIELD:STR - FIELD < STR (string).
  • --str-ge FIELD:STR - FIELD >= STR (string).
  • --str-gt FIELD:STR - FIELD > STR (string).
  • --str-eq FIELD:STR - FIELD == STR (string).
  • --istr-eq FIELD:STR - FIELD == STR (string, case-insensitive).
  • --str-ne FIELD:STR - FIELD != STR (string).
  • --istr-ne FIELD:STR - FIELD != STR (string, case-insensitive).
  • --str-in-fld FIELD:STR - FIELD contains STR (substring search).
  • --istr-in-fld FIELD:STR - FIELD contains STR (substring search, case-insensitive).
  • --str-not-in-fld FIELD:STR - FIELD does not contain STR (substring search).
  • --istr-not-in-fld FIELD:STR - FIELD does not contain STR (substring search, case-insensitive).

Regular expression tests:

  • --regex FIELD:REGEX - FIELD matches regular expression.
  • --iregex FIELD:REGEX - FIELD matches regular expression, case-insensitive.
  • --not-regex FIELD:REGEX - FIELD does not match regular expression.
  • --not-iregex FIELD:REGEX - FIELD does not match regular expression, case-insensitive.

Field to field comparisons:

  • --ff-le FIELD1:FIELD2 - FIELD1 <= FIELD2 (numeric).
  • --ff-lt FIELD1:FIELD2 - FIELD1 < FIELD2 (numeric).
  • --ff-ge FIELD1:FIELD2 - FIELD1 >= FIELD2 (numeric).
  • --ff-gt FIELD1:FIELD2 - FIELD1 > FIELD2 (numeric).
  • --ff-eq FIELD1:FIELD2 - FIELD1 == FIELD2 (numeric).
  • --ff-ne FIELD1:FIELD2 - FIELD1 != FIELD2 (numeric).
  • --ff-str-eq FIELD1:FIELD2 - FIELD1 == FIELD2 (string).
  • --ff-istr-eq FIELD1:FIELD2 - FIELD1 == FIELD2 (string, case-insensitive).
  • --ff-str-ne FIELD1:FIELD2 - FIELD1 != FIELD2 (string).
  • --ff-istr-ne FIELD1:FIELD2 - FIELD1 != FIELD2 (string, case-insensitive).
  • --ff-absdiff-le FIELD1:FIELD2:NUM - abs(FIELD1 - FIELD2) <= NUM
  • --ff-absdiff-gt FIELD1:FIELD2:NUM - abs(FIELD1 - FIELD2) > NUM
  • --ff-reldiff-le FIELD1:FIELD2:NUM - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) <= NUM
  • --ff-reldiff-gt FIELD1:FIELD2:NUM - abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) > NUM

Examples:

Basic comparisons:

$ # Field 2 non-zero
$ tsv-filter --ne 2:0 data.tsv

$ # Field 1 == 0 and Field 2 >= 100, first line is a header.
$ tsv-filter --header --eq 1:0 --ge 2:100 data.tsv

$ # Field 1 == -1 or Field 1 > 100
$ tsv-filter --or --eq 1:-1 --gt 1:100

$ # Field 3 is foo, Field 4 contains bar
$ tsv-filter --header --str-eq 3:foo --str-in-fld 4:bar data.tsv

$ # Field 3 == field 4 (numeric test)
$ tsv-filter --header --ff-eq 3:4 data.tsv

Regular expressions:

Official regular expression syntax defined by D (http://dlang.org/phobos/std_regex.html), however, basic syntax is rather standard, and forms commonly used with other tools usually work as expected. This includes unicode character classes.

$ # Field 2 has a sequence with two a's, one or more digits, then 2 a's.
$ tsv-filter --regex '2:aa[0-9]+aa' data.tsv

$ # Same thing, except the field starts and ends with the two a's.
$ tsv-filter --regex '2:^aa[0-9]+aa$' data.tsv

$ # Field 2 is a sequence of "word" characters with two or more embedded whitespace sequences
$ tsv-filter --regex '2:^\w+\s+(\w+\s+)+\w+$' data.tsv

$ # Field 2 containing at least one cyrillic character.
$ tsv-filter --regex '2:\p{Cyrillic}' data.tsv

Short-circuiting expressions:

Numeric tests like --gt (greater-than) assume field values can be interpreted as numbers. An error occurs if the field cannot be parsed as a number, halting the program. This can be avoiding by including a testing ensure the field is recognizable as a number. For example:

$ # Ensure field 2 is a number before testing for greater-than 10.
$ tsv-filter --is-numeric 2 --gt 2:10 data.tsv

$ # Ensure field 2 is a number, not NaN or infinity before testing for greater-than 10.
$ tsv-filter --is-finite 2 --gt 2:10 data.tsv

The above tests work because tsv-filter short-circuits evaluation, only running as many tests as necessary to filter each line. Tests are run in the order listed on the command line. In the first example, if --is-numeric 2 is false, the remaining tests do not get run.

tsv-select reference

Synopsis: tsv-select -f n[,n...] [options] [file...]

tsv-select reads files or standard input and writes specified fields to standard output in the order listed. Similar to 'cut' with the ability to reorder fields. Fields can be listed more than once, and fields not listed can be output using the --rest option. When working with multiple files, the --header option can be used to retain only the header from the first file.

Options:

  • --h|help - Print help.
  • --H|header - Treat the first line of each file as a header.
  • --f|fields n[,n...] - (Required) Fields to extract. Fields are output in the order listed.
  • --r|rest none|first|last - Location for remaining fields. Default: none
  • --d|delimiter CHR - Character to use as field delimiter. Default: TAB. (Single byte UTF-8 characters only.)

Examples:

$ # Output fields 2 and 1, in that order
$ tsv-select -f 2,1 --rest first data.tsv

$ # Move field 1 to the end of the line
$ tsv-select -f 1 --rest first data.tsv

$ # Move fields 7 and 3 to the start of the line
$ tsv-select -f 7,3 --rest last data.tsv

tsv-summarize reference

Synopsis: tsv-summarize [options] file [file...]

tsv-summarize reads tabular data files (tab-separated by default), tracks field values for each unique key, and runs summarization algorithms. Consider the file data.tsv:

make    color   time
ford    blue    131
chevy   green   124
ford    red     128
bmw     black   118
bmw     black   126
ford    blue    122

The min and average times for each make is generated by the command:

$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv

This produces:

make   time_min time_mean
ford   122      127
chevy  124      124
bmw    118      122

Using --group 1,2 will group by both 'make' and 'color'. Omitting the --group-by entirely summarizes fields for full file.

The program tries to generate useful headers, but custom headers can be specified. Example (using -g and -H shortcuts for --header and --group-by):

$ tsv-summarize -H -g 1 --min 3:fastest --mean 3:average data.tsv

Most operators take custom headers in a similarly way, generally following:

--<operator-name> FIELD[:header]

Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Example:

--median 2,3,4

Summarization operators available are:

   count       min        mean       stddev
   retain      max        median     unique-count
   first       range      mad        mode
   last        sum        var        values

Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled. Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.

Options:

  • --h|help - Print help.
  • --help-verbose - Print detailed help.
  • --g|group-by n[,n...] - Fields to use as key.
  • --H|header - Treat the first line of each file as a header.
  • --w|write-header - Write an output header even if there is no input header.
  • --d|delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
  • --v|values-delimiter CHR - Values delimiter. Default: vertical bar (|). (Single byte UTF-8 characters only.)
  • --p|float-precision NUM - 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12

Operators:

  • --count - Count occurrences of each unique key.
  • --count-header STR - Count occurrences of each unique key, use header STR.
  • --retain n[,n...] - Retain one copy of the field.
  • --first n[,n...][:STR] - First value seen.
  • --last n[,n...][:STR]- Last value seen.
  • --min n[,n...][:STR] - Min value. (Numeric fields only.)
  • --max n[,n...][:STR] - Max value. Numeric fields only.
  • --range n[,n...][:STR] - Difference between min and max values. (Numeric fields only.)
  • --sum n[,n...][:STR] - Sum of the values. (Numeric fields only.)
  • --mean n[,n...][:STR] - Mean (average). (Numeric fields only.)
  • --median n[,n...][:STR] - Median value. (Numeric fields only. Reads all values into memory.)
  • --mad n[,n...][:STR] - Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.)
  • --var n[,n...][:STR] - Variance. (Sample variance, numeric fields only).
  • --stdev n[,n...][:STR] - Standard deviation. (Sample st.dev, numeric fields only).
  • --unique-count n[,n...][:STR] Number of unique values. (Reads all values into memory).
  • --mode n[,n...][:STR] - Mode. The most frequent value. (Reads all values into memory.)
  • --values n[,n...][:STR] - All the values, separated by --v|values-delimiter. (Reads all values into memory.)

tsv-join reference

Synopsis: tsv-join --filter-file file [options] file [file...]

tsv-join matches input lines against lines from a 'filter' file. The match is based on exact match comparison of one or more 'key' fields. Fields are TAB delimited by default. Matching lines are written to standard output, along with any additional fields from the key file that have been specified.

Options:

  • --h|help - Print help.
  • --h|help-verbose - Print detailed help.
  • --f|filter-file FILE - (Required) File with records to use as a filter.
  • --k|key-fields n[,n...] - Fields to use as join key. Default: 0 (entire line).
  • --d|data-fields n[,n...] - Data record fields to use as join key, if different than --key-fields.
  • --a|append-fields n[,n...] - Filter fields to append to matched records.
  • --H|header - Treat the first line of each file as a header.
  • --p|prefix STR - String to use as a prefix for --append-fields when writing a header line.
  • --w|write-all STR - Output all data records. STR is the --append-fields value when writing unmatched records. This is an outer join.
  • --e|exclude - Exclude matching records. This is an anti-join.
  • --delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
  • --z|allow-duplicate-keys - Allow duplicate keys with different append values (last entry wins). Default behavior is that this is an error.

Examples:

Filter one file based on another, using the full line as the key.

$ # Output lines in data.txt that appear in filter.txt
$ tsv-join -f filter.txt data.txt

$ # Output lines in data.txt that do not appear in filter.txt
$ tsv-join -f filter.txt --exclude data.txt

Filter multiple files, using fields 2 & 3 as the filter key.

$ tsv-join -f filter.tsv --key-fields 2,3 data1.tsv data2.tsv data3.tsv

Same as previous, except use field 4 & 5 from the data files.

$ tsv-join -f filter.tsv --key-fields 2,3 --data-fields 4,5 data1.tsv data2.tsv data3.tsv

Append a field from the filter file to matched records.

$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2 data.tsv

Write out all records from the data file, but when there is no match, write the 'append fields' as NULL. This is an outer join.

$ tsv-join -f filter.tsv --key-fields 1 --append-fields 2 --write-all NULL data.tsv

Managing headers: Often it's useful to join a field from one data file to anther, where the data fields are related and the headers have the same name in both files. They can be kept distinct by adding a prefix to the filter file header. Example:

$ tsv-join -f run1.tsv --header --key-fields 1 --append-fields 2 --prefix run1_ run2.tsv

tsv-append reference

Synopsis: tsv-append [options] [file...]

tsv-append concatenates multiple TSV files, similar to the Unix 'cat' utility. Unlike 'cat', it is header aware ('--H|header'), writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row. Results are written to standard output.

Concatenation with header support is useful when preparing data for traditional Unix utilities like 'sort' and 'sed' or applications that read a single file.

Source tracking is useful when creating long/narrow form tabular data, a format used by many statistics and data mining packages. In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file.

The file-name (without extension) is used as the source value. This can customized using the --f|file option.

Example: Header processing:

$ tsv-append -H file1.tsv file2.tsv file3.tsv

Example: Header processing and source tracking:

$ tsv-append -H -t file1.tsv file2.tsv file3.tsv

Example: Source tracking with custom values:

$ tsv-append -H -s test_id -f test1=file1.tsv -f test2=file2.tsv

Options:

  • --h|help - Print help.
  • --help-verbose - Print detailed help.
  • --H|header - Treat the first line of each file as a header.
  • --t|track-source - Track the source file. Adds an column with the source name.
  • --s|source-header STR - Use STR as the header for the source column. Implies --H|header and --t|track-source. Default: 'file'
  • --f|file STR=FILE - Read file FILE, using STR as the 'source' value. Implies --t|track-source.
  • --d|delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)

tsv-uniq reference

tsv-uniq identifies equivalent lines in tab-separated value files. Input is read line by line, recording a key based on one or more of the fields. Two lines are equivalent if they have the same key. When operating in 'uniq' mode, the first time a key is seen the line is written to standard output, but subsequent lines are discarded. This is similar to the Unix 'uniq' program, but based on individual fields and without requiring sorted data.

The alternate to 'uniq' mode is 'equiv-class' identification. In this mode, all lines are written to standard output, but with a new field added marking equivalent entries with an ID. The ID is simply a one-upped counter.

Synopsis: tsv-uniq [options] [file...]

Options:

  • -h|help - Print help.
  • --help-verbose - Print detailed help.
  • --H|header - Treat the first line of each file as a header.
  • --f|fields n[,n...] - Fields to use as the key. Default: 0 (entire line).
  • --i|ignore-case - Ignore case when comparing keys.
  • --e|equiv - Output equiv class IDs rather than uniq'ing entries.
  • --equiv-header STR - Use STR as the equiv-id field header. Applies when using '--header --equiv'. Default: 'equiv_id'.
  • --equiv-start INT - Use INT as the first equiv-id. Default: 1.
  • --d|delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)

Examples:

$ # Uniq a file, using the full line as the key
$ tsv-uniq data.txt

$ # Same as above, but case-insensitive
$ tsv-uniq --ignore-case data.txt

$ # Unique a file based on one field
$ tsv-unique -f 1 data.tsv

$ # Unique a file based on two fields
$ tsv-uniq -f 1,2 data.tsv

$ # Output all the lines, generating an ID for each unique entry
$ tsv-uniq -f 1,2 --equiv data.tsv

$ # Generate uniq IDs, but account for headers
$ tsv-uniq -f 1,2 --equiv --header data.tsv

tsv-sample reference

Synopsis: tsv-sample [options] [file...]

tsv-sample randomizes or samples input lines. By default, all lines are output in random order. --n|num can be used to limit the sample size produced. A weighted random sample is generated using the --f|field option, this identifies the field containing weights. Sampling is without replacement.

Weighted random sampling is done using an algorithm described by Efraimidis and Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Negative values are not meaningful and given the value zero. However, any positive real values can be used. Lines are output ordered by the randomized weight that was assigned. This means, for example, that a smaller sample can be produced by taking the first N lines of output. For more info on the sampling approach see:

  • Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling
  • "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256)

The implementation uses reservoir sampling. All lines output must be held in memory. Memory needed for large inputs can reduced significantly using a sample size. Both tsv-sample -n <num> and tsv-sample | head -n <num> produce the same results, but the former is faster.

Each run produces a different randomization. This can be changed using --s|static-seed. This uses the same initial seed each run to produce consistent randomization orders. The random seed can also be specified using --v|seed-value. This takes a non-zero, 32-bit positive integer. (A zero value is a no-op and ignored.)

Options:

  • --help-verbose - Print more detailed help.
  • --H|header - Treat the first line of each file as a header.
  • --n|num NUM - Number of lines to output. All lines are output if not provided or zero.
  • --f|field NUM - Field containing weights. All lines get equal weight if not provided or zero.
  • --p|print-random - Output the random values that were assigned.
  • --s|static-seed - Use the same random seed every run.
  • --v|seed-value NUM - Sets the initial random seed. Use a non-zero, 32 bit positive integer. Zero is a no-op.
  • --d|delimiter CHR - Field delimiter.
  • --h|help - This help information.

csv2tsv reference

Synopsis: csv2tsv [options] [file...]

csv2tsv converts CSV (comma-separated) text to TSV (tab-separated) format. Records are read from files or standard input, converted records are written to standard output.

Both formats represent tabular data, each record on its own line, fields separated by a delimiter character. The key difference is that CSV uses escape sequences to represent newlines and field separators in the data, whereas TSV disallows these characters in the data. The most common field delimiters are comma for CSV and tab for TSV, but any character can be used.

Conversion to TSV is done by removing CSV escape syntax, changing field delimiters, and replacing newlines and field delimiters in the data. By default, newlines and field delimiters in the data are replaced by spaces. Most details are customizable.

There is no single spec for CSV, any number of variants can be found. The escape syntax is common enough: fields containing newlines or field delimiters are placed in double quotes. Inside a quoted field, a double quote is represented by a pair of double quotes. As with field separators, the quoting character is customizable.

Behaviors of this program that often vary between CSV implementations:

  • Newlines are supported in quoted fields.
  • Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules.
  • Each record can have a different numbers of fields.
  • The three common forms of newlines are supported: CR, CRLF, LF.
  • A newline will be added if the file does not end with one.
  • No whitespace trimming is done.

This program does not validate CSV correctness, but will terminate with an error upon reaching an inconsistent state. Improperly terminated quoted fields are the primary cause.

UTF-8 input is assumed. Convert other encodings prior to invoking this tool.

Options:

  • --h|help - Print help.
  • --help-verbose - Print detailed help.
  • --H|header - Treat the first line of each file as a header. Only the header of the first file is output.
  • --q|quote CHR - Quoting character in CSV data. Default: double-quote (")
  • --c|csv-delim CHR - Field delimiter in CSV data. Default: comma (,).
  • --t|tsv-delim CHR - Field delimiter in TSV data. Default: TAB
  • --r|replacement STR - Replacement for newline and TSV field delimiters found in CSV input. Default: Space.

number-lines reference

Synopsis: number-lines [options] [file...]

number-lines reads from files or standard input and writes each line to standard output preceded by a line number. It is a simplified version of the Unix 'nl' program. It supports one feature 'nl' does not: the ability to treat the first line of files as a header. This is useful when working with tab-separated-value files. If header processing used, a header line is written for the first file, and the header lines are dropped from any subsequent files.

Options:

  • --h|help - Print help.
  • --H|header - Treat the first line of each file as a header. The first input file's header is output, subsequent file headers are discarded.
  • --s|header-string STR - String to use as the header for the line number field. Implies --header. Default: 'line'.
  • --n|start-number NUM - Number to use for the first line. Default: 1.
  • --d|delimiter CHR - Character appended to line number, preceding the rest of the line. Default: TAB (Single byte UTF-8 characters only.)

Examples:

$ # Number lines in a file
$ number-lines file.tsv

$ # Number lines from multiple files. Treat the first line each file as a header.
$ number-lines --header data*.tsv

Available versions

1.0.13 1.0.12 1.0.11 1.0.10 1.0.9 1.0.8 1.0.7 1.0.6 1.0.5 1.0.4 1.0.3 1.0.2 1.0.1 1.0.0 ~master