eBay's TSV utilities. Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
To use this package, run the following command in your project's root directory:
This package provides sub packages which can be used individually:
tsv-utils:common - Routines used by applications in eBay's TSV Utilities project.
tsv-utils:csv2tsv - Convert comma-separated values data to tab-separated values format (CSV to TSV).
tsv-utils:keep-header - Execute a unix command in a header aware fashion.
tsv-utils:number-lines - Number lines
tsv-utils:tsv-append - Concatenate TSV files. Header aware, with support for source file tracking.
tsv-utils:tsv-filter - Filter lines in a tab-separated value file.
tsv-utils:tsv-join - Join lines in tab-separated value files.
tsv-utils:tsv-pretty - Print TSV data aligned for easier reading on consoles and traditional command-line environments.
tsv-utils:tsv-sample - Randomize or sample lines from input data. Several sampling methods are available, including simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling.
tsv-utils:tsv-select - Output select columns from TSV files.
tsv-utils:tsv-summarize - Run aggregation and summarization operations on fields from TSV files.
tsv-utils:tsv-uniq - Output unique lines in TSV files using a subset of fields.
Command line utilities for tabular data files
This is a set of command line utilities for manipulating large tabular data files. Files of numeric and text data commonly found in machine learning, data mining, and similar environments. Filtering, sampling, statistics, joins, and more.
These tools are especially useful when working with large data sets. They run faster than other tools providing similar functionality, often by significant margins. See Performance Studies for comparisons with other tools.
File an issue if you have problems, questions or suggestions.
In this README:
- Tools reference - Detailed documentation.
- Releases - Prebuilt binaries and release notes.
- Tips and tricks - Simpler and faster command line tool use.
- Performance Studies - Benchmarks against similar tools and other performance studies.
- Comparing TSV and CSV formats
- Building with Link Time Optimization (LTO) and Profile Guided Optimization (PGO)
- About the code (see also: tsv-utils code documentation)
- Other toolkits
Talks and blog posts:
- Faster Command Line Tools in D. May 24, 2017. A blog post showing a few ways to optimize performance in command line tools. Many of the ideas in the post were identified while developing the TSV Utilities.
- Experimenting with Link Time Optimization. Dec 14, 2017. A presentation at the Silicon Valley D Meetup describing experiments using LTO based on eBay's TSV Utilities.
- Exploring D via Benchmarking of eBay's TSV Utilities. May 2, 2018. A presentation at DConf 2018 describing performance benchmark studies conducted using eBay's TSV Utilities (slides here).
These tools perform data manipulation and statistical calculations on tab delimited data. They are intended for large files. Larger than ideal for loading entirely in memory in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments. The features supported are useful both for standalone analysis and for preparing data for use in R, Pandas, and similar toolkits.
The tools work like traditional Unix command line utilities such as
awk, and are intended to complement these tools. Each tool is a standalone executable. They follow common Unix conventions for pipeline programs. Data is read from files or standard input, results are written to standard output. The field separator defaults to TAB, but any character can be used. Input and output is UTF-8, and all operations are Unicode ready, including regular expression match (
tsv-filter). Documentation is available for each tool by invoking it with the
--help option. TSV format is similar to CSV, see Comparing TSV and CSV formats for the differences.
The rest of this section contains descriptions of each tool. Click on the links below to jump directly to one of the tools. Full documentation is available in the tool reference.
- tsv-filter - Filter lines using numeric, string and regular expression comparisons against individual fields. (This description also provides an introduction to features found throughout the toolkit.)
- tsv-select - Keep a subset of columns (fields). Like
cut, but with field reordering.
- tsv-uniq - Filter out duplicate lines using either the full line or individual fields as a key.
- tsv-summarize - Summary statistics on selected fields, against the full data set or grouped by key.
- tsv-sample - Sample input lines or randomize their order. A number of sampling methods are available.
- tsv-join - Join lines from multiple files using fields as a key.
- tsv-pretty - Print TSV data aligned for easier reading on the command-line.
- csv2tsv - Convert CSV files to TSV.
- tsv-append - Concatenate TSV files. Header-aware; supports source file tracking.
- number-lines - Number the input lines.
- keep-header - Run a shell command in a header-aware fashion.
Filter lines by running tests against individual fields. Multiple tests can be specified in a single call. A variety of numeric and string comparison tests are available, including regular expressions.
Consider a file having 4 fields:
count. Using tsv-pretty to view the first few lines:
$ tsv-pretty data.tsv | head -n 5 id color year count 100 green 1982 173 101 red 1935 756 102 red 2008 1303 103 yellow 1873 180
The following command finds all entries where 'year' (field 3) is 2008:
$ tsv-filter -H --eq 3:2008 data.tsv
--eq operator performs a numeric equality test. String comparisons are also available. The following command finds entries where 'color' (field 2) is "red":
$ tsv-filter -H --str-eq 2:red data.tsv
Fields are identified by a 1-up field number, same as traditional Unix tools. The
-H option preserves the header line.
Multiple tests can be specified. The following command finds
red entries with years between 1850 and 1950:
$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv
Viewing the first few results produced by this command:
$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv | tsv-pretty | head -n 5 id color year count 101 red 1935 756 106 red 1883 1156 111 red 1907 1792 114 red 1931 1412
Files can be placed anywhere on the command line. Data will be read from standard input if a file is not specified. The following commands are equivalent:
$ tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950 data.tsv $ tsv-filter data.tsv -H --str-eq 2:red --ge 3:1850 --lt 3:1950 $ cat data.tsv | tsv-filter -H --str-eq 2:red --ge 3:1850 --lt 3:1950
Multiple files can be provided. Only the header line from the first file will be kept when the
-H option is used:
$ tsv-filter -H data1.tsv data2.tsv data3.tsv --str-eq 2:red --ge 3:1850 --lt 3:1950 $ tsv-filter -H *.tsv --str-eq 2:red --ge 3:1850 --lt 3:1950
Numeric comparisons are among the most useful tests. Numeric operators include:
--ge(less-than, less-equal, greater-than, greater-equal).
Several filters are available to help with invalid entries. Assume there is a messier version of the 4-field file where some fields are not filled in. The following command will filter out all lines with an empty value in any of the four fields:
$ tsv-filter -H messy.tsv --not-empty 1-4
The above command uses a "field list" to specify running the test on each of fields 1-4. The test can be "inverted" to see the lines that were filtered out:
$ tsv-filter -H messy.tsv --invert --not-empty 1-4 | head -n 5 | tsv-pretty id color year count 116 1982 11 118 yellow 143 123 red 65 126 79
There are several filters for testing characteristics of numeric data. The most useful are:
--is-numeric- Test if the data in a field can be interpreted as a number.
--is-finite- Test if the data in a field can be interpreted as a number, but not NaN (not-a-number) or infinity. This is useful when working with data where floating point calculations may have produced NaN or infinity values.
By default, all tests specified must be satisfied for a line to pass a filter. This can be changed using the
--or option. For example, the following command finds records where 'count' (field 4) is less than 100 or greater than 1000:
$ tsv-filter -H --or --lt 4:100 --gt 4:1000 data.tsv | head -n 5 | tsv-pretty id color year count 102 red 2008 1303 105 green 1982 16 106 red 1883 1156 107 white 1982 0
A number of string and regular expression tests are available. These include:
- Partial match:
- Relational operators:
- Case insensitive tests:
- Regular expressions:
- Field length:
--not-empty example uses a "field list". Fields lists specify a set of fields and can be used with most operators. For example, the following command ensures that fields 1-3 and 7 are less-than 100:
$ tsv-filter -H --lt 1-3,7:100 file.tsv
Bash completion is especially helpful with
tsv-filter. It allows quickly seeing and selecting from the different operators available. See bash completion on the Tips and tricks page for setup information.
tsv-filter is perhaps the most broadly applicable of the TSV Utilities tools, as dataset pruning is such a common task. It is stream oriented, so it can handle arbitrarily large files. It is fast, quite a bit faster than other tools the author has tried. (See the "Numeric row filter" and "Regular expression row filter" tests in the 2018 Benchmark Summary.)
tsv-filter ideal for preparing data for applications like R and Pandas. It is also convenient for quickly answering simple questions about a dataset. For example, to count the number of records with a non-zero value in field 4, use the command:
$ tsv-filter --ne 4:0 file.tsv | wc -l
See the tsv-filter reference for more details and the full list of operators.
A version of the Unix
cut utility with the additional ability to re-order the fields. It also helps with header lines by keeping only the header from the first file (
--header option). The following command writes fields [4, 2, 9, 10, 11] from a pair of files to stdout:
$ tsv-select -f 4,2,9-11 file1.tsv file2.tsv
See the tsv-select reference for details.
Similar in spirit to the Unix
tsv-uniq filters a dataset so there is only one copy of each unique line.
tsv-uniq goes beyond Unix
uniq in a couple ways. First, data does not need to be sorted. Second, equivalence can be based on a subset of fields rather than the full line.
tsv-uniq can also be run in 'equivalence class identification' mode, where lines with equivalent keys are marked with a unique id rather than filtered out. Another variant is 'number' mode, which generates lines numbers grouped by the key.
An example uniq'ing a file on fields 2 and 3:
$ tsv-uniq -f 2,3 data.tsv
tsv-uniq operates on the entire line when no fields are specified. This is a useful alternative to the traditional
sort -u or
sort | uniq paradigms for identifying unique lines in unsorted files, as it is quite a bit faster, especially when there are many duplicate lines. As a bonus, order of the input lines is retained.
An in-memory lookup table is used to record unique entries. This ultimately limits the data sizes that can be processed. The author has found that datasets with up to about 10 million unique entries work fine, but performance starts to degrade after that. Even then it remains faster than the alternatives.
See the tsv-uniq reference for details.
tsv-summarize performs statistical calculations on fields. For example, generating the sum or median of a field's values. Calculations can be run across the entire input or can be grouped by key fields. Consider the file
color weight red 6 red 5 blue 15 red 4 blue 10
Calculations of the sum and mean of the
weight column is shown below. The first command runs calculations on all values. The second groups them by color.
$ tsv-summarize --header --sum 2 --mean 2 data.tsv weight_sum weight_mean 40 8 $ tsv-summarize --header --group-by 1 --sum 2 --mean 2 data.tsv color weight_sum weight_mean red 15 5 blue 25 12.5
Multiple fields can be used as the
--group-by key. The file's sort order does not matter, there is no need to sort in the
--group-by order first.
See the tsv-summarize reference for the list of statistical and other aggregation operations available.
tsv-sample randomizes line order (shuffling) or selects random subsets of lines (sampling) from input data. Several methods are available, including shuffling, simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling. Data can be read from files or standard input. These sampling methods are made available through several modes of operation:
- Shuffling - The default mode of operation. All lines are read in and written out in random order. All orderings are equally likely.
- Simple random sampling (
--n|num N) - A random sample of
Nlines are selected and written out in random order. The
--i|inorderoption preserves the original input order.
- Weighted random sampling (
--w|weight-field F) - A weighted random sample of N lines are selected using weights from a field on each line. Output is in weighted selected order unless the
--i|inorderoption is used. Omitting
--n|numoutputs all lines in weighted selection order (weighted shuffling).
- Sampling with replacement (
--n|num N) - All lines are read in, then lines are randomly selected one at a time and written out. Lines can be selected multiple times. Output continues until
Nsamples have been output.
- Bernoulli sampling (
--p|prob P) - A streaming form of sampling. Lines are read one at a time and selected for output using probability
-p 0.1specifies that 10% of lines should be included in the sample.
- Distinct sampling (
--p|prob P) - Another streaming form of sampling. However, instead of each line being subject to an independent selection choice, lines are selected based on a key contained in each line. A portion of keys are randomly selected for output, with probability P. Every line containing a selected key is included in the output. Consider a query log with records consisting of <user, query, clicked-url> triples. It may be desirable to sample records for one percent of the users, but include all records for the selected users.
tsv-sample is designed for large data sets. Streaming algorithms make immediate decisions on each line. They do not accumulate memory and can run on infinite length input streams. Both shuffling and sampling with replacement read in the entire dataset and are limited by available memory. Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (
--n|num) in memory. By default, a new random order is generated every run, but options are available for using the same randomization order over multiple runs. The random values assigned to each line can be printed, either to observe the behavior or to run custom algorithms on the results.
See the tsv-sample reference for further details.
Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. Example:
$ tsv-join --filter-file filter.tsv --key-fields 1,3 --append-fields 5,6 data.tsv
filter.tsv, creating a lookup table keyed on fields 1 and 3.
data.tsv is read, lines with a matching key are written to standard output with fields 5 and 6 from
filter.tsv appended. This is a form of inner-join. Outer-joins and anti-joins can also be done.
Common uses for
tsv-join are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.
See the tsv-join reference for details.
tsv-pretty prints TSV data in an aligned format for better readability when working on the command-line. Text columns are left aligned, numeric columns are right aligned. Floats are aligned on the decimal point and precision can be specified. Header lines are detected automatically. If desired, the header line can be repeated at regular intervals. An example, first printed without formatting:
$ cat sample.tsv Color Count Ht Wt Brown 106 202.2 1.5 Canary Yellow 7 106 0.761 Chartreuse 1139 77.02 6.22 Fluorescent Orange 422 1141.7 7.921 Grey 19 140.3 1.03
tsv-pretty, using header underlining and float formatting:
$ tsv-pretty -u -f sample.tsv Color Count Ht Wt ----- ----- -- -- Brown 106 202.20 1.500 Canary Yellow 7 106.00 0.761 Chartreuse 1139 77.02 6.220 Fluorescent Orange 422 1141.70 7.921 Grey 19 140.30 1.030
See the tsv-pretty reference for details.
csv2tsv does what you expect: convert CSV data to TSV. Example:
$ csv2tsv data.csv > data.tsv
A strict delimited format like TSV has many advantages for data processing over an escape oriented format like CSV. However, CSV is a very popular data interchange format and the default export format for many database and spreadsheet programs. Converting CSV files to TSV allows them to be processed reliably by both this toolkit and standard Unix utilities like
Note that many CSV files do not use escapes, and in-fact follow a strict delimited format using comma as the delimiter. Such files can be processed reliably by this toolkit and Unix tools by specifying the delimiter character. However, when there is doubt, using a
csv2tsv converter adds reliability.
csv2tsv converter often has a second benefit: regularizing newlines. CSV files are often exported using Windows newline conventions.
csv2tsv converts all newlines to Unix format.
See Comparing TSV and CSV formats for more information on CSV escapes and other differences between CSV and TSV formats.
There are many variations of CSV file format. See the csv2tsv reference for details of the format variations supported by this tool.
tsv-append concatenates multiple TSV files, similar to the Unix
cat utility. It is header-aware, writing the header from only the first file. It also supports source tracking, adding a column indicating the original file to each row.
Concatenation with header support is useful when preparing data for traditional Unix utilities like
sed or applications that read a single file.
Source tracking is useful when creating long/narrow form tabular data. This format is used by many statistics and data mining packages. (See Wide & Long Data - Stanford University or Hadley Wickham's Tidy data for more info.)
In this scenario, files have been used to capture related data sets, the difference between data sets being a condition represented by the file. For example, results from different variants of an experiment might each be recorded in their own files. Retaining the source file as an output column preserves the condition represented by the file. The source values default to the file names, but this can be customized.
See the tsv-append reference for the complete list of options available.
A simpler version of the Unix
nl program. It prepends a line number to each line read from files or standard input. This tool was written primarily as an example of a simple command line tool. The code structure it uses is the same as followed by all the other tools. Example:
$ number-lines myfile.txt
Despite it's original purpose as a code sample,
number-lines turns out to be quite convenient. It is often useful to add a unique row ID to a file, and this tool does this in a manner that maintains proper TSV formatting.
See the number-lines reference for details.
A convenience utility that runs unix commands in a header-aware fashion. It is especially useful with
sort, which puts the header line wherever it falls in the sort order. Using
keep-header, the header line retains its position as the first line. For example:
$ keep-header myfile.txt -- sort
It is also useful with
sed, similar tools, when the header line should be excluded from the command's action.
Multiple files can be provided, only the header from the first is retained. The command is executed as specified, so additional command options can be provided. See the keep-header reference for more information.
Obtaining and installation
There are several ways to obtain the tools: prebuilt binaries; building from source code; and installing using the DUB package manager. The tools have been tested on Linux and Mac OS X. They have not been tested on Windows, but there are no obvious impediments to running on Windows as well.
Prebuilt binaries are available for Linux and Mac, these can be found on the Github releases page. Download and unpack the tar.gz file. Executables are in the
bin directory. Add the
bin directory or individual tools to the
PATH environment variable. As an example, the 1.4.4 releases for Linux and MacOS can be downloaded and unpacked with these commands:
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_linux-x86_64_ldc2.tar.gz | tar xz $ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_osx-x86_64_ldc2.tar.gz | tar xz
See the Github releases page for the latest release.
For some distributions a package can directly be installed:
| Distribution | Command |
| ------------ | --------------------- |
| Arch Linux |
pacaur -S tsv-utils (see
Note: The distributions above are not updated as frequently as the [Github releases](https://github.com/eBay/tsv-utils/releases) page.
Build from source files
Download a D compiler. These tools have been tested with the DMD and LDC compilers, on Mac OSX and Linux. Use DMD version 2.076.1 or later, LDC version 1.6.0 or later.
Clone this repository, select a compiler, and run
make from the top level directory:
$ git clone https://github.com/eBay/tsv-utils.git $ cd tsv-utils $ make # For LDC: make DCOMPILER=ldc2
Executables are written to
tsv-utils/bin, place this directory or the executables in the PATH. The compiler defaults to DMD, this can be changed on the make command line (e.g.
make DCOMPILER=ldc2). DMD is the reference compiler, but LDC produces faster executables. (For some tools LDC is quite a bit faster than DMD.)
The makefile supports other typical development tasks such as unit tests and code coverage reports. See Building and makefile for more details.
For fastest performance, use LDC with Link Time Optimization (LTO) and Profile Guided Optimization (PGO) enabled:
$ git clone https://github.com/eBay/tsv-utils.git $ cd tsv-utils $ make DCOMPILER=ldc2 LDC_LTO_RUNTIME=1 LDC_PGO=2 $ # Run the test suite $ make test-nobuild DCOMPILER=ldc2
The above requires LDC 1.9.0 or later. See Building with Link Time Optimization for more information. The prebuilt binaries are built using LTO and PGO, but these must be explicitly enabled when building from source. LTO and PGO are still early stage technologies, issues may surface in some system configurations. Running the test suite (shown above) is a good way to detect issues that may arise.
Install using DUB
If you are a D user you likely use DUB, the D package manager. DUB comes packaged with DMD starting with DMD 2.072. You can install and build using DUB as follows (replace
1.3.2 with the current version):
$ dub fetch tsv-utils --cache=local $ cd tsv-utils-1.3.2/tsv-utils $ dub run # For LDC: dub run -- --compiler=ldc2
dub run command compiles all the tools. The executables are written to
tsv-utils/bin. Add this directory or individual executables to the PATH.
See Building and makefile for more information about the DUB setup.
The applications can be built with LTO and PGO when source code is fetched by DUB. However, the DUB build system does not support this.
make must be used instead. See Building with Link Time Optimization.
There are a number of simple ways to ways to improve the utility of these tools, these are listed on the Tips and tricks page. Bash aliases, Unix sort command customization, and bash completion are especially useful.
- Registered by Jon Degenhardt
- 1.5.0 released 2 days ago
- Copyright (c) 2015-2020, eBay Inc.
- Sub packages:
- tsv-utils:common, tsv-utils:csv2tsv, tsv-utils:keep-header, tsv-utils:number-lines, tsv-utils:tsv-append, tsv-utils:tsv-filter, tsv-utils:tsv-join, tsv-utils:tsv-pretty, tsv-utils:tsv-sample, tsv-utils:tsv-select, tsv-utils:tsv-summarize, tsv-utils:tsv-uniq
1.5.0 2020-Feb-16 1.4.4 2019-Sep-23 1.4.3 2019-Aug-19 1.4.2 2019-Jun-14 1.4.1 2019-Apr-07
- Download Stats:
9 downloads today
31 downloads this week
76 downloads this month
1100 downloads total
- Short URL: