"This tool provides better filtering options for CSV files. User can add unlimited amount of filters in multiple columns. E.g. MS Excel provides only 2 options for filtering a single column."



"regexxer is a nifty GUI search/replace tool featuring Perl-style regular expressions. If you need project-wide substitution and you’re tired of hacking sed command lines together, then you should definitely give it a try."




"CSVfix is a command-line stream editor specifically designed to deal with CSV data. With it, you can:

Convert fixed format, multi-line and DSV files to CSV
Reorder, remove, split and merge fields
Convert case, trim leading & trailing spaces
Search for specific content using regular expressions
Filter out duplicate data or data on exclusion lists
Perform sed/perl style editing
Enrich with data from other sources
Add sequence numbers and file source information
Split large CSV files into smaller files based on field contents
Perform arithmetic calculations on individual fields
Validate CSV data against a collection of validation rules
Convert from CSV to fixed format, XML, SQL, DSV
Summarise CSV data, calculating averages, modes, frequencies etc."




A handy command line tool from Merjis called csvtool for handling CSV files from shell scripts. It should be available from most *nix-based repositories.



GNU Diffutils

"GNU Diffutils is a package of several programs related to finding differences between files.

Computer users often find occasion to ask how two files differ. Perhaps one file is a newer version of the other file. Or maybe the two files started out as identical copies but were changed by different people.

You can use the diff command to show differences between two files, or each corresponding file in two directories. diff outputs differences between files line by line in any of several formats, selectable by command line options. This set of differences is often called a ‘diff’ or ‘patch’. For files that are identical, diff normally produces no output; for binary (non-text) files, diff normally reports only that they are different.

You can use the cmp command to show the offsets and line numbers where two files differ. cmp can also show all the characters that differ between the two files, side by side.

You can use the diff3 command to show differences among three files. When two people have made independent changes to a common original, diff3 can report the differences between the original and the two changed versions, and can produce a merged file that contains both persons' changes together with warnings about conflicts.

You can use the sdiff command to merge two files interactively."



note on data sets

I will no longer track data sets on this blog. They are far too numerous, easy to find and are beyond the narrow scope I had originally defined for myself.



"As a priority Open Government Initiative for President Obama's administration, Data.gov increases the ability of the public to easily find, download, and use datasets that are generated and held by the Federal Government. Data.gov provides descriptions of the Federal datasets (metadata), information about how to access the datasets, and tools that leverage government datasets. The data catalogs will continue to grow as datasets are added. Federal, Executive Branch data are included in the first version of Data.gov."


The World Bank Open Data

"The World Bank's Open Data initiative is intended to provide all users with access to World Bank data. The data catalog is a listing of available World Bank data sources. This listing will continue to be updated as additional data resources are added. These resources include databases, pre-formatted tables and reports. Each of the listings includes a description of the data source and a direct link to that source. Where possible, the databases are linked directly to a selection screen to allow users to select the countries, indicators, and years they would like to search. Those search results can be exported in different formats. Users can also choose to download the entire database directly from the catalog."


Transparency Data

"Transparency Data is a central source for all federal and state campaign contributions made in the last twenty years. Here you can begin your search, find the information you need and then download records of what a candidate has received, what an individual has given, and how much companies and their employees have given."



PDF to Excel

"Unlike most PDF-to-Excel converters, we successfully detect all tables and discard non-tabular content, leaving you with a clean, easy-to-use XLS file..."



TheDataWeb and DataFerrett

"TheDataWeb brings together under one umbrella demographic, economic, environmental, health, (and more) datasets that are usually separated by geography and/or organization. TheDataWeb is the infrastructure for intelligent browsing and accessing data across the Internet. TheDataWeb provides access across the Internet to demographic, economic, environmental, health, and other databases housed in different systems in different agencies and organizations. TheDataWeb is a collection of systems and software that provide data query and extract capabilities, as well as data analysis and visualization tools, i.e., the DataFerrett."


"DataFerrett is a unique data mining and extraction tool. DataFerrett allows you to select a databasket full of variables and then recode those variables as you need. You can then develop and customize tables. Selecting your results in your table you can create a chart or graph for a visual presentation into an html page. Save your data in the databasket and save your table for continued reuse. DataFerrett helps you locate and retrieve the data you need across the Internet to your desktop or system, regardless of where the data resides..."

  • lets you receive data in the form in which you need it (whether it be extracted to an ascii, SAS, SPSS, Excel/Access file); or
  • lets you move seamlessly between query, analysis, and visualization of data in one package;
  • lets data providers share their data easier, and manage their own online data.

"DataFerrett can run from your internet browser or an application that installed, puts an icon on your desktop."


Log Parser 2.2

"Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®. You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.

Most software is designed to accomplish a limited number of specific tasks. Log Parser is different... the number of ways it can be used is limited only by the needs and imagination of the user. The world is your database with Log Parser."

Microsoft Log Parser 2.2 download


Research Pipeline Datasets

"Just a little interface between you and the world’s online datasets, journals, scientific software, and other resources. Research Pipeline is a collaborative website focused on organizing the world's data. Come see how the world finds data!"


egrep for Linguists (PDF)

"The following pages are intended as a starting point for the empirically inclined linguist who wants to make acquaintance with some basic Unix tricks useful in e.g. corpus studies. Everything on the following pages is (perhaps more accurately!) described elsewhere [2, 4, 8, 9]. However, with the exception of [2], the Unix literature is not written for linguists, and the examples of the diļ¬€erent commands are often far fetched from a linguistic text processing perspective..."


NYC Data Mine

Raw Data Catalog
NYC.gov features a raw data catalog with machine readable data (such as XLS, XML, CSV, and RSS). Click the following link to search Raw Data Sets.



(G)awk cheat sheet (PDF)

Printable, downloadable cheat sheet for gawk (4 page PDF)


"Recipe for a Language
  • 1 part egrep
  • 1 part snobol
  • 2 parts ed
  • 3 parts C
  • Blend all parts well using lex and yacc. Document minimally and release.
  • After eight years, add another part egrep and two more parts C. Document very well and release."



"BEWARE! Boring man page description on the way:

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors.


Note 1: This page was completely generated by a sed script.
Note 2: This page is a valid sed script. You can copy&paste and run it with sed -f."


What is the Google Fusion Tables API?

The Google Fusion Tables API enables programmatic access to Google Fusion Tables content. It is an extension of Google's existing structured data capabilities for developers. Here are some of the things you can do:
  • Upload: Here's where it all starts: populating a table in Google Fusion Tables with data from spreadsheets or .CSV files.
  • Query and download: The Google Fusion Tables API is built on top of a subset of the SQL querying language. By referencing data values in SQL-like query expressions, you can find the data you need, then download it for use by your application. Your app can do any desired processing on the data, such as computing aggregates or feeding into a visualization gadget.
  • Sync: As you add or change data in the tables in your offline repository, you can ensure the most up-to-date version is available to the world by synchronizing those changes up to Google Fusion Tables.