html2exel - A Chrome browser extension

"Fed up with copy and paste not working properly from website tables. Need something to increase your data entry productivity or data collection processing speed? Ever want to download a listing of all of the open jobs on a career center? When enabled, this extension will display a save menu above each minimally nested HTML table within the loaded page. Pages containing very structured tabular data are not supported. Currently the extension only supports CSV, however XLSX and Excel 97 formats are in development."




"csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occurs and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator. If you do so, columnnames are shown if a difference is found..."



GNU Recutils

"GNU Recutils is a set of tools and libraries to access human-editable, text-based databases called recfiles. The data is stored as a sequence of records, each record containing an arbitrary number of named fields.

Some advanced capabilities usually found in other data storage systems are supported: data types, data integrity (keys, mandatory fields, etc) as well as the ability of records to refer to other records (sort of foreign keys). Despite its simplicity, recfiles can be used to store medium-sized databases. See the manual for more information about the Rec format.

The GNU recutils suite comprises:

  • A texinfo manual describing the Rec format and the software.
  • A C library (librec) providing a rich set of functions to access rec files.
  • A set of C utilities (recinf, recsel, recins, recdel, recset, recfix, recfmt, csv2rec and mdb2rec) that can be used in shell scripts and in the command line to operate on rec files.
  • A set of conversion utilities (mdb2rec, csv2rec) to convert data from other formats to rec files.
  • An emacs mode (rec-mode).

A video with a talk introducing the program can be found here."




"cfv is a utility to both test and create .sfv, .csv, .crc, .md5(sfv-like), md5sum, bsd md5, sha1sum, and .torrent files. These files are commonly used to ensure the correct retrieval or storage of data.

cfv is written in python, and as such should run on all platforms python supports. Currently, it has been verified to work on linux, freebsd, openbsd, netbsd, solaris, macosx, and windows."

  • supports testing and creating of .sfv, .csv(2, 3, and 4 field variants), .crc, sfvmd5(sfv file using md5 instead of crc32), md5sum, bsd md5, sha1sum, and BitTorrent file formats
  • test-only support for PAR and PAR2 files
  • automatic checksum file naming ability in create mode
  • recursive operation
  • show unverified files option
  • ignore case and fix path seperator options for cross platform use
  • transparent gzip support for checksum files
  • configurable renaming of bad files (with testing against previous bad files, to save only unique differing copies)
  • searching for/fixing of misnamed files
  • raw listing of files of specified type (bad, missing, etc)
  • test suite to ensure correct operation



DSPL: Dataset Publishing Language

"DSPL is the Dataset Publishing Language, a representation language for the data and metadata of datasets. Datasets described in this format can be processed by Google and visualized in the Google Public Data Explorer.

  • Use existing data: Just add an XML metadata file to your existing CSV data files
  • Powerful visualizations: Unleash the full capabilities of the Google Public Data Explorer, including the animated bar chart, motion chart, and map visualization
  • Linkable concepts: Link to concepts in other datasets or create your own that others can use
  • Multi-language: Create datasets with metadata in any combination of languages
  • Geo-enabled: Make your data mappable by adding latitude and longitude data to your concept definitions. For even easier mapping, link to Google's canonical geographic concepts.
  • Fully open: Freely use the DSPL format in your own applications"



Data Wrangler

"Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing."

Wrangler Demo Video from Stanford Visualization Group on Vimeo.



Zoho Reports

"Zoho Reports is an Online Reporting and Business Intelligence service. The features offered by Zoho Reports specializes on providing in-depth, powerful and flexible reporting and analytical capabilities. It contains an in-built database grid inside, is optimized for reporting and analytical querying than just serving as a real-time online transactional database.

Having said the above, Zoho Reports still offers features like manual data addition, data upload, relational modelling, SQL support, collaboration, API etc., along with specialized reporting features which could be found suitable as an Online database for some of your application requirements. Users are requested to exercise judgement to decide on the suitability of Zoho Reports in such scenarios. You can import tabular data from the following file formats:

Excel Spreadsheets (.xls)
CSV (Comma Separated Values),
TSV (Tab Separated Values)
Any tabular data in text file format
HTML files
MS Access (.mdb) files
Web URL which generates data in CSV format
Zipped files in any of the above file formats (except .mdb files)

You can also copy-paste data from all the above file formats as well as from spreadsheets (Microsoft Office Excel , OpenOffice Calc, StarOffice) files to import the data into Zoho Reports..."



The 70 Online Databases that Define Our Planet

"If you want to simulate the Earth, you'll need data on the climate, health, finance, economics, traffic and lots more. Here's where to find it..."


Note: Although I stated I would no longer be posting data sets this was just too comprehensive and interesting to ignore.

Common Format and MIME Type for Comma-Separated Values (CSV) Files

"The comma separated values format (CSV) has been used for exchanging and converting data between various spreadsheet programs for quite some time. Surprisingly, while this format is very common, it has never been formally documented. Additionally, while the IANA MIME registration tree includes a registration for "text/tab-separated-values" type, no MIME types have ever been registered with IANA for CSV. At the same time, various programs and operating systems have begun to use different MIME types for this format. This RFC documents the format of comma separated values (CSV) files and formally registers the "text/csv" MIME type for CSV in accordance with RFC 2048..."



"Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase."

"Google Refine is NOT for entering new data one cell at a time. It is NOT for doing accounting. Google Refine is for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume..."