PDF to Excel

"Unlike most PDF-to-Excel converters, we successfully detect all tables and discard non-tabular content, leaving you with a clean, easy-to-use XLS file..."



TheDataWeb and DataFerrett

"TheDataWeb brings together under one umbrella demographic, economic, environmental, health, (and more) datasets that are usually separated by geography and/or organization. TheDataWeb is the infrastructure for intelligent browsing and accessing data across the Internet. TheDataWeb provides access across the Internet to demographic, economic, environmental, health, and other databases housed in different systems in different agencies and organizations. TheDataWeb is a collection of systems and software that provide data query and extract capabilities, as well as data analysis and visualization tools, i.e., the DataFerrett."


"DataFerrett is a unique data mining and extraction tool. DataFerrett allows you to select a databasket full of variables and then recode those variables as you need. You can then develop and customize tables. Selecting your results in your table you can create a chart or graph for a visual presentation into an html page. Save your data in the databasket and save your table for continued reuse. DataFerrett helps you locate and retrieve the data you need across the Internet to your desktop or system, regardless of where the data resides..."

  • lets you receive data in the form in which you need it (whether it be extracted to an ascii, SAS, SPSS, Excel/Access file); or
  • lets you move seamlessly between query, analysis, and visualization of data in one package;
  • lets data providers share their data easier, and manage their own online data.

"DataFerrett can run from your internet browser or an application that installed, puts an icon on your desktop."


Log Parser 2.2

"Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®. You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.

Most software is designed to accomplish a limited number of specific tasks. Log Parser is different... the number of ways it can be used is limited only by the needs and imagination of the user. The world is your database with Log Parser."

Microsoft Log Parser 2.2 download


Research Pipeline Datasets

"Just a little interface between you and the world’s online datasets, journals, scientific software, and other resources. Research Pipeline is a collaborative website focused on organizing the world's data. Come see how the world finds data!"


egrep for Linguists (PDF)

"The following pages are intended as a starting point for the empirically inclined linguist who wants to make acquaintance with some basic Unix tricks useful in e.g. corpus studies. Everything on the following pages is (perhaps more accurately!) described elsewhere [2, 4, 8, 9]. However, with the exception of [2], the Unix literature is not written for linguists, and the examples of the diļ¬€erent commands are often far fetched from a linguistic text processing perspective..."


NYC Data Mine

Raw Data Catalog
NYC.gov features a raw data catalog with machine readable data (such as XLS, XML, CSV, and RSS). Click the following link to search Raw Data Sets.



(G)awk cheat sheet (PDF)

Printable, downloadable cheat sheet for gawk (4 page PDF)


"Recipe for a Language
  • 1 part egrep
  • 1 part snobol
  • 2 parts ed
  • 3 parts C
  • Blend all parts well using lex and yacc. Document minimally and release.
  • After eight years, add another part egrep and two more parts C. Document very well and release."



"BEWARE! Boring man page description on the way:

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed's ability to filter text in a pipeline which particularly distinguishes it from other types of editors.


Note 1: This page was completely generated by a sed script.
Note 2: This page is a valid sed script. You can copy&paste and run it with sed -f."


What is the Google Fusion Tables API?

The Google Fusion Tables API enables programmatic access to Google Fusion Tables content. It is an extension of Google's existing structured data capabilities for developers. Here are some of the things you can do:
  • Upload: Here's where it all starts: populating a table in Google Fusion Tables with data from spreadsheets or .CSV files.
  • Query and download: The Google Fusion Tables API is built on top of a subset of the SQL querying language. By referencing data values in SQL-like query expressions, you can find the data you need, then download it for use by your application. Your app can do any desired processing on the data, such as computing aggregates or feeding into a visualization gadget.
  • Sync: As you add or change data in the tables in your offline repository, you can ensure the most up-to-date version is available to the world by synchronizing those changes up to Google Fusion Tables.