20091231

CsvMode for Emacs


"CsvMode, csv-mode, is a package by FrancisJWright for editing comma separated value files (.csv)... The commands include sorting numerically or alphabetically on a particular field, cutting and pasting columns of fields, padding to align fields, or removing padding."

http://www.emacswiki.org/emacs/CsvMode

Working with CSV files in Vim

[Scripts for Vim]

CSV files (comma-separated values) are often used to save tables of data in plain text. Following are some useful techniques for working with CSV files. You can:

  • Highlight all text in any column.
  • View fields (convert csv text to columns or separate lines).
  • Navigate using the HJKL keys to go left, down, up, right by cell (hjkl work as normal).
  • Search for text in a specific column.
  • Sort lines by column.
  • Delete a column.
  • Specify a delimiter other than comma.

http://vim.wikia.com/wiki/Working_with_CSV_files

20091230

CRUSH-Tools

CRUSH (Custom Reporting Utilities for SHell) is a collection of tools for processing delimited-text data from the command line or in shell scripts.

For help getting started using CRUSH, or to see a demo of what it can do, try the CrushTutorial. For a list of the utilities provided in CRUSH and links to their documentation, see the UserDocs. Or see ApplicationDevelopmentWithCrush for a detailed look at writing applications using the CRUSH toolkit.

Join the CRUSH discussion group at http://groups.google.com/group/crush-tools

http://code.google.com/p/crush-tools/

Data.gov Catalogs

"Use the Data.gov catalog below to access U.S. Federal Executive Branch datasets. Click on the name of a dataset to view additional metadata for that dataset. By accessing the data catalogs, you agree to the Data Policy. Data.gov offers data in three ways: through the "raw" data catalog, using tools and through the geodata catalog. The "Raw" Data Catalog provides an instant download of machine readable, platform-independent datasets while the Tools Catalog provides hyperlinks which may lead to agency tools or agency web pages that allow you to mine datasets."

http://www.data.gov/catalog

20091226

Awk Channel Wiki

This wiki is maintained by regulars from the #awk channel on irc.freenode.net. #awk is small with a low traffic, and is not different from the other channels of this size on irc ie:
  • Do not wait to see if someone is awake to ask your question, we will answer if we see one when we see it.
  • Be patient, we do answer to most of the questions
  • The best way to ask a question on #awk is often to use a PasteBin containing a sample of your input data and a sample of the output you expect.
  • if #awk doesn't answer maybe consider asking in comp.lang.awk

http://awk.freeshell.org/

"Data Munging With Perl" by David Cross (PDF)

"... In other words, munging data. It’s a dirty job, but someone has to do it.

If that someone is you, you’re definitely holding the right book. In the following pages, Dave will show you dozens of useful ways to get those everyday data manipulation chores done better, faster, and more reliably. Whether you deal with fixed-format data, or binary, or SQL databases, or CSV, or HTML/XML, or some bizarre proprietary format that was obviously made up on a drunken bet, there’s help right here..."


Data Munging With Perl.pdf (2744k)

Regex Dictionary

"The Regex Dictionary is a searchable online dictionary, based on The American Heritage Dictionary of the English Language, 4th edition, that returns matches based on strings —defined here as a series of characters and metacharacters— rather than on whole words, while optionally grouping results by their part of speech ... regexes are regular expressions, a set of characters, metacharacters, and operators that define a string or group of strings in a search pattern. The Regex Dictionary uses Perl's regular expression syntax..."

http://www.visca.com/regexdict/

20091214

Unix Utilities for Windows

"The Cygwin tools are ports of the popular GNU development tools for Microsoft Windows. They run thanks to the Cygwin library which provides the UNIX system calls and environment these programs expect.

With these tools installed, it is possible to write Win32 console or GUI applications that make use of the standard Microsoft Win32 API and/or the Cygwin API. As a result, it is possible to easily port many significant Unix programs without the need for extensive changes to the source code. This includes configuring and building most of the available GNU software (including the packages included with the Cygwin development tools themselves). Even if the development tools are of little to no use to you, you may have interest in the many standard Unix utilities provided with the package. They can be used both from the bash shell (provided) or from the standard Windows command shell."


http://www.cygwin.com/

"The GnuWin32 project provides Win32-versions of GNU tools, or tools with a similar open source licence. The ports are native ports, that is they rely only on libraries provided with any standard 32-bits MS-Windows operating system, such as MS-Windows 95 / 98 / ME / NT / 2000 / XP / 2003 / Vista. Native ports do not rely on some kind of Unix emulation, such as CygWin or Msys, so that there is no need to install additional emulation libraries. At present, all developments have been done under MS-Windows-XP, using the Mingw port of the GNU C and C++ (GCC) compilers. Utilities and libraries provided by GnuWin32, are used and distributed with packages such as GNU Emacs and KDE-Windows..."

http://gnuwin32.sourceforge.net/

Editing files with the ed text editor from scripts

"Unlike sed, ed is really a file editor. If you try to change file contents with sed, and the file is open elsewhere and read by some process, you will find out that GNU sed and its -i option of course does not edit in-file. There are circumstances where you may need that, either editing active and open files or not having GNU sed or some other sed with “in-place” option available.

Why ed?
  • maybe your sed doesn't support in-place edit
  • maybe you need to be as portable as possible
  • maybe you need to really edit in-file (and not create a new file like GNU sed)
  • last but not least: standard ed has very good editing and addressing possibilities, compared to standard sed

Don't get me wrong, this is not meant as anti-sed article! It's just meant to show you another way that may do the job..."

http://bash-hackers.org/wiki/doku.php?id=howto:edit-ed

Regular Expression Basic Syntax References

A quick regex cheatsheet at: http://www.regular-expressions.info/reference.html
And here is a printable PDF: regular-expressions-cheat-sheet-v2.pdf

The Comma Separated Value (CSV) File Format

"The CSV ("Comma Separated Value") file format is often used to exchange data between disparate applications. The file format, as it is used in Microsoft Excel, has become a pseudo standard throughout the industry, even among non-Microsoft platforms.

As is the case with most exchange formats since XML, CSV files have become somewhat of a legacy format. New applications that wish to include an export format will generally use XML today (though there may be exceptions). In legacy systems though (pre-XML), CSV files had indeed become a de facto industry standard. Just as there are still billions of lines of CoBOL code in use today that need to be maintained, support for a legacy standard such as CSV is likely to be required long after it has stopped being implemented in new designs..."


http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

20091213

Greppin' in the GNU World Lab

"Computers are good at storing and processing data. Much of the data you store on them is in the form of plain-text files, especially in Linux. Text files are not just things like recipes for peanut butter cookies; they may contain important configuration files, email, amateur political commentary, calendar entries, even grocery shopping lists.

When a computer processes data, it often provides some sort of plain-text feedback, hopefully the information you wanted. Sometimes though, the computer provides text about an error, crash output, even information you didn't want. In any case, there's usually a lot of text to sort through on a computer!

Being human, you tend to take in certain pieces of information and ignore others; if you didn't filter information, you would otherwise be swamped with things you didn't care about or need to know. Wouldn't it be great if your computer, which is supposed to make your life easier, were able to help you in the same way by showing you only only the things you want to see? Well, you're about to learn how to make that happen!

In Linux, you can use the commands grep, sort, wc, and their associated flags to find and display text in ways that are meaningful and useful to you..."


http://code.google.com/edu/tools101/linux/grep.html

20091212

"sed & awk, Second Edition" by Dale Dougherty, Arnold Robbins


"sed & awk describes two text processing programs that are mainstays of the UNIX programmer's toolbox. sed is a "stream editor" for editing streams of text that might be too large to edit as a single file, or that might be generated on the fly as part of a larger data processing step. The most common operation done with sed is substitution, replacing one block of text with another. awk is a complete programming language. Unlike many conventional languages, awk is "data driven" -- you specify what kind of data you are interested in and the operations to be performed when that data is found. awk does many things for you, including automatically opening and closing data files, reading records, breaking the records up into fields, and counting the records. While awk provides the features of most conventional programming languages, it also includes some unconventional features, such as extended regular expression matching and associative arrays. sed & awk describes both programs in detail and includes a chapter of example sed and awk scripts. This edition covers features of sed and awk that are mandated by the POSIX standard. This most notably affects awk, where POSIX standardized a new variable, CONVFMT, and new functions, toupper() and tolower(). The CONVFMT variable specifies the conversion format to use when converting numbers to strings (awk used to use OFMT for this purpose). The toupper() and tolower() functions each take a (presumably mixed case) string argument and return a new version of the string with all letters translated to the corresponding case. In addition, this edition covers GNU sed, newly available since the first edition. It also updates the first edition coverage of Bell Labs nawk and GNU awk (gawk), covers mawk, an additional freely available implementation of awk, and briefly discusses three commercial versions of awk, MKS awk, Thompson Automation awk (tawk), and Videosoft (VSAwk)."

http://oreilly.com/catalog/9781565922259

Daily Dose of Excel

There is a lot of CSV specific information at the always entertaining "Daily Dose of Excel" blog.

http://www.dicks-blog.com/

Don’t MAWK AWK - the fastest and most elegant big data munging language!

"When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hundreds of megabytes of data, even simple operations can take plenty of time..."

http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/

Flat File Checker

"Flat File Checker (FlaFi) is a simple tool for validation of structured data stored in flat files (*.txt, *.csv, etc)..."

http://www.flat-file.net/

20091210

"Unix for Poets" by Kenneth Ward Church (PDF)

"Text is available like never before. Data collection efforts such as the Association for Computational Linguistics’ Data Collection Initiative (ACL/DCI), the Consortium for Lexical Research (CLR), the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), the Linguistic Data Consortium (LDC), Electronic Dictionary Research (EDR) and many others have done a wonderful job in acquiring and distributing dictionaries and corpora.1 In addition, there are vast quantities of so-called Information Super Highway Roadkill: email, bboards, faxes. We now has access to billions and billions of words, and even more pixels.

What can we do with it all? Now that data collection efforts have done such a wonderful service to the community, many researchers have more data than they know what to do with. Electronic bboards are beginning to fill up with requests for word frequency counts, ngram statistics, and so on. Many researchers believe that they don’t have sufficient computing resources to do these things for themselves. Over the years, I’ve spent a fair bit of time designing and coding a set of fancy corpus tools for very large corpora (eg, billions of words), but for a mere million words or so, it really isn’t worth the effort. You can almost certainly do it yourself, even on a modest PC. People used to do these kinds of calculations on a PDP-11, which is much more modest in almost every respect than whatever computing resources you are currently using..."


UnixforPoets.pdf

20091209

"The AWK Programming Language" by Aho, Kernighan & Weinberger

"Originally developed by Alfred Aho, Brian Kernighan, and Peter Weinberger in 1977, AWK is a pattern-matching language for writing short programs to perform common data-manipulation tasks. In 1985, a new version of the language was developed, incorporating additional features such as multiple input files, dynamic regular expressions, and user-defined functions. This new version is available for both Unix and MS-DOS. This is the first book on AWK. It begins with a tutorial that shows how easy AWK is to use. The tutorial is followed by a comprehensive manual for the new version of AWK. Subsequent chapters illustrate the language by a range of useful applications, such as: *Retrieving, transforming, reducing, and validating data *Managing small, personal databases *Text processing *Little languages *Experimenting with algorithms The examples illustrates the book's three themes: showing how to use AWK well, demonstrating AWK's versatility, and explaining how common computing operations are done. In addition, the book contains two appendixes: summary of the language, and answers to selected exercises."

Be sure to read the customer reviews at amazon.com...

http://plan9.bell-labs.com/cm/cs/awkbook/

lawker - The AWK code locker

This repository holds all the code (and web pages) for the awk community portal, http://awk.info.

The repository including single functions, groups of related functions, preprocessors and language variants.

http://code.google.com/p/lawker/

CSVed

"CSVed is an easy and powerful CSV file editor, you can manipulate any CSV file, separated with any separator."



http://csved.sjfrancke.nl/

Google Fusion Tables

  • Get started with an interesting data set from the Table Gallery.
  • Upload data tables from speadsheets or csv files. During our labs release, we can support up to 100MB per table, and up to 250MB per user. You can export your data as csv too.
  • See the data on a map or as a chart immediately. Columns with locations are interpreted automatically, and you can adjust them directly on a map if necessary. Use filter and aggregate tools for more selective visualizations.
  • Just enter the email addresses of the people with whom you want to share a table and send them an invitation.
  • When another table has information about the same entities, merge the tables together to see all information in one place. When any data table is updated, the merged table will show the latest value too.
  • Multiple people can view and comment on the data. Discussions display people's commments and any changes to the data over time.
  • Choose only the columns you want to share with others. Save as a linked table with its own share permissions that will always show your current data values.
  • During data import, you can specify attribution for the data. The attribution will appear even when your data is merged into other tables.
  • Now that you've got that nice map or chart of your data, you can embed it in a web page or blog post. It will always display the latest data values for your table and allow you to communicate your story more easily.
  • FAQ

http://tables.googlelabs.com/Home

comp.lang.awk

Activity - Low
Description - The AWK programming language.
Language - English
Categories - Computers > Programming
Access - Public > Usenet
FAQ - http://www.faqs.org/faqs/computer-lang/awk/faq/

http://groups.google.com/group/comp.lang.awk/topics?pli=1