19
Jan
10

Adventures in awk

I’ve been working on a literature review lately so I’ve been sifting through tons of articles.  In my search, I’ve come across a few bibliographies of papers on specific topics which were compiled by researchers interested in those topics.  When the bibliography is reasonably small (in the case of a very narrowly defined topic), it’s usually fastest to sift through it by hand to find articles that might be of interest.  However, the most recent bibliography I found contains 8342 papers.  I am definitely not about to print that out and go through it by hand.

This bibliography is available as an EndNote library file and as a Rich Text Format (rtf) document.  Apparently, if you have a recent version of EndNote installed you can use its search features to sift through the data.  However, the file won’t open in EndNote 6, which is what I have on my laptop.  Zotero wouldn’t import it either.  I tried using the built-in search capabilities in Word and even jEdit on the rtf but nothing could provide me with what I wanted.  Basically, I wanted the ability to export entries that matched a given search criterion to a separate file.  Presumably, there are programs that can do this for you, but I don’t know of them and don’t have them installed.

In the end, I decided to convert the rtf to a simple txt file.  This put each entry on its own line.  With each entry occupying a single line in the text file, I just needed some way to search for a given term and then output each line that contains that term.  I have used sed and awk a little bit in the past and I knew that there must be some way to do that with either or both of those, so I looked into their syntax online.  I found this awk tutorial and, using the examples there, I was able to put together a command that does exactly what I needed:

awk '/term/ {print $0}' < bibliography.txt > term.txt

where “term” is replaced by whatever term you want to match.  You can further automate this by putting a bunch of these commands into a shell script or writing a little Perl program that will take a command line argument and insert it as “term” in the command.

Now I have a list of papers related to all the terms I’m interested in and I’ve got a fast way to search for further terms in the future if I need to.  The approach is a little “awk”ward (groan!) because I have to run it in Linux and I use Windows most of the time.  I no longer have Linux installed as a virtual machine on my laptop and I don’t even have Cygwin installed anymore.  So, I had to upload my text file to one my research group’s Linux servers, run the scripts, and download the results back to my laptop.  Once I figured out what I needed to do it took me less than half an hour to do it, though, so even if it’s kludgy, it’s still a lot faster than reading through the bibliography manually.

UPDATE: I just realized that all I did was replicate the functionality of grep using awk.  That is, I could achieve the same result with the following code:

grep term < bibliography.txt > term.txt

Additionally, it turns out that you can produce this functionality with sed as well, using the following code:

sed -n 's/term/&/p' < bibliography.txt > term.txt

I guess I missed the fact that I could use grep because I started thinking about using sed or awk before I converted the file to plain text.  Each entry was spread over multiple lines so I was thinking about needing something fairly sophisticated.  I know that grep can do regular expressions but my first thought went to sed and awk, which are like one logical unit in my brain because of the O’Reilly books that cover both.

Advertisements

0 Responses to “Adventures in awk”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


Calendar

January 2010
S M T W T F S
« Dec   Jul »
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Recent Twitterings

Follow Me on Twitter

RSS That to which I am listening

  • An error has occurred; the feed is probably down. Try again later.

RSS Entries from my photoblog

  • An error has occurred; the feed is probably down. Try again later.
Advertisements

%d bloggers like this: