Monday, July 21, 2008

Reading ... Processing file paragraph by paragraph

There are many situations you may want to process a text file paragraph by paragraph ...

One such example was this, I wanted to delete those paragraphs from a text file that had a particular pattern. Like, delete all paragraphs that has text like 'copyright protected by blah blah'.

First thing is to learn how to read a text file paragraph by paragraph, for that we will see how to open a file (named web_extract.txt):
open (FILE, "web_extract.txt") or die "Unable to open web_extract.txt: $!\n";

This is how you can read the opened file line by line:
while(my $line = <FILE> ) {
.... do something ...
}


But to read a file in paragraph mode, you have to reset (zero) the special variable $/, look at the code below:
{
local $/ = '';
@paragraphs = <FILE>;
chomp @paragraphs;
}


So this will read the opened file in paragraph mode.

Don't worry about the block, it is used to localize resetting the $/ variable.

Now the variable @paragraphs has paragraphs as its elements. So you can loop around this variable and push the elements (to @filtered_paragraphs) that do not match your pattern. Then print that new array (@filtered_paragraphs) to the (same/another) file.

Done!!!

0 comments: