Pages

Wednesday, June 22, 2011

Simple Perl script and regular expressions

Since I've started posting "Before and After" not only to FFN but to the Fanfiction Mailing List (FFML) as well as The Fanfiction Forum and SpaceBattles creative writing forum, I've had to juggle a number of formats. I still use latex2rtf for the FFN upload, but I've had to develop techniques for the other formats. For FFML, I send out plain, ASCI text (so no macrons over letters, no fancy ellipses, no italics except that denoted by underscores). For the fora, I have to use [i] and [/i] respectively for italics, replace the LaTeX primitives that denote accented characters with their unicode equivalents, and so on. I know well enough that this procedure is some tedious find and replace stuff when done by hand, but Perl's capacity to handle regular expressions makes it easy to automate. For a primer regular expressions, see wiki, but the basic idea is thus: a statement in Perl using regular expressions might look like

$str =~ s|``|\"|g;

where $str is the existing string to be altered. The syntax is basically =~ is for changing in place, s for search, | (or any other character) as a delimiter, `` is what we're searching for, \" means replace `` with a quote (escaped because " usually means a literal string), and g means do this everywhere in the string, not just to the first match. The other | should be the same character as the first | (in other words, you can use whatever character for the delimiter; they just need to all be the same).

The script I use now is below. It's written for use on my linux partition, but the gist of it ought to be portable to windows.

#!/usr/bin/perl -w
use utf8; #allows for utf characters in input and output
use Getopt::Std; # allows parsing the -b flag to convert to bbcode instead
my %options=();
getopts('b',\%options);

if($#ARGV != 0) #0 means 1 argument (not a flag); -1 means 0 arguments, etc.
{
        print(
"strip-tex [-b] tex_file.tex
        renders a latex subdocument into plain text or bbbcode

OPTIONS:
        -b: renders in bbcode
        default (no flags): renders in plain text
                (does not auto-wrap; use par on result)\n");
        exit;
}
binmode STDOUT, ":utf8";
open(FILE, $ARGV[0]) or die("Open failed\n"); #kills the script if failure
@lines = ; #each entry in array is a single line (paragraph)
$str = "";
$length = scalar(@lines); #the number of lines (paragraphs)
for my $i (0 .. $length-1)
{
        #common statements (not in if...) are for both formats
        $str = $lines[$i];
        $str =~ s|``|\"|g;
        $str =~ s|\\qs| |g;
        if ($options{b})
        {
                #\x{####} denotes a unicode character by its hex code
                $str =~ s|\\ellip{}|\x{2026}|g; #ellipsis
                $str =~ s|---|\x{2014}|g; #em dash
                $str =~ s|\\=o|\x{014D}|g; #macron o
                $str =~ s|\\=O|\x{014C}|g; #macron O
                $str =~ s|\\=u|\x{016B}|g; #macron u
                $str =~ s|\\=U|\x{016A}|g; #macron U

                #this while checks for nested \emph{} or other commands
                #latex allows nested emphasis, but bbcode doesn't
                #allow nested [i]
                while($str =~ m!\\([^{]*){!)
                {
                        $str =~ s!\\([^{]*){([^{}]*)}!\[i\]$2\[/i\]!g;
                }
        }
        else
        {
                #plain text portion
                $str =~ s|\\ellip{}'|...'|g; #difference in ellipsis
                $str =~ s|\\ellip{}|... |g;  #handling for better breaks
                $str =~ s|---|--|g;
                $str =~ s|\\=||g;
                while($str =~ m!\\([^{]*){!)
                {
                        $str =~ s!\\([^{]*){([^{}]*)}!_$2_!g;
                }
        }
        $str =~ s|''|\"|g;
        $str =~ s|`|\'|g;
        print $str; #hence, output is to stdout; redirect to desired output
}
close FILE;

2 comments:

Adam said...

use utf8; probably does not do what you think. It allows you to type utf8 characters in the source code of the perl file.

So you probably don't need it...

Muphrid said...

That's a good catch; thanks for the tip.