W3C home > Mailing lists > Public > www-archive@w3.org > March 2012

Russian romanization via ruby annotation

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 27 Mar 2012 23:28:05 +0200
To: www-archive@w3.org
Message-ID: <o1a4n75qnikg0nmct4cgl3pon7ncl1d903@hive.bjoern.hoehrmann.de>
So,

  The other day I wondered how I would go about learning the russian
language. I have enjoyed formal education in high german, the mooring
dialect of the north frisian language, english, latin, and danish, the
main problem is of course the different script.

I could not find a good online resource suited for how I've wired my
brain over the years; I wouldn't do well with some lookup table that
lists the members of the alphabet and offers some pronounciation hints,
I would rather like something where I can get immediate results that I
care about, something visual that can easily be connected to things I
already know, understanding loanwords better for instance, or russian
spelling of already familiar terms like UNESCO.

So the general idea was that I would like to have something where I can
enter some russian text and I would get some reasonably phonetic romani-
zation of the text, with a direct connection to individual letters in-
stead of some bulk romanization of the complete text. So I figured I can
create that myself.

Ruby annotation was the obvious solution, and as it turns out fellow
Usenet participants Alan J. Flavell and Andreas Prilop worked on some-
thing like this before, in particular 

  http://www.unics.uni-hannover.de/nhtcapri/ruby-annotation.var

was helpful, with suitable fall-back style sheets for browsers that do
not handle Ruby annotations natively. Similarily good old CPAN came to
the rescue to provide romanization rules, here in the form of the module
<http://search.cpan.org/dist/Lingua-Translit/>.

That module unfortunately relies on Regex features unsupported in JS and
I wanted something that works in the browser, so I made a crude script
that turns the Perl regular expression into something JS ones can handle
based on the raw data files (using brute force as there are no good con-
version utilities that I know of). I used roughly this script:

  #!perl -w
  use strict;
  use warnings;
  use XML::LibXML;
  use Set::IntSpan;
  use Encode;
  use JSON;
  use Attribute::Memoize;
  
  die "Usage: $0 rules.xml\n" unless @ARGV;
  
  my $doc = XML::LibXML->load_xml(
    location     => $ARGV[0],
    load_ext_dtd => 0,
  );
  
  my @rules = $doc->findnodes('//rule');
  
  sub simplify_regex : Memoize {
    my $regex = shift;
  
    no warnings 'utf8';
  
    my $spans = Set::IntSpan->new(
      0xD800 .. 0xDFFF,
      grep { chr($_) =~ /$regex/; } 0x0000 .. 0xD7FF, 0xE000 .. 0xFFFF
    );
  
    return
        "["
      . join("", map { sprintf "\\u%04X-\\u%04X", @$_ } $spans->spans)
      . "]";
  }
  
  my @list;
  
  foreach my $rule (@rules) {
    my $from   = $rule->findvalue('from');
    my $to     = $rule->findvalue('to');
    my $before = $rule->findvalue('.//before') || "";
    my $after  = $rule->findvalue('.//after') || "";
  
    push @list, {
      from   => $from,
      to     => $to,
      before => simplify_regex($before) . (length($before) ? '' : '?'),
      after  => simplify_regex($after) . (length($after) ? '' : '?')
    };
  
  }
  
  print JSON->new->ascii(1)->pretty(1)->encode(\@list);

Specifically I used the rules for DIN 1460, the german "industrial norm"
1460 for russian, as provided by Lingua::Translit in an XML form, and I
came up with <http://www.websitedev.de/temp/din1460.html> where you can
enter russian text and get properly DIN 1460 annotated output. This has
helped me quite a bit to develop a sense for alphabet and language so
far.

Note that Lingua::Translit might behave differently in some edge cases,
I did not have the time or inclination to check for that in detail, but
the edge cases that I did test seemed to work out okay (the issue is
largely that the Perl module changes the whole input for each rule,
while my code applies all rules to any given position; since some of the
rules are based on what comes "after" the current position, there may be
differences due to that. That's not the only difference though).

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Tuesday, 27 March 2012 21:28:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:48 GMT