- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 27 Mar 2012 23:28:05 +0200
- To: www-archive@w3.org
So, The other day I wondered how I would go about learning the russian language. I have enjoyed formal education in high german, the mooring dialect of the north frisian language, english, latin, and danish, the main problem is of course the different script. I could not find a good online resource suited for how I've wired my brain over the years; I wouldn't do well with some lookup table that lists the members of the alphabet and offers some pronounciation hints, I would rather like something where I can get immediate results that I care about, something visual that can easily be connected to things I already know, understanding loanwords better for instance, or russian spelling of already familiar terms like UNESCO. So the general idea was that I would like to have something where I can enter some russian text and I would get some reasonably phonetic romani- zation of the text, with a direct connection to individual letters in- stead of some bulk romanization of the complete text. So I figured I can create that myself. Ruby annotation was the obvious solution, and as it turns out fellow Usenet participants Alan J. Flavell and Andreas Prilop worked on some- thing like this before, in particular http://www.unics.uni-hannover.de/nhtcapri/ruby-annotation.var was helpful, with suitable fall-back style sheets for browsers that do not handle Ruby annotations natively. Similarily good old CPAN came to the rescue to provide romanization rules, here in the form of the module <http://search.cpan.org/dist/Lingua-Translit/>. That module unfortunately relies on Regex features unsupported in JS and I wanted something that works in the browser, so I made a crude script that turns the Perl regular expression into something JS ones can handle based on the raw data files (using brute force as there are no good con- version utilities that I know of). I used roughly this script: #!perl -w use strict; use warnings; use XML::LibXML; use Set::IntSpan; use Encode; use JSON; use Attribute::Memoize; die "Usage: $0 rules.xml\n" unless @ARGV; my $doc = XML::LibXML->load_xml( location => $ARGV[0], load_ext_dtd => 0, ); my @rules = $doc->findnodes('//rule'); sub simplify_regex : Memoize { my $regex = shift; no warnings 'utf8'; my $spans = Set::IntSpan->new( 0xD800 .. 0xDFFF, grep { chr($_) =~ /$regex/; } 0x0000 .. 0xD7FF, 0xE000 .. 0xFFFF ); return "[" . join("", map { sprintf "\\u%04X-\\u%04X", @$_ } $spans->spans) . "]"; } my @list; foreach my $rule (@rules) { my $from = $rule->findvalue('from'); my $to = $rule->findvalue('to'); my $before = $rule->findvalue('.//before') || ""; my $after = $rule->findvalue('.//after') || ""; push @list, { from => $from, to => $to, before => simplify_regex($before) . (length($before) ? '' : '?'), after => simplify_regex($after) . (length($after) ? '' : '?') }; } print JSON->new->ascii(1)->pretty(1)->encode(\@list); Specifically I used the rules for DIN 1460, the german "industrial norm" 1460 for russian, as provided by Lingua::Translit in an XML form, and I came up with <http://www.websitedev.de/temp/din1460.html> where you can enter russian text and get properly DIN 1460 annotated output. This has helped me quite a bit to develop a sense for alphabet and language so far. Note that Lingua::Translit might behave differently in some edge cases, I did not have the time or inclination to check for that in detail, but the edge cases that I did test seemed to work out okay (the issue is largely that the Perl module changes the whole input for each rule, while my code applies all rules to any given position; since some of the rules are based on what comes "after" the current position, there may be differences due to that. That's not the only difference though). regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 27 March 2012 21:28:29 UTC