- From: <bugzilla@jessica.w3.org>
- Date: Thu, 01 Mar 2012 08:28:06 +0000
- To: public-i18n-core@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16166 --- Comment #1 from Simon Pieters <simonp@opera.com> 2012-03-01 08:28:04 UTC --- I ran a search on the dotnetdotcom.org data. $ grep -aPo "<[^>]+xml:lang[^>]+>" web200904 > xmllang.txt I removed all line breaks in xmllang.txt and then replaced all ">" with ">\n". 68202 tags have xml:lang (but potentially also lang). I then ran this python script to filter out lines that have a lang attribute: #!/usr/bin/python import re f = open('xmllang.txt', 'r') o = open('onlyxmllang.txt', 'a') for line in f: if re.search(r'\slang\s?=', line): continue o.write(line) f.close() o.close() 10245 tags have xml:lang but not lang. What are those tags? #!/usr/bin/python import re f = open('onlyxmllang.txt', 'r') tags = {} for line in f: tag = re.match(r'<([^\s]+)', line).group(1) if tag in tags: tags[tag] = tags[tag] + 1 else: tags[tag] = 1 f.close() o = open('onlyxmllangtags.txt', 'a') for tag in tags: o.write(tag + ': ' + str(tags[tag]) + '\n') o.close() feed: 5 rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"xmlns="http://purl.org/rss/1.0/"xmlns:foaf="http://xmlns.com/foaf/0.1/"xmlns:content="http://purl.org/rss/1.0/modules/content/"xml:lang="ja">: 1 !--: 5 h2: 16 h3: 1 dc:title: 5 blink: 1 meta: 190 htmlxmlns=: 6 rdf:li: 5 dc:publisher: 4 !DOCTYPE: 4 dc:subject: 2 span: 163 img: 24 caption: 1 li: 27 content: 5 ",: 2 HTML: 59 th: 1 xs:documentation: 811 input: 5 !--<rdf:RDF: 10 Segment: 27 dcterms:isPartOf: 4 body: 93 rdf:RDF: 7 head: 5 acronym: 35 ?php++require_once: 1 td: 16 link: 17 abbr: 90 address: 1 em: 3 strong: 1 table: 1 !--<html: 1 rss: 1 a: 105 i: 2 title: 1 html: 7965 summary: 1 htmlxml:lang="fr": 3 p: 430 META: 24 div: 58 -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug.
Received on Thursday, 1 March 2012 08:28:08 UTC