Re: Final CFP: In-Use Track ISWC 2013

I have a simple problem. how to extract meaningful information from
the PDF? for instance, citation data from the PDF. I would be happy if
I could extract citation data with a 70% accuracy. so far we have
tried a lot of tools and got very poor results. I would also like to
know how could I get the content of the PDF, jail break the PDF so
that I can make effective use of the content.

I dont have anything against the PDF, I would be happy just by having
an open PDF, something that gets content free.

On Thu, May 2, 2013 at 10:41 PM, Norman Gray <norman@astro.gla.ac.uk> wrote:
>
> Sarven and all, hello.
>
> On 2013 May 2, at 18:38, Sarven Capadisli <info@csarven.ca> wrote:
>
>>> _What_ sucks on the web?  Certainly not PDF.
>>
>> HTML/Web, PDF/Desktop?
>
> PDF/Web, HTML/Desktop?  I'm not sure what you're trying to say here.
>
>>> Thus HTML can do some unimportant things better than PDF,
>>
>> Web pages. It will never take off.
>
> No no, the web is massively successful.  HTML is a really clever hypertext format which is successful because it lets a number of things go wrong (it doesn't guarantee link integrity, links are all one-way, there's minimal text metadata, and so on).  These deficiencies are seriously smart things to use to create a global hypertext.  Web pages have taken off in a big way.
>
> It does not follow that HTML-based hypertext solves all text problems.  In particular, there is nothing in the above set of clever properties which makes HTML obviously ideal for communicating long-form textual arguments.
>
> And what is this 'desktop' of which you speak?  PDF is for making posters, presentations, on-screen documents, and on-tablet documents -- lots of very distinct layout problems there.  In the last case, you can even transfer the things to paper and read them in the bath, if you want.
>
>
>> but what it
>>> can't do, which _is_ important, is make things readable.  The visual
>>> appearance -- that is, the typesetting -- of rendered HTML is almost
>>> universally bad, from the point of view of reading extended pieces.
>>> I haven't (I admit) yet experimented with reading extended text on a
>>> tablet, but I'd be surprised if that made a major difference.
>>
>> I think you are conflating the job of HTML with CSS. Also, I think you are conflating readability with legibility as far as the typesetting goes. Again, that's something CSS handles provided that suitable fonts are in use.
>
> CSS can help make HTML pages more readable.  Myself, I usually put quite a lot of effort into the CSS which accompanies web pages I write.  But it takes a lot of effort to produce good CSS, and the case you're aiming to optimise is the case of a normal-length web-page (under 1000 words, say), with relatively small investments on the part of the reader.
>
> Distributing PDF, you have easy and precise control over fonts, layout, and overall design (or rather, you in principle have access to a style which is carefully designed).  This makes it easy to produce something which is easy to read for thousands of words.
>
> But this is to some extent irrelevant, because I think we're now talking about a non-problem:
>
>>> Also, HTML is not the same as linked data; there's no 'dog food' here
>>> for us to eat.
>>
>> That's quite a generalization there? So, I would argue that "HTML" is more about eating dogfood in the Linked Data mailing list than parading on PDF. We are trying to build things one step at a time; HTML today, a URI that it can sit on tomorrow. Additional machine-friendly stuff the day after.
>
> What, seriously, is the connection between HTML and linked-data?  If there is a deep connection, then HTML articles represent the linked-data community's dog-food, and it should be eaten.
>
> But there is no such deep connection.
>
> Certainly, HTML is one of the representations which a LD system will offer, because a data provider needs to produce a readily and flexibly rendered human-readable representation of the item data being named/offered.   That's a completely different thing from an article.
>
> In another message in this thread, Alexander Garcia Castro says:
>
>> I am right now struggling with a task as simple as getting citation data
>> from PDFs. I dont want to say that the PDF is all bad but... come on,
>> it had a place in the time when desktop was king. now we need to make
>> effective use of content, the reality is simply that content is locked
>> up in PDFs.
>
> Sure: there are weaknesses in the way that article metadata is currently incorporated in PDFs.  DOIs, ORCIDs, arXiv identifiers, all of the 'Beyond PDF' experiments, and so on are all attempts to join the various dots here, and they are rapidly getting better.
>
> Until we really get AI that can read the paper for us, there's nothing 'locked up in PDFs' that's more than (I exaggerate only slightly) a regular expression away.
>
> All the best,
>
> Norman
>
>
> --
> Norman Gray  :  http://nxg.me.uk
> SUPA School of Physics and Astronomy, University of Glasgow, UK
>



-- 
Alexander Garcia
http://www.alexandergarcia.name/
http://www.usefilm.com/photographer/75943.html
http://www.linkedin.com/in/alexgarciac

Received on Thursday, 2 May 2013 20:47:39 UTC