[maemo-users] pdf reading?

From: Martin Collins maemou at mkcollins.org
Date: Thu May 28 07:14:07 EEST 2009
On 5/25/09, Marius Gedminas <marius at pov.lt> wrote:
> On Mon, May 25, 2009 at 12:33:42PM -0600, Martin Collins wrote:
>
> I think you mean pdftohtml.  At least, on my Ubuntu system poppler-utils
> has pdftotext and pdftohtml, without any actual digits in the name.

Yes, you're right.

>> mail the pdf to your gmail account, view in HTML, then save.
>
> How well does this work in practice?

Not very. You get an mht file that looks somewhat like the pdf minus any images.
All the elements use absolute positioning so it may not wrap too well.
You need to use the basic HTML version of gmail to get the 'view in
HTML' option. The standard 'view' just gives you an image of the pdf.

> I'm unhappy with the results I get from pdftotext: it even loses
> paragraph breaks.  pdftohtml, which I never tried before, is a bit
> better, but it considers every line to be a separate paragraph.

What you get from any of these methods will depend to a large extent
on the pdf and how it was created. Some manual intervention will
usually be necessary to get optimal results: With something like a
novel take the converted HTML, run it through tidy then in a good text
editor search and replace the bold and italic tags (and any other
formatting you want to save) to some non-HTML but equivalent
construct. Strip the remaining HTML with htmltotext or similar. Then
in the editor replace your formatting constructs with HTML, add in
valid headers etc. and you're done.

You can fix the split paragraphs in vim by recording keystroke macros
to join any line beginning with [a-z] to the one above, and any line
ending in [,a-z] to the one below.

It sounds involved but once you have a process each book only takes a
few minutes unless the pdf is really borked. With some knowledge of
sed, awk, perl and/or vim the process can be largely automated.

BTW, I just discovered evince will rotate too. The option is under the
edit menu for some reason...

Martin

More information about the maemo-users mailing list