Notices
Computer & Technology Related Post here for help and discussion of computing and related technology. Internet, TVs, phones, consoles, computers, tablets and any other gadgets.

PDF text doc into Word - use OCR?

Thread Tools
 
Search this Thread
 
Old 12 February 2004, 03:20 PM
  #1  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Question PDF text doc into Word - use OCR?

Got a 40-page text document in PDF, the original is unavailable. I need to get it into HTML, which I would expect to do by converting it to Word first.

Rather than printing it and scanning it, then having to convert all the Is to 1s and Ss to 5s etc etc, can I just somehow open it with, say, Omnipage and do an OCR straight off? Is that possible / easy?

Asking you as although I'm pretty sure we have OmniPage here (no idea what version), I don't think it's used much - the guy just feeds the scanner and hits the button.

Many thanks for tips

Brendan
Old 12 February 2004, 03:29 PM
  #2  
mj
Scooby Regular
 
mj's Avatar
 
Join Date: Apr 2002
Location: The poliotical wing of Chip Sengravy.
Posts: 6,129
Likes: 0
Received 0 Likes on 0 Posts
Default

without wanting to state the obvious, can you not use the "text select tool"/Ctrl + A, then copy and paste into a blank word Doc?

might work, though I think it depends on how the pdf was created, ie, from a scanned image or not.
Old 12 February 2004, 03:33 PM
  #3  
jjones
Scooby Regular
 
jjones's Avatar
 
Join Date: Apr 1999
Posts: 4,410
Received 1 Like on 1 Post
Default

there is software to do this. saw it in smiths at lunchtime can;t remeber name tho!! prolly a "free" download available.
Old 12 February 2004, 03:40 PM
  #4  
Roland_STI
Scooby Regular
 
Roland_STI's Avatar
 
Join Date: Jan 2004
Posts: 62
Likes: 0
Received 0 Likes on 0 Posts
Default

OCR software will still have trouble with I/1 and S/5 etc. As above, open the .pdf in Adobe Acrobat Reader (free download) and copy all the text to whatever program you wish. If you truly want to edit the .pdf file, get a copy of Adobe Acrobat Writer to edit it.

You say its destination is HTML, you know you can link to the .pdf file and when clicked most people will view the file from within Internet Explorer.

Oh and I hope you wernt serious when you mentioned writing HTML via Word
Old 12 February 2004, 04:30 PM
  #5  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Default

Hmm...

OK. I've done the text select tool copy/paste before, but IIRC I had to do it page by page. Being accurate, I can now tell you I have two docs of 57 and 46 pages each, and each one should be done in two languages. So 200-odd pages at one page a time would be, um, irritating Hence my Holy Grail of A Quicker Way, doing the whole document in one hit with OCR. I've a suspicion that the copy/paste from Acrobat screws up the text formatting also.

Roland, it has to go into HTML for the free text search to work in the database it'll be uploaded to. And I use Word as it has the best text cleanup facilities - DW4 soon gets rid of the crap that Mr Gates inserts as payment
Old 12 February 2004, 04:46 PM
  #6  
boxst
Scooby Regular
 
boxst's Avatar
 
Join Date: Nov 1998
Posts: 11,905
Likes: 0
Received 0 Likes on 0 Posts
Default

Hello

A quick scan on google:

http://www.gohtm.com/convert_html.asp

http://www.softpedia.com/public/cat/4/2/4-2-47.shtml

http://www.xelerate.biz/index.php

Steve.
Old 12 February 2004, 05:19 PM
  #7  
GaryK
Scooby Regular
 
GaryK's Avatar
 
Join Date: Sep 1999
Location: Bedfordshire
Posts: 4,037
Likes: 0
Received 0 Likes on 0 Posts
Default

not sure if this will work in acrobat reader 5.0 (free d/l) but the full acrobat 5.0 you can simply do edit|select all|copy and then fire up word, do a paste and job done! What you probably wont be able to do in reader is export the images if there are any in the pdf to separate image files.

cheers

Gary
Old 12 February 2004, 06:04 PM
  #8  
lightning101
Scooby Regular
 
lightning101's Avatar
 
Join Date: Oct 2004
Location: Never do names esp. Joey, spaz or Mong
Posts: 39,688
Likes: 0
Received 0 Likes on 0 Posts
Talking

export all pages as jpegs and run them through the OCR prog - works perfect.
Old 12 February 2004, 08:04 PM
  #9  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Default

A virtual beer winging its way to GaryK IF you can solve me the problem of word wrap . That method shoves everything down the left side of the page, which I can handle, but with no gaps between paragraphs. (When there's a para gap it's easy, I just use word's auto-replace to delete all the single para marks but keep the doubles, thus getting the word wrap). Acrobat 5's text selection "table/formatted text" keeps the word wrap (though with some dodgy justification) but only grabs a page at a time (musta been that I was thinking of earlier). Is there any way I can get the "Select All" command to highlight it in "Table/formatted text" mode?
Old 12 February 2004, 08:52 PM
  #10  
lightning101
Scooby Regular
 
lightning101's Avatar
 
Join Date: Oct 2004
Location: Never do names esp. Joey, spaz or Mong
Posts: 39,688
Likes: 0
Received 0 Likes on 0 Posts
Cool

http://www.convertzone.com/


this will do it for you



I can taste the virtual beer already (LOL)
Old 12 February 2004, 09:35 PM
  #11  
Fig
Scooby Regular
 
Fig's Avatar
 
Join Date: Aug 2002
Location: not forgetting 20,000 posts from last time ;)
Posts: 5,806
Likes: 0
Received 0 Likes on 0 Posts
Default

Latest Version of Omnipage Pro (Office) has an add-in for MS Word to open PDF files and convert to word docs.
Old 13 February 2004, 08:25 AM
  #12  
GaryK
Scooby Regular
 
GaryK's Avatar
 
Join Date: Sep 1999
Location: Bedfordshire
Posts: 4,037
Likes: 0
Received 0 Likes on 0 Posts
Thumbs up

Brendan,

How about a case of virtual wine? Actually it will all be safe because I cant see any quick and easy way of doing it TBH.

good luck with it anyway!

Gary
Old 13 February 2004, 09:28 AM
  #13  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Default

Coz I can't afford the virtual postage on an entire virtual crate . Reading more carefully, it seems you'd have to share it with mj anyway!

I'm going to see if our sysadmin will let me use the free progs suggested above (too tight to pay for any of them), and see what version of OmniPage we have.

Otherwise, I approach our secretary with a big smile and compliment her hair...
Related Topics
Thread
Thread Starter
Forum
Replies
Last Post
Scott@ScoobySpares
Full Cars Breaking For Spares
61
11 January 2021 03:08 PM
Scott@ScoobySpares
Full Cars Breaking For Spares
55
05 August 2018 07:02 AM
Scott@ScoobySpares
Full Cars Breaking For Spares
7
14 December 2015 08:16 AM
alcazar
Computer & Technology Related
2
29 September 2015 07:18 PM
riiidaa
ScoobyNet General
1
12 September 2015 11:52 AM



Quick Reply: PDF text doc into Word - use OCR?



All times are GMT +1. The time now is 09:04 AM.