• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

pdf>excel

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.
Acrobat has a really bizarre way of formatting text that makes it unusable for other applications.

The only way that I've been able to do this type of thing is to save the pdf file as a tif and then use an OCR program to convert it to text, then copy and paste into Excel.
 
Thats a good solution... You could buy the

try downloading the Trial version of Adobe Acrobat Professional, and see if you can pull out the text that way. I dont know if you can. Acrobat works by a compressing images (and sometimes text) to vectors. This makes them very small, very clear, and Zoom works very good. The problem could be that the PDF contains no actual TEXT, and just images of TEXT.

Then you would have to output to a Image, and then load a OCR program as above stated.

http://www.adobe.com/products/acrobatpro/

However, when your done with the OCR... You will have another issue. And that is to process the text into an PDF. Usually this require programming. To convert Text to a format that can be inputed into excel. This takes some cleaning up logic, and can be difficult. I pull it off using a few different tools, each one accomplishing something...

Then when its finally as good as it gets, then you got to go in and edit it yourself

Mike

PS I looked at the excel sheet. I could probably pull it off... but good luck, it is NOT going to be easy... OCR software is not going to be easy... You will have errors... and would have to double check every number... I recommend that you either contact the people who made the PDF, or you start typing..
 
Last edited:
GreenJelly said:
Thats a good solution... You could buy the

try downloading the Trial version of Adobe Acrobat Professional, and see if you can pull out the text that way. I dont know if you can. Acrobat works by a compressing images (and sometimes text) to vectors. This makes them very small, very clear, and Zoom works very good. The problem could be that the PDF contains no actual TEXT, and just images of TEXT.

Actually, I use the professional version extensively. Due to Acrobat's placing text as individual text objects in a specific tab order, however, direct export as a table is not possible. Acrobat does not convert object types, but maintains the object as it was in the source file; text, vector-based graphic, or image. Problems sometimes arise when folks simply place an image (scan) of a document in a pdf, but this does not apply to the document linked by the OP.

GreenJelly said:
However, when your done with the OCR... You will have another issue. And that is to process the text into an PDF. Usually this require programming. To convert Text to a format that can be inputed into excel. This takes some cleaning up logic, and can be difficult. I pull it off using a few different tools, each one accomplishing something...Then when its finally as good as it gets, then you got to go in and edit it yourself

Converting text to a pdf is as simple as using the free acrobat printer, but the OP wishes to extract a table to excel. Any table, be it html, a word document, or a plain text file using separated values (tab, comma, space) can be exported directly into excel. Advanced OCR programs such as Readiris Pro 9.0. will recognize and format text into tables for you.


GreenJelly said:
PS I looked at the excel sheet. I could probably pull it off... but good luck, it is NOT going to be easy... OCR software is not going to be easy... You will have errors... and would have to double check every number... I recommend that you either contact the people who made the PDF, or you start typing..

Actually, since Acrobat Professional (needed to export to tif) has the ability to save an exact image of the font at resolutions of up to 2400dpi (only 600dpi will suffice) the image quality is essentially perfect. Advanced OCR programs can read the rendered numbers with 100% accuracy, providing that the source document was text (as was the case with the OP's link).
 
My problem is with the size of the font, and the font itself (and it doesnt zoom well; it looks like it may have been a scan->pdf)... if he is unable to change that, the OCR programs will have a tough time. Anyways, it depends on his need for accuracy. The problem will be that if he uses a OCR, and the OCR makes a 8->0, then the task has failed.

We dont know if their is text underlieing the document. If their is, then he can export it. However you then have to filter it out and put it in a form, like you mentioned, that can be imported into excel. You can post it in Word, and do a bunch of Replaces to do most of this.

The Ultimate Solution is to find out who made the PDF and get the original document... Which was probably made in Excel... hehe.

All of these solutions take time. Ive done conversions like this and its no short project. If everything works well I am done in 2 hours... But thats rare... Life always throws curveballs, and thoose 2 hours can turn into two days.

I hope the guy luck... I think it would be best to try and extract the text with the pro-version. if its not possible, then TRY a OCR... if their is allot of errors, then you must step up, and retype it...

Luckily its only 10-15 pages (I forget how many exactly). If you can get a fast typer to do it, they could get it done in 4 hours or less... Then if its important enough, then get a second person to type in the same thing, Then run a check on the two excel sheets.

In fact, Im leaning on re-typing it from the get go... You probably will have to deal with allot to get this going. The only way I would do it another way, would be if their are sets of these, and you have to develop a process to do 10+ of them.

This is the BEST solution in my mind, due to the small amount of data we are dealing with.

PDF's are the worst for extracting data from... Their may be metadata on this, that you can pull off... I wonder how google pulls the text from that.

Mike
 
GreenJelly said:
My problem is with the size of the font, and the font itself (and it doesnt zoom well; it looks like it may have been a scan->pdf)...

With Acrobat pro, you can easily differentiate. As I mentioned, the link provided by the OP is text. Using my technique, he can be done in about 15 minutes.

I speak from experience, as I have to do this every year for many tables produced by the IRS, as we use the same tables in our local tax department. Our procedure is pdf>tif>ocr>excel>csv>As/400. It is 100% accurate.
 
Cool... I was thinking MetaData...

I betcha the text is in the header of that file, if you can get to it...

Mike
 
GreenJelly said:
Cool... I was thinking MetaData...

I betcha the text is in the header of that file, if you can get to it...

Mike

Actually, the text is encapsulated in the file, just like in Illustrator (remember eps?). It's only the manner in which Acrobat formats the text that makes it unusable as a table.
 
But I remember scanning over detailed discussions of the addition of Metadata to the PDF standard to provide the ability for Search Engines to download the content of a PDF, without downloading the whole file.

Unfortunately, I didnt need this information at the time... but now I clearly remember these documents and how they exist. If this metadata exists, it should contain a text version of the entire document without any of very few types of formating. This metadata maybe/probably compressed, and can be read without pulling the whole file. So it would be near the top.

The problem and a great solution for both of you, is to figure out if this data exists and how to access it. Then the solution and process to both of your problems, becomes infinately easier.

Mike
 
Last edited:
Back