If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Rate Thread | Display Modes |
#1
|
|||
|
|||
search multiple PDFs for common text
Hi,
I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
Ads |
#2
|
|||
|
|||
search multiple PDFs for common text
"dale" wrote
| I have about 100 PDF files I would like to search for common words. | It's easy with something like Agent Ransack. Just put the PDFs in a folder and search. No script needed. But there is one, big caveat: Not all text in PDFs is in there as text. A book, for instance, might be comprised of the text of the book. Or it could be composed of scans of the book's pages. The latter will have no actual text. You's have to extract the pages and run them through OCR to get the text. |
#3
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 5:32 PM, Mayayana wrote:
"dale" wrote | I have about 100 PDF files I would like to search for common words. | It's easy with something like Agent Ransack. Just put the PDFs in a folder and search. No script needed. But there is one, big caveat: Not all text in PDFs is in there as text. A book, for instance, might be comprised of the text of the book. Or it could be composed of scans of the book's pages. The latter will have no actual text. You's have to extract the pages and run them through OCR to get the text. Please remember that there are two basic kinds of PDF documents. One is a Text based document in the PDF format. The other is an Image in the PDF format. If the Image is of a text document, it will look like a text based document, but not be able to be searched. If the Image document had been OCR'ed and the OCR text had been included in the PDF the PDF document would be searchable. So depending on the source of the documents you with to search, your goal may be obtainable, but if the pdf's are image document from a scanner they will not. -- 2018: The year we learn to play the great game of Euchre |
#4
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 5:32 PM, Mayayana wrote:
"dale" wrote | I have about 100 PDF files I would like to search for common words. | It's easy with something like Agent Ransack. Just put the PDFs in a folder and search. No script needed. But there is one, big caveat: Not all text in PDFs is in there as text. A book, for instance, might be comprised of the text of the book. Or it could be composed of scans of the book's pages. The latter will have no actual text. You's have to extract the pages and run them through OCR to get the text. These aren't books, though might be encoded the same way, I am able to copy/paste the text on one I tried, using MS Edge browser as viewer looked over the Agent Ransack specs, seems like it can work isn't available in Microsoft store so I don't know yet if I want to install it, "about us" looks impressive Thanks much -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
#5
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 5:47 PM, Keith Nuttle wrote:
On 5/9/2018 5:32 PM, Mayayana wrote: "dale" wrote | I have about 100 PDF files I would like to search for common words. | Â*Â*Â* It's easy with something like Agent Ransack. Just put the PDFs in a folder and search. No script needed. But there is one, big caveat: Not all text in PDFs is in there as text. A book, for instance, might be comprised of the text of the book. Or it could be composed of scans of the book's pages. The latter will have no actual text. You's have to extract the pages and run them through OCR to get the text. Please remember that there are two basic kinds of PDF documents.Â* One is a Text based document in the PDF format.Â* The other is an Image in the PDF format.Â* If the Image is of a text document, it will look like a text based document, but not be able to be searched.Â* If the Image document had been OCR'ed and the OCR text had been included in the PDF the PDF document would be searchable. So depending on the source of the documents you with to search, your goal may be obtainable, but if the pdf's are image document from a scanner they will not. I can copy/paste from one of the PDF files using MS Edge browser as viewer ..., and I can search one with Edge's "find on page" function -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
#6
|
|||
|
|||
search multiple PDFs for common text
dale wrote:
Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay Acrobat Exchange (could be called just "Acrobat" today), had an inverted search indexer, which could accept hundreds of documents and index them into a common database file. I have at least one CD distributed as advertising, which had one of those indexes which covered every PDF on the CD. Very nice. The search could then be carried out instantly, and when you clicked a single line in the search result, the document in question would open. My copy of Acrobat Exchange 4 or so, had a feature like that. The inverted search indexer. This could also work... as long as Windows 10 had a search provider for PDF. Since Windows 10 doesn't have a "thumbnailer" capability for PDF, unless you install some Acrobat software, what are the odds that PDF files will get indexed by the built-in Federated Search in Windows 10 ? The problem is, Windows cannot "add a file type" for content search, unless a search provider knows how to open the file and extract words from it. There are tools such as open source pdf2text or pdftotext that might work (script level detection of text). Note that, there can be differences in the quality of the tools. For example, any bozo can extract a single text string. bozo However, tools like LibreOffice, have on occasion resorted to micro-positioning (overriding the font metrics and pretending they're smarter than font people) when they save out in PDF format. What happens when pdftotext sees b o z o Does that get converted to four, one letter words ? Or is the tool clever enough to realize that is "bozo". If the letters are arranged like this, no open-source software will do a good job. The baseline has to be smooth. b z o o Note that, PDF documents can contain strings positioned on spline curves. Do not expect an open source tool to extract those. Probably Adobe knows how to extract such a thing, but other tools will be hit and miss. Text which has purely horizontal or vertical orientation (as a string), with a smooth baseline, might well be extracted as you would expect. So, yes, you might be able to buy software to do this. I still, on occasion (for experiments), try to get that old inverted search indexer to do stuff for me. You could also try mechanically concatenating all 100 documents into one document, and then using the sequential text search that exists in MSEdge. It takes MSEdge roughly 7.5 minutes to search a 36,300 page document. Paul |
#7
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 6:27 PM, Paul wrote:
dale wrote: Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay Acrobat Exchange (could be called just "Acrobat" today), had an inverted search indexer, which could accept hundreds of documents and index them into a common database file. I have at least one CD distributed as advertising, which had one of those indexes which covered every PDF on the CD. Very nice. The search could then be carried out instantly, and when you clicked a single line in the search result, the document in question would open. My copy of Acrobat Exchange 4 or so, had a feature like that. The inverted search indexer. This could also work... as long as Windows 10 had a search provider for PDF. Since Windows 10 doesn't have a "thumbnailer" capability for PDF, unless you install some Acrobat software, what are the odds that PDF files will get indexed by the built-in Federated Search in Windows 10 ? The problem is, Windows cannot "add a file type" for content search, unless a search provider knows how to open the file and extract words from it. There are tools such as open source pdf2text or pdftotext that might work (script level detection of text). Note that, there can be differences in the quality of the tools. For example, any bozo can extract a single text string. Â*Â*Â* bozo However, tools like LibreOffice, have on occasion resorted to micro-positioning (overriding the font metrics and pretending they're smarter than font people) when they save out in PDF format. What happens when pdftotext sees Â*Â* b o z o Does that get converted to four, one letter words ? Or is the tool clever enough to realize that is "bozo". If the letters are arranged like this, no open-source software will do a good job. The baseline has to be smooth. Â* bÂ*Â* z Â*Â*Â* oÂ*Â* o Note that, PDF documents can contain strings positioned on spline curves. Do not expect an open source tool to extract those. Probably Adobe knows how to extract such a thing, but other tools will be hit and miss. Text which has purely horizontal or vertical orientation (as a string), with a smooth baseline, might well be extracted as you would expect. So, yes, you might be able to buy software to do this. I still, on occasion (for experiments), try to get that old inverted search indexer to do stuff for me. You could also try mechanically concatenating all 100 documents into one document, and then using the sequential text search that exists in MSEdge. It takes MSEdge roughly 7.5 minutes to search a 36,300 page document. Â*Â* Paul Thanks much, will research found these https://helpx.adobe.com/acrobat/usin...hing-pdfs.html https://helpx.adobe.com/acrobat/usin...ng_pdf_indexes don't think I need an index although it would be nice, this means I don't need "Pro" this shows pricing ... "Pro" is only $2 more on a monthly basis https://acrobat.adobe.com/us/en/acrobat/pricing.html -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
#8
|
|||
|
|||
search multiple PDFs for common text
"dale" wrote
| looked over the Agent Ransack specs, seems like it can work | | isn't available in Microsoft store so I don't know yet if I want to | install it, "about us" looks impressive | I've used it for years to replace the very limited Windows search functionality. Some people like a program called Everything, but I've never tried that. Another option would be to use something like Sumatra, or any basic PDF reader, to export the content as a text file. Then you'd have better access. Though in my experience, nothing exports text perfectly. There's usually some "noise", like an "h" that ends up as 1n. Things like that. Mistakes based on character shape. |
#9
|
|||
|
|||
search multiple PDFs for common text
dale wrote:
Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay There are some more breadcrumbs in this thread. https://answers.microsoft.com/en-us/...6-3247614e202b The "reader search handler" is apparently a means of getting Windows Search to include the content of PDF files. https://filestore.community.support....c-184b8e852900 Paul |
#10
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 8:01 PM, Paul wrote:
dale wrote: Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay There are some more breadcrumbs in this thread. https://answers.microsoft.com/en-us/...6-3247614e202b The "reader search handler" is apparently a means of getting Windows Search to include the content of PDF files. https://filestore.community.support....c-184b8e852900 Paul Total Commander is a great substitute for windows explorer. Has a lot of plugins including one to search pdf files. |
#11
|
|||
|
|||
search multiple PDFs for common text
mike wrote:
On 5/9/2018 8:01 PM, Paul wrote: dale wrote: Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay There are some more breadcrumbs in this thread. https://answers.microsoft.com/en-us/...6-3247614e202b The "reader search handler" is apparently a means of getting Windows Search to include the content of PDF files. https://filestore.community.support....c-184b8e852900 Paul Total Commander is a great substitute for windows explorer. Has a lot of plugins including one to search pdf files. Weird. I just checked my PDF entry and I have a "reader search handler". Looks like I had Acrobat Reader (as part of an experiment to get PDF thumbnails working), and removed it, and the "reader search handler" seems to have stuck around. When I found a reasonably unique keyword and tried a search in File Explorer, the only file that popped up in the search result, was the PDF file in question. So it looks like mine have been indexed in Win10, purely by accident/sideeffect. ******* For the Total Commander plugin, is that an indexer or just a "real time search" ? I think a fun test, would be to find a file prepared in Illustrator, where the text is on a path, and see if it detects the text string properly. Paul |
#12
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 9:49 PM, Paul wrote:
mike wrote: On 5/9/2018 8:01 PM, Paul wrote: dale wrote: Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay There are some more breadcrumbs in this thread. https://answers.microsoft.com/en-us/...6-3247614e202b The "reader search handler" is apparently a means of getting Windows Search to include the content of PDF files. https://filestore.community.support....c-184b8e852900 Paul Total Commander is a great substitute for windows explorer. Has a lot of plugins including one to search pdf files. Weird. I just checked my PDF entry and I have a "reader search handler". Looks like I had Acrobat Reader (as part of an experiment to get PDF thumbnails working), and removed it, and the "reader search handler" seems to have stuck around. When I found a reasonably unique keyword and tried a search in File Explorer, the only file that popped up in the search result, was the PDF file in question. So it looks like mine have been indexed in Win10, purely by accident/sideeffect. ******* For the Total Commander plugin, is that an indexer or just a "real time search" ? I think a fun test, would be to find a file prepared in Illustrator, where the text is on a path, and see if it detects the text string properly. Paul I have no experience. Standard install of TC won't search inside a PDF, but the plugin does. I just installed the plugin to see what it would do. Put in a keyword to search and the PDF files containing the keyword popped up in the list. I searched a directory of PDF's to keep it simple. Seems to be able to restrict search to pdf, but I didn't try it. It's a keeper. |
#13
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 11:01 PM, Paul wrote:
dale wrote: Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay There are some more breadcrumbs in this thread. https://answers.microsoft.com/en-us/...6-3247614e202b The "reader search handler" is apparently a means of getting Windows Search to include the content of PDF files. https://filestore.community.support....c-184b8e852900 Â*Â* Paul thanks much Paul -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
#14
|
|||
|
|||
search multiple PDFs for common text
On 5/9/2018 5:14 PM, dale wrote:
Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell assistance appreciated and will be recognized in any public use of assistance if okay was able to do this in Windows File Explorer, enter a folder that has the PDF files, then use the search field in the upper right hand corner .... duh ... thanks to my Accountant Sister don't know all types it will do, it will do a Mozilla Thunderbird Email file (.eml) ... just happened to be in the same folder -- dale - http://www.dalekelly.org/ Not a professional opinion unless specified. |
#15
|
|||
|
|||
search multiple PDFs for common text
On 09/05/18 22:14, dale wrote:
Hi, I have about 100 PDF files I would like to search for common words. "word" for instance prefer not to write a script, but can if the only option using Windows 10, so the scripting language would be preferably powershell That sounds like gross overkill for something this simple. pdftotext is a command-line utility in the Poppler libraries, which extracts all text from a PDF, so installing Poppler would be the first thing to try. It appears to be available for Linux, Mac, and Windows. Then you can type as follows (this is Linux Bash syntax; I assume Windows has something similar in the Command shell): for f in *.pdf; do pdftotext $f; done which will create a .txt file for every .pdf, so you can use them in whatever search program you have. ///Peter |
|
Thread Tools | |
Display Modes | Rate This Thread |
|
|