If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Can I copy text not meant for copying?
When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use sometimes, but won't help when the place I want to copy to only allows text. Thanks |
Ads |
#2
|
|||
|
|||
Can I copy text not meant for copying?
| When webpages and pdf files don't permit selecting and copying text,
| is there a way around that? Not counting print screen, which I use | sometimes, but won't help when the place I want to copy to only allows | text. | Those are different issues. A webpage can do things like blocking the right-click menu, but only if you enable script. A webmaster can also do things like using an image of text. I recently saw a page where the author had gone to great lengths to block image copying by loading the images in a Flash program. (Without script and Flash enabled there are no pictures on the page!) If there's text there you should be able to get it by disabling script. You can also view the source code to get at the text. And in most browsers you can view with no style, which makes the right selection easier in some cases. PDFs are different. Adobe designed the PDF format to allow for a number of restrictions. Text copying can be blocked. A password can be required. Etc. Those restrictions are actually just "flags" in the PDF file. There's not really any kind of lock. But most software respects the flag. So, if you have a PDF with a text-copying restriction the only option is to get software that will bypass it. I think there is such software, but not for free. It's an odd issue. Since you have a right to the file you have a right to access the text, but Adobe has tried to mimic white collar procedure in order to impart a sense of solidity to digital files. In doing that they've done their best to render a PDF as an immutable file that mimics a printed page, and is actually designed just to get business docs transported via PC and printer rather than via postal mail. Unfortunately, people often restrict PDFs for no good reason. (I once downloaded a state auto accident report form that I had to file in triplicate, and the editing function was blocked!) In some cases a PDF is actually a collection of scanned book pages. In that case there isn't any text. Your only option is to run it through OCR software. But actually, these days OCR software is quite good, and usually comes free with a scanner. There's a command line PDF extractor named XPDF. I wrote a convenient wrapper for it he http://www.jsware.net/jsware/pdfconv.php5 With that you can extract text and images. As I note on that page, Sumatra PDF can also extract text and does a better job. XPDF is outdated. But Sumatra doesn't extract images. Both XPDF and Sumatra can be recompiled to ignore restriction flags with a very small code edit. They're both OSS. But both authors have chosen to respect the restriction flags in their compile. |
#3
|
|||
|
|||
Can I copy text not meant for copying?
micky wrote:
When webpages and pdf files don't permit selecting and copying text, is there a way around that? Not counting print screen, which I use sometimes, but won't help when the place I want to copy to only allows text. Thanks This company came up with a solution for PDF years ago. http://www.elcomsoft.com/apdfpr.html There are a couple aspects to PDF security. One is documents protected with a password. And the other, is that stupid "copy" setting. I'm convinced, that a lot of the time, the author of the document has left the "copy" setting at its default, instead of thinking it through. In the past, there were a few recipes that involved using a third party application to "launder" the PDF (just open it and save again). As time passes, those kind of holes get plugged, so you can't expect a recipe you found from 2005, to still work in 2012. You might get lucky, and find a modern recipe like that, or, you might be forced to go with something like Elcomsoft, to have a fair chance. When the "copy" setting first came out, I used to "launder" documents with a modified copy of GhostScript, but PDFs have come a long way since then. If you have access to the source code for a PDF engine (an engine that supports all the features of the language), then I would expect at least the "no copy" feature could be turned off. (That's because that feature relies, to a large extent, on the "honor system". That no software writer will turn it off.) Some of the other features might be a bit harder to crack (if the whole document is password protected, then it's a decryption problem, and not something that simply modifying the source is going to fix). Then, it might depend on the "hardness" of the encryption algorithm. Or known weaknesses. Paul |
#4
|
|||
|
|||
Can I copy text not meant for copying?
| because that feature relies, to a large extent, on the
| "honor system". That no software writer will turn it off. | Interesting, isn't it, that everyone's all for honor.... unless you're willing to pay $100. Then you have a right to access any PDF. |
#5
|
|||
|
|||
Can I copy text not meant for copying?
On Thu, 08 Mar 2012 08:44:12 -0500, in
microsoft.public.windowsxp.general, micky , wrote When webpages and pdf files don't permit selecting and copying text, is there a way around that? Not counting print screen, which I use sometimes, but won't help when the place I want to copy to only allows text. Thanks There are a few things that 'sometimes work'. One of them is that sometimes you can do a CTRL A and select the entire page, then copy it with CTRL C, though you will then have to snip away the parts you don't need when you paste it. jim |
#6
|
|||
|
|||
Can I copy text not meant for copying?
On Thu, 8 Mar 2012 09:50:15 -0500, in microsoft.public.windowsxp.general,
"Mayayana" , wrote | because that feature relies, to a large extent, on the | "honor system". That no software writer will turn it off. | Interesting, isn't it, that everyone's all for honor.... unless you're willing to pay $100. Then you have a right to access any PDF. I recently had someone who wanted to copy from a historical opinion file -- author dead, etc., and i really saw no reason why it was 'copy-protected'. The first thought i had was that the options were simply a set of flags and so i looked for the header format for a pdf file, thinking it would give the location there. I had a hex change in mind. I never found the format but in the looking, realized that even if i were writing it I would prevent that type of change by a simple hash check on the field or something.......... jim |
#7
|
|||
|
|||
Can I copy text not meant for copying?
On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote:
There are a few things that 'sometimes work'. One of them is that sometimes you can do a CTRL A and select the entire page, then copy it with CTRL C, I can only speak from my own experience, but I can not remember a time when Ctrl-A didn't work (and I use it a lot). Ken Blake, Microsoft MVP |
#8
|
|||
|
|||
Can I copy text not meant for copying?
Ken Blake, MVP wrote:
On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote: There are a few things that 'sometimes work'. One of them is that sometimes you can do a CTRL A and select the entire page, then copy it with CTRL C, I can only speak from my own experience, but I can not remember a time when Ctrl-A didn't work (and I use it a lot). Ctrl-A can always copy just the text, and not the image? I'll have to try that again sometime. I have used Ctrl-A to copy all the text in a text document, but don't recall trying it on a PDF file. |
#9
|
|||
|
|||
Can I copy text not meant for copying?
On Thu, 8 Mar 2012 17:25:35 -0700, "Bill in Co"
wrote: Ken Blake, MVP wrote: On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote: There are a few things that 'sometimes work'. One of them is that sometimes you can do a CTRL A and select the entire page, then copy it with CTRL C, Good idea. I'll try it. (I shoudl have thought of that. :-( ) I can only speak from my own experience, but I can not remember a time when Ctrl-A didn't work (and I use it a lot). Good to hear! Ctrl-A can always copy just the text, and not the image? I'll have to try that again sometime. I have used Ctrl-A to copy all the text in a text document, but don't recall trying it on a PDF file. Me neither, but I'll try it there too. And for once, I remember which pages and files do this. (In a few days, I may post the url if this doesn't work.) |
#10
|
|||
|
|||
Can I copy text not meant for copying?
micky wrote:
When webpages and pdf files don't permit selecting and copying text, is there a way around that? Not counting print screen, which I use sometimes, but won't help when the place I want to copy to only allows text. Thanks Web pages and pdf are two different things. The pdf's are likely pictures of print rather than actual print. Your best bet is ocr. Irfan has one that works somewhat. There are others. If you have a scanner, it should have a separate ocr ability. Web pages are often set to no copy, no anything. Print screen and ocr them. |
#11
|
|||
|
|||
Can I copy text not meant for copying?
On Thu, 08 Mar 2012 22:58:21 -0500, micky
wrote: On Thu, 8 Mar 2012 17:25:35 -0700, "Bill in Co" wrote: Ken Blake, MVP wrote: On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote: There are a few things that 'sometimes work'. One of them is that sometimes you can do a CTRL A and select the entire page, then copy it with CTRL C, Good idea. I'll try it. (I shoudl have thought of that. :-( ) I can only speak from my own experience, but I can not remember a time when Ctrl-A didn't work (and I use it a lot). Good to hear! Ctrl-A can always copy just the text, and not the image? I'll have to try that again sometime. I have used Ctrl-A to copy all the text in a text document, but don't recall trying it on a PDF file. Me neither, but I'll try it there too. And for once, I remember which pages and files do this. (In a few days, I may post the url if this doesn't work.) It didn't work! This is the webpage: http://www.tropicana.com/#/trop_prod...anaPurePremium I wanted to get the 3 lines of black text above the 5 bottles***. On a pdf file, it lets me select all -- it's even in the drop down menu --, but it doesn't let me copy it/paste it. And copy is not in the drop down menu. It will take me a while to find the url for this file from a stock broker, because now I'm working with a downloaded copy. I plan to post again. ***I want to be able to do this if possible regardless, but FYI the immediate cause was: I havent' finished investigating this brand, and it's the first one I thought of, but I learned recently that some or most orange juice labeled "not from concentrate" (especially probably mass market brands, not expensive boutique brands) may not have been concentrated, but they use juice stored up to a year, after the oxygen has been removed from it, etc. http://www2.macleans.ca/2009/05/19/f...ss/#more-57383 this url is almost 3 years old, but I heard this on the radio? news like it was recent. http://civileats.com/2009/05/06/fres...uice-in-boxes/ This is from the same month, but may only be about juice in boxes. http://articles.mercola.com/sites/ar...e-oranges.aspx This one is from last august, though I don't trust "healt" companies that advertise. |
#12
|
|||
|
|||
Can I copy text not meant for copying?
micky wrote:
It didn't work! This is the webpage: http://www.tropicana.com/#/trop_prod...anaPurePremium I wanted to get the 3 lines of black text above the 5 bottles***. I have two web browsers. One with Adobe Flash installed, and one without. The "5 bottles" only appear in the Flash based version of the webpage. The text in this case, is in a Flash image, and is not text "you can wipe over". The non-Flash equipped browser, shows a quite different page. I was able to copy this text from the non-Flash page. The page is entirely different, with different text. I got this via copy/paste of the non-Flash page (with anything needing Unicode, removed). "We're committed to using the best fruit to give you the great tasting juices you love and the nutrition your body needs. Each 59oz container of Tropicana Pure Premium has 16 fresh-picked oranges squeezed into it and an 8oz glass gives you 100% vitamin C to help you maintain a healthy immune system." The claim of 100% vitamin C, I guess that means your glass is filled to the rim with dried Ascorbic Acid crystals :-) Linus Pauling would be overjoyed. http://en.wikipedia.org/wiki/Ascorbic_acid On a pdf file, it lets me select all -- it's even in the drop down menu --, but it doesn't let me copy it/paste it. And copy is not in the drop down menu. It will take me a while to find the url for this file from a stock broker, because now I'm working with a downloaded copy. I plan to post again. When you have a PDF to play with, post the URL, so we can try our own bags of tricks :-) I haven't tried "copy busting" in a while. Paul |
#13
|
|||
|
|||
Can I copy text not meant for copying?
On Fri, 09 Mar 2012 00:21:32 -0500, in
microsoft.public.windowsxp.general, Paul , wrote When you have a PDF to play with, post the URL, so we can try our own bags of tricks :-) I haven't tried "copy busting" in a while. Paul I recently tried to bust one for a friend -- trying three different utilities, one was XPDF using command line options in pdftotext.exe which *claimed* to bypass restrictions and failed spectacularly (i may have done it wrong). I have a query out for that URL now and will post it if/when i get it. jim jim |
#14
|
|||
|
|||
Can I copy text not meant for copying?
jim wrote:
On Fri, 09 Mar 2012 00:21:32 -0500, in microsoft.public.windowsxp.general, Paul , wrote When you have a PDF to play with, post the URL, so we can try our own bags of tricks :-) I haven't tried "copy busting" in a while. Paul I recently tried to bust one for a friend -- trying three different utilities, one was XPDF using command line options in pdftotext.exe which *claimed* to bypass restrictions and failed spectacularly (i may have done it wrong). I have a query out for that URL now and will post it if/when i get it. jim If you do "properties", take a look at the security settings, while viewing the document in Acrobat Reader. Perhaps there is something there to explain why it can't be busted. Maybe Adobe had enough time to re-think how to fix the "honor" system... Paul |
#15
|
|||
|
|||
Can I copy text not meant for copying?
| I recently tried to bust one for a friend -- trying three different
| utilities, one was XPDF using command line options in pdftotext.exe which | *claimed* to bypass restrictions and failed spectacularly I wrote about that in my earlier post. XPDF claims no such thing. In fact, the author has specifically written an explanation saying that he doesn't feel right about bypassing restrictions. http://www.foolabs.com/xpdf/cracking.html XPDF is also outdated, and never worked all that well in the first place. It actually only requires a very small edit to make pdftotext.exe ignore restrictions: In pdftotext.c one just needs to comment out the permission check: // check for copy permission /* if (!doc-okToCopy()) { error(-1, "Copying of text from this document is not allowed."); exitCode = 3; goto err2; } */ Unfortunately, one also needs to be capable of recompiling the software. I looked around, at one point, for a program that ignores restrictions and found that it seems to be mostly a commercial thing. If you don't mind paying, you can have the functionality. But for some reason the OSS people "respect" the design of PDFs, which is unfortunate since, as Paul said, most copy-protected PDFs seem to be that way simply because the author wasn't paying attention to the settings. |
Thread Tools | |
Display Modes | |
|
|