Can I copy text not meant for copying?

**micky[_2_]** · March 8th 12, 01:44 PM posted to microsoft.public.windowsxp.general

When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

**Mayayana** · March 8th 12, 02:21 PM posted to microsoft.public.windowsxp.general

| When webpages and pdf files don't permit selecting and copying text,
| is there a way around that? Not counting print screen, which I use
| sometimes, but won't help when the place I want to copy to only allows
| text.
|

Those are different issues. A webpage can do things
like blocking the right-click menu, but only if you enable
script.

A webmaster can also do things like using an image
of text. I recently saw a page where the author had
gone to great lengths to block image copying by
loading the images in a Flash program. (Without script
and Flash enabled there are no pictures on the page!)

If there's text there you should be able to get it
by disabling script. You can also view the source code
to get at the text. And in most browsers you can view
with no style, which makes the right selection easier
in some cases.

PDFs are different. Adobe designed the PDF format to
allow for a number of restrictions. Text copying can be
blocked. A password can be required. Etc. Those
restrictions are actually just "flags" in the PDF file. There's
not really any kind of lock. But most software respects
the flag. So, if you have a PDF with a text-copying
restriction the only option is to get software that will
bypass it. I think there is such software, but not for free.

It's an odd issue. Since you have a right to the file you
have a right to access the text, but Adobe has tried to
mimic white collar procedure in order to impart a sense
of solidity to digital files. In doing that they've done their
best to render a PDF as an immutable file that mimics a
printed page, and is actually designed just to get business
docs transported via PC and printer rather than via postal
mail.
Unfortunately, people often restrict PDFs for no good
reason. (I once downloaded a state auto accident report
form that I had to file in triplicate, and the editing function
was blocked!)

In some cases a PDF is actually a collection of scanned
book pages. In that case there isn't any text. Your only
option is to run it through OCR software. But actually, these
days OCR software is quite good, and usually comes free
with a scanner.

There's a command line PDF extractor named XPDF.
I wrote a convenient wrapper for it he
http://www.jsware.net/jsware/pdfconv.php5

With that you can extract text and images. As I
note on that page, Sumatra PDF can also extract
text and does a better job. XPDF is outdated.
But Sumatra doesn't extract images.

Both XPDF and Sumatra can be recompiled to
ignore restriction flags with a very small code edit.
They're both OSS. But both authors have chosen
to respect the restriction flags in their compile.

**Paul** · March 8th 12, 02:21 PM posted to microsoft.public.windowsxp.general

micky wrote:
When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

This company came up with a solution for PDF years ago.

http://www.elcomsoft.com/apdfpr.html

There are a couple aspects to PDF security. One is documents
protected with a password. And the other, is that stupid "copy"
setting. I'm convinced, that a lot of the time, the author of
the document has left the "copy" setting at its default, instead
of thinking it through.

In the past, there were a few recipes that involved using a
third party application to "launder" the PDF (just open it and
save again). As time passes, those kind of holes get plugged, so
you can't expect a recipe you found from 2005, to still work
in 2012. You might get lucky, and find a modern recipe like that,
or, you might be forced to go with something like Elcomsoft,
to have a fair chance.

When the "copy" setting first came out, I used to "launder"
documents with a modified copy of GhostScript, but PDFs
have come a long way since then. If you have access to the
source code for a PDF engine (an engine that supports all
the features of the language), then I would expect at
least the "no copy" feature could be turned off. (That's
because that feature relies, to a large extent, on the
"honor system". That no software writer will turn it off.)

Some of the other features might be a bit harder to crack
(if the whole document is password protected, then it's a
decryption problem, and not something that simply modifying
the source is going to fix). Then, it might depend on the
"hardness" of the encryption algorithm. Or known weaknesses.

Paul

**Mayayana** · March 8th 12, 02:50 PM posted to microsoft.public.windowsxp.general

| because that feature relies, to a large extent, on the
| "honor system". That no software writer will turn it off.
|

Interesting, isn't it, that everyone's all for honor....
unless you're willing to pay $100. Then you have a
right to access any PDF.

**jim** · March 8th 12, 10:39 PM posted to microsoft.public.windowsxp.general

On Thu, 08 Mar 2012 08:44:12 -0500, in
microsoft.public.windowsxp.general, micky ,
wrote

When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C, though you will then have to snip away the parts you don't
need when you paste it.

jim

**jim** · March 8th 12, 10:46 PM posted to microsoft.public.windowsxp.general

On Thu, 8 Mar 2012 09:50:15 -0500, in microsoft.public.windowsxp.general,
"Mayayana" , wrote

| because that feature relies, to a large extent, on the
| "honor system". That no software writer will turn it off.
|

Interesting, isn't it, that everyone's all for honor....
unless you're willing to pay $100. Then you have a
right to access any PDF.

I recently had someone who wanted to copy from a historical opinion file
-- author dead, etc., and i really saw no reason why it was
'copy-protected'. The first thought i had was that the options were
simply a set of flags and so i looked for the header format for a pdf
file, thinking it would give the location there. I had a hex change in
mind. I never found the format but in the looking, realized that even if
i were writing it I would prevent that type of change by a simple hash
check on the field or something..........

jim

**Ken Blake, MVP[_4_]** · March 9th 12, 12:07 AM posted to microsoft.public.windowsxp.general

On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote:

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C,

I can only speak from my own experience, but I can not remember a time
when Ctrl-A didn't work (and I use it a lot).

Ken Blake, Microsoft MVP

**Bill in Co** · March 9th 12, 12:25 AM posted to microsoft.public.windowsxp.general

Ken Blake, MVP wrote:
On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote:

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C,

I can only speak from my own experience, but I can not remember a time
when Ctrl-A didn't work (and I use it a lot).

Ctrl-A can always copy just the text, and not the image? I'll have to try
that again sometime. I have used Ctrl-A to copy all the text in a text
document, but don't recall trying it on a PDF file.

**micky[_2_]** · March 9th 12, 03:58 AM posted to microsoft.public.windowsxp.general

On Thu, 8 Mar 2012 17:25:35 -0700, "Bill in Co"
wrote:

Ken Blake, MVP wrote:
On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote:

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C,

Good idea. I'll try it. (I shoudl have thought of that. :-( )

I can only speak from my own experience, but I can not remember a time
when Ctrl-A didn't work (and I use it a lot).

Good to hear!

Ctrl-A can always copy just the text, and not the image? I'll have to try
that again sometime. I have used Ctrl-A to copy all the text in a text
document, but don't recall trying it on a PDF file.

Me neither, but I'll try it there too.

And for once, I remember which pages and files do this. (In a few
days, I may post the url if this doesn't work.)

**Paul in Houston TX** · March 9th 12, 04:09 AM posted to microsoft.public.windowsxp.general

micky wrote:
When webpages and pdf files don't permit selecting and copying text,
is there a way around that? Not counting print screen, which I use
sometimes, but won't help when the place I want to copy to only allows
text.

Thanks

Web pages and pdf are two different things.
The pdf's are likely pictures of print rather than
actual print. Your best bet is ocr.
Irfan has one that works somewhat. There are others.
If you have a scanner, it should have a separate ocr ability.
Web pages are often set to no copy, no anything.
Print screen and ocr them.

**micky[_2_]** · March 9th 12, 04:33 AM posted to microsoft.public.windowsxp.general

On Thu, 08 Mar 2012 22:58:21 -0500, micky
wrote:

On Thu, 8 Mar 2012 17:25:35 -0700, "Bill in Co"
wrote:

Ken Blake, MVP wrote:
On Thu, 08 Mar 2012 17:39:55 -0500, jim wrote:

There are a few things that 'sometimes work'. One of them is that
sometimes you can do a CTRL A and select the entire page, then copy it
with CTRL C,

Good idea. I'll try it. (I shoudl have thought of that. :-( )

I can only speak from my own experience, but I can not remember a time
when Ctrl-A didn't work (and I use it a lot).

Good to hear!

Ctrl-A can always copy just the text, and not the image? I'll have to try
that again sometime. I have used Ctrl-A to copy all the text in a text
document, but don't recall trying it on a PDF file.

Me neither, but I'll try it there too.

And for once, I remember which pages and files do this. (In a few
days, I may post the url if this doesn't work.)

It didn't work! This is the webpage:

http://www.tropicana.com/#/trop_prod...anaPurePremium

I wanted to get the 3 lines of black text above the 5 bottles***.

On a pdf file, it lets me select all -- it's even in the drop down
menu --, but it doesn't let me copy it/paste it. And copy is not in
the drop down menu. It will take me a while to find the url for this
file from a stock broker, because now I'm working with a downloaded
copy. I plan to post again.

***I want to be able to do this if possible regardless, but FYI the
immediate cause was:

I havent' finished investigating this brand, and it's the first one I
thought of, but I learned recently that some or most orange juice
labeled "not from concentrate" (especially probably mass market
brands, not expensive boutique brands) may not have been
concentrated, but they use juice stored up to a year, after the oxygen
has been removed from it, etc.

http://www2.macleans.ca/2009/05/19/f...ss/#more-57383
this url is almost 3 years old, but I heard this on the radio? news
like it was recent.

http://civileats.com/2009/05/06/fres...uice-in-boxes/
This is from the same month, but may only be about juice in boxes.

http://articles.mercola.com/sites/ar...e-oranges.aspx
This one is from last august, though I don't trust "healt" companies
that advertise.

**Paul** · March 9th 12, 05:21 AM posted to microsoft.public.windowsxp.general

micky wrote:

It didn't work! This is the webpage:

http://www.tropicana.com/#/trop_prod...anaPurePremium

I wanted to get the 3 lines of black text above the 5 bottles***.

I have two web browsers. One with Adobe Flash installed, and one without.
The "5 bottles" only appear in the Flash based version of the webpage.
The text in this case, is in a Flash image, and is not text "you can wipe over".

The non-Flash equipped browser, shows a quite different page. I was
able to copy this text from the non-Flash page. The page is
entirely different, with different text. I got this via copy/paste
of the non-Flash page (with anything needing Unicode, removed).

"We're committed to using the best fruit to give you the great tasting juices
you love and the nutrition your body needs. Each 59oz container of Tropicana
Pure Premium has 16 fresh-picked oranges squeezed into it and an 8oz glass
gives you 100% vitamin C to help you maintain a healthy immune system."

The claim of 100% vitamin C, I guess that means your glass is filled to the
rim with dried Ascorbic Acid crystals :-) Linus Pauling would be overjoyed.

http://en.wikipedia.org/wiki/Ascorbic_acid

On a pdf file, it lets me select all -- it's even in the drop down
menu --, but it doesn't let me copy it/paste it. And copy is not in
the drop down menu. It will take me a while to find the url for this
file from a stock broker, because now I'm working with a downloaded
copy. I plan to post again.

When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :-) I haven't tried "copy busting" in a while.

Paul

**jim** · March 10th 12, 01:23 PM posted to microsoft.public.windowsxp.general

On Fri, 09 Mar 2012 00:21:32 -0500, in
microsoft.public.windowsxp.general, Paul , wrote

When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :-) I haven't tried "copy busting" in a while.

Paul

I recently tried to bust one for a friend -- trying three different
utilities, one was XPDF using command line options in pdftotext.exe which
*claimed* to bypass restrictions and failed spectacularly (i may have
done it wrong). I have a query out for that URL now and will post it
if/when i get it.

jim

jim

**Paul** · March 10th 12, 02:08 PM posted to microsoft.public.windowsxp.general

jim wrote:
On Fri, 09 Mar 2012 00:21:32 -0500, in
microsoft.public.windowsxp.general, Paul , wrote

When you have a PDF to play with, post the URL, so we can try our
own bags of tricks :-) I haven't tried "copy busting" in a while.

Paul

I recently tried to bust one for a friend -- trying three different
utilities, one was XPDF using command line options in pdftotext.exe which
*claimed* to bypass restrictions and failed spectacularly (i may have
done it wrong). I have a query out for that URL now and will post it
if/when i get it.

jim

If you do "properties", take a look at the security settings,
while viewing the document in Acrobat Reader. Perhaps there
is something there to explain why it can't be busted. Maybe
Adobe had enough time to re-think how to fix the "honor" system...

Paul

**Mayayana** · March 10th 12, 02:09 PM posted to microsoft.public.windowsxp.general

| I recently tried to bust one for a friend -- trying three different
| utilities, one was XPDF using command line options in pdftotext.exe which
| *claimed* to bypass restrictions and failed spectacularly

I wrote about that in my earlier post. XPDF claims
no such thing. In fact, the author has specifically
written an explanation saying that he doesn't feel
right about bypassing restrictions.

http://www.foolabs.com/xpdf/cracking.html

XPDF is also outdated, and never worked all that
well in the first place. It actually only requires a
very small edit to make pdftotext.exe ignore restrictions:

In pdftotext.c one just needs to comment out the
permission check:

// check for copy permission
/*
if (!doc-okToCopy()) {
error(-1, "Copying of text from this document is not allowed.");
exitCode = 3;
goto err2;
}
*/

Unfortunately, one also needs to be capable of
recompiling the software.

I looked around, at one point, for a program that
ignores restrictions and found that it seems to be
mostly a commercial thing. If you don't mind paying,
you can have the functionality. But for some reason
the OSS people "respect" the design of PDFs, which is
unfortunate since, as Paul said, most copy-protected
PDFs seem to be that way simply because the author
wasn't paying attention to the settings.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode