If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
How to find and delete CR/LF quickly.
I solved the immediate problem by editing the text in Agent and
rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? I have notepad++ for example, and if I turn something on in View, it gives a reverse image of CR and LF, but I can't manage to copy either of them to put them in the Find box. I have editpad lite, but I can't even figure out how to show the CR/LF (maybe because it's meant for writing code?) Eudora has Find but not Replace. FTR, I'm not even planning on republishing a copyrighted file. I just want to copy 3 paragraphs of it for my own notes, in a text, non-pdf file. |
Ads |
#2
|
|||
|
|||
How to find and delete CR/LF quickly.
Micky wrote:
I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? I have notepad++ for example, and if I turn something on in View, it gives a reverse image of CR and LF, but I can't manage to copy either of them to put them in the Find box. I have editpad lite, but I can't even figure out how to show the CR/LF (maybe because it's meant for writing code?) Eudora has Find but not Replace. FTR, I'm not even planning on republishing a copyrighted file. I just want to copy 3 paragraphs of it for my own notes, in a text, non-pdf file. What is it that you are trying to do? I don't see control characters except at EOL and can easily copy and paste into notepad, word, etc. |
#3
|
|||
|
|||
How to find and delete CR/LF quickly.
Micky wrote:
I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? I have notepad++ for example, and if I turn something on in View, it gives a reverse image of CR and LF, but I can't manage to copy either of them to put them in the Find box. I'm a bit confused about what can be causing all the line-feeds mid way through text, but in any case I know that notepad++ can remove line feeds. It's a function of the "find and replace" system. I have a file with instructions about how to use it to remove line feeds from line wrapped text files, but unfortunately my XP PC is mid way through resurrection after a HDD failure at the moment, and it's a bit of trouble to get to those notes. So I can't tell you the exact parameters for the "find" box, but they are in the notepad++ documentation (something like "/n" means line break). I had to do two passes with "find and replace" to fix the line wrapping, but you might be able to do everything in one. If you happen to have LibreOffice installed, pasting the text into that, then choosing an option called something like autocorrect or autoformat may fix the problem. It is another way to remove line wrapping from text files, I don't know if it would be smart enough to fix the problem you're having though. -- __ __ #_ |\| | _# |
#4
|
|||
|
|||
How to find and delete CR/LF quickly.
[Default] On Fri, 04 Dec 2015 00:07:46 -0600, in
microsoft.public.windowsxp.general Paul in Houston TX wrote: Micky wrote: I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? I have notepad++ for example, and if I turn something on in View, it gives a reverse image of CR and LF, but I can't manage to copy either of them to put them in the Find box. I have editpad lite, but I can't even figure out how to show the CR/LF (maybe because it's meant for writing code?) Eudora has Find but not Replace. FTR, I'm not even planning on republishing a copyrighted file. I just want to copy 3 paragraphs of it for my own notes, in a text, non-pdf file. What is it that you are trying to do? Get rid of the Carriage Return and Line Feed on each line. I don't see control characters except at EOL and can easily copy and paste into notepad, word, etc. I can copy and paste it too, but it's hard to read! When I try to do a few paragraphs, it looks like what's below. For you, did it not look like this: HIGHLIGHTS OF PRESCR IBING INFORMA TION These highlights do not include all the information needed to use FOSA MA X safely a nd effectively .. See full prescribin g information for FOSA MA X .. FOSA MA X ® (alendronate sodium) tablets , for oral use FOSA MA X ® (alendronat e sodium) oral solution Initial U.S. A pproval: 1995 --------------------------- RECENT MA JOR CHA NGES --------------------------- Wa rnings and Precautions (5.4) 2/2015 ---------------------------- INDICA TIONS A ND USA GE ---------------------------- FOSAMAX is a bisphosphonate indicated for: ? Treatment and prevention of osteoporosis in postmenopausal wom en (1 ..1, 1.2 ) ? Treatment to increase bone mass in men with osteoporosis (1 ..3) ? Treatment of glucocorticoid - induced osteoporosis (1.4) ? Treatment of Paget's disease of bone (1.5) L imitations of use: O ptimal duration of use has not been determined. F or patients at low - risk for fracture , c onsider drug discontinuation after 3 to 5 years of use .. (1.6) ----------------------- DOSA GE A ND A DMINISTRA TION ------------------------ ? Treatment of osteoporosis in postmenopausal wom en and in men: 10 mg daily or 70 mg (tablet or oral solution) once weekly .. (2.1, 2.3) ? Prevention of ost eoporosis in |
#5
|
|||
|
|||
How to find and delete CR/LF quickly.
On 04/12/2015 03:16, Micky wrote:
This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? Yes use a decent browser and/or save the file and open it in pdf reader. No Problems Here http://s12.postimg.org/kte5edl7h/2015_12_04_0745.png Also use a decent email client such as: Microsoft Outlook Mozilla Thunderbird Nothing else can compete with these two. Good luck. -- /*This post contains rich text (HTML). if you don't like it then you can kill-filter the poster without crying about it like a small baby so that you don't see this poster's posts ever again.*/ /*This message is best read in Mozilla Thunderbird as it uses 21st century technology.*/ |
#7
|
|||
|
|||
How to find and delete CR/LF quickly.
[Default] On Fri, 04 Dec 2015 03:02:45 -0500, in
microsoft.public.windowsxp.general Micky wrote: I had to do two passes with "find and replace" to fix the line wrapping, but you might be able to do everything in one. Replacing All the \r with nil put it all in one paragraph. So it takes 2 actions but that's only one more than 1. Boy am I slow. One can just Replace All \n\r with nil and you're done. If it's \r\n instead, it shows when you Show symbols / End of Line. One action. |
#8
|
|||
|
|||
How to find and delete CR/LF quickly.
[Default] On Fri, 4 Dec 2015 06:36:09 +0000 (UTC), in
microsoft.public.windowsxp.general lid (Computer Nerd Kev) wrote: So I can't tell you the exact parameters for the "find" box, but they are in the notepad++ documentation (something like "/n" means line break). https://notepad-plus-plus.org/resources.html has Word and PDF versions of a Cheat Sheet at the bottom of the page. However, it doesn't list \n, \r or the other \letters! But may list everything else. |
#9
|
|||
|
|||
How to find and delete CR/LF quickly.
On 12/4/15 1:02 AM, Micky wrote:
[Default] On Fri, 4 Dec 2015 06:36:09 +0000 (UTC), in lid (Computer Nerd Kev) wrote: wrote: I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? I wonder if they do this to make it harder to copy their files. There are other ways however. I have a manual for a 50cc motorscooter which says in the properties that you can't copy from it, and you can't. Try this, Micky, but no promises... Download and install and install a PDF printer driver. Open the PDF file. Select Print. Choose the new PDF printer driver and save the new PDF file. Open the new file and see if you can copy paste from the new file. I had a friend with the same problem as you. I had him install Nitro PDF Reader 3, which includes a PDF printer driver. Followed the above instructions, and we could copy paste from the new document. I would think other PDF printer drivers may give the same results. -- Ken Mac OS X 10.8.5 Firefox 42.0 Thunderbird 31.5 "My brain is like lightning, a quick flash and it's gone!" |
#10
|
|||
|
|||
How to find and delete CR/LF quickly.
Ken Springer wrote:
On 12/4/15 1:02 AM, Micky wrote: [Default] On Fri, 4 Dec 2015 06:36:09 +0000 (UTC), in lid (Computer Nerd Kev) wrote: wrote: I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? I wonder if they do this to make it harder to copy their files. There are other ways however. I have a manual for a 50cc motorscooter which says in the properties that you can't copy from it, and you can't. Try this, Micky, but no promises... Download and install and install a PDF printer driver. Open the PDF file. Select Print. Choose the new PDF printer driver and save the new PDF file. Open the new file and see if you can copy paste from the new file. I had a friend with the same problem as you. I had him install Nitro PDF Reader 3, which includes a PDF printer driver. Followed the above instructions, and we could copy paste from the new document. I would think other PDF printer drivers may give the same results. The document in question may be using Unicode, in an effort to capture some pretty unique looking symbols. They have the trademark symbol. They also have what looks like a bullet symbol, but if you zoom in, it's a clock icon, all done with a character set. The tool flow looks ancient. Looking at Document Properties shows the details. That doesn't mean the document is "all right" in terms of its construction - it may be abusing character sets or locale settings in some way. The author of the document may be compensating for some problem by doing it that way. When I copy and paste, I'm *not* seeing the same symptoms. I see character sequences my current locale can't handle, like where the clock-symbol appears. But there is no abnormal sequence after each word. So something else is happening there. And so far, no "easy" method of fixing it has worked. Acrobat actually crashed on me once, when I attempted to "Select All" and copy. I still have more testing to do, once I can get at my VM on the other machine. Some Acrobat setups, resist printing from PDF to PDF. As that is viewed as an attempt to override document permissions. Sometimes I go PDF - postscript - PDF, but the print driver can put blockages in there too. Sometimes, it requires deleting about 12 lines of stuff placed in the PostScript .ps on purpose to prevent re-distillation. Lots of fun. For a simple requirement of being able to Copy/Paste. What Micky is seeing, is the "normal" abuse of word placement by desktop publishing tools. At one time, a DTP tool would just dump a string, and rely on the very nice bearing and glyph rules in the font itself on the printer. (My dog has fleas) print Later, tools started messing with the spacing between words. (My) print moveto 23 (dog) print moveto 79 (has) print moveto 106 and the idea is to "enhance" or "override" what the font handling in the printer (or downstream tool) might do. Now, if you copy/paste from that, the buffer could get some sort of strange character in it. They could even insert an "em-space" on the end of each string, for added peril. Now, LibreOffice does stuff like this (M) print moveto 7 (y) print moveto 10 (d) print moveto 16 and attempts to place each character manually. Which overrides "fi" ligature or attempts at kerning in the font. https://en.wikipedia.org/wiki/Kerning https://en.wikipedia.org/wiki/Typographic_ligature The LibreOffice approach is a disaster, when the details of the font at the viewing end are slightly different. Now, suddenly, the spacing looks all wrong, as the screen view applies its own font spacing to "M" say "M " what Libreoffice assumes the "air space" looks like "M " what something on your computer is using Then when you look at the doc on the screen, it looks like this. Sometimes the edges of letters even overlap. Because each is rendered separately, and the printer font layout rules have been completely overridden. M yd og h asf l eas These are the perils when DTP tools mess around. Back in the old days... (My dog has fleas) print could certainly foul up, but you could then blame the font in the printer ROM as being "poor quality", whereas the ham-fisted "spacing nitwits" in current tools, basically just ruin it from the get-go. It's a wonder you can copy and paste this stuff, at all. Paul |
#11
|
|||
|
|||
How to find and delete CR/LF quickly.
Paul wrote:
Ken Springer wrote: On 12/4/15 1:02 AM, Micky wrote: [Default] On Fri, 4 Dec 2015 06:36:09 +0000 (UTC), in lid (Computer Nerd Kev) wrote: wrote: I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? I wonder if they do this to make it harder to copy their files. There are other ways however. I have a manual for a 50cc motorscooter which says in the properties that you can't copy from it, and you can't. Try this, Micky, but no promises... Download and install and install a PDF printer driver. Open the PDF file. Select Print. Choose the new PDF printer driver and save the new PDF file. Open the new file and see if you can copy paste from the new file. I had a friend with the same problem as you. I had him install Nitro PDF Reader 3, which includes a PDF printer driver. Followed the above instructions, and we could copy paste from the new document. I would think other PDF printer drivers may give the same results. The document in question may be using Unicode, in an effort to capture some pretty unique looking symbols. They have the trademark symbol. They also have what looks like a bullet symbol, but if you zoom in, it's a clock icon, all done with a character set. The tool flow looks ancient. Looking at Document Properties shows the details. That doesn't mean the document is "all right" in terms of its construction - it may be abusing character sets or locale settings in some way. The author of the document may be compensating for some problem by doing it that way. When I copy and paste, I'm *not* seeing the same symptoms. I see character sequences my current locale can't handle, like where the clock-symbol appears. But there is no abnormal sequence after each word. So something else is happening there. And so far, no "easy" method of fixing it has worked. Acrobat actually crashed on me once, when I attempted to "Select All" and copy. I still have more testing to do, once I can get at my VM on the other machine. Some Acrobat setups, resist printing from PDF to PDF. As that is viewed as an attempt to override document permissions. Sometimes I go PDF - postscript - PDF, but the print driver can put blockages in there too. Sometimes, it requires deleting about 12 lines of stuff placed in the PostScript .ps on purpose to prevent re-distillation. Lots of fun. For a simple requirement of being able to Copy/Paste. What Micky is seeing, is the "normal" abuse of word placement by desktop publishing tools. At one time, a DTP tool would just dump a string, and rely on the very nice bearing and glyph rules in the font itself on the printer. (My dog has fleas) print Later, tools started messing with the spacing between words. (My) print moveto 23 (dog) print moveto 79 (has) print moveto 106 and the idea is to "enhance" or "override" what the font handling in the printer (or downstream tool) might do. Now, if you copy/paste from that, the buffer could get some sort of strange character in it. They could even insert an "em-space" on the end of each string, for added peril. Now, LibreOffice does stuff like this (M) print moveto 7 (y) print moveto 10 (d) print moveto 16 and attempts to place each character manually. Which overrides "fi" ligature or attempts at kerning in the font. https://en.wikipedia.org/wiki/Kerning https://en.wikipedia.org/wiki/Typographic_ligature The LibreOffice approach is a disaster, when the details of the font at the viewing end are slightly different. Now, suddenly, the spacing looks all wrong, as the screen view applies its own font spacing to "M" say "M " what Libreoffice assumes the "air space" looks like "M " what something on your computer is using Then when you look at the doc on the screen, it looks like this. Sometimes the edges of letters even overlap. Because each is rendered separately, and the printer font layout rules have been completely overridden. M yd og h asf l eas These are the perils when DTP tools mess around. Back in the old days... (My dog has fleas) print could certainly foul up, but you could then blame the font in the printer ROM as being "poor quality", whereas the ham-fisted "spacing nitwits" in current tools, basically just ruin it from the get-go. It's a wonder you can copy and paste this stuff, at all. Paul The PDF document, is for this stuff. Some of the articles in the reference section are dated 1996. https://en.wikipedia.org/wiki/Alendronic_acid The document is an "MS Word For Windows (OLE)" document. Perhaps prepared back then. There are some tricks associated with the handling of Symbol font back then (for the "bullet" symbol in the PDF), and it's possible one of the rectangles in the copy buffer is an F0. The document font list includes Symbol. http://blogs.msdn.com/b/murrays/arch...ord-s-rtf.aspx But the rest of it handles reasonably well in a copy/paste. No sign of CRLF sprinkled all over indiscriminately. I can't copy/paste in Acrobat 6, but the new Acrobat DC handles it OK (installed in my VM). Paul |
#12
|
|||
|
|||
How to find and delete CR/LF quickly.
Micky wrote:
[Default] On Fri, 04 Dec 2015 00:07:46 -0600, in microsoft.public.windowsxp.general Paul in Houston TX wrote: Micky wrote: I solved the immediate problem by editing the text in Agent and rocking two of my fingers back and forth between end and delete, so there is no rush about the answer here, but I still think it's worth pursuing. This is one of a number of pdf files, or webpages, which has CR/LF after every word!!! Sometimes between syllables, and sometimes between letters like the t and he of the word "the". Why do they do that? http://www.merck.com/product/usa/pi_...fosamax_pi.pdf Is there a simpler way to remove them than going word by word? I have notepad++ for example, and if I turn something on in View, it gives a reverse image of CR and LF, but I can't manage to copy either of them to put them in the Find box. I have editpad lite, but I can't even figure out how to show the CR/LF (maybe because it's meant for writing code?) Eudora has Find but not Replace. FTR, I'm not even planning on republishing a copyrighted file. I just want to copy 3 paragraphs of it for my own notes, in a text, non-pdf file. What is it that you are trying to do? Get rid of the Carriage Return and Line Feed on each line. I don't see control characters except at EOL and can easily copy and paste into notepad, word, etc. I can copy and paste it too, but it's hard to read! When I try to do a few paragraphs, it looks like what's below. For you, did it not look like this: HIGHLIGHTS OF PRESCR IBING INFORMA TION I use Foxit 2.2.2129. This is what I see and paste: HIGHLIGHTS OF PRESCRIBING INFORMATION These highlights do not include all the information needed to use FOSAMAX safely and effectively. See full prescribing information for FOSAMAX. FOSAMAX® (alendronate sodium) tablets, for oral use FOSAMAX® (alendronate sodium) oral solution Initial U.S. Approval: 1995 ---------------------------RECENT MAJOR CHANGES --------------------------- Warnings and Precautions (5.4) 2/2015 ----------------------------INDICATIONS AND USAGE---------------------------- FOSAMAX is a bisphosphonate indicated for:  Treatment and prevention of osteoporosis in postmenopausal women (1.1, 1.2)  Treatment to increase bone mass in men with osteoporosis (1.3)  Treatment of glucocorticoid-induced osteoporosis (1.4)  Treatment of Paget's disease of bone (1.5) Etc. |
#13
|
|||
|
|||
How to find and delete CR/LF quickly.
Paul wrote:
Now, LibreOffice does stuff like this (M) print moveto 7 (y) print moveto 10 (d) print moveto 16 and attempts to place each character manually. Which overrides "fi" ligature or attempts at kerning in the font. https://en.wikipedia.org/wiki/Kerning https://en.wikipedia.org/wiki/Typographic_ligature The LibreOffice approach is a disaster, when the details of the font at the viewing end are slightly different. Now, suddenly, the spacing looks all wrong, as the screen view applies its own font spacing to "M" say "M " what Libreoffice assumes the "air space" looks like "M " what something on your computer is using Then when you look at the doc on the screen, it looks like this. Sometimes the edges of letters even overlap. Because each is rendered separately, and the printer font layout rules have been completely overridden. M yd og h asf l eas Ahh, so that's what causes that to happen! Thanks for the info Paul. -- __ __ #_ |\| | _# |
#14
|
|||
|
|||
How to find and delete CR/LF quickly.
In message , Micky
writes: [] I found \n in the search box. That's LF and replacing it with nothing got rid of it. There is Replace All, which I havent' seen iirc since SPF. That's very nice. There is also \r \t \0 \x and maybe more, one of which must mean CR. [] Those sound like high-level-language (like C) escape sequences. From when I was studying such (over 30 years ago!): \n means newline; how this is actually implemented depends on the system you're working under - it can be CRLF, LFCR, just LF, just CR, or possibly even other sequences. \r (IIRR!) specifically means the CR character - ASCII code 13 decimal or D hexadecimal (or 15 octal). (IIRR there's an escape sequence for the other one too: logic would dictate that it's \l, but I can't remember.) \t means the tab character (ASCII 8). \0 means the null character (ASCII 0); used in many high-level languages to mark the end of a character string, so there needs to be a way of referring to it if you want to. \\ means the "\" character itself. -- J. P. Gilliver. UMRA: 1960/1985 MB++G()AL-IS-Ch++(p)Ar@T+H+Sh0!:`)DNAf "Look, if it'll help you to do what I tell you, baby, imagine that I've got a blaster ray in my hand." "Uh - you _have_ got a blaster ray in your hand." "So you shouldn't have to tax your imagination too hard." (Link episode) |
#15
|
|||
|
|||
How to find and delete CR/LF quickly.
[Default] On Fri, 4 Dec 2015 23:10:28 +0000, in
microsoft.public.windowsxp.general "J. P. Gilliver (John)" wrote: In message , Micky writes: [] I found \n in the search box. That's LF and replacing it with nothing got rid of it. There is Replace All, which I havent' seen iirc since SPF. That's very nice. There is also \r \t \0 \x and maybe more, one of which must mean CR. [] Those sound like high-level-language (like C) escape sequences. From when I was studying such (over 30 years ago!): \n means newline; how this is actually implemented depends on the system you're working under - it can be CRLF, LFCR, just LF, just CR, or possibly even other sequences. \r (IIRR!) specifically means the CR character - ASCII code 13 decimal or D hexadecimal (or 15 octal). (IIRR there's an escape sequence for the other one too: logic would dictate that it's \l, but I can't remember.) \t means the tab character (ASCII 8). \0 means the null character (ASCII 0); used in many high-level languages to mark the end of a character string, so there needs to be a way of referring to it if you want to. \\ means the "\" character itself. A post I actually understand the first time I reed it. And which I should be able to remember. Thanks. All the more important since the Cheat Sheet omits mention of these sequences. |
Thread Tools | |
Display Modes | |
|
|