If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Rate Thread | Display Modes |
#1
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
This is a vim-only question where I simply ask if you _already_ have an
answer to this problem of whether it's even possible, using just the freeware vim, to find first-field duplicates? How can I find the duplicate domains in a text file of the format http://domain1.com (space separated random data) http://domain1.com (more space separated random data) http://domain2.com (perhaps different space separated random data) Where I want to identify that domain1.com is duplicated in that file up to the first space (what comes after the space need only be concatenated later so that, in the end, domain1.com is not only listed just once, but so that the one instance of domain1.com has all the data assigned by multiple lines in the original file. I have no problem sorting the whole file inside of vim :%!sort I have no problem finding all fully duplicated lines inside of vim :%s/^\(.*\)\(\n\1\)\+$/\1/gec Nor do I have a problem finding duplicate field 1 outside of vim on Linux $ awk '!seen[$1]++' filename.txt $ sort -u -t' ' -k1,1 filename.txt For example, if the delimiter was a comma, I would add that as shown below $awk -F',' '!seen[$1]++' filename.txt In this case, the delimiter is a space and the first field is always a domain of the format http://domainname.com but anything can be after that space after the domain, since the file is generated using a script. All I want to do is combine duplicate data that is found for the same domain. It's actually a Windows vim question on a dual boot desktop where I'm constantly booting to a Linux VM to run the sort uniq and then awk but where I'd rather just find a vim command that works on both platforms which can identify lines based on a duplicate first space separated field. That's why "uniq", "awk", "sed", "grep", "col -b", aren't available where I don't want to install Cygwin since I already can dual boot to Linux without Cygwin. The Windows PowerShell can _count_ the number of fully duplicate lines Get-Content .\filename.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count And PowerShell can remove fully duplicate lines: Get-Content filename.txt | Sort-Object -Unique | Set-Content output.txt But I haven't figured out yet how PowerShell looks at the first field only in order to find partially duplicate lines based on field 1 only. |
Ads |
#2
|
|||
|
|||
Finding first field duplicate lines in a sorted text filewithout uniq or awk or col using only vim - is it possible?
On 2019-06-05, Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an answer to this problem of whether it's even possible, using just the freeware vim, to find first-field duplicates? The macro language is turing complete, so yes anlmost certainly possible. -- When I tried casting out nines I made a hash of it. |
#3
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an answer to this problem of whether it's even possible, using just the freeware vim, to find first-field duplicates? How can I find the duplicate domains in a text file of the format http://domain1.com (space separated random data) http://domain1.com (more space separated random data) http://domain2.com (perhaps different space separated random data) Where I want to identify that domain1.com is duplicated in that file up to the first space (what comes after the space need only be concatenated later so that, in the end, domain1.com is not only listed just once, but so that the one instance of domain1.com has all the data assigned by multiple lines in the original file. I have no problem sorting the whole file inside of vim :%!sort I have no problem finding all fully duplicated lines inside of vim :%s/^\(.*\)\(\n\1\)\+$/\1/gec Nor do I have a problem finding duplicate field 1 outside of vim on Linux $ awk '!seen[$1]++' filename.txt $ sort -u -t' ' -k1,1 filename.txt For example, if the delimiter was a comma, I would add that as shown below $awk -F',' '!seen[$1]++' filename.txt In this case, the delimiter is a space and the first field is always a domain of the format http://domainname.com but anything can be after that space after the domain, since the file is generated using a script. All I want to do is combine duplicate data that is found for the same domain. It's actually a Windows vim question on a dual boot desktop where I'm constantly booting to a Linux VM to run the sort uniq and then awk but where I'd rather just find a vim command that works on both platforms which can identify lines based on a duplicate first space separated field. That's why "uniq", "awk", "sed", "grep", "col -b", aren't available where I don't want to install Cygwin since I already can dual boot to Linux without Cygwin. The Windows PowerShell can _count_ the number of fully duplicate lines Get-Content .\filename.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count And PowerShell can remove fully duplicate lines: Get-Content filename.txt | Sort-Object -Unique | Set-Content output.txt But I haven't figured out yet how PowerShell looks at the first field only in order to find partially duplicate lines based on field 1 only. Try this: anon@lowtide:~/code/vim/work$ cat doit.vim let @q='0jvf dkJ0' anon@lowtide:~/code/vim/work$ cat domains http://domain1.com a b c http://domain2.com lll mmm nnn http://domain1.com x y z http://domain1.com k l m http://domain2.com aaa bbb ccc http://domain2.com sss ttt uuu http://domain1.com m n o anon@lowtide:~/code/vim/work$ vi domains anon@lowtide:~/code/vim/work$ cat domains http://domain1.com a b c k l m m n o x y z http://domain2.com aaa bbb ccc lll mmm nnn sss ttt uuu anon@lowtide:~/code/vim/work$ You may be able to fiddle with it to get it to work on a range. For now just run the macro with a count, where count is one less than the number of occurences of a domain. sort the file with vim's internal sort: :sort u source the script: :so doit.vim Put your cursor on the first line for a domain, then either repeatedly hit @q until done with that domain or guess a count. For instance, if you think the count is 10, run 9@q. Drop down to the next domain with 'j', and repeat. https://streamable.com/g653l |
#4
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an answer to this problem of whether it's even possible, using just the freeware vim, to find first-field duplicates? How can I find the duplicate domains in a text file of the format http://domain1.com (space separated random data) http://domain1.com (more space separated random data) http://domain2.com (perhaps different space separated random data) Where I want to identify that domain1.com is duplicated in that file up to the first space (what comes after the space need only be concatenated later so that, in the end, domain1.com is not only listed just once, but so that the one instance of domain1.com has all the data assigned by multiple lines in the original file. Did a bit more with this. Just source the following file, and it should take care of everything. The variable dc (domain count) is set to 100. Just set it to something you know is large enough for the number of different domains in the file. anon@lowtide:~/code/vim/work$ cat doitall.vim function! C(blah) redir = cnt silent exe "%s#" . a:blah . "##gn" redir END let res = strpart(cnt, 1, stridx(cnt, " ")) let i = 0 while i res - 1 normal! @q let i += 1 endwhile endfunction function! All() let i = 0 while i g:dc normal! "ayWspace call C(getreg("a")) normal! j let i += 1 endwhile endfunction let dc = 100 let @q='0jvf dkJ0' sort u normal! gg call All() anon@lowtide:~/code/vim/work$ |
#5
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Wed, 12 Jun 2019 23:29:06 +0000 (UTC), owl wrote:
Did a bit more with this. Just source the following file, and it should take care of everything. Hi owl, Thank you for that help since it is a problem asked ALL OVER the net, and only solved on Linux, but never solved (to my knowledge) on Windows! I realize this is an extremely difficult problem so I will explain what happened, step by step, when I tested each solution: Here's a test file C:\ type test.txt http://domain1.com (stuff, stuff, stuff) http://domain1.com (more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk) http://domain2.com (even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff) http://domain4.com (junk, junk, junk) Here is the doit.vim file that is sourced inside that test file: C:\ type doit.vim let @q='0jvf dkJ0' Here's a graphical representation of what happened, in 4 steps: 1. https://i.postimg.cc/zXpyxPZD/sort01.jpg 2. https://i.postimg.cc/XvBX7jZX/sort02.jpg 3. https://i.postimg.cc/HxGnrq5z/sort03.jpg 4. https://i.postimg.cc/FRjR3TV4/sort04.jpg What we wanted to have happened was this end result: http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff) http://domain4.com (junk, junk, junk) Unfortunately, what happened was this end result instead: http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff)(junk, junk, junk) Which mixed the results from domain4 into domain3, wiping out domain4. In words, what happened in Windows, after I source the "doit.vim" file, when I place the cursor on any given line and then press and hold Shift, and press and hold @ and then tap once quickly the "q", it seems to work but it really doesn't work because it's not discriminating on lines that have different domains (which it should ignore since the only lines to be acted upon are those of the same domain in the first field. I see there is a second file, so I'll try that next, but first I wanted to let you know that I understand this is a difficult problem so I appreciate that you attempted a solution, where I repeat, I don't see ANY Windows freeware solution on the entire Internet, even though this problem has been asked before. I still need to figure out how it does what it does, but first I want to test it out and let you know the results. |
#6
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Tue, 11 Jun 2019 16:14:15 +0000 (UTC), owl wrote:
sort the file with vim's internal sort: :sort u Hi owl, Most of the time I use Windows' sort, as in: :%!sort But the VIM sort should work as well, I agree: :sort u For the second test, keeping that Windows "sort u" unique option in mind, I realized I needed to duplicate some data to test the sort unique option inside of VIM (which I generally don't use because the Windows sort doesn't have a unique switch). Hence, for the second test, I started with this file: C:\ type test2.txt http://domain1.com (this, is, random, data) http://domain1.com (this, is, to, be, concatonated) http://domain2.com (this, is, duplicated, random, data) http://domain2.com (this, is, duplicated, random, data) http://domain2.com (this, is, more, random, data, to, be, concatonated) http://domain3.com (this, is, data, to not, be, concatonated) http://domain4.com (this, is, data, to, not, be, concatonated) Then I sourced the doitall.vim file, which worked PERFECTLY! :source doitall.vim Here's what happened in screenshots: 1. https://i.postimg.cc/Hsyj0sWm/sort05.jpg 2. https://i.postimg.cc/V5QfZns8/sort06.jpg This is the resulting file after running the doitall.vim script: C:\ type test2.txt http://domain1.com (this, is, random, data) (this, is, to, be, concatonated) http://domain2.com (this, is, duplicated, random, data) (this, is, more, random, data, to, be, concatonated) http://domain3.com (this, is, data, to not, be, concatonated) http://domain4.com (this, is, data, to, not, be, concatonated) That's perfect! I changed the doitall script to handle thousands of lines by changing the DC variable from 100 to 10000, and then ran it on the original real file. It worked but with an error of... Error detected while processing function All[4]..C: E486: Pattern not found: http://domain_abc.com^@ Press ENTER or type command to continue Where it turned out that there was one domain with no data http://domain_abc.com Which was an error in the data that my prior checks had not caught. That line should have been of the format: http://domain_abc.com (some, data, after, the, domain) So that was my fault for not catching that one error in the file. But other than that, in just a minute or three, it reduced a text file containing about 10,000 lines to about 8,000 lines, where I need to do two things to move forward from he a. I need to check for errors in the result, and, b. I need to figure out HOW this thing worked! As a sanity check, I sourced the doitall.vim _again_ in the data file it just fixed, and let it crunch for two or three minutes, where what happened was not what I had expected. It's difficult to do a diff in Windows so I'll just explain with line counts where the three tests showed line counts of 1. The first file was 10,631 lines 2. The doitall.vim resulting file was 8,010 lines 3. A second run of doitall.vim resulted in 7,998 lines 4. A third run of doitall.vim resulted in 7,986 lines As another sanity check, I created a second file containing only the domains where I stripped out all the data using VIM :%s/ .*// Which removed everything after a space, including the space to leave me with just a file of the domains themselves. Then I ran a Windows VIM "sort u" on that file, which reduced 1. The original file contained 10,631 lines 2. The VIM "sort u" reduced that to 7,975 lines 3. Subsequent VIM "sort u" commands didn't change the number of lines That means there are 7975 unique domains in the first field, which is pretty close to what the doitall.vim resulted in, but where I need to check the few differences that remain from that count. I suspect the differences are due to the uneven use of tabs and spaces, where multiple spaces can occur, which I will clean up first, and then re-run the tests so that the original file has no tabs and only single spaces. |
#7
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux Arlen G. Holder wrote:
On Tue, 11 Jun 2019 16:14:15 +0000 (UTC), owl wrote: sort the file with vim's internal sort: :sort u Hi owl, Most of the time I use Windows' sort, as in: :%!sort But the VIM sort should work as well, I agree: :sort u For the second test, keeping that Windows "sort u" unique option in mind, I realized I needed to duplicate some data to test the sort unique option inside of VIM (which I generally don't use because the Windows sort doesn't have a unique switch). Hence, for the second test, I started with this file: C:\ type test2.txt http://domain1.com (this, is, random, data) http://domain1.com (this, is, to, be, concatonated) http://domain2.com (this, is, duplicated, random, data) http://domain2.com (this, is, duplicated, random, data) http://domain2.com (this, is, more, random, data, to, be, concatonated) http://domain3.com (this, is, data, to not, be, concatonated) http://domain4.com (this, is, data, to, not, be, concatonated) Then I sourced the doitall.vim file, which worked PERFECTLY! :source doitall.vim Here's what happened in screenshots: 1. https://i.postimg.cc/Hsyj0sWm/sort05.jpg 2. https://i.postimg.cc/V5QfZns8/sort06.jpg This is the resulting file after running the doitall.vim script: C:\ type test2.txt http://domain1.com (this, is, random, data) (this, is, to, be, concatonated) http://domain2.com (this, is, duplicated, random, data) (this, is, more, random, data, to, be, concatonated) http://domain3.com (this, is, data, to not, be, concatonated) http://domain4.com (this, is, data, to, not, be, concatonated) That's perfect! I changed the doitall script to handle thousands of lines by changing the DC variable from 100 to 10000, and then ran it on the original real file. It worked but with an error of... Error detected while processing function All[4]..C: E486: Pattern not found: http://domain_abc.com^@ Press ENTER or type command to continue Where it turned out that there was one domain with no data http://domain_abc.com Which was an error in the data that my prior checks had not caught. That line should have been of the format: http://domain_abc.com (some, data, after, the, domain) So that was my fault for not catching that one error in the file. But other than that, in just a minute or three, it reduced a text file containing about 10,000 lines to about 8,000 lines, where I need to do two things to move forward from he a. I need to check for errors in the result, and, b. I need to figure out HOW this thing worked! As a sanity check, I sourced the doitall.vim _again_ in the data file it just fixed, and let it crunch for two or three minutes, where what happened was not what I had expected. It's difficult to do a diff in Windows so I'll just explain with line counts where the three tests showed line counts of 1. The first file was 10,631 lines 2. The doitall.vim resulting file was 8,010 lines 3. A second run of doitall.vim resulted in 7,998 lines 4. A third run of doitall.vim resulted in 7,986 lines As another sanity check, I created a second file containing only the domains where I stripped out all the data using VIM :%s/ .*// Which removed everything after a space, including the space to leave me with just a file of the domains themselves. Then I ran a Windows VIM "sort u" on that file, which reduced 1. The original file contained 10,631 lines 2. The VIM "sort u" reduced that to 7,975 lines 3. Subsequent VIM "sort u" commands didn't change the number of lines That means there are 7975 unique domains in the first field, which is pretty close to what the doitall.vim resulted in, but where I need to check the few differences that remain from that count. I suspect the differences are due to the uneven use of tabs and spaces, where multiple spaces can occur, which I will clean up first, and then re-run the tests so that the original file has no tabs and only single spaces. Change this line of the script: normal! "ayWspace to normal! "ayW That "space" was a holdover from a different test and is not needed. It ends up causing the cursor to walk to the right on each iteration of the loop and may be causing a problem. Other than that, I'd have to see a diff to determine what it is tripping over on those subsequent runs where you get different counts. |
#8
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux Arlen G. Holder wrote:
On Wed, 12 Jun 2019 23:29:06 +0000 (UTC), owl wrote: Did a bit more with this. Just source the following file, and it should take care of everything. Hi owl, Thank you for that help since it is a problem asked ALL OVER the net, and only solved on Linux, but never solved (to my knowledge) on Windows! I realize this is an extremely difficult problem so I will explain what happened, step by step, when I tested each solution: Here's a test file C:\ type test.txt http://domain1.com (stuff, stuff, stuff) http://domain1.com (more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk) http://domain2.com (even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff) http://domain4.com (junk, junk, junk) Here is the doit.vim file that is sourced inside that test file: C:\ type doit.vim let @q='0jvf dkJ0' Here's a graphical representation of what happened, in 4 steps: 1. https://i.postimg.cc/zXpyxPZD/sort01.jpg 2. https://i.postimg.cc/XvBX7jZX/sort02.jpg 3. https://i.postimg.cc/HxGnrq5z/sort03.jpg 4. https://i.postimg.cc/FRjR3TV4/sort04.jpg What we wanted to have happened was this end result: http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff) http://domain4.com (junk, junk, junk) Unfortunately, what happened was this end result instead: http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff) http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk) http://domain3.com (stuff, stuff, stuff)(junk, junk, junk) Which mixed the results from domain4 into domain3, wiping out domain4. In words, what happened in Windows, after I source the "doit.vim" file, when I place the cursor on any given line and then press and hold Shift, and press and hold @ and then tap once quickly the "q", it seems to work but it really doesn't work because it's not discriminating on lines that have different domains (which it should ignore since the only lines to be acted upon are those of the same domain in the first field. I see there is a second file, so I'll try that next, but first I wanted to let you know that I understand this is a difficult problem so I appreciate that you attempted a solution, where I repeat, I don't see ANY Windows freeware solution on the entire Internet, even though this problem has been asked before. I still need to figure out how it does what it does, but first I want to test it out and let you know the results. If you want to watch it in slow motion, just type the keystrokes that make up macro. What happened in with domain3 and domain4 is that you didn't need to run the macro on domain 3 because it only had on record. In other words, it was as if you had already run it and just needed to go on to the next domain. If you accidentally do this at any point, you can just hit undo ('u') and move to the next line. |
#9
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Thu, 13 Jun 2019 12:46:38 +0000 (UTC), owl wrote:
If you want to watch it in slow motion, just type the keystrokes that make up macro. What happened in with domain3 and domain4 is that you didn't need to run the macro on domain 3 because it only had on record. In other words, it was as if you had already run it and just needed to go on to the next domain. If you accidentally do this at any point, you can just hit undo ('u') and move to the next line. Hi owl, Thanks for all your dedicated help and explanations. I need to do two things before I can report back anything useful. 1. I need to figure out HOW your magic works, and, 2. I need to figure out WHY repetitive runs still found stuff. Tonight when I get home I will change that line you suggested FROM: normal! "ayWspace TO: normal! "ayW And then I will source doitall.vim on a copy of the original file. I'll then repetitively source doitall.vim, where the "expected" output is that the second, third, and fourth runs, etc., should do nothing. In my tests yesterday, successive runs resulted in... 1. 7,962 lines 2. 7,951 lines 3. 7,940 lines etc. If, after making the fix you suggested, doitall.vim still removes about a dozen lines out of about 8 thousand lines, then I need to look up the best way to do a Windows diff on a large file. C:\ fc test_a_7951.txt test_a_7940.txt C:\ fc /c /lb10000 /n "test_a_7951.txt" "test_a_7940.txt" C:\ findstr /V test_a_7951.txt test_a_7940.txt diff.txt etc. Thanks for your purposefully helpful expert advice, where this thread is a CLASSIC for how Usenet is a public potluck, where the value we bring to the table to share can benefit everyone. Bear in mind, NOWHERE on the net have I _ever_ seen a solution to this problem for Windows users without resorting to Excel or Cygwin or some other non-vim freeware tool - where LOTS of people have asked the same question, where they usually want to filter based on a given field. |
#10
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Thu, 13 Jun 2019 20:53:19 -0000 (UTC), Arlen G. Holder wrote:
Bear in mind, NOWHERE on the net have I _ever_ seen a solution to this problem for Windows users without resorting to Excel or Cygwin or some other non-vim freeware tool - where LOTS of people have asked the same question, where they usually want to filter based on a given field. Hi owl, I'm actually thoroughly confused, where my biggest problem is that I never needed a UNIX-like compare on Windows before, where the windows diff is atrocious in output so far (using 'fc' or "findstr" anyway), so let me just say that there's something fishy going on since each time I run the recently modified doitall.vim, the subsequent file is a dozen lines shorter, every single time. o Original file: a_10706.txt o First source of doitall.vim = b_8015.txt o Second source of doitall.vim = c_8003.txt o Third source of doitall.vim = d_7991.txt o Fourth source of doitall.vim = e_7979.txt o Fifth source of doitall.vim = f_7967.txt Notice _every_ file is 12 lines shorter than the prior file when all I did was source the doitall.vim with the count set to 100000. You'd think a simple freaking DIFF would exist on Windows to find what those dozen lines are, and I'm sure it does - but the diff's I'm running literally suck for that simple purpose. C:\ fc test_a_7951.txt test_a_7940.txt C:\ fc /c /lb10000 /n "test_a_7951.txt" "test_a_7940.txt" C:\ findstr /V test_a_7951.txt test_a_7940.txt diff.txt So I have to diverge for a moment to find a freaking decent diff on Windows that simply tells me what's in file1 that is not in file2, which should only be a dozen lines, but all the diffs above show much more crap than just that. I do appreciate your help as nobody has solved this to my knowledge, where I should be able to debug what the heck is different each time it's run. What I _expect_ is for the subsequent runs to do nothing, but each file is a dozen lines shorter. |
#11
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Fri, 14 Jun 2019 17:51:07 -0000 (UTC), Arlen G. Holder wrote:
So I have to diverge for a moment to find a freaking decent diff on Windows that simply tells me what's in file1 that is not in file2, which should only be a dozen lines, but all the diffs above show much more crap than just that. Hi owl, Even though the "diff" capability of Windows appears to be atrocious (or I just haven't found a reasonable diff command on Windows) I was able to narrow down "most" (if not all) the differences which, so far, were all due to errors in the data set syntax, where, for example, I had a domain with no data after it http://domain1.xyz And, for example, I had a domain twice in the same line http://domain1.biz (stuff ... more stuff ... )http://domain1.biz (stuff) And, for some reason, the sort order was wrong due to a capitalized domain http://Domain1.edu (stuff) And, at some point, I had a few files missing the dot with a dash instead http://Domain1-org (stuff) Your script works fine, it seems, if the data is fine, so I need to clean up the data a bit for things that I should have caught with vim regular expressions. My fault - not yours. I'll keep testing as this has never been accomplished on the net as far as I can tell - which is a big deal ... and which I appreciate that you added to the Usenet potluck by bringing something of value to the table. LATER - when I've confirmed it's working perfectly, I'll try to figure out HOW it works - as it's magic at the moment. I do like that it crunches for about a minute or two and then dings the computer for a minute or two, and that's how I know it's done. |
#12
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Fri, 14 Jun 2019 19:40:55 -0000 (UTC), Arlen G. Holder wrote:
LATER - when I've confirmed it's working perfectly, I'll try to figure out HOW it works - as it's magic at the moment. I do like that it crunches for about a minute or two and then dings the computer for a minute or two, and that's how I know it's done. Hi owl, Bingo! You are a genius! o It works fine now! Since I haven't tried to figure out the script yet, I simply tried to figure out which data it was barfing on, where _all_ the barfs were due to the data not being what it was supposed to be (i.e., the data wasn't what I thought it was - as it contained errors - which I had to clean up one by one). The Windows diff commands only kind of sort of told me where to look. The remaining data errors were similar to the prior errors, e.g., http://domain1-net (stuff) instead of http://domain1.net (stuff Woo hoo! All of which are now fixed in the original file, which you could never have anticipated nor known about. The current original database has 10,725 lines. The first pass of doitall.vim dropped that to 8,043 lines. All subsequent runs kept the file length at 8,043 lines. Which, of course, is the current new file size for the auto-generated db. Thank you for solving the problem, where now I have to figure out the magic! function! C(blah) redir = cnt silent exe "%s#" . a:blah . "##gn" redir END let res = strpart(cnt, 1, stridx(cnt, " ")) let i = 0 while i res - 1 normal! @q let i += 1 endwhile endfunction function! All() let i = 0 while i g:dc normal! "ayW call C(getreg("a")) normal! j let i += 1 endwhile endfunction let dc = 30000 let @q='0jvf dkJ0' sort u normal! gg call All() |
#13
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux Arlen G. Holder wrote:
On Fri, 14 Jun 2019 19:40:55 -0000 (UTC), Arlen G. Holder wrote: LATER - when I've confirmed it's working perfectly, I'll try to figure out HOW it works - as it's magic at the moment. I do like that it crunches for about a minute or two and then dings the computer for a minute or two, and that's how I know it's done. Hi owl, Bingo! You are a genius! o It works fine now! Since I haven't tried to figure out the script yet, I simply tried to figure out which data it was barfing on, where _all_ the barfs were due to the data not being what it was supposed to be (i.e., the data wasn't what I thought it was - as it contained errors - which I had to clean up one by one). The Windows diff commands only kind of sort of told me where to look. The remaining data errors were similar to the prior errors, e.g., http://domain1-net (stuff) instead of http://domain1.net (stuff Woo hoo! All of which are now fixed in the original file, which you could never have anticipated nor known about. The current original database has 10,725 lines. The first pass of doitall.vim dropped that to 8,043 lines. All subsequent runs kept the file length at 8,043 lines. Which, of course, is the current new file size for the auto-generated db. Thank you for solving the problem, where now I have to figure out the magic! function! C(blah) redir = cnt silent exe "%s#" . a:blah . "##gn" redir END let res = strpart(cnt, 1, stridx(cnt, " ")) let i = 0 while i res - 1 normal! @q let i += 1 endwhile endfunction function! All() let i = 0 while i g:dc normal! "ayW call C(getreg("a")) normal! j let i += 1 endwhile endfunction let dc = 30000 let @q='0jvf dkJ0' sort u normal! gg call All() The "magic" is in the macro, which is held in the "q" register: 0jvf dkJ0 Some speed can probably be gained by re-writing that as: 0jdWkJ0 So replace let @q='0jvf dkJ0' with let @q='0jdWkJ0' Explanation of '0jvf dkJ0': '0': go to first char on the line 'j': step down to next line 'v': visual char select 'f ': find a space 'd': delete (selection) 'k': go up to previous line 'J': join this line to next '0': go to first char on the line. Explanation of alternative '0jdWkJ0': '0': go to first char on the line 'j': step down to next line 'dW': delete WORD 'k': go up to previous line 'J': join this line to next '0': go to first char on the line. The helper functions are just to get the counts right. Function All() uses the total domain count (in dc variable) to iterate a loop where, for each domain, it yanks the domain text into register "a": normal! "ayW and passes that text to function C(). Function C() then gets count of that text using this method: s/domain text//gn by concatenating the argument text: silent exe "%s#" . a:blah . "##gn" '#' delimiters are used, since the domain text will contain a couple '/'. This would normally echo the count to the screen in the message box. We silence that message and also copy it to the "cnt" variable with "redir = cnt". We want to extract the first string of that text, which represents the count, e.g.: "12 matches on 6 lines" We want the "12". let res = strpart(cnt, 1, stridx(cnt, " ")) (For some reason, there is a null byte at the beginning, so we have to start with byte 1 instead of byte 0). Now we have the count in the "res" variable. We then loop, runnning the macro in register "q" one less than than number of matches, so starting with i = 0 we loop while i res - 1. C() itself is called in a loop from All(), for each domain. C() puts everything for a domain on one line. After C() returns, the loop runs "normal! j", which steps down to the next domain and runs the next iteration. |
#14
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
In alt.os.linux owl wrote:
... and passes that text to function C(). Function C() then gets count of that text using this method: s/domain text//gn by concatenating the argument text: silent exe "%s#" . a:blah . "##gn" '#' delimiters are used, since the domain text will contain a couple '/'. This would normally echo the count to the screen in the message box. We silence that message and also copy it to the "cnt" variable with "redir = cnt". We want to extract the first string of that text, which represents the count, e.g.: "12 matches on 6 lines" We want the "12". We should change the match count line to: from: silent exe "%s#" . a:blah . "##gn" to: silent exe "%s#^" . a:blah . "##gn" |
#15
|
|||
|
|||
Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?
On Sat, 15 Jun 2019 13:31:33 +0000 (UTC), owl wrote:
Some speed can probably be gained by re-writing that as: Hi owl, Thank you for adding value to the Usenet potluck, to share with all. Thanks for doing the explanation for me, where I'm no slouch in vim commands, but that "magic" was way above my capabilities. Did you already have that available, or did you expressly code it to resolve the problem set? I ask because that's a LOT of effort to code it, where NOBODY has ever solved this problem (to my knowledge), for Windows inside of the vim freeware. It does take a few minutes (three or four minutes) to run on 10,000 lines, but that's fine considering how long it would take to manually do the job. I did find another error in my input data, which is generated so I have to fix the generation script more so than the data itself. This was in the file: http://domain1.xyz(stuff) instead of this: http://domain1.xyz (stuff) Which doesn't cause your script a problem, but where your script considers them two different lines, understandably, since the space is supposed to be the delimiter between the domain and the stuff. I've since fixed that so it's not a problem. I'm curious, if you wrote that script, did you hand do it first inside of VIM and then copied the hand code to the script? It's the most involved vim script I've ever seen, which is why I ask. function! C(blah) redir = cnt silent exe "%s#^" . a:blah . "##gn" redir END let res = strpart(cnt, 1, stridx(cnt, " ")) let i = 0 while i res - 1 normal! @q let i += 1 endwhile endfunction function! All() let i = 0 while i g:dc normal! "ayW call C(getreg("a")) normal! j let i += 1 endwhile endfunction let dc = 30000 let @q='0jdWkJ0' sort u normal! gg call All() |
|
Thread Tools | |
Display Modes | Rate This Thread |
|
|