A Windows XP help forum. PCbanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » PCbanter forum » Windows 10 » Windows 10 Help Forum
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?



 
 
Thread Tools Rate Thread Display Modes
  #1  
Old June 5th 19, 11:30 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

This is a vim-only question where I simply ask if you _already_ have an
answer to this problem of whether it's even possible, using just the
freeware vim, to find first-field duplicates?

How can I find the duplicate domains in a text file of the format
http://domain1.com (space separated random data)
http://domain1.com (more space separated random data)
http://domain2.com (perhaps different space separated random data)

Where I want to identify that domain1.com is duplicated in that file up to
the first space (what comes after the space need only be concatenated later
so that, in the end, domain1.com is not only listed just once, but so that
the one instance of domain1.com has all the data assigned by multiple lines
in the original file.

I have no problem sorting the whole file inside of vim
:%!sort

I have no problem finding all fully duplicated lines inside of vim
:%s/^\(.*\)\(\n\1\)\+$/\1/gec

Nor do I have a problem finding duplicate field 1 outside of vim on Linux
$ awk '!seen[$1]++' filename.txt
$ sort -u -t' ' -k1,1 filename.txt

For example, if the delimiter was a comma, I would add that as shown below
$awk -F',' '!seen[$1]++' filename.txt

In this case, the delimiter is a space and the first field is always a
domain of the format http://domainname.com but anything can be after that
space after the domain, since the file is generated using a script.

All I want to do is combine duplicate data that is found for the same
domain.

It's actually a Windows vim question on a dual boot desktop where I'm
constantly booting to a Linux VM to run the sort uniq and then awk but
where I'd rather just find a vim command that works on both platforms which
can identify lines based on a duplicate first space separated field.

That's why "uniq", "awk", "sed", "grep", "col -b", aren't available
where I don't want to install Cygwin since I already can dual boot to Linux
without Cygwin.

The Windows PowerShell can _count_ the number of fully duplicate lines
Get-Content .\filename.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count
And PowerShell can remove fully duplicate lines:
Get-Content filename.txt | Sort-Object -Unique | Set-Content output.txt

But I haven't figured out yet how PowerShell looks at the first field only
in order to find partially duplicate lines based on field 1 only.
Ads
  #2  
Old June 6th 19, 06:18 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Jasen Betts
external usenet poster
 
Posts: 148
Default Finding first field duplicate lines in a sorted text filewithout uniq or awk or col using only vim - is it possible?

On 2019-06-05, Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an
answer to this problem of whether it's even possible, using just the
freeware vim, to find first-field duplicates?


The macro language is turing complete, so yes anlmost certainly possible.

--
When I tried casting out nines I made a hash of it.
  #3  
Old June 11th 19, 05:14 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an
answer to this problem of whether it's even possible, using just the
freeware vim, to find first-field duplicates?

How can I find the duplicate domains in a text file of the format
http://domain1.com (space separated random data)
http://domain1.com (more space separated random data)
http://domain2.com (perhaps different space separated random data)

Where I want to identify that domain1.com is duplicated in that file up to
the first space (what comes after the space need only be concatenated later
so that, in the end, domain1.com is not only listed just once, but so that
the one instance of domain1.com has all the data assigned by multiple lines
in the original file.

I have no problem sorting the whole file inside of vim
:%!sort

I have no problem finding all fully duplicated lines inside of vim
:%s/^\(.*\)\(\n\1\)\+$/\1/gec

Nor do I have a problem finding duplicate field 1 outside of vim on Linux
$ awk '!seen[$1]++' filename.txt
$ sort -u -t' ' -k1,1 filename.txt

For example, if the delimiter was a comma, I would add that as shown below
$awk -F',' '!seen[$1]++' filename.txt

In this case, the delimiter is a space and the first field is always a
domain of the format http://domainname.com but anything can be after that
space after the domain, since the file is generated using a script.

All I want to do is combine duplicate data that is found for the same
domain.

It's actually a Windows vim question on a dual boot desktop where I'm
constantly booting to a Linux VM to run the sort uniq and then awk but
where I'd rather just find a vim command that works on both platforms which
can identify lines based on a duplicate first space separated field.

That's why "uniq", "awk", "sed", "grep", "col -b", aren't available
where I don't want to install Cygwin since I already can dual boot to Linux
without Cygwin.

The Windows PowerShell can _count_ the number of fully duplicate lines
Get-Content .\filename.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count
And PowerShell can remove fully duplicate lines:
Get-Content filename.txt | Sort-Object -Unique | Set-Content output.txt

But I haven't figured out yet how PowerShell looks at the first field only
in order to find partially duplicate lines based on field 1 only.


Try this:

anon@lowtide:~/code/vim/work$ cat doit.vim
let @q='0jvf dkJ0'
anon@lowtide:~/code/vim/work$ cat domains
http://domain1.com a b c
http://domain2.com lll mmm nnn
http://domain1.com x y z
http://domain1.com k l m
http://domain2.com aaa bbb ccc
http://domain2.com sss ttt uuu
http://domain1.com m n o
anon@lowtide:~/code/vim/work$ vi domains
anon@lowtide:~/code/vim/work$ cat domains
http://domain1.com a b c k l m m n o x y z
http://domain2.com aaa bbb ccc lll mmm nnn sss ttt uuu
anon@lowtide:~/code/vim/work$

You may be able to fiddle with it to get it to work on a range. For now
just run the macro with a count, where count is one less than the number
of occurences of a domain.

sort the file with vim's internal sort:
:sort u

source the script:
:so doit.vim

Put your cursor on the first line for a domain, then either
repeatedly hit @q until done with that domain or guess a count.
For instance, if you think the count is 10, run 9@q.

Drop down to the next domain with 'j', and repeat.

https://streamable.com/g653l


  #4  
Old June 13th 19, 12:29 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux Arlen G. Holder wrote:
This is a vim-only question where I simply ask if you _already_ have an
answer to this problem of whether it's even possible, using just the
freeware vim, to find first-field duplicates?

How can I find the duplicate domains in a text file of the format
http://domain1.com (space separated random data)
http://domain1.com (more space separated random data)
http://domain2.com (perhaps different space separated random data)

Where I want to identify that domain1.com is duplicated in that file up to
the first space (what comes after the space need only be concatenated later
so that, in the end, domain1.com is not only listed just once, but so that
the one instance of domain1.com has all the data assigned by multiple lines
in the original file.


Did a bit more with this. Just source the following file, and it
should take care of everything. The variable dc (domain count)
is set to 100. Just set it to something you know is large enough
for the number of different domains in the file.

anon@lowtide:~/code/vim/work$ cat doitall.vim
function! C(blah)
redir = cnt
silent exe "%s#" . a:blah . "##gn"
redir END
let res = strpart(cnt, 1, stridx(cnt, " "))
let i = 0
while i res - 1
normal! @q
let i += 1
endwhile
endfunction

function! All()
let i = 0
while i g:dc
normal! "ayWspace
call C(getreg("a"))
normal! j
let i += 1
endwhile
endfunction

let dc = 100
let @q='0jvf dkJ0'
sort u
normal! gg
call All()
anon@lowtide:~/code/vim/work$


  #5  
Old June 13th 19, 09:09 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Wed, 12 Jun 2019 23:29:06 +0000 (UTC), owl wrote:

Did a bit more with this. Just source the following file, and it
should take care of everything.


Hi owl,

Thank you for that help since it is a problem asked ALL OVER the net,
and only solved on Linux, but never solved (to my knowledge) on Windows!

I realize this is an extremely difficult problem so I will
explain what happened, step by step, when I tested each solution:

Here's a test file
C:\ type test.txt
http://domain1.com (stuff, stuff, stuff)
http://domain1.com (more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)
http://domain2.com (even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)
http://domain4.com (junk, junk, junk)

Here is the doit.vim file that is sourced inside that test file:
C:\ type doit.vim
let @q='0jvf dkJ0'

Here's a graphical representation of what happened, in 4 steps:
1. https://i.postimg.cc/zXpyxPZD/sort01.jpg
2. https://i.postimg.cc/XvBX7jZX/sort02.jpg
3. https://i.postimg.cc/HxGnrq5z/sort03.jpg
4. https://i.postimg.cc/FRjR3TV4/sort04.jpg

What we wanted to have happened was this end result:
http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)
http://domain4.com (junk, junk, junk)

Unfortunately, what happened was this end result instead:
http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)(junk, junk, junk)

Which mixed the results from domain4 into domain3, wiping out domain4.

In words, what happened in Windows, after I source the "doit.vim" file,
when I place the cursor on any given line and then press and hold Shift,
and press and hold @ and then tap once quickly the "q", it seems to work
but it really doesn't work because it's not discriminating on lines that
have different domains (which it should ignore since the only lines to be
acted upon are those of the same domain in the first field.

I see there is a second file, so I'll try that next, but first I
wanted to let you know that I understand this is a difficult problem
so I appreciate that you attempted a solution, where I repeat,
I don't see ANY Windows freeware solution on the entire Internet,
even though this problem has been asked before.

I still need to figure out how it does what it does, but first
I want to test it out and let you know the results.
  #6  
Old June 13th 19, 10:00 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Tue, 11 Jun 2019 16:14:15 +0000 (UTC), owl wrote:

sort the file with vim's internal sort:
:sort u


Hi owl,

Most of the time I use Windows' sort, as in:
:%!sort
But the VIM sort should work as well, I agree:
:sort u

For the second test, keeping that Windows "sort u" unique option in mind,
I realized I needed to duplicate some data to test the sort unique option
inside of VIM (which I generally don't use because the Windows sort doesn't
have a unique switch).

Hence, for the second test, I started with this file:
C:\ type test2.txt
http://domain1.com (this, is, random, data)
http://domain1.com (this, is, to, be, concatonated)
http://domain2.com (this, is, duplicated, random, data)
http://domain2.com (this, is, duplicated, random, data)
http://domain2.com (this, is, more, random, data, to, be, concatonated)
http://domain3.com (this, is, data, to not, be, concatonated)
http://domain4.com (this, is, data, to, not, be, concatonated)

Then I sourced the doitall.vim file, which worked PERFECTLY!
:source doitall.vim

Here's what happened in screenshots:
1. https://i.postimg.cc/Hsyj0sWm/sort05.jpg
2. https://i.postimg.cc/V5QfZns8/sort06.jpg

This is the resulting file after running the doitall.vim script:
C:\ type test2.txt
http://domain1.com (this, is, random, data) (this, is, to, be, concatonated)
http://domain2.com (this, is, duplicated, random, data) (this, is, more, random, data, to, be, concatonated)
http://domain3.com (this, is, data, to not, be, concatonated)
http://domain4.com (this, is, data, to, not, be, concatonated)

That's perfect!

I changed the doitall script to handle thousands of lines
by changing the DC variable from 100 to 10000,
and then ran it on the original real file.

It worked but with an error of...
Error detected while processing function All[4]..C:
E486: Pattern not found: http://domain_abc.com^@
Press ENTER or type command to continue

Where it turned out that there was one domain with no data
http://domain_abc.com
Which was an error in the data that my prior checks had not caught.
That line should have been of the format:
http://domain_abc.com (some, data, after, the, domain)
So that was my fault for not catching that one error in the file.

But other than that, in just a minute or three, it reduced a text file
containing about 10,000 lines to about 8,000 lines, where I need to
do two things to move forward from he
a. I need to check for errors in the result, and,
b. I need to figure out HOW this thing worked!

As a sanity check, I sourced the doitall.vim _again_ in the data file
it just fixed, and let it crunch for two or three minutes,
where what happened was not what I had expected.

It's difficult to do a diff in Windows so I'll just explain with
line counts where the three tests showed line counts of
1. The first file was 10,631 lines
2. The doitall.vim resulting file was 8,010 lines
3. A second run of doitall.vim resulted in 7,998 lines
4. A third run of doitall.vim resulted in 7,986 lines

As another sanity check, I created a second file containing only
the domains where I stripped out all the data using VIM
:%s/ .*//
Which removed everything after a space, including the space
to leave me with just a file of the domains themselves. Then
I ran a Windows VIM "sort u" on that file, which reduced
1. The original file contained 10,631 lines
2. The VIM "sort u" reduced that to 7,975 lines
3. Subsequent VIM "sort u" commands didn't change the number of lines

That means there are 7975 unique domains in the first field,
which is pretty close to what the doitall.vim resulted in, but where
I need to check the few differences that remain from that count.

I suspect the differences are due to the uneven use of tabs and
spaces, where multiple spaces can occur, which I will clean up
first, and then re-run the tests so that the original file has no
tabs and only single spaces.
  #7  
Old June 13th 19, 01:33 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux Arlen G. Holder wrote:
On Tue, 11 Jun 2019 16:14:15 +0000 (UTC), owl wrote:

sort the file with vim's internal sort:
:sort u


Hi owl,

Most of the time I use Windows' sort, as in:
:%!sort
But the VIM sort should work as well, I agree:
:sort u

For the second test, keeping that Windows "sort u" unique option in mind,
I realized I needed to duplicate some data to test the sort unique option
inside of VIM (which I generally don't use because the Windows sort doesn't
have a unique switch).

Hence, for the second test, I started with this file:
C:\ type test2.txt
http://domain1.com (this, is, random, data)
http://domain1.com (this, is, to, be, concatonated)
http://domain2.com (this, is, duplicated, random, data)
http://domain2.com (this, is, duplicated, random, data)
http://domain2.com (this, is, more, random, data, to, be, concatonated)
http://domain3.com (this, is, data, to not, be, concatonated)
http://domain4.com (this, is, data, to, not, be, concatonated)

Then I sourced the doitall.vim file, which worked PERFECTLY!
:source doitall.vim

Here's what happened in screenshots:
1. https://i.postimg.cc/Hsyj0sWm/sort05.jpg
2. https://i.postimg.cc/V5QfZns8/sort06.jpg

This is the resulting file after running the doitall.vim script:
C:\ type test2.txt
http://domain1.com (this, is, random, data) (this, is, to, be, concatonated)
http://domain2.com (this, is, duplicated, random, data) (this, is, more, random, data, to, be, concatonated)
http://domain3.com (this, is, data, to not, be, concatonated)
http://domain4.com (this, is, data, to, not, be, concatonated)

That's perfect!

I changed the doitall script to handle thousands of lines
by changing the DC variable from 100 to 10000,
and then ran it on the original real file.

It worked but with an error of...
Error detected while processing function All[4]..C:
E486: Pattern not found: http://domain_abc.com^@
Press ENTER or type command to continue

Where it turned out that there was one domain with no data
http://domain_abc.com
Which was an error in the data that my prior checks had not caught.
That line should have been of the format:
http://domain_abc.com (some, data, after, the, domain)
So that was my fault for not catching that one error in the file.

But other than that, in just a minute or three, it reduced a text file
containing about 10,000 lines to about 8,000 lines, where I need to
do two things to move forward from he
a. I need to check for errors in the result, and,
b. I need to figure out HOW this thing worked!

As a sanity check, I sourced the doitall.vim _again_ in the data file
it just fixed, and let it crunch for two or three minutes,
where what happened was not what I had expected.

It's difficult to do a diff in Windows so I'll just explain with
line counts where the three tests showed line counts of
1. The first file was 10,631 lines
2. The doitall.vim resulting file was 8,010 lines
3. A second run of doitall.vim resulted in 7,998 lines
4. A third run of doitall.vim resulted in 7,986 lines

As another sanity check, I created a second file containing only
the domains where I stripped out all the data using VIM
:%s/ .*//
Which removed everything after a space, including the space
to leave me with just a file of the domains themselves. Then
I ran a Windows VIM "sort u" on that file, which reduced
1. The original file contained 10,631 lines
2. The VIM "sort u" reduced that to 7,975 lines
3. Subsequent VIM "sort u" commands didn't change the number of lines

That means there are 7975 unique domains in the first field,
which is pretty close to what the doitall.vim resulted in, but where
I need to check the few differences that remain from that count.

I suspect the differences are due to the uneven use of tabs and
spaces, where multiple spaces can occur, which I will clean up
first, and then re-run the tests so that the original file has no
tabs and only single spaces.


Change this line of the script:
normal! "ayWspace
to
normal! "ayW

That "space" was a holdover from a different test and is not needed.
It ends up causing the cursor to walk to the right on each iteration
of the loop and may be causing a problem. Other than that, I'd have to
see a diff to determine what it is tripping over on those subsequent
runs where you get different counts.

  #8  
Old June 13th 19, 01:46 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux Arlen G. Holder wrote:
On Wed, 12 Jun 2019 23:29:06 +0000 (UTC), owl wrote:

Did a bit more with this. Just source the following file, and it
should take care of everything.


Hi owl,

Thank you for that help since it is a problem asked ALL OVER the net,
and only solved on Linux, but never solved (to my knowledge) on Windows!

I realize this is an extremely difficult problem so I will
explain what happened, step by step, when I tested each solution:

Here's a test file
C:\ type test.txt
http://domain1.com (stuff, stuff, stuff)
http://domain1.com (more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)
http://domain2.com (even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)
http://domain4.com (junk, junk, junk)

Here is the doit.vim file that is sourced inside that test file:
C:\ type doit.vim
let @q='0jvf dkJ0'

Here's a graphical representation of what happened, in 4 steps:
1. https://i.postimg.cc/zXpyxPZD/sort01.jpg
2. https://i.postimg.cc/XvBX7jZX/sort02.jpg
3. https://i.postimg.cc/HxGnrq5z/sort03.jpg
4. https://i.postimg.cc/FRjR3TV4/sort04.jpg

What we wanted to have happened was this end result:
http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)
http://domain4.com (junk, junk, junk)

Unfortunately, what happened was this end result instead:
http://domain1.com (stuff, stuff, stuff)(more stuff, more stuff, more stuff)
http://domain2.com (junk, junk, junk)(even more junk, even more junk, even more junk)
http://domain3.com (stuff, stuff, stuff)(junk, junk, junk)

Which mixed the results from domain4 into domain3, wiping out domain4.

In words, what happened in Windows, after I source the "doit.vim" file,
when I place the cursor on any given line and then press and hold Shift,
and press and hold @ and then tap once quickly the "q", it seems to work
but it really doesn't work because it's not discriminating on lines that
have different domains (which it should ignore since the only lines to be
acted upon are those of the same domain in the first field.

I see there is a second file, so I'll try that next, but first I
wanted to let you know that I understand this is a difficult problem
so I appreciate that you attempted a solution, where I repeat,
I don't see ANY Windows freeware solution on the entire Internet,
even though this problem has been asked before.

I still need to figure out how it does what it does, but first
I want to test it out and let you know the results.


If you want to watch it in slow motion, just type the keystrokes
that make up macro.

What happened in with domain3 and domain4 is that you didn't need to run
the macro on domain 3 because it only had on record. In other words,
it was as if you had already run it and just needed to go on to the
next domain. If you accidentally do this at any point, you can just
hit undo ('u') and move to the next line.


  #9  
Old June 13th 19, 09:53 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Thu, 13 Jun 2019 12:46:38 +0000 (UTC), owl wrote:

If you want to watch it in slow motion, just type the keystrokes
that make up macro.

What happened in with domain3 and domain4 is that you didn't need to run
the macro on domain 3 because it only had on record. In other words,
it was as if you had already run it and just needed to go on to the
next domain. If you accidentally do this at any point, you can just
hit undo ('u') and move to the next line.


Hi owl,
Thanks for all your dedicated help and explanations.
I need to do two things before I can report back anything useful.
1. I need to figure out HOW your magic works, and,
2. I need to figure out WHY repetitive runs still found stuff.

Tonight when I get home I will change that line you suggested
FROM: normal! "ayWspace
TO: normal! "ayW

And then I will source doitall.vim on a copy of the original file.
I'll then repetitively source doitall.vim, where the "expected" output is
that the second, third, and fourth runs, etc., should do nothing.

In my tests yesterday, successive runs resulted in...
1. 7,962 lines
2. 7,951 lines
3. 7,940 lines
etc.

If, after making the fix you suggested, doitall.vim still removes about a
dozen lines out of about 8 thousand lines, then I need to look up the best
way to do a Windows diff on a large file.
C:\ fc test_a_7951.txt test_a_7940.txt
C:\ fc /c /lb10000 /n "test_a_7951.txt" "test_a_7940.txt"
C:\ findstr /V test_a_7951.txt test_a_7940.txt diff.txt
etc.

Thanks for your purposefully helpful expert advice, where this thread is a
CLASSIC for how Usenet is a public potluck, where the value we bring to the
table to share can benefit everyone.

Bear in mind, NOWHERE on the net have I _ever_ seen a solution to this
problem for Windows users without resorting to Excel or Cygwin or some
other non-vim freeware tool - where LOTS of people have asked the same
question, where they usually want to filter based on a given field.
  #10  
Old June 14th 19, 06:51 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Thu, 13 Jun 2019 20:53:19 -0000 (UTC), Arlen G. Holder wrote:

Bear in mind, NOWHERE on the net have I _ever_ seen a solution to this
problem for Windows users without resorting to Excel or Cygwin or some
other non-vim freeware tool - where LOTS of people have asked the same
question, where they usually want to filter based on a given field.


Hi owl,
I'm actually thoroughly confused, where my biggest problem is that I never
needed a UNIX-like compare on Windows before, where the windows diff is
atrocious in output so far (using 'fc' or "findstr" anyway), so let me just
say that there's something fishy going on since each time I run the
recently modified doitall.vim, the subsequent file is a dozen lines
shorter, every single time.
o Original file: a_10706.txt
o First source of doitall.vim = b_8015.txt
o Second source of doitall.vim = c_8003.txt
o Third source of doitall.vim = d_7991.txt
o Fourth source of doitall.vim = e_7979.txt
o Fifth source of doitall.vim = f_7967.txt

Notice _every_ file is 12 lines shorter than the prior file when all I did
was source the doitall.vim with the count set to 100000.

You'd think a simple freaking DIFF would exist on Windows to find what
those dozen lines are, and I'm sure it does - but the diff's I'm running
literally suck for that simple purpose.
C:\ fc test_a_7951.txt test_a_7940.txt
C:\ fc /c /lb10000 /n "test_a_7951.txt" "test_a_7940.txt"
C:\ findstr /V test_a_7951.txt test_a_7940.txt diff.txt

So I have to diverge for a moment to find a freaking decent diff on Windows
that simply tells me what's in file1 that is not in file2, which should
only be a dozen lines, but all the diffs above show much more crap than
just that.

I do appreciate your help as nobody has solved this to my knowledge, where
I should be able to debug what the heck is different each time it's run.

What I _expect_ is for the subsequent runs to do nothing, but each file is
a dozen lines shorter.
  #11  
Old June 14th 19, 08:40 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Fri, 14 Jun 2019 17:51:07 -0000 (UTC), Arlen G. Holder wrote:

So I have to diverge for a moment to find a freaking decent diff on Windows
that simply tells me what's in file1 that is not in file2, which should
only be a dozen lines, but all the diffs above show much more crap than
just that.


Hi owl,

Even though the "diff" capability of Windows appears to be atrocious
(or I just haven't found a reasonable diff command on Windows) I was able
to narrow down "most" (if not all) the differences which, so far, were all
due to errors in the data set syntax, where, for example, I had a domain
with no data after it
http://domain1.xyz

And, for example, I had a domain twice in the same line
http://domain1.biz (stuff ... more stuff ... )http://domain1.biz (stuff)

And, for some reason, the sort order was wrong due to a capitalized domain
http://Domain1.edu (stuff)

And, at some point, I had a few files missing the dot with a dash instead
http://Domain1-org (stuff)

Your script works fine, it seems, if the data is fine, so I need to clean
up the data a bit for things that I should have caught with vim regular
expressions. My fault - not yours.

I'll keep testing as this has never been accomplished on the net as far as
I can tell - which is a big deal ... and which I appreciate that you added
to the Usenet potluck by bringing something of value to the table.

LATER - when I've confirmed it's working perfectly, I'll try to figure out
HOW it works - as it's magic at the moment. I do like that it crunches for
about a minute or two and then dings the computer for a minute or two, and
that's how I know it's done.
  #12  
Old June 15th 19, 01:48 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Fri, 14 Jun 2019 19:40:55 -0000 (UTC), Arlen G. Holder wrote:

LATER - when I've confirmed it's working perfectly, I'll try to figure out
HOW it works - as it's magic at the moment. I do like that it crunches for
about a minute or two and then dings the computer for a minute or two, and
that's how I know it's done.


Hi owl,
Bingo! You are a genius!
o It works fine now!

Since I haven't tried to figure out the script yet, I simply tried to
figure out which data it was barfing on, where _all_ the barfs were due to
the data not being what it was supposed to be (i.e., the data wasn't what I
thought it was - as it contained errors - which I had to clean up one by
one).

The Windows diff commands only kind of sort of told me where to look.

The remaining data errors were similar to the prior errors, e.g.,
http://domain1-net (stuff)
instead of
http://domain1.net (stuff

Woo hoo! All of which are now fixed in the original file, which you could
never have anticipated nor known about.

The current original database has 10,725 lines.
The first pass of doitall.vim dropped that to 8,043 lines.
All subsequent runs kept the file length at 8,043 lines.

Which, of course, is the current new file size for the auto-generated db.
Thank you for solving the problem, where now I have to figure out the
magic!
function! C(blah)
redir = cnt
silent exe "%s#" . a:blah . "##gn"
redir END
let res = strpart(cnt, 1, stridx(cnt, " "))
let i = 0
while i res - 1
normal! @q
let i += 1
endwhile
endfunction

function! All()
let i = 0
while i g:dc
normal! "ayW
call C(getreg("a"))
normal! j
let i += 1
endwhile
endfunction

let dc = 30000
let @q='0jvf dkJ0'
sort u
normal! gg
call All()
  #13  
Old June 15th 19, 02:31 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux Arlen G. Holder wrote:
On Fri, 14 Jun 2019 19:40:55 -0000 (UTC), Arlen G. Holder wrote:

LATER - when I've confirmed it's working perfectly, I'll try to figure out
HOW it works - as it's magic at the moment. I do like that it crunches for
about a minute or two and then dings the computer for a minute or two, and
that's how I know it's done.


Hi owl,
Bingo! You are a genius!
o It works fine now!

Since I haven't tried to figure out the script yet, I simply tried to
figure out which data it was barfing on, where _all_ the barfs were due to
the data not being what it was supposed to be (i.e., the data wasn't what I
thought it was - as it contained errors - which I had to clean up one by
one).

The Windows diff commands only kind of sort of told me where to look.

The remaining data errors were similar to the prior errors, e.g.,
http://domain1-net (stuff)
instead of
http://domain1.net (stuff

Woo hoo! All of which are now fixed in the original file, which you could
never have anticipated nor known about.

The current original database has 10,725 lines.
The first pass of doitall.vim dropped that to 8,043 lines.
All subsequent runs kept the file length at 8,043 lines.

Which, of course, is the current new file size for the auto-generated db.
Thank you for solving the problem, where now I have to figure out the
magic!
function! C(blah)
redir = cnt
silent exe "%s#" . a:blah . "##gn"
redir END
let res = strpart(cnt, 1, stridx(cnt, " "))
let i = 0
while i res - 1
normal! @q
let i += 1
endwhile
endfunction

function! All()
let i = 0
while i g:dc
normal! "ayW
call C(getreg("a"))
normal! j
let i += 1
endwhile
endfunction

let dc = 30000
let @q='0jvf dkJ0'
sort u
normal! gg
call All()


The "magic" is in the macro, which is held in the "q" register:
0jvf dkJ0
Some speed can probably be gained by re-writing that as:
0jdWkJ0

So replace
let @q='0jvf dkJ0'
with
let @q='0jdWkJ0'

Explanation of '0jvf dkJ0':
'0': go to first char on the line
'j': step down to next line
'v': visual char select
'f ': find a space
'd': delete (selection)
'k': go up to previous line
'J': join this line to next
'0': go to first char on the line.

Explanation of alternative '0jdWkJ0':
'0': go to first char on the line
'j': step down to next line
'dW': delete WORD
'k': go up to previous line
'J': join this line to next
'0': go to first char on the line.

The helper functions are just to get the counts right. Function All()
uses the total domain count (in dc variable) to iterate a loop where,
for each domain, it yanks the domain text into register "a":

normal! "ayW

and passes that text to function C(). Function C() then gets count of
that text using this method:

s/domain text//gn

by concatenating the argument text:

silent exe "%s#" . a:blah . "##gn"

'#' delimiters are used, since the domain text will contain a couple '/'.
This would normally echo the count to the screen in the message box.
We silence that message and also copy it to the "cnt" variable with
"redir = cnt". We want to extract the first string of that text,
which represents the count, e.g.:

"12 matches on 6 lines"

We want the "12".

let res = strpart(cnt, 1, stridx(cnt, " "))
(For some reason, there is a null byte at the beginning, so we have to
start with byte 1 instead of byte 0).

Now we have the count in the "res" variable.
We then loop, runnning the macro in register "q" one less than than
number of matches, so starting with i = 0 we loop while i res - 1.

C() itself is called in a loop from All(), for each domain. C() puts
everything for a domain on one line. After C() returns, the loop runs
"normal! j", which steps down to the next domain and runs the next
iteration.

  #14  
Old June 15th 19, 03:03 PM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
owl
external usenet poster
 
Posts: 50
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

In alt.os.linux owl wrote:
...
and passes that text to function C(). Function C() then gets count of
that text using this method:

s/domain text//gn

by concatenating the argument text:

silent exe "%s#" . a:blah . "##gn"

'#' delimiters are used, since the domain text will contain a couple '/'.
This would normally echo the count to the screen in the message box.
We silence that message and also copy it to the "cnt" variable with
"redir = cnt". We want to extract the first string of that text,
which represents the count, e.g.:

"12 matches on 6 lines"

We want the "12".


We should change the match count line to:

from:

silent exe "%s#" . a:blah . "##gn"

to:

silent exe "%s#^" . a:blah . "##gn"

  #15  
Old June 16th 19, 03:41 AM posted to alt.os.linux,alt.comp.os.windows-10,alt.comp.freeware
Arlen G. Holder
external usenet poster
 
Posts: 236
Default Finding first field duplicate lines in a sorted text file without uniq or awk or col using only vim - is it possible?

On Sat, 15 Jun 2019 13:31:33 +0000 (UTC), owl wrote:

Some speed can probably be gained by re-writing that as:


Hi owl,

Thank you for adding value to the Usenet potluck, to share with all.

Thanks for doing the explanation for me, where I'm no slouch in vim
commands, but that "magic" was way above my capabilities.

Did you already have that available, or did you expressly code it to
resolve the problem set? I ask because that's a LOT of effort to code it,
where NOBODY has ever solved this problem (to my knowledge), for Windows
inside of the vim freeware.

It does take a few minutes (three or four minutes) to run on 10,000 lines,
but that's fine considering how long it would take to manually do the job.

I did find another error in my input data, which is generated so I have to
fix the generation script more so than the data itself.

This was in the file:
http://domain1.xyz(stuff)
instead of this:
http://domain1.xyz (stuff)

Which doesn't cause your script a problem, but where your script considers
them two different lines, understandably, since the space is supposed to be
the delimiter between the domain and the stuff.

I've since fixed that so it's not a problem.

I'm curious, if you wrote that script, did you hand do it first inside of
VIM and then copied the hand code to the script? It's the most involved vim
script I've ever seen, which is why I ask.

function! C(blah)
redir = cnt
silent exe "%s#^" . a:blah . "##gn"
redir END
let res = strpart(cnt, 1, stridx(cnt, " "))
let i = 0
while i res - 1
normal! @q
let i += 1
endwhile
endfunction

function! All()
let i = 0
while i g:dc
normal! "ayW
call C(getreg("a"))
normal! j
let i += 1
endwhile
endfunction

let dc = 30000
let @q='0jdWkJ0'
sort u
normal! gg
call All()
 




Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off






All times are GMT +1. The time now is 02:44 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 PCbanter.
The comments are property of their posters.