How to add text to a large file?
OK, smarty-pants developers here's another question for you.
How can I filter out duplicate entries in my text file now it's far too big to fit into Excel?
Stefan
P.S. Without having to purchase a new development environment
How can I filter out duplicate entries in my text file now it's far too big to fit into Excel?
Stefan
P.S. Without having to purchase a new development environment
Get a copy of gawk from http://gnuwin32.sourceforge.net/packages/gawk.htm
Create a file with notepad : foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}
END {
for (i = 1; i <= count; i++)
print lines[i]
}
Create a test file : foo.txt
fred
jim
fred
eric
jim
bert
At a DOS command in an appropriate directory:
gawk -f foo.awk foo.txt >> foo.new
Examine foo.new
Voila!
(v. quick too!)
EDIT: but don't know about 950,000 lines - should be OK!
Create a file with notepad : foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}
END {
for (i = 1; i <= count; i++)
print lines[i]
}
Create a test file : foo.txt
fred
jim
fred
eric
jim
bert
At a DOS command in an appropriate directory:
gawk -f foo.awk foo.txt >> foo.new
Examine foo.new
Voila!
(v. quick too!)
EDIT: but don't know about 950,000 lines - should be OK!
Last edited by MartinM; Nov 3, 2004 at 10:55 AM.
In unix, use sort command then use 'my' dedupe perl script (it removes duplicate entries and puts them in a separate file in case you need them).
Sorry, not at work this week so can't send you the script.
Sorry, not at work this week so can't send you the script.
Back again 
The VB script is a pile of cr@p with more than a few hundred entries. Started importing over 950,000 yesterday and it was still going this morning.
I can import via XML, but I need to get my list into that format.
The Gawk script from Martin does the job of filtering out duplicates in my list and the the perl script from Steve/Mark_A does the job of adding the XML text to the beginning of lines.
What I need to do is modify the Perl script to add </fpc4:Str> to the end of each line.
TIA
Stefan

The VB script is a pile of cr@p with more than a few hundred entries. Started importing over 950,000 yesterday and it was still going this morning.
I can import via XML, but I need to get my list into that format.
The Gawk script from Martin does the job of filtering out duplicates in my list and the the perl script from Steve/Mark_A does the job of adding the XML text to the beginning of lines.
What I need to do is modify the Perl script to add </fpc4:Str> to the end of each line.
TIA
Stefan
Using Mark's DOS-friendly version:
perl -nle "print 'http://' . $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt
Or if you just wanted to append the string to each line:
perl -nle "print $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt
perl -nle "print 'http://' . $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt
Or if you just wanted to append the string to each line:
perl -nle "print $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt
gawk version....
foo.awk
{
if ( length($0) > 0 )
{
print $0 "/fpc4:Str"
}
};
gawk -f foo.awk foo.txt >> foo.new
...or...
combine the dedupe with adding the string
foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}
END {
for (i = 1; i <= count; i++)
print lines[i] "/fpc4:Str"
}
...but its nice to see we're now using the proper tools (gawk, perl etc) for the job rather than databases, spreadsheets and miscellaneous programming languages (flame suit on...)
foo.awk
{
if ( length($0) > 0 )
{
print $0 "/fpc4:Str"
}
};
gawk -f foo.awk foo.txt >> foo.new
...or...
combine the dedupe with adding the string
foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}
END {
for (i = 1; i <= count; i++)
print lines[i] "/fpc4:Str"
}
...but its nice to see we're now using the proper tools (gawk, perl etc) for the job rather than databases, spreadsheets and miscellaneous programming languages (flame suit on...)
totally agree martin, doing it in 1 or 2 lines with a tool that is perfect for simple text manipulation is the way to go, you shouldnt have to write an app for that! trouble is I know bugger all about gawk or perl, it' just as easy for to write a delphi app in literally 2 minutes to tear through text files.
Originally Posted by GaryK
...I know bugger all about gawk or perl, it' just as easy for to write a delphi app in literally 2 minutes to tear through text files...
) and it takes about an hour of playing with simple things to get the hang of it once and for all.Well worth the investment in my view - it's just another string for your bow when it comes to text processing....
Martin,
Just wanted to resurrect this as I have had a cursory look at gawk and it seems to fail in one area that I really need to with regards to text file processing, that is working with comma separated data, or maybe I am missing something!
In delphi I can just issue a:
AStringList.CommaText = <line from file>
and bingo all my comma separated columns are turned into a list so to access column 5 (which would be ordered from zero) I can just do
sColumn := AStringList[4];
I cannot see any way that gawk parses a line into columns which is all I ever need to do.
Cheers
Gary
Just wanted to resurrect this as I have had a cursory look at gawk and it seems to fail in one area that I really need to with regards to text file processing, that is working with comma separated data, or maybe I am missing something!
In delphi I can just issue a:
AStringList.CommaText = <line from file>
and bingo all my comma separated columns are turned into a list so to access column 5 (which would be ordered from zero) I can just do
sColumn := AStringList[4];
I cannot see any way that gawk parses a line into columns which is all I ever need to do.
Cheers
Gary
Thread
Thread Starter
Forum
Replies
Last Post
alcazar
Computer & Technology Related
2
Sep 29, 2015 07:18 PM




