Notices
Computer & Technology Related Post here for help and discussion of computing and related technology. Internet, TVs, phones, consoles, computers, tablets and any other gadgets.

How to add text to a large file?

Thread Tools
 
Search this Thread
 
Old Nov 3, 2004 | 10:27 AM
  #31  
ozzy's Avatar
ozzy
Thread Starter
Scooby Regular
 
Joined: Nov 1999
Posts: 10,504
Likes: 1
From: Scotland, UK
Default

OK, smarty-pants developers here's another question for you.

How can I filter out duplicate entries in my text file now it's far too big to fit into Excel?

Stefan

P.S. Without having to purchase a new development environment
Reply
Old Nov 3, 2004 | 10:30 AM
  #32  
IWatkins's Avatar
IWatkins
Scooby Regular
 
Joined: Mar 2000
Posts: 4,531
Likes: 0
From: Gloucestershire, home of the lawnmower.
Default

Could do it in Delphi in about a minute.
Reply
Old Nov 3, 2004 | 10:43 AM
  #33  
ozzy's Avatar
ozzy
Thread Starter
Scooby Regular
 
Joined: Nov 1999
Posts: 10,504
Likes: 1
From: Scotland, UK
Default

I could probably do it on a Cray Supercomputer in less time, but I don't have one of those either
Reply
Old Nov 3, 2004 | 10:43 AM
  #34  
MartinM's Avatar
MartinM
Scooby Regular
 
Joined: Jun 1999
Posts: 1,496
Likes: 0
Default

Get a copy of gawk from http://gnuwin32.sourceforge.net/packages/gawk.htm

Create a file with notepad : foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}

END {
for (i = 1; i <= count; i++)
print lines[i]
}

Create a test file : foo.txt
fred
jim
fred
eric
jim
bert

At a DOS command in an appropriate directory:
gawk -f foo.awk foo.txt >> foo.new

Examine foo.new

Voila!
(v. quick too!)
EDIT: but don't know about 950,000 lines - should be OK!

Last edited by MartinM; Nov 3, 2004 at 10:55 AM.
Reply
Old Nov 3, 2004 | 11:07 AM
  #35  
Dracoro's Avatar
Dracoro
Scooby Regular
 
Joined: Sep 2001
Posts: 10,261
Likes: 0
From: A powerslide near you
Default

In unix, use sort command then use 'my' dedupe perl script (it removes duplicate entries and puts them in a separate file in case you need them). Sorry, not at work this week so can't send you the script.
Reply
Old Nov 3, 2004 | 11:08 AM
  #36  
Dracoro's Avatar
Dracoro
Scooby Regular
 
Joined: Sep 2001
Posts: 10,261
Likes: 0
From: A powerslide near you
Default

Or bung the file into Access (if you have it of course!) and dedupe using that.
Reply
Old Nov 3, 2004 | 11:14 AM
  #37  
ozzy's Avatar
ozzy
Thread Starter
Scooby Regular
 
Joined: Nov 1999
Posts: 10,504
Likes: 1
From: Scotland, UK
Default

OK thanks guys. Gawk has done the trick.

Stefan
Reply
Old Nov 3, 2004 | 11:39 AM
  #38  
stevencotton's Avatar
stevencotton
Scooby Regular
 
Joined: Jan 2001
Posts: 2,710
Likes: 1
From: behind twin turbos
Default

Under unix you could just 'sort -u' without having to resort to writing a script
Reply
Old Nov 3, 2004 | 12:56 PM
  #39  
Stueyb's Avatar
Stueyb
Scooby Regular
 
Joined: May 2002
Posts: 1,893
Likes: 0
Default

would be an interesting file to the pervs though
Reply
Old Nov 3, 2004 | 01:45 PM
  #40  
MartinM's Avatar
MartinM
Scooby Regular
 
Joined: Jun 1999
Posts: 1,496
Likes: 0
Default

Originally Posted by ozzy
OK thanks guys. Gawk has done the trick.

Stefan
Invoice on the way....
Reply
Old Nov 4, 2004 | 11:49 AM
  #41  
ozzy's Avatar
ozzy
Thread Starter
Scooby Regular
 
Joined: Nov 1999
Posts: 10,504
Likes: 1
From: Scotland, UK
Default

Back again

The VB script is a pile of cr@p with more than a few hundred entries. Started importing over 950,000 yesterday and it was still going this morning.

I can import via XML, but I need to get my list into that format.

The Gawk script from Martin does the job of filtering out duplicates in my list and the the perl script from Steve/Mark_A does the job of adding the XML text to the beginning of lines.

What I need to do is modify the Perl script to add </fpc4:Str> to the end of each line.

TIA
Stefan
Reply
Old Nov 4, 2004 | 01:09 PM
  #42  
stevencotton's Avatar
stevencotton
Scooby Regular
 
Joined: Jan 2001
Posts: 2,710
Likes: 1
From: behind twin turbos
Default

Using Mark's DOS-friendly version:

perl -nle "print 'http://' . $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt

Or if you just wanted to append the string to each line:

perl -nle "print $_ . '</fpc4:Str>'" inputfile.txt > outputfile.txt
Reply
Old Nov 4, 2004 | 02:03 PM
  #43  
MartinM's Avatar
MartinM
Scooby Regular
 
Joined: Jun 1999
Posts: 1,496
Likes: 0
Default

gawk version....

foo.awk
{
if ( length($0) > 0 )
{
print $0 "/fpc4:Str"
}
};

gawk -f foo.awk foo.txt >> foo.new

...or...

combine the dedupe with adding the string
foo.awk
{
if (data[$0]++ == 0)
lines[++count] = $0
}

END {
for (i = 1; i <= count; i++)
print lines[i] "/fpc4:Str"
}

...but its nice to see we're now using the proper tools (gawk, perl etc) for the job rather than databases, spreadsheets and miscellaneous programming languages (flame suit on...)
Reply
Old Nov 4, 2004 | 02:31 PM
  #44  
ozzy's Avatar
ozzy
Thread Starter
Scooby Regular
 
Joined: Nov 1999
Posts: 10,504
Likes: 1
From: Scotland, UK
Default

I'll admit, they are the proper tools. Just not as idiot proof as opening them up in a spreadsheet

Thanks for the help lads
Reply
Old Nov 4, 2004 | 03:11 PM
  #45  
GaryK's Avatar
GaryK
Scooby Regular
 
Joined: Sep 1999
Posts: 4,037
Likes: 0
From: Bedfordshire
Default

totally agree martin, doing it in 1 or 2 lines with a tool that is perfect for simple text manipulation is the way to go, you shouldnt have to write an app for that! trouble is I know bugger all about gawk or perl, it' just as easy for to write a delphi app in literally 2 minutes to tear through text files.
Reply
Old Nov 4, 2004 | 04:33 PM
  #46  
MartinM's Avatar
MartinM
Scooby Regular
 
Joined: Jun 1999
Posts: 1,496
Likes: 0
Default

Originally Posted by GaryK
...I know bugger all about gawk or perl, it' just as easy for to write a delphi app in literally 2 minutes to tear through text files...
I knew nothing about gawk (even it's existence) until 10 days ago. The help file that comes with it is really good (...where I got the dedupe code from ) and it takes about an hour of playing with simple things to get the hang of it once and for all.

Well worth the investment in my view - it's just another string for your bow when it comes to text processing....
Reply
Old Nov 10, 2004 | 03:18 PM
  #47  
GaryK's Avatar
GaryK
Scooby Regular
 
Joined: Sep 1999
Posts: 4,037
Likes: 0
From: Bedfordshire
Default

Martin,

Just wanted to resurrect this as I have had a cursory look at gawk and it seems to fail in one area that I really need to with regards to text file processing, that is working with comma separated data, or maybe I am missing something!

In delphi I can just issue a:

AStringList.CommaText = <line from file>

and bingo all my comma separated columns are turned into a list so to access column 5 (which would be ordered from zero) I can just do

sColumn := AStringList[4];

I cannot see any way that gawk parses a line into columns which is all I ever need to do.

Cheers

Gary
Reply
Related Topics
Thread
Thread Starter
Forum
Replies
Last Post
KAS35RSTI
Subaru
27
Nov 4, 2021 07:12 PM
Sam Witwicky
Engine Management and ECU Remapping
17
Nov 13, 2015 10:49 AM
Ganz1983
Subaru
5
Oct 2, 2015 09:22 AM
alcazar
Computer & Technology Related
2
Sep 29, 2015 07:18 PM
Littleted
Computer & Technology Related
0
Sep 25, 2015 08:44 AM




All times are GMT +1. The time now is 02:13 AM.