Counting char/words in HTML files?
Thread Starter
Scooby Regular
Joined: Oct 2000
Posts: 11,314
Likes: 4
From: same time, different place
I've got a database of texts (laws) on the internet, uploaded in HTML format. There are about 500 laws, some of them half a page, some are 40 pages.
Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.
Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.
Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.
Any suggestions pleeeeeaze?
Brendan
Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.
Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.
Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.
Any suggestions pleeeeeaze?
Brendan
You'll have to write something to do this automatically (obviously). How do you define a word? Is it just the content (laws) you want counted, or all of the HTML?
I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard
Steve.
I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard

Steve.
copy the text from the html page and paste it into Notepad, failing this wordpad. this will lose all the formatting.
Then copy the notepad/wordpad text into word where you can view word/etc stats.
Then copy the notepad/wordpad text into word where you can view word/etc stats.
This works (if you have Perl installed
) You can give it a list of URLs instead of just one like I did for testing. You could just build a list of all possible pages (with find or dir/s or something) and feed that in instead.
Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.
Well, it's worth what you paid for it
#!/usr/local/bin/perl -w
use strict;
my $wordcount = 0;
{
package MyParser;
use base qw(HTML::Parser);
sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}
package main;
use LWP::UserAgent;
use HTTP::Request;
use Data:
umper;
## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';
## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);
my $html = $res->{_content};
my $p = MyParser->new;
$p->parse($html);
print "Total words in $url: $wordcount\n";
) You can give it a list of URLs instead of just one like I did for testing. You could just build a list of all possible pages (with find or dir/s or something) and feed that in instead.Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.
Well, it's worth what you paid for it

#!/usr/local/bin/perl -w
use strict;
my $wordcount = 0;
{
package MyParser;
use base qw(HTML::Parser);
sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}
package main;
use LWP::UserAgent;
use HTTP::Request;
use Data:
umper;## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';
## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);
my $html = $res->{_content};
my $p = MyParser->new;
$p->parse($html);
print "Total words in $url: $wordcount\n";
Thread Starter
Scooby Regular
Joined: Oct 2000
Posts: 11,314
Likes: 4
From: same time, different place
Fosters - thanks, but with 500 texts to repeat I'd prefer just to risk opening the original Word files
Steve - er, bloody hell, thanks - it sure looks impressive
. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it
. I think I was hoping for a browser or DW plug-in
- let's see how we get on with this!
Brendan
Oh, LOL at Data:
umper!

Steve - er, bloody hell, thanks - it sure looks impressive
. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it
. I think I was hoping for a browser or DW plug-in
- let's see how we get on with this!Brendan
Oh, LOL at Data:
umper!
Yeah, you can actually remove the Data:: Dumper line, since I only used it for dumping $res I think, it's not needed to work 
You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.
Steve.

You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.
Steve.
Trending Topics
Thread
Thread Starter
Forum
Replies
Last Post
Pro-Line Motorsport
Car Parts For Sale
2
Sep 29, 2015 07:36 PM
shorty87
Wheels And Tyres For Sale
0
Sep 29, 2015 02:18 PM



