Counting char/words in HTML files?
#1
Scooby Regular
Thread Starter
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes
on
2 Posts
I've got a database of texts (laws) on the internet, uploaded in HTML format. There are about 500 laws, some of them half a page, some are 40 pages.
Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.
Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.
Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.
Any suggestions pleeeeeaze?
Brendan
Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.
Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.
Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.
Any suggestions pleeeeeaze?
Brendan
#2
Scooby Regular
You'll have to write something to do this automatically (obviously). How do you define a word? Is it just the content (laws) you want counted, or all of the HTML?
I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard
Steve.
I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard
Steve.
#3
Scooby Regular
Join Date: Jul 2000
Location: Islington
Posts: 2,145
Likes: 0
Received 0 Likes
on
0 Posts
copy the text from the html page and paste it into Notepad, failing this wordpad. this will lose all the formatting.
Then copy the notepad/wordpad text into word where you can view word/etc stats.
Then copy the notepad/wordpad text into word where you can view word/etc stats.
#4
Scooby Regular
This works (if you have Perl installed ) You can give it a list of URLs instead of just one like I did for testing. You could just build a list of all possible pages (with find or dir/s or something) and feed that in instead.
Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.
Well, it's worth what you paid for it
#!/usr/local/bin/perl -w
use strict;
my $wordcount = 0;
{
package MyParser;
use base qw(HTML::Parser);
sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}
package main;
use LWP::UserAgent;
use HTTP::Request;
use Data:umper;
## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';
## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);
my $html = $res->{_content};
my $p = MyParser->new;
$p->parse($html);
print "Total words in $url: $wordcount\n";
Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.
Well, it's worth what you paid for it
#!/usr/local/bin/perl -w
use strict;
my $wordcount = 0;
{
package MyParser;
use base qw(HTML::Parser);
sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}
package main;
use LWP::UserAgent;
use HTTP::Request;
use Data:umper;
## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';
## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);
## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);
my $html = $res->{_content};
my $p = MyParser->new;
$p->parse($html);
print "Total words in $url: $wordcount\n";
#6
Scooby Regular
Thread Starter
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes
on
2 Posts
Fosters - thanks, but with 500 texts to repeat I'd prefer just to risk opening the original Word files
Steve - er, bloody hell, thanks - it sure looks impressive. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it. I think I was hoping for a browser or DW plug-in - let's see how we get on with this!
Brendan
Oh, LOL at Data:umper!
Steve - er, bloody hell, thanks - it sure looks impressive. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it. I think I was hoping for a browser or DW plug-in - let's see how we get on with this!
Brendan
Oh, LOL at Data:umper!
#7
Scooby Regular
Yeah, you can actually remove the Data:: Dumper line, since I only used it for dumping $res I think, it's not needed to work
You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.
Steve.
You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.
Steve.
Trending Topics
Thread
Thread Starter
Forum
Replies
Last Post
shorty87
Full Cars Breaking For Spares
19
22 December 2015 11:59 AM
Pro-Line Motorsport
Car Parts For Sale
2
29 September 2015 07:36 PM
shorty87
Wheels And Tyres For Sale
0
29 September 2015 02:18 PM