Counting char/words in HTML files?

Reply Subscribe

Thread Tools

Search this Thread

14 November 2002, 10:58 AM

Brendan Hughes

Scooby Regular

Thread Starter

Join Date: Oct 2000

Location: same time, different place

Posts: 11,313

Likes: 0

Received 4 Likes on 2 Posts

I've got a database of texts (laws) on the internet, uploaded in HTML format. There are about 500 laws, some of them half a page, some are 40 pages.

Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.

Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.

Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.

Any suggestions pleeeeeaze?

Brendan

Reply Like

14 November 2002, 01:40 PM

stevencotton

Scooby Regular

Join Date: Jan 2001

Location: behind twin turbos

Posts: 2,710

Likes: 0

Received 1 Like on 1 Post

You'll have to write something to do this automatically (obviously). How do you define a word? Is it just the content (laws) you want counted, or all of the HTML?

I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard

Steve.

Reply Like

14 November 2002, 01:42 PM

Fosters

Scooby Regular

Join Date: Jul 2000

Location: Islington

Posts: 2,145

Likes: 0

Received 0 Likes on 0 Posts

copy the text from the html page and paste it into Notepad, failing this wordpad. this will lose all the formatting.
Then copy the notepad/wordpad text into word where you can view word/etc stats.

Reply Like

14 November 2002, 02:07 PM

stevencotton

Scooby Regular

Join Date: Jan 2001

Location: behind twin turbos

Posts: 2,710

Likes: 0

Received 1 Like on 1 Post

This works (if you have Perl installed

) You can give it a list of URLs instead of just one like I did for testing. You could just build a list of all possible pages (with find or dir/s or something) and feed that in instead.

Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.

Well, it's worth what you paid for it

#!/usr/local/bin/perl -w

use strict;

my $wordcount = 0;

{
package MyParser;
use base qw(HTML::Parser);

sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}

package main;

use LWP::UserAgent;
use HTTP::Request;
use Data:

umper;

## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';

## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);

## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);

my $html = $res->{_content};

my $p = MyParser->new;
$p->parse($html);

print "Total words in $url: $wordcount\n";

Reply Like

14 November 2002, 02:08 PM

stevencotton

Scooby Regular

Join Date: Jan 2001

Location: behind twin turbos

Posts: 2,710

Likes: 0

Received 1 Like on 1 Post

Hehe, stupid UBB code

Reply Like

14 November 2002, 03:47 PM

Brendan Hughes

Scooby Regular

Thread Starter

Join Date: Oct 2000

Location: same time, different place

Posts: 11,313

Likes: 0

Received 4 Likes on 2 Posts

Fosters - thanks, but with 500 texts to repeat I'd prefer just to risk opening the original Word files

Steve - er, bloody hell, thanks - it sure looks impressive

. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it

. I think I was hoping for a browser or DW plug-in

- let's see how we get on with this!

Brendan

Oh, LOL at Data:

umper!

Reply Like

14 November 2002, 03:56 PM

stevencotton

Scooby Regular

Join Date: Jan 2001

Location: behind twin turbos

Posts: 2,710

Likes: 0

Received 1 Like on 1 Post

Yeah, you can actually remove the Data:: Dumper line, since I only used it for dumping $res I think, it's not needed to work

You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.

Steve.

Reply Like

Counting char/words in HTML files?

Counting char/words in HTML files?

Trending Topics