Notices
Computer & Technology Related Post here for help and discussion of computing and related technology. Internet, TVs, phones, consoles, computers, tablets and any other gadgets.

Counting char/words in HTML files?

Thread Tools
 
Search this Thread
 
Old 14 November 2002, 10:58 AM
  #1  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Question

I've got a database of texts (laws) on the internet, uploaded in HTML format. There are about 500 laws, some of them half a page, some are 40 pages.

Now I have to estimate the content of text in the database, in order to estimate how much it would cost to translate it.

Is there an easy way of doing a word/character count? I can go back to each file and open it in Word, and select Tools - Word Count, but a) it's a bit of a pain, and b) I risk somehow corrupting the original files (they've all had the Microsoft coding cleaned out) if I close and Save Changes.
Netscape 4.7 has "page info - content length" but this gives double the characters that Word does - my guess is it counts the coding as well, so that won't be accurate. MSIE 5.0 only has the file size, which will again count the coding.
DreamWeaver 4 doesn't seem to have anything for estimating content.

Ideally I would like to do this using a browser, though I can grudgingly go back to the individual files if I really have to - in which case I would feel safer using DW rather than Word.

Any suggestions pleeeeeaze?

Brendan
Old 14 November 2002, 01:40 PM
  #2  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

You'll have to write something to do this automatically (obviously). How do you define a word? Is it just the content (laws) you want counted, or all of the HTML?

I'd use Perl and LWP then something like HTML::Parser to strip out the text from the HTML. Keep in mind parsing HTML isn't like parsing XML, most peoples' HTML is extremely substandard

Steve.
Old 14 November 2002, 01:42 PM
  #3  
Fosters
Scooby Regular
 
Fosters's Avatar
 
Join Date: Jul 2000
Location: Islington
Posts: 2,145
Likes: 0
Received 0 Likes on 0 Posts
Post

copy the text from the html page and paste it into Notepad, failing this wordpad. this will lose all the formatting.
Then copy the notepad/wordpad text into word where you can view word/etc stats.

Old 14 November 2002, 02:07 PM
  #4  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

This works (if you have Perl installed ) You can give it a list of URLs instead of just one like I did for testing. You could just build a list of all possible pages (with find or dir/s or something) and feed that in instead.

Keep in mind, I'm splitting on whitespace which may or may not be what you need, but it will give you a rough idea for pricing purposes. It'll only count pages that are returned with a 200 OK, so things like redirects wont be counted.

Well, it's worth what you paid for it


#!/usr/local/bin/perl -w

use strict;

my $wordcount = 0;

{
package MyParser;
use base qw(HTML::Parser);

sub text {
my ($self, $origtext, $is_cdata) = @_;
my @words = split(/\s+/, $origtext) unless $origtext =~ /^\W+$/;
$wordcount += scalar @words;
}
}

package main;

use LWP::UserAgent;
use HTTP::Request;
use Data:umper;

## You'll have a list of URLs here
##
my $url = 'http://www.scoobynet.co.uk/bbs/';

## Make the HTTP request
##
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $url);

## Only parse 200 responses
##
my $res = $ua->request($req);
die "Bad response from server, code was ", $res->{_rc}, "\n"
if ($res->{_rc} != 200);

my $html = $res->{_content};

my $p = MyParser->new;
$p->parse($html);

print "Total words in $url: $wordcount\n";
Old 14 November 2002, 02:08 PM
  #5  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

Hehe, stupid UBB code
Old 14 November 2002, 03:47 PM
  #6  
Brendan Hughes
Scooby Regular
Thread Starter
 
Brendan Hughes's Avatar
 
Join Date: Oct 2000
Location: same time, different place
Posts: 11,313
Likes: 0
Received 4 Likes on 2 Posts
Thumbs up

Fosters - thanks, but with 500 texts to repeat I'd prefer just to risk opening the original Word files

Steve - er, bloody hell, thanks - it sure looks impressive. I'm a user not a programmer so I'll have to pass this to our IT dept, when they've finished sorting out our current server migration problem - I've never heard of Perl but am going to hope that we have it. I think I was hoping for a browser or DW plug-in - let's see how we get on with this!

Brendan

Oh, LOL at Data:umper!
Old 14 November 2002, 03:56 PM
  #7  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

Yeah, you can actually remove the Data:: Dumper line, since I only used it for dumping $res I think, it's not needed to work

You'll need LWP and HTML::Parser installed too. Perl is free, as are all the modules.

Steve.
Old 14 November 2002, 04:22 PM
  #8  
orbv
Scooby Regular
 
orbv's Avatar
 
Join Date: Apr 2001
Location: Hants
Posts: 1,103
Likes: 0
Received 0 Likes on 0 Posts
Post

$ man wc
Old 14 November 2002, 04:22 PM
  #9  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

wc doesn't work over http of course, and will count all the HTML too.


[Edited by stevencotton - 11/14/2002 4:24:20 PM]
Old 14 November 2002, 05:22 PM
  #10  
orbv
Scooby Regular
 
orbv's Avatar
 
Join Date: Apr 2001
Location: Hants
Posts: 1,103
Likes: 0
Received 0 Likes on 0 Posts
Post

$ man wc
$ man wget
Old 14 November 2002, 11:11 PM
  #11  
stevencotton
Scooby Regular
 
stevencotton's Avatar
 
Join Date: Jan 2001
Location: behind twin turbos
Posts: 2,710
Likes: 0
Received 1 Like on 1 Post
Post

Not everyone uses UNIX, unfortunately.
Related Topics
Thread
Thread Starter
Forum
Replies
Last Post
Frizzle-Dee
Essex Subaru Owners Club
13
09 March 2019 07:35 PM
shorty87
Full Cars Breaking For Spares
19
22 December 2015 11:59 AM
Pro-Line Motorsport
Car Parts For Sale
2
29 September 2015 07:36 PM
shorty87
Wheels And Tyres For Sale
0
29 September 2015 02:18 PM
shorty87
Other Marques
0
25 September 2015 08:52 PM



Quick Reply: Counting char/words in HTML files?



All times are GMT +1. The time now is 10:54 PM.