Notices
Computer & Technology Related Post here for help and discussion of computing and related technology. Internet, TVs, phones, consoles, computers, tablets and any other gadgets.

Big, big, big file processing

Thread Tools
 
Search this Thread
 
Old 18 April 2002, 11:26 PM
  #1  
IWatkins
Scooby Regular
Thread Starter
 
IWatkins's Avatar
 
Join Date: Mar 2000
Location: Gloucestershire, home of the lawnmower.
Posts: 4,531
Likes: 0
Received 0 Likes on 0 Posts
Question

Got a project kicking off soon where I will need to process 250,000 files (approx 200kb each) every 6 hours 24/7/365.

The actual processing is fairly simple, but obviously we are talking an awful lot of disk I/O here. Files will be opened, a few bytes read from them (to get header info.) then closed again until needed later. The header info will be written to a catalogue file.

Anybody had this sort of experience and can give me an idea of what sort of spec machine we are talking about here ? Must be Wintel solution.

I ask as all my experience is on desktop/workstation level PCs or Crays superdoopercomputers and nothing in between.

I'm guessing some sort of server (Dell is a preference) with a bloody quick disk system. Needs to have redundancy etc.

Any ideas, as I haven't a clue where to start really until the actual code is done, but I need to spec. it up now really.

Cheers

Ian
Old 19 April 2002, 03:53 AM
  #2  
Shaun
Scooby Regular
Support Scoobynet!
 
Shaun's Avatar
 
Join Date: Mar 2000
Location: 5 beats 4 - RS3 Rulez!!!
Posts: 8,617
Received 22 Likes on 18 Posts
Post

Ian,

Dell dual pentium server. I have an example spec that I have just recently placed an order for through work to Dell. The server is being used as a citrix server. It's the same spec as the council is using for it's teleworking citrix server, so it will be plenty fast enough for what you need it for, and the price is good as well. I am not in wotk until Tuesday, so drop me an email at work shaunfennings@warwickshire.gov.uk and I will reply on Tuesday.

If you want more explanation in the meantime, email me at shaun@scoobynet.co.uk

Regards,
Shaun.
Old 19 April 2002, 09:38 AM
  #3  
ChrisB
Moderator
 
ChrisB's Avatar
 
Join Date: Dec 1998
Location: Staffs
Posts: 23,573
Likes: 0
Received 0 Likes on 0 Posts
Post

Are the files replaced once processed Ian? ie does the disk space requirement keep increasing or stay the same?

Choosing the right RAID config will pay off on a setup like that. If you can afford the 'waste' of disk space, then a RAID 0 + 1 setup will probably give the best results.

From Adaptec:

"Dual level raid, combines multiple mirrored drives (RAID 1) with data striping (RAID 0) into a single array. Provides highest performance with data protection."

15k RPM HDs are probably worth the extra cost here. Choose a decent 64 bit PCI RAID Card with a fair amount of battery backed cache (64MB+). Look for redundancy options on...

NICs
PSUs
Fans

As an example Compaq spec...

ML370 G2
1 or 2 1.4Ghz P3 CPUs with 512k Cache
1GB RAM
5 x 36GB 15k Hot Swap HDs (4 live, 1 hot spare), 1 free drive bay
Smart Array RAID 5302 with 64MB Cache
Redundant 10/100 NICs (maybe Gigabit, depends on your backbone)
Redundant PSUs
Redundant Fan Option Kit
Old 19 April 2002, 09:47 AM
  #4  
MrDeference
Scooby Regular
 
MrDeference's Avatar
 
Join Date: Mar 2002
Posts: 337
Likes: 0
Received 0 Likes on 0 Posts
Post

I don't know what work you do, so this may be a bit of an insult - if so, I apologise.

It is more to do with how the code is written than anything else.

Target NT, use I/O Asynchronous completion, and a thread pool.
NT uses Memory mapped IO, so the disk subsystem will only read the part of the file required (not load the whole file).
Pay attention to cache coherency (try to help the disk cache the next file by keeping locality high - this is also a propery of the thing that generates the 250000 files in the first place).

Pay special attention to things like - do you want to update the info if the file you are going to read hasn't changed since last iteration? What is the locking scheme for the files (can you catalog them whilst they are being externally updated)?

A poorly written solution is going to kill anything, and a good one will run happily on a normal spec PC, particularly if you have a quick set of SCSI disks and heep-um-big-memory.
Old 19 April 2002, 09:49 AM
  #5  
MrDeference
Scooby Regular
 
MrDeference's Avatar
 
Join Date: Mar 2002
Posts: 337
Likes: 0
Received 0 Likes on 0 Posts
Post

anyway - why does it have to be done every 6 hours. Can't you do it constantly?
Old 19 April 2002, 10:49 AM
  #6  
IWatkins
Scooby Regular
Thread Starter
 
IWatkins's Avatar
 
Join Date: Mar 2000
Location: Gloucestershire, home of the lawnmower.
Posts: 4,531
Likes: 0
Received 0 Likes on 0 Posts
Post

Thanks for all that guys.

I should have added the following:

1. The 250,000 files will be delivered over a fast network to the machine every 6 hours. I.e. new 250,000 files every 6 hours coming as one big delivery.

2. The 250,000 files from each delivery will be kept for 96 hours and then deleted. I.e. after 4 days from starting up, the number of files on the system will remain approx constant.

3. After each file has been catalogued on arrival (hence the reading some header info. from each file) they will remain relatively untouched. I would estimate that maybe a couple of hundred files will be "looked" at a day. But which ones will depend on the users of which there will only be a very few (less than 20).

Cheers

Ian
Old 19 April 2002, 11:04 AM
  #7  
David_Wallis
Scooby Regular
 
David_Wallis's Avatar
 
Join Date: Nov 2001
Location: Leeds - It was 562.4bhp@28psi on Optimax, How much closer to 600 with race fuel and a bigger turbo?
Posts: 15,239
Likes: 0
Received 1 Like on 1 Post
Post

I would agree with the spec chrisb said... very good raid card... would def. say 15k disks at a minimum.... depends on your budget... could even consider fibre channel.. what sort of reliability from the box are you wanting? Are you running win2k...

What is doing the processing of the files?? is it a custom app / script? ie as mrdeference says single threaded/or multithreaded... as if your app is cr@p then the box is irrelevant.. pay attention to how you set up things like your virus scanner.. etc..

edited to say that the new dl370's look cr@p as they look like a video

David

[Edited by David_Wallis - 4/19/2002 11:05:54 AM]
Old 19 April 2002, 11:15 AM
  #8  
ozzy
Scooby Regular
 
ozzy's Avatar
 
Join Date: Nov 1999
Location: Scotland, UK
Posts: 10,504
Likes: 0
Received 1 Like on 1 Post
Question

Ian,

How 'fast' is your fast network? Gigabit or 100Mbps using SmartTrunking (or similar)?

Transferring 250,000 @ 200K is (by my calculations) nearly 48GB , so hopefully the thing generating the files is just as quick.

As David mentioned, be careful with the virus engine you run on the box. I've seen these cause disk I/O to suffer badly even on the biggest servers.

Stefan
Old 19 April 2002, 11:17 AM
  #9  
MrDeference
Scooby Regular
 
MrDeference's Avatar
 
Join Date: Mar 2002
Posts: 337
Likes: 0
Received 0 Likes on 0 Posts
Post

Don't know much about the speed of fast networking, but wouldn't it be fair to say that you will make sure the write speed of the disks is faster than the network throughput?
So, you would expect the developers to write a listener receiving the network data (hopefully a compressed stream), which parses the files and writes them to the disk, simultaneously creating the catalog.
Wouldn't that change the onus from disk speed to processor / code speed.
Old 19 April 2002, 12:04 PM
  #10  
ozzy
Scooby Regular
 
ozzy's Avatar
 
Join Date: Nov 1999
Location: Scotland, UK
Posts: 10,504
Likes: 0
Received 1 Like on 1 Post
Post

Ian,

You'll also need a lot of disks, so look either at systems with good internal storage or (as already suggested) and SAN solution.

In 96hrs, you'll get 16 deliveries of your 48GB of data, so you need 3/4 of a Terabyte in storage

Personally, I'd recommend more smaller disks rather than fewer larger ones. That way, depending on the RAID level, you can maximise the data that's stripped to the disks.

Stefan
Related Topics
Thread
Thread Starter
Forum
Replies
Last Post
KAS35RSTI
Subaru
27
04 November 2021 07:12 PM
Sam Witwicky
Engine Management and ECU Remapping
17
13 November 2015 10:49 AM
Ganz1983
Subaru
5
02 October 2015 09:22 AM
Nick_Cat
Computer & Technology Related
2
26 September 2015 08:00 AM
Littleted
Computer & Technology Related
0
25 September 2015 08:44 AM



Quick Reply: Big, big, big file processing



All times are GMT +1. The time now is 06:24 AM.