Big, big, big file processing
#1
Scooby Regular
Thread Starter
Join Date: Mar 2000
Location: Gloucestershire, home of the lawnmower.
Posts: 4,531
Likes: 0
Received 0 Likes
on
0 Posts
Got a project kicking off soon where I will need to process 250,000 files (approx 200kb each) every 6 hours 24/7/365.
The actual processing is fairly simple, but obviously we are talking an awful lot of disk I/O here. Files will be opened, a few bytes read from them (to get header info.) then closed again until needed later. The header info will be written to a catalogue file.
Anybody had this sort of experience and can give me an idea of what sort of spec machine we are talking about here ? Must be Wintel solution.
I ask as all my experience is on desktop/workstation level PCs or Crays superdoopercomputers and nothing in between.
I'm guessing some sort of server (Dell is a preference) with a bloody quick disk system. Needs to have redundancy etc.
Any ideas, as I haven't a clue where to start really until the actual code is done, but I need to spec. it up now really.
Cheers
Ian
The actual processing is fairly simple, but obviously we are talking an awful lot of disk I/O here. Files will be opened, a few bytes read from them (to get header info.) then closed again until needed later. The header info will be written to a catalogue file.
Anybody had this sort of experience and can give me an idea of what sort of spec machine we are talking about here ? Must be Wintel solution.
I ask as all my experience is on desktop/workstation level PCs or Crays superdoopercomputers and nothing in between.
I'm guessing some sort of server (Dell is a preference) with a bloody quick disk system. Needs to have redundancy etc.
Any ideas, as I haven't a clue where to start really until the actual code is done, but I need to spec. it up now really.
Cheers
Ian
#2
Scooby Regular
Ian,
Dell dual pentium server. I have an example spec that I have just recently placed an order for through work to Dell. The server is being used as a citrix server. It's the same spec as the council is using for it's teleworking citrix server, so it will be plenty fast enough for what you need it for, and the price is good as well. I am not in wotk until Tuesday, so drop me an email at work shaunfennings@warwickshire.gov.uk and I will reply on Tuesday.
If you want more explanation in the meantime, email me at shaun@scoobynet.co.uk
Regards,
Shaun.
Dell dual pentium server. I have an example spec that I have just recently placed an order for through work to Dell. The server is being used as a citrix server. It's the same spec as the council is using for it's teleworking citrix server, so it will be plenty fast enough for what you need it for, and the price is good as well. I am not in wotk until Tuesday, so drop me an email at work shaunfennings@warwickshire.gov.uk and I will reply on Tuesday.
If you want more explanation in the meantime, email me at shaun@scoobynet.co.uk
Regards,
Shaun.
#3
Are the files replaced once processed Ian? ie does the disk space requirement keep increasing or stay the same?
Choosing the right RAID config will pay off on a setup like that. If you can afford the 'waste' of disk space, then a RAID 0 + 1 setup will probably give the best results.
From Adaptec:
"Dual level raid, combines multiple mirrored drives (RAID 1) with data striping (RAID 0) into a single array. Provides highest performance with data protection."
15k RPM HDs are probably worth the extra cost here. Choose a decent 64 bit PCI RAID Card with a fair amount of battery backed cache (64MB+). Look for redundancy options on...
NICs
PSUs
Fans
As an example Compaq spec...
ML370 G2
1 or 2 1.4Ghz P3 CPUs with 512k Cache
1GB RAM
5 x 36GB 15k Hot Swap HDs (4 live, 1 hot spare), 1 free drive bay
Smart Array RAID 5302 with 64MB Cache
Redundant 10/100 NICs (maybe Gigabit, depends on your backbone)
Redundant PSUs
Redundant Fan Option Kit
Choosing the right RAID config will pay off on a setup like that. If you can afford the 'waste' of disk space, then a RAID 0 + 1 setup will probably give the best results.
From Adaptec:
"Dual level raid, combines multiple mirrored drives (RAID 1) with data striping (RAID 0) into a single array. Provides highest performance with data protection."
15k RPM HDs are probably worth the extra cost here. Choose a decent 64 bit PCI RAID Card with a fair amount of battery backed cache (64MB+). Look for redundancy options on...
NICs
PSUs
Fans
As an example Compaq spec...
ML370 G2
1 or 2 1.4Ghz P3 CPUs with 512k Cache
1GB RAM
5 x 36GB 15k Hot Swap HDs (4 live, 1 hot spare), 1 free drive bay
Smart Array RAID 5302 with 64MB Cache
Redundant 10/100 NICs (maybe Gigabit, depends on your backbone)
Redundant PSUs
Redundant Fan Option Kit
#4
I don't know what work you do, so this may be a bit of an insult - if so, I apologise.
It is more to do with how the code is written than anything else.
Target NT, use I/O Asynchronous completion, and a thread pool.
NT uses Memory mapped IO, so the disk subsystem will only read the part of the file required (not load the whole file).
Pay attention to cache coherency (try to help the disk cache the next file by keeping locality high - this is also a propery of the thing that generates the 250000 files in the first place).
Pay special attention to things like - do you want to update the info if the file you are going to read hasn't changed since last iteration? What is the locking scheme for the files (can you catalog them whilst they are being externally updated)?
A poorly written solution is going to kill anything, and a good one will run happily on a normal spec PC, particularly if you have a quick set of SCSI disks and heep-um-big-memory.
It is more to do with how the code is written than anything else.
Target NT, use I/O Asynchronous completion, and a thread pool.
NT uses Memory mapped IO, so the disk subsystem will only read the part of the file required (not load the whole file).
Pay attention to cache coherency (try to help the disk cache the next file by keeping locality high - this is also a propery of the thing that generates the 250000 files in the first place).
Pay special attention to things like - do you want to update the info if the file you are going to read hasn't changed since last iteration? What is the locking scheme for the files (can you catalog them whilst they are being externally updated)?
A poorly written solution is going to kill anything, and a good one will run happily on a normal spec PC, particularly if you have a quick set of SCSI disks and heep-um-big-memory.
#6
Scooby Regular
Thread Starter
Join Date: Mar 2000
Location: Gloucestershire, home of the lawnmower.
Posts: 4,531
Likes: 0
Received 0 Likes
on
0 Posts
Thanks for all that guys.
I should have added the following:
1. The 250,000 files will be delivered over a fast network to the machine every 6 hours. I.e. new 250,000 files every 6 hours coming as one big delivery.
2. The 250,000 files from each delivery will be kept for 96 hours and then deleted. I.e. after 4 days from starting up, the number of files on the system will remain approx constant.
3. After each file has been catalogued on arrival (hence the reading some header info. from each file) they will remain relatively untouched. I would estimate that maybe a couple of hundred files will be "looked" at a day. But which ones will depend on the users of which there will only be a very few (less than 20).
Cheers
Ian
I should have added the following:
1. The 250,000 files will be delivered over a fast network to the machine every 6 hours. I.e. new 250,000 files every 6 hours coming as one big delivery.
2. The 250,000 files from each delivery will be kept for 96 hours and then deleted. I.e. after 4 days from starting up, the number of files on the system will remain approx constant.
3. After each file has been catalogued on arrival (hence the reading some header info. from each file) they will remain relatively untouched. I would estimate that maybe a couple of hundred files will be "looked" at a day. But which ones will depend on the users of which there will only be a very few (less than 20).
Cheers
Ian
#7
Scooby Regular
Join Date: Nov 2001
Location: Leeds - It was 562.4bhp@28psi on Optimax, How much closer to 600 with race fuel and a bigger turbo?
Posts: 15,239
Likes: 0
Received 1 Like
on
1 Post
I would agree with the spec chrisb said... very good raid card... would def. say 15k disks at a minimum.... depends on your budget... could even consider fibre channel.. what sort of reliability from the box are you wanting? Are you running win2k...
What is doing the processing of the files?? is it a custom app / script? ie as mrdeference says single threaded/or multithreaded... as if your app is cr@p then the box is irrelevant.. pay attention to how you set up things like your virus scanner.. etc..
edited to say that the new dl370's look cr@p as they look like a video
David
[Edited by David_Wallis - 4/19/2002 11:05:54 AM]
What is doing the processing of the files?? is it a custom app / script? ie as mrdeference says single threaded/or multithreaded... as if your app is cr@p then the box is irrelevant.. pay attention to how you set up things like your virus scanner.. etc..
edited to say that the new dl370's look cr@p as they look like a video
David
[Edited by David_Wallis - 4/19/2002 11:05:54 AM]
Trending Topics
#8
Scooby Regular
Ian,
How 'fast' is your fast network? Gigabit or 100Mbps using SmartTrunking (or similar)?
Transferring 250,000 @ 200K is (by my calculations) nearly 48GB , so hopefully the thing generating the files is just as quick.
As David mentioned, be careful with the virus engine you run on the box. I've seen these cause disk I/O to suffer badly even on the biggest servers.
Stefan
How 'fast' is your fast network? Gigabit or 100Mbps using SmartTrunking (or similar)?
Transferring 250,000 @ 200K is (by my calculations) nearly 48GB , so hopefully the thing generating the files is just as quick.
As David mentioned, be careful with the virus engine you run on the box. I've seen these cause disk I/O to suffer badly even on the biggest servers.
Stefan
#9
Don't know much about the speed of fast networking, but wouldn't it be fair to say that you will make sure the write speed of the disks is faster than the network throughput?
So, you would expect the developers to write a listener receiving the network data (hopefully a compressed stream), which parses the files and writes them to the disk, simultaneously creating the catalog.
Wouldn't that change the onus from disk speed to processor / code speed.
So, you would expect the developers to write a listener receiving the network data (hopefully a compressed stream), which parses the files and writes them to the disk, simultaneously creating the catalog.
Wouldn't that change the onus from disk speed to processor / code speed.
#10
Scooby Regular
Ian,
You'll also need a lot of disks, so look either at systems with good internal storage or (as already suggested) and SAN solution.
In 96hrs, you'll get 16 deliveries of your 48GB of data, so you need 3/4 of a Terabyte in storage
Personally, I'd recommend more smaller disks rather than fewer larger ones. That way, depending on the RAID level, you can maximise the data that's stripped to the disks.
Stefan
You'll also need a lot of disks, so look either at systems with good internal storage or (as already suggested) and SAN solution.
In 96hrs, you'll get 16 deliveries of your 48GB of data, so you need 3/4 of a Terabyte in storage
Personally, I'd recommend more smaller disks rather than fewer larger ones. That way, depending on the RAID level, you can maximise the data that's stripped to the disks.
Stefan
Thread
Thread Starter
Forum
Replies
Last Post
Sam Witwicky
Engine Management and ECU Remapping
17
13 November 2015 10:49 AM