Subscribe Add to Technorati Favorites

Tuesday, August 5, 2008

When data stored becomes very large ...

There will be situations at which the data you store in memory (variables) is huge, & the script cribbs & might throw "Out of memory" error (happens if the available memory is very low) & exit.

This had happened to me when extracting large amount of data & storing it into a data structure. The actual reason for this was not the data structure itself, but because of an array in which I used to read the entire files' contents (~35MB/file). The issue was solved once I started reading the file line-by-line.

That was just a part of the topic I wanted to write here... Actually, in the process of lowering the memory usage I had used some new (at least, its new for me) techniques of storing numerical data & that's what I want to write here.

This is the data structure that I had used:
%hash = {
"key1" => {
"kkey1" => "vvalue1"
...
},
...
};

So that's the pictorial of my data structure. In this I had to store some numbers that had to be paired (there was a relationship b/w them) & also for these numbers I had to calculate the average & median. The number-pairs found to be larger than 999, so I had to use a glue character b/w then number-pair & store as "kkey1" & store the number of occurrences as the "vvalue1".

Since it was average & median to be calculated... I wrote these two neat subroutines:

sub get_average {
my (%input_cnt) = @_;
my $count = 0;
my $sum = 0;
foreach my $input (keys %input_cnt) {
$sum += ($input*$input_cnt{$input});
}
foreach my $cnt (values %input_cnt) {
$count += $cnt;
}
my $average = $sum/$count;
return $average;
}

sub get_median {
my (%inputs_cnt) = @_;
my @sorted = sort { $a <=> $b } keys %inputs_cnt;
my %sorted_vals_index = (); # sorted_vals_index{delay} => 'low_index:high_index'
my $index = 0;
foreach my $ele (@sorted) {
my $h_i = $index + ($inputs_cnt{$ele}-1);
$sorted_vals_index{$ele} = "$index:$h_i";
$index = $h_i+1;
}
my $mid_index = ($index-1)/2;
foreach my $ele (keys %sorted_vals_index) {
my ($l_i, $h_i) = ($sorted_vals_index{$ele} =~ /(.+?):(.+)/);
if (($mid_index >= $l_i) && ($mid_index <= $h_i)) {
return $ele;
}
}
}

Each subroutine takes a hash, where the key is the number & value is the number of occurrences.

So the memory used became very less, since I avoided storing duplicate number-pairs.

0 comments: