Since the size of the set is managable, here is what the result of the Histogram command looks like, This file has 30 million numbers in a tab-separated format. I'll work with a file that is ~500MB, that I created using this code, but this should work no matter the file size. Below, I am assuming that we know the max, min, and number of data points. I think it is necessary to know the bins you are going to use. So what we want to do is to create a histogram of data in a large file. That brings up the point that sometimes you may want to generate a histogram on a data set so large that you could not hold it all in memory at one time. Using Sjoerd's answer method I was able to generate the list of bins and bincounts much faster than the read-and-bin method I outline below, but on my data and my machine it took a very long time and used up so much RAM that my machine wasn't usable. Took all of my 8GB of RAM and eventually crashed the kernel. Just trying to do a histogram on the data hlist = HistogramList // AbsoluteTiming Using ReadList at least worked, temp = ReadList // AbsoluteTimingīut it took almost 15 minutes and MemoryInUse returned 6393810904. Takes longer than it took to clean my office so I quit that.
My data file has one floating point number per line and the number of lines is wc -l datafile.dat I just so happen to have a 4GB data file laying around, so I thought I might give this a try.