
I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. And this says nothing of trying to download a web crawl over the network - it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.įull web indices are just too big to play with casually there will always be a very small number of them. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data.

The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Our intention is to change that.īravo! However, a web crawl is a truly enormous file. Currently, only a select few corporations have access to an index of the world wide web. We believe the internet should be open to everyone. They are crawling the entire web and making the results available for download via BitTorrent, because Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.ĭotBot is the only open web index project I am aware of.
#DOTBOT INTEREST RATES FULL#
A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private.
#DOTBOT INTEREST RATES SOFTWARE#
The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval.
#DOTBOT INTEREST RATES HOW TO#
What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is? How to I tell Google to count the number of web pages that are written in Chinese? Africa AI art belief censorship China climate change community computational journalism consciousness consumerism culture developing world dickheads economics energy funny information information visualization internet iran Iraq journalism knowledge language marketing media minds news obama personal politics public health risk science sex social media storytelling technology transparency travel twitter visualization wikipedia world peace ArchivesĪnything that’s hard to put into words is hard to put into Google.
