Abusing a search engine
I've been writing a specialized web crawler for Aardvark, our blog
recommender. The crawler is fairly straightforward - pull blog feeds,
add them the the Aura DataStore, and look for URLs to other blogs. When
the crawler finds a new URL it has to decide whether or not it should
crawl that URL to look for blogs. The first part of this decision is to
check whether or not we've already crawled the URL. This is simple
enough for a small crawler - when a URL is crawled, toss the URL into a
HashSet, and whenever it is time to crawl a new URL, first check the
HashSet - if the URL is already there, it has already been crawled, so
it can be skipped. However, this gets to be a bit trickier when the
crawler wants to keep track of millions of URLs. The HashSet will no
longer comfortably fit in memory. So what to do? We could keep the URLs
in a sorted file that we could quickly search through, or we could use a
database to store the URLs and construct a query to see if a URL is
already in the database. Or we can just use our search engine. As the
old saying goes, when you have a hammer, every problem looks like a nail
- and since we have a search engine,
to us, everything looks like a search problem. And so with our trusty
search engine in hand, we created a persistent hashset that can easily
deal with millions and millions of members. And the code was
surprisingly easy to write. Here are the core functions:
/** adds a string to the persistent set */ public void add(String s) { if (!contains(s)) { inMemoryCache.add(s); if (inMemoryCache.size() > MAX_MEMORY_CACHE_SIZE) { SimpleIndexer indexer = searchEngine.getSimpleIndexer(); for (String cachedString : inMemoryCache) { indexer.startDocument(cachedString); indexer.endDocument(); } indexer.finish(); inMemoryCache.clear(); } } /** tests to see if in item is in the set */ public boolean contains(String s) { if (inMemoryCache.contains(s)) { return true; } else { return searchEngine.isIndexed(s); } }The code was surprising fast too. With a million items in the set, a 'contains' operation took a couple of hundred microseconds to execute.
So, this is one way you can abuse a search engine, (the other way is to make it crawl ;)
Update: Steve has blogged about the rest of the story.
If you only have add and contains, you should consider using a bloom filter:
http://en.wikipedia.org/wiki/Bloom_filter
Posted by Erik Frey on April 23, 2008 at 09:34 AM EDT #