I've been writing a specialized web crawler for Aardvark, our blog recommender. The crawler is fairly straightforward - pull blog feeds, add them the the Aura DataStore, and look for URLs to other blogs. When the crawler finds a new URL it has to decide whether or not it should crawl that URL to look for blogs. The first part of this decision is to check whether or not we've already crawled the URL. This is simple enough for a small crawler - when a URL is crawled, toss the URL into a HashSet, and whenever it is time to crawl a new URL, first check the HashSet - if the URL is already there, it has already been crawled, so it can be skipped. However, this gets to be a bit trickier when the crawler wants to keep track of millions of URLs. The HashSet will no longer comfortably fit in memory. So what to do? We could keep the URLs in a sorted file that we could quickly search through, or we could use a database to store the URLs and construct a query to see if a URL is already in the database. Or we can just use our search engine. As the old saying goes, when you have a hammer, every problem looks like a nail - and since we have a search engine, to us, everything looks like a search problem. And so with our trusty search engine in hand, we created a persistent hashset that can easily deal with millions and millions of members. And the code was surprisingly easy to write. Here are the core functions:
    /** adds a string to the persistent set */
    public void add(String s) {
        if (!contains(s)) {
            inMemoryCache.add(s);
            if (inMemoryCache.size() > MAX_MEMORY_CACHE_SIZE) {
                SimpleIndexer indexer = searchEngine.getSimpleIndexer();
                for (String cachedString : inMemoryCache) {
                    indexer.startDocument(cachedString);
                    indexer.endDocument();
                }
                indexer.finish();
                inMemoryCache.clear();
            }
        }

    /** tests to see if in item is in the set */
    public  boolean contains(String s) {
        if (inMemoryCache.contains(s)) {
            return true;
        } else {
            return searchEngine.isIndexed(s);
        }
    }
The code was surprising fast too. With a million items in the set, a 'contains' operation took a couple of hundred microseconds to execute.

So, this is one way you can abuse a search engine, (the other way is to make it crawl ;)

Update: Steve has blogged about the rest of the story.

Comments:

If you only have add and contains, you should consider using a bloom filter:

http://en.wikipedia.org/wiki/Bloom_filter

Posted by Erik Frey on April 23, 2008 at 09:34 AM EDT #

Post a Comment:
Comments are closed for this entry.

This blog copyright 2010 by plamere