Creating a Client-Side Search Engine With Gears

Brad Neuberg, Gears Team, July 2008

Summary

This article describes how Gears can be used to create a client-side search engine plugged right into your web page. Learn how to add this functionality to your own web site, then dive deep to see how Gears and the Dojo toolkit were combined to create this client-side search engine. Readers should have experience with JavaScript and a basic understanding of Gears.

Introduction

Did you know that you can use Gears to do fast, client-side searching of data, similar to a client-side search engine? Gears bundles Full-Text Search (FTS) abilities right into its local, SQLite database. MySpace, for example, uses this feature with their MySpace Mail application, downloading all of a user's messages for fast, client-side search. Because all of the data is local, you can do nifty things like search over the data in real-time as the user types, something that is much harder if you have to query over the network to a server to do the searching.

Would you like to add the same kind of fast, local searching to your own web page and web applications? This article introduces you to PubTools Search and the Gears features that power it, namely Full-Text Search and Workers. PubTools Search is an open source JavaScript library that drops a client-side search engine right into your page. You configure it with basic HTML plus a list of URLs to index. Once loaded, a search form that uses the local Gears full-text search abilities will appear in your page to quickly and locally search over the documents in real time as a user types into the search field.

Please note that PubTools Search is not an official Google project or Gears API; it is a project I created on my own to teach and help developers. The Gears team does not support this project. However, email me if you have questions, concerns, or patches while working with the code.

Give PubTools Search a try right now with the demo embedded into this page below! The search form below uses PubTools Search to grab and index several free, public domain books from Project Gutenberg, including books from Goethe, Descarte, Emma Goldman, and more. Type the word 'history' and notice that as you type results are returned instantaneously since everything is happening locally. If for some reason the search form does not appear below you can run it here.

PubTools Search is also a great educational tool and source code for developers that want to know how to use Gears Workers and FTS together for their own web applications, as well as best practices around working with these for performance and reliability.

This article covers the following:

By the end of this article you should have a better grasp of Gears, PubTools Search, and Dojo, including how to use the Gears Full-Text Search and Workers in your own web applications for fast, client-side search.

Full-Text Search and Gears Workers

Gears bundles a local relational database that web sites can tap into and use. While a relational database can store and query the data that is present, traditionally a database can not do partial matching of documents in an efficient way. For example, I couldn't ask to return all the rows from a database that partially match the word 'orange'. This is the role of a search index, which builds up a fast way to match and find all documents that have some term. SQLite, using the fts2 module bundled with Gears, can create special 'virtual' tables that are in fact backed by seach indexes so that you can quickly find matches for search terms using special SQL. This is known as Full-Text Search (FTS).

Full-Text Search (FTS) in Gears allows you to create special tables in the local relational database. When you insert data into these tables, the data is indexed in such a way that searching over all the data to find full or partial matches is very fast. In essence, Gears and FTS gives you the ability to roll your own, client-side search engines that can work with very specific kinds of data, such as your corporate directory, a corpus of documents, and more.

Writing to the database and searching over the FTS table can be quite intense, since it is hitting the hard drive at regular intervals. In traditional web applications without Gears everything JavaScript does has to run on the same thread as the web browser. If JavaScript does something intensive, the browser itself slows to a crawl and freezes. Gears Workers are a way to run JavaScript on threads separate from the browser's user-interface, allowing you to do considerable work while keeping the browser responsive. As you will see in this article, PubTools Search uses Workers to do all of its FTS database operations, ensuring the browser stays fast.

Using PubTools Search

In general, Gears tends to give you lower-level primitives, such as FTS and Workers. It is the job of developers to tie these together using JavaScript to create higher-level applications that use Gears. If you're just getting started with web development and Gears this can be alot to absorb all at once. In the first part of this article I'll show you how you can use PubTools Search to get some of the great abilities of Gears into your page without having to delve into JavaScript, just by sprinkling a little bit of HTML into your page. In the second part I'll delve into the internals of PubTools Search so you can use these Gears features inside your web application itself.

To use PubTools Search in your web page, first download the following files and put them on your web server. PubTools Search is under an Apache v2 license so you can safely use this in commercial projects. You can also download the PubTools Search ZIP file and find these files in the src/ directory.

Second, add the PubTools Search CSS and JavaScript to your web page in the HEAD portion of your HTML document:

<link rel="stylesheet" type="text/css" href="pubtools-search.css"></link>

<script type="text/javascript" src="pubtools-util.js"></script>
<script type="text/javascript" src="pubtools-search.js"></script>

Next, create a file named search.txt that is in the same directory as your HTML page. This file will contain a list of URLs that you want to index locally and search over. The URLs are relative to the HTML file. Here is the search.txt file used in the example PubTools Search form embedded into this article above:

version=0.6
../resources/descartes.txt
../resources/goethe.txt
../resources/goldman.txt
../resources/machiavelli.txt
../resources/montaigne.txt
../resources/kafka.html
../resources/plato.html

The first line must have a version, such as version=0.6. When this version changes, such as if you add, modify, or remove a document, then PubTools Search will reindex the documents. Following the version string you must have a URL to index, one on each line.

The client-side search can only index HTML, text, and XML files. A good tip is to make sure any HTML documents you want to index have a TITLE element in the HEAD; this will be the title of the document that is returned for search results. For XML and text files we use the URL as the title.

Finally, add a DIV to your page with the following ID:

<div id='st-widget'></div>

When PubTools Search loads it will put the search UI box into this DIV.

That's it, you're done! You can see sample HTML here. You can also override the filename for the list of search URLs and the ID of the search widget; see the README file for details.

Dissecting PubTools Search

Let's jump right in and delve into the code for PubTools Search. First, all of the source is in a single file that you can refer to, pubtools-search.js. Here are the following JavaScript classes and their responsibilities in PubTools Search:

Here is a diagram of the classes and their methods. Private methods have an _ appended to the end. Note that some methods are left out to prevent clutter. Looking over the diagram and seeing the method names gives an overall impression of the PubTools Search system:

Let's run through the steps involved in the two most interesting phases, indexing and searching.

Indexing

I'll describe the indexing steps from a high-level, but if you want to see the control flow in depth you can view the diagram below. In the diagram, you will see a small 'Async!' label if the operation happens asychronously with a callback; an icon of tools if the action happens on a Gears Worker; and an icon of a disk if we are working with the local database and FTS table.

Indexing involves the following high level steps; as I describe the steps I've hyperlinked them right into the methods involved so you can study the source yourself and follow along. First, the SearchTool class determines our database name and then sets up the schema for our local database. When the page is finished loading, we then initialize the UI class and have it embed itself into the page.

Next, we kick off downloading our search file, generally named search.txt, that has the URLs to index and the version of all these files by using the SearchManifest class. Note that while Gears has a JSON-based manifest file that is used by the Gears LocalServer, reusing this for PubTools Search was not appropriate because a Gears LocalServer manifest can have many files that you would not want to index into a local search engine, such as JavaScript, CSS, etc. The SearchManifest class fetches the search file and then parses it into a form we can use. We then process and extract the version number from the file and compare it to what we might have stored in the local database. We don't want to re-index all of our documents every time the page loads; we only want to do so when the document list has changed. If the version hasn't changed, then we are done and don't have to index; otherwise we store the new version into our Gears Database and continue.

Now that we have a list of URLs to download, we can instantiate the Documents class. This class initially tries to determine the MIME type of all the URLs we have. This is for two reasons. One, we don't want to accidentally download and try to work with a binary file, such as a Microsoft Word file. Second, we will need the MIME type later on when doing snippet and title extraction. We determine the MIME type and do filtering by issuing an HTTP HEAD request to the server. We filter out any URL that is not an XML, HTML, or text document. Once we've got our final list of URLs to work with we can download the documents.

At this point we are armed with the data we need to do the real work, indexing. We instantiate the Indexer class and pass it all of our document's contents, their URLs, and the file's MIME type to index. The Indexer does two big things: extract and determine a good title for each document so we can use this in return results, and save the document into our local database into a Full-Text Search table. Since getting a title and doing the save can be computationally expensive, we do them on a Gears Worker to keep the browser responsive and keep things fast. At this point we are done with indexing.

Searching

Let's look at the searching side now. For low-level control details you can see the following diagram:

When a user types into the search field, we kick off a search. The hard work behind searching is handled by the Searcher class. Just like indexing, we do almost everything on a Gears Worker to keep the browser responsive. First, we search our local database's Full Text Search table for results, then generate snippets based on the query string of returning the results. These results get sent to the UI class and printed to the page.

Dojo & Code Snippets

Now that you've seen how PubTools Search works from a high-level, we can begin to move into viewing actual source code snippets. First, I'll need to give a quick introduction to Dojo, an open source JavaScript library that PubTools Search uses as a utility library since many of the code snippets use Dojo routines.

Dojo is a popular open source JavaScript and Ajax toolkit. It actually consists of three major pieces:

PubTools Search uses just the Dojo Core piece. Dojo includes a special build system that can be used to do all sorts of nifty optimizations and things. PubTools Search bundles a small, special build of Dojo that re-namespaces the dojo object to be pu, for PubUtils. This was done so that pages that includes PubTools Search that are already using Dojo won't get code collisions. For example, instead of calling dojo.hitch(), the PubTools Search code calls pu.hitch(). The build of Dojo that PubTools Search uses is named pubtools-util.js.

Let's look at some of the Dojo functions that PubTools Search uses in the context of actual code that PubTools Search uses. This will give you a chance to both get familiar with Dojo as well as learn how PubTools Search ties Gears' functionality together.

dojo.xhrGet and dojo.forEach

Let's take a look at doing XMLHttpRequests with Dojo's xhrGet function. We will look at the Documents.download_ function as reference:

download_: function(downloadMe) {
  var idx = new Indexer(downloadMe.length);

  pu.forEach(downloadMe, function(entry) {
    var url = entry.url;
    var mimeType = entry.mimeType;

    ui.tickProgress();

    pu.xhrGet({
      url: url,

      load: function(data) {
        ui.tickProgress();
        idx.index(url, mimeType, data);

        return data;
      },

      error: pu.hitch(this, function(err) {
        searchTools.handleError(err);

        return err;
      })
    });
  });
}

First, we initialize our Indexer class. downloadMe is an array of URLs to download, filtered by MIME type to just the ones we can work with. pu.forEach is a useful Dojo function that takes an array of elements, loops over them, and runs the given function over and over, handing it an element to work with. forEach can help to make your code tighter and more readable in some cases. In the code above we use this to get each URL.

When we get an entry to work with, we can call pu.xhrGet to fetch the given URL. Dojo's xhrGet function is straightforward to work with; it takes an object literal of arguments, including the url to load; a load function that is called when the values have been returned; and an error function that will be called if there is an error. In the download_ function above once we get the results back we pass it to the Indexer to store the result for later use and indexing. We used to index each document as it came in; however, performance testing showed that it is much faster to index a large set of documents in one shot using SQLite transactions rather than individually.

dojo.declare

JavaScript is a prototype-based programming language that can be used to emulate object-oriented programming. While this is fine for smaller projects, sometimes using a simpler syntax than the standard JavaScript prototype-based notation can make your code a bit more readable and maintainable. Dojo's declare method makes it easy to declare a class. Here is the class definition for the Searcher class in PubTools Search:

  pu.declare('Searcher', null, {
      search: function(query, callback) {
      },

      escapeString_: function(str) {
      },

      getSnippet_: function(query, mimeType, content) {     
      }
  });

The declare function takes a class name, in this case 'Searcher'; an optional super-class, null in the code above since we don't subclass; and finally an object literal of functions to add to this class. You can also have a special method named constructor that will be run when an instance of the class is created. Once you've defined your class instantiating an instance of it uses the standard JavaScript new keyword:

  var s = new Searcher();

dojo.hitch

In JavaScript programming closures are your friend. Explaining closures is beyond the scope of this article, but their use can be illustrated with the following example.

When developers first encounter JavaScript and need to create an event listener while doing JavaScript object-oriented programming, they typically do something like the following:

  // define a class named MyClass
  function MyClass(msg) {
    this.msg = msg;
    
    // have the doSomething() method get called when a button is clicked
    var button = document.getElementById('myButton');
    button.onclick = this.doSomething;
  }
  
  MyClass.prototype.doSomething = function() {
    alert('Hello World! Our message is ' + this.msg);
  }
  
  // create an instance of MyClass
  var instance = new MyClass();

This won't work! In JavaScript, functions are first class citizens and aren't necessarily 'bound' to an instance; they can be passed around by themselves. When the onclick handler gets called after a click, instance.doSomething is called. Inside of doSomething, we try to print out the message we passed in with this.msg. However, this doesn't refer to our instance! Instead, it actually refers to the global window object and we see 'Hello World! Our message is undefined' printed out. More advanced JavaScript programmers then do something like this to make sure that doSomething is attached to our instance:

  // define a class named MyClass
  function MyClass(msg) {
    this.msg = msg;
    
    // have the doSomething() method get called when a button is clicked
    var button = document.getElementById('myButton');
      
    // make a closure to capture our instance, then call the method when
    // the button is clicked
    var self = this;
    button.onclick = function() {
      self.doSomething();
    }
  }
  
  MyClass.prototype.doSomething = function() {
    alert('Hello World! Our message is ' + this.msg);
  }
  
  // create an instance of MyClass
  var instance = new MyClass();

We've used a closure to 'capture' the instance we are working with; then, when the button is clicked the doSomething method is called and the value of this.msg is valid and prints out correctly.

This pattern shows up all the time in JavaScript, especially when you are working with asychronous tasks that involve callbacks such as the network and Gears Workers; if you aren't careful your code can get ugly with lots of self variables and tricky bugs. Dojo provides a convenient hitch method that makes this pattern cleaner and your code more compact. Here is a code snippet from the PubTools Search Documents class that uses hitch:

pu.declare('Documents', null, {
    constructor: function(urls) {
      this.filter_(urls, pu.hitch(this, this.download_));
    },

    filter_: function(urls, callback) {
      // filter the URLs here asynchronously
      // ...
      callback(filteredURLs)
    },

    download_: function(downloadMe) {
      this.doSomething_();
    }
});

Remember that the Documents class is tasked with taking a list of URLs, filtering them down and getting their MIME types, and then downloading the document's contents for indexing. Both filtering and downloading the documents are asychronous tasks involving the network, so they need callbacks. Imagine if the line in our constructor was the following:

constructor: function(urls) {
  this.filter_(urls, this.download_); // bad!
}

The filter_ function takes some URLs to filter and a callback that will be called when we are done. In the above incorrect code snippet when the filtering is done the download_ method is run. However, just like our event handling code the download_ method will no longer run in the context of our Documents instance! If the download_ tries to reference variables and functions, such as this.doSomething_() then things will be undefined and fail. The Dojo hitch method makes it easy to bind a given instance and function name, returning another function that can be safely used in a callback or event handler. Here is the correct code:

constructor: function(urls) {
  this.filter_(urls, pu.hitch(this, this.download_));
},

this.download_ and this, referring to the Documents instance, are now hitched. When filtering is done it will correctly call the download_ method and everything will still be running in the context of the Documents instance.

dojo.byId & dojo.query

Most JavaScript toolkits provide convenience functions for quickly getting DOM elements on the page, so that you don't have to constantly litter your code with the DOM standards verbose document.getElementById method. Dojo provides a similar method, dojo.byId. Here we are getting the PubTools Search DIV that a developer provides in their HTML so we can fill in the UI:

var w = pu.byId('st-widget');

Recently a new practice has developed of binding your JavaScript's behavior to the page using CSS Selectors rather than lots of DOM manipulation code, which can be very verbose. Large gains in performance powering this technique by JavaScript toolkit authors has made it viable for large-scale projects, and the productivity and code maintainability it opens up are large. In JQuery, for example, this is the standard way you work with the page.

Dojo includes similar functionality with dojo.query. Let's look at an example of how this is used in PubTools Search. In PubTools Search you can use a LINK tag in your HTML with a special rel name to over-ride the default location and name of the search manifest:

<link rel="search.urls" href="/some/file/path/search_me.txt"></link>

When we startup we want to see if this LINK tag is defined and get its value. Doing so using old W3C DOM manipulation would have involved getting all the LINK tags on the page (document.getElementsByTagName('link')); looping over all of them looking at the rel value; and then grabbing the one that might match and getting its value. This kind of older DOM code can get ugly and verbose fast. Here is how we do it using dojo.query in the SearchTools class:

getSearchManifestURL_: function() {
  var url = 'search.txt';
  var results = pu.query("link[rel='search.urls']");
  if (results.length) {
   url = results[0].getAttribute('href'); 
  }
  
  return url;
}

Instead of lots of looping, we just use a simple CSS Selector, link[rel='search.urls'], which will match any LINK tags that have a rel attribute equal to search.urls. If it's present, we just grab its value and we are done.

We've only scratched the surface of dojo.query and writing JavaScript code based on the CSS Selector idiom; see the chapter on this subject in the The Book of Dojo for more details.

Tips, Tricks, & Best Practices

Now that you've seen some snippets of PubTools Search in conjunction with Dojo, let's look at some general Tips, Tricks, and Best Practices when using Gears. All of these are used in PubTools Search so you can always consult the source to get more details.

Conclusion

At this point you should have greater knowledge of Gears, Full-Text Search, Workers, PubTools Search, and Dojo. I look forward to seeing how you put these pieces together in your own applications! Email me if you have questions or have done something nifty with the pieces provided in this article.

Resources

Special Thanks

Special thanks Thibaud Lopez Schneider for great bug-testing, optimization, and many many patches. Also thanks to Scott Hess for the awesome fts2 module that does full-text search in SQLite.