Yahoo, Google and MSN hold a huge lead in search engine technology over open source alternatives. These search giants are competing in a battle among themselves to be a computer user’s default search site for search.
Where can a computer user go to find an adequate open source alternative to mainstream engines? Choices appear to be limited. A few established open source projects provide corporate IT managers some additional choices; however, a new offering from the founder of Wikipedia may soon change the search engine landscape.
The concept of finding essential information with the fewest key word refinements is a challenge for both searcher and search engine company. Searching for information online and within local storage drives is an integral part of the work flow process.
The need for an open search engine tool with the ability to catalog and retrieve data stored within the user’s network as well as find information on the Internet holds potential for innovation from open source projects. However, few alternatives exist today in open source search engine technology.
“The difference in using Google or Yahoo is the ability for searching inside my firewall or searchingprivately. You can buy a proprietary product [for intranet searching], but very few open source searchengines are in use,” David Christian, chief technical officer of Mindbridge, told LinuxInsider. Mindbridge is a provider ofbusiness process outsourcing (BPO) services.
Some critics of existing search engine products say there is a growing need for alternatives to theproprietary search companies and the big business associated with sponsored information and ad revenuefrom search results. A few innovators are conducting a quest for new search engines and an alternative tothe influences of ranking done by proprietary search platforms.
For instance, take the experience of Matt Burkhardt, chief executive officer of Impari Systems, as an example of thegrowing user need for new search engine options. Impari Systems is a startup focusing on bringing opensource software to schools.
Burkhardt is unhappy with his efforts to disperse his information displayed on Google news feeds. He putout two press releases only to find that soon after posting, they disappeared. Even worse, his noticesseemed to be replaced with competing information that was two years old.
That experience and others convinced Burkhardt that search is broken on the Internet. He is hoping thatsomething better comes along.
“Existing open source caters to [a] vertical market. We need something more mainstream,” he told LinuxInsider.
Search engines such as Google, Yahoo and MSN differ in their methodologies and search algorithms. Search engine technology is mostly secret, given the proprietary nature of their platforms.
Preferences for one search engine over another sometimes reach fanatic status, as users rely on a favoritesearch platform to find content. One of the leading search product alternatives, according to Mindbridge’sChristian, is Apache Lucene.
Most open source searching involves a component embedded into a larger project, he noted. Similarly, most of the open source projects using full text search are built with Lucene as the basis.
These alternative open source search projects include both desktop technologies and server-sidetechnologies, alone or in combination, he explained.
The Lucene Model
Apache Lucene is an open source, full-featured text search engine library written in Java that iscompatible with cross-platform searching. It is available for free download.
Its June update includes new features that include a payloads package for query mechanisms. This newversion is able to boost a search term’s relevancy score based on the value of the payload located at thatterm.
Lucene is now able to use “point-in-time” searching over NFS (network file system) structures. It also has a new API (application programming interface) for pre-analyzed fields.
A Starting Point
Using the Lucene platform as a basis for new open source search products may offer more choices. It iscapable of integrating current technology.
“From a programmer’s perspective, Apache Lucene has a robust API and .net and Java compatibility. Lucene is the basis for a number of search platforms,” said Christian.
NET Framework is a software component developed by Microsoft that is included in the Microsoft Windowsoperating system. It provides a large library of pre-coded instructions. Java is a programming languagedeveloped by Sun Microsystems.
Developing new search engine strategies, for both Internet and intranet use, runs the risk of otherproblems for potential users, warned Christian.
For example, one problem with using an alternative search product is that components may not talk to alldata containers. Another problem is that most people are not good at managing metadata (mechanisms that help define the structure of various document types).
“We need to search multiple indexes and return results in a cohesive fashion. We see some companies justbeginning to explore this. We need a search vehicle that will pull everything together,” Christian said.
Perhaps one of the most promising new open source search offerings will become available by the end ofthis year by Wiki.com, which recently completed a purchase of the Grub Web crawler tool from LookSmart.
Until now a proprietary search engine, Jimmy Wales, Wikia chairman and Wikipedia founder, toldLinuxInsider he will release the Grub code as open source.
Grub is a Web crawler that creates an index of the World Wide Web by borrowing the processing powerdonated by volunteer computers, similar to the [email protected] project, which looks for extraterrestrial life.This will allow Wales to jumpstart his new search product without having to develop its own computernetwork to crawl the Web to build and maintain a catalog of content.
“We plan to build all the software needed for free licensing for searching. I want to make all contentavailable license free. Nothing like this exists today,” Wales said.
Wales’ plan for a new open source based search engine calls for an expansion of previous open sourceefforts begun by projects such as Lucene. His goal is to create an open and transparent search tool thatdoes not mask its methodologies and search algorithms.
“There were several open source search projects. They were a start. Some of the pieces have existed. Now we are trying to give it full support,” he said.
Wales plans to release some form of a very rough first cut of his new search offering by the first of theyear. He will use an ad-based model for the Web site but is not sure about the rest of the business modelyet.