Lucene, LingPipe, and Gate are popular open source tools to build powerful search applications. Building Search Applications describes functions from Lucene that include indexing, searching, ranking, and spelling correction to build search engines. With this book you will learn to: Extract tokens from text using custom tokenizers and analyzers from Lucene, LingPipe, and Gate. Construct a search engine index with an optional backend database to manage large document collections. Explore the wide range of Lucene queries to search an index, understand the ranking algorithm for a query, and suggest spelling corrections. Find the names of people, places, and other entities in text using LingPipe and Gate. Categorize documents by topic using classifiers and build groups of self-organized documents using clustering algorithms from LingPipe. Create a Web crawler to scan the Web, Intranet, or desktop using Nutch. Track the sentiment of articles published on the Web with LingPipe.All the examples we have seen so far have used plain text files or hard coded text strings. Nonetheless, it is more common to find formatted text ... We select the text filter for a document using a simple look-up in a table (see Figure 2.4). If files are named using a standard file naming ... using proprietary Microsoft formats such as .DOC and .PPT. The Apache Jakarta Poor Obfuscation Implementation (POI) project was initiated to build a set of Java APIs to read and write MS Office files.

Author:Manu Konchady - 2008


