Before any document set can be made searchable, it needs to be processed—a procedure known as indexing. AIDA’s Indexer component takes care of the preprocessing (the conversion, tokenization, and possibly normalization) of the text of each document as well as the subsequent index generation. It is flexible and can be easily configured through a configuration file. For example, different fields can be extracted from each document type, such as title, document name, authors, or the entire contents.
The currently supported document encodings are Microsoft Word, Portable Document Format (PDF), MedLine, and plain text. The so-called “DocumentHandlers” which handle the actual conversion of each source file are loaded at runtime, so a handler for any other proprietary document encoding can be created and used instantly. Because Lucene is used as basis, there is a plethora of options and/or languages available for stemming, tokenization, normalization, or stop word removal which may all be set on a per-field, per-document type, or per-index basis using the configuration file.
An index can currently be constructed using either the command-line, a SOAP webservice (with the limitation of 1 document per call), or using the Taverna plugin.
Installation
There are two ways of configuring the installation of the Indexer component, cf. AIDA - retrieval. The only configuration-specific parameter for the Indexer is:
- manager.port - Tomcat’s port number.
The Indexer uses the environment variable INDEXDIR to find the directory, containing all directories with Lucene indexes in them. E.g.
/INDEXDIR/LUCENE_INDEX_DIR1
/INDEXDIR/LUCENE_INDEX_DIR2
/INDEXDIR/LUCENE_INDEX_DIR3
/INDEXDIR/LUCENE_INDEX_DIR4
...
When (re)starting tomcat, make sure you have the INDEXDIR environment variable set. The easiest way is ofcourse to modify catalina.(bat|sh), so starting omcat always gets the correct environment variable. Alternatively, you can do it in the operating system (OS). This differs per OS, but for example for unix/linux (bash):
export INDEXDIR=PATH_TO_INDEXDIR
Then, to install, type ‘ant jar’ in the AIDA/Search/Indexer/ folder. If all went well, it should say something like:
[echo] To run this application from the command line without Ant, try:
[echo] java -jar “AIDA/Search/Indexer/dist/Indexer.jar”
Now you can use either the webservice or the commandline to index your documents.
The Actual Indexing
To build an index, you should first edit the file indexconfig.xml. At the top of that file there are three configuration options you need to set:
<Name>My_index</Name>- The name of the index to create, which will be created as a subdirectory to the INDEXDIR defined earlier,<DataPath>datadir</DataPath>- the directory which contains the files you want to index, and<IndexOverwrite>true</IndexOverwrite>- a boolean flag which indicates whether an already existing index should be overwritten.
Furthermore, you can define how certain filetypes should be handled. In the provided code, there are 4 distinct document handlers:
<DocType FileType="medline">- MedLine files,<DocType FileType="txt">- text files,<DocType FileType="pdf">- PDF files, and<DocType FileType="msword">- Microsoft Word files.
You can control how files should be handled according to their extension. For example, to tell the Indexer it should handle files with extension .conf as text files, you should add the following line to indexconfig.xml:
<DocType FileType="txt">
<FileExtension>conf</FileExtension>
<Field Name="content">
<Index>TOKENIZED</Index>
<Store>true</Store>
<Termvector>YES</Termvector>
<Description>"content"</Description>
</Field>
Custom Filetypes
(Description will follow. If you need/want/like help with this, send an e-mail to emeij (at) science.uva.nl.)