Class TrecContentSource
java.lang.Object
org.apache.lucene.benchmark.byTask.feeds.ContentItemsSource
org.apache.lucene.benchmark.byTask.feeds.ContentSource
org.apache.lucene.benchmark.byTask.feeds.TrecContentSource
- All Implemented Interfaces:
Closeable,AutoCloseable
Implements a
ContentSource over the TREC collection.
Supports the following configuration parameters (on top of ContentSource):
- work.dir - specifies the working directory. Required if "docs.dir" denotes a relative path (default=work).
- docs.dir - specifies the directory where the TREC files reside. Can be set to a relative path if "work.dir" is also specified (default=trec).
- trec.doc.parser - specifies the
TrecDocParserclass to use for parsing the TREC documents content (default=TrecGov2Parser). - html.parser - specifies the
HTMLParserclass to use for parsing the HTML parts of the TREC documents content (default=DemoHTMLParser). - content.source.encoding - if not specified, ISO-8859-1 is used.
- content.source.excludeIteration - if true, do not append iteration number to docname
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Stringstatic final Stringstatic final Stringseparator between lines in the byfferstatic final Stringstatic final StringFields inherited from class org.apache.lucene.benchmark.byTask.feeds.ContentItemsSource
encoding, forever, logStep, verbose -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Called when reading from this content source is no longer required.getNextDocData(DocData docData) Returns the nextDocDatafrom the content source.voidResets the input for this content source, so that the test would behave as if it was just started, input-wise.voidSets theConfigfor this content source.Methods inherited from class org.apache.lucene.benchmark.byTask.feeds.ContentItemsSource
addBytes, addItem, collectFiles, getBytesCount, getConfig, getItemsCount, getTotalBytesCount, getTotalItemsCount, printStatistics, shouldLog
-
Field Details
-
DOCNO
- See Also:
-
TERMINATING_DOCNO
- See Also:
-
DOC
- See Also:
-
TERMINATING_DOC
- See Also:
-
NEW_LINE
separator between lines in the byffer
-
-
Constructor Details
-
TrecContentSource
public TrecContentSource()
-
-
Method Details
-
parseDate
-
close
Description copied from class:ContentItemsSourceCalled when reading from this content source is no longer required.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Specified by:
closein classContentItemsSource- Throws:
IOException
-
getNextDocData
Description copied from class:ContentSourceReturns the nextDocDatafrom the content source. Implementations must account for multi-threading, as multiple threads can call this method simultaneously.- Specified by:
getNextDocDatain classContentSource- Throws:
NoMoreDataExceptionIOException
-
resetInputs
Description copied from class:ContentItemsSourceResets the input for this content source, so that the test would behave as if it was just started, input-wise.NOTE: the default implementation resets the number of bytes and items generated since the last reset, so it's important to call super.resetInputs in case you override this method.
- Overrides:
resetInputsin classContentItemsSource- Throws:
IOException
-
setConfig
Description copied from class:ContentItemsSourceSets theConfigfor this content source. If you override this method, you must call super.setConfig.- Overrides:
setConfigin classContentItemsSource
-