Discussion:
Lucene performance: benchmarktemplate.xml
(too old to reply)
Glen Newton
2008-04-15 16:25:23 UTC
Permalink
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
<li><i>Average filesize of source documents</i>:
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<li><i>Notes</i>:
<ul>
<li>
These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>

</ul>
</li>
</p>
</ul>
</benchmark>
Glen Newton
2008-04-15 16:51:30 UTC
Permalink
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
<li><i>Average filesize of source documents</i>:
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<li><i>Notes</i>:
<ul>
<li>
These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>

</ul>
</li>
</p>
</ul>
</benchmark>
--
-
Cass Costello
2008-04-16 02:40:02 UTC
Permalink
I just did that so I could read it. :) I'll leave it up until Glen resends
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
---------------------------------------------------------------------
--
Lego timeline:
Loading Image...
Glen Newton
2008-04-16 13:40:11 UTC
Permalink
Cass,
Thanks for converting it. I've posted it to my blog:
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html

Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").

Sorry if it hurt your eyes! :-)

-Glen
Post by Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen resends
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
---------------------------------------------------------------------
--
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4-timeline.jpg
--
-
Michael McCandless
2008-04-16 14:28:47 UTC
Permalink
These are great results! Thanks for posting.

I'd be curious if you'd get better indexing throughput by using a
single IndexWriter, fed by all 8 indexing threads, with an 8X bigger
RAM buffer, instead of 8 IndexWriters that merge in the end.

How long does that final merge take now?

Also, 64 threads doing document construction seems too high? You may
be losing some performance to the cost of thread context switching.

Did you use autoCommit=false? I think it should help since you have
so many stored fields and some term vectors.

Mike
Post by Glen Newton
Cass,
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-
benchmarks.html
Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").
Sorry if it hurt your eyes! :-)
-Glen
Post by Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen resends
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title,
abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields
besides
the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed
multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
--------------------------------------------------------------------
-
--
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4-
timeline.jpg
--
-
---------------------------------------------------------------------
Glen Newton
2008-04-16 17:51:53 UTC
Permalink
Post by Michael McCandless
These are great results! Thanks for posting.
Thanks!
Post by Michael McCandless
I'd be curious if you'd get better indexing throughput by using a single
IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer,
instead of 8 IndexWriters that merge in the end.
While I am new to this list, I have been trying different
combinations/configurations for this over the last 18 months. To
answer your questions: no, not on our multi-core machine. But I
haven't tried your suggested scenario with >= v2.2, so it is possible
that it might be better.
Post by Michael McCandless
How long does that final merge take now?
I don't have that timed. I will alter the app to record this.
Post by Michael McCandless
Also, 64 threads doing document construction seems too high? You may be
losing some performance to the cost of thread context switching.
I did performance tests on 8,16,24,32,48,64,72,96,128 threads. 64 was
the sweet spot, for this particular configuration.
Post by Michael McCandless
Did you use autoCommit=false? I think it should help since you have so
many stored fields and some term vectors.
Damn, I am using the defaults (autoCommit=true). I will re-run and
post the results!

Thanks, :-)

-glen
Post by Michael McCandless
Mike
Post by Glen Newton
Cass,
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Post by Glen Newton
Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").
Sorry if it hurt your eyes! :-)
-Glen
Post by Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen
resends
Post by Glen Newton
Post by Cass Costello
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that
this
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the
documents
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title,
abstract);
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields
besides
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed
multithreaded
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>Eight separate parallel IndexWriters are fed by the
pipeline
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same
time.
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
---------------------------------------------------------------------
Post by Glen Newton
Post by Cass Costello
--
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4-timeline.jpg
Post by Glen Newton
--
-
---------------------------------------------------------------------
---------------------------------------------------------------------
--
-
Anshum
2008-04-18 04:01:13 UTC
Permalink
Hi Glenn,

I am not too clear about it, but isn't there a limit to the memory
consumption specified for the JVM? The limit being 1.3Gigs of resident
and 2 Gigs of memory in all? You just mentioned the Memory consumption:
-Xms4000m -Xmx6000m.
Could someone please help me with the same.

--
Anshum
Post by Glen Newton
Post by Michael McCandless
These are great results! Thanks for posting.
Thanks!
Post by Michael McCandless
I'd be curious if you'd get better indexing throughput by using a single
IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer,
instead of 8 IndexWriters that merge in the end.
While I am new to this list, I have been trying different
combinations/configurations for this over the last 18 months. To
answer your questions: no, not on our multi-core machine. But I
haven't tried your suggested scenario with >= v2.2, so it is possible
that it might be better.
Post by Michael McCandless
How long does that final merge take now?
I don't have that timed. I will alter the app to record this.
Post by Michael McCandless
Also, 64 threads doing document construction seems too high? You may be
losing some performance to the cost of thread context switching.
I did performance tests on 8,16,24,32,48,64,72,96,128 threads. 64 was
the sweet spot, for this particular configuration.
Post by Michael McCandless
Did you use autoCommit=false? I think it should help since you have so
many stored fields and some term vectors.
Damn, I am using the defaults (autoCommit=true). I will re-run and
post the results!
Thanks, :-)
-glen
Post by Michael McCandless
Mike
Post by Glen Newton
Cass,
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Post by Glen Newton
Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").
Sorry if it hurt your eyes! :-)
-Glen
Post by Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen
resends
Post by Glen Newton
Post by Cass Costello
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that
this
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the
documents
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title,
abstract);
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields
besides
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed
multithreaded
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>Eight separate parallel IndexWriters are fed by the
pipeline
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same
time.
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
---------------------------------------------------------------------
Post by Glen Newton
Post by Cass Costello
--
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4-timeline.jpg
Post by Glen Newton
--
-
---------------------------------------------------------------------
---------------------------------------------------------------------
--
-
---------------------------------------------------------------------
Glen Newton
2008-04-18 15:05:46 UTC
Permalink
HI Anshum,

A reasonable question. Answer: 64 bit architecture running 64 bit Java
VM. It is great! :-)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
If you have any other questions, please let me know. :-)

-Glen
Hi Glenn,
I am not too clear about it, but isn't there a limit to the memory
consumption specified for the JVM? The limit being 1.3Gigs of resident
-Xms4000m -Xmx6000m.
Could someone please help me with the same.
--
Anshum
Post by Glen Newton
Post by Michael McCandless
These are great results! Thanks for posting.
Thanks!
Post by Michael McCandless
I'd be curious if you'd get better indexing throughput by using a single
IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer,
instead of 8 IndexWriters that merge in the end.
While I am new to this list, I have been trying different
combinations/configurations for this over the last 18 months. To
answer your questions: no, not on our multi-core machine. But I
haven't tried your suggested scenario with >= v2.2, so it is possible
that it might be better.
Post by Michael McCandless
How long does that final merge take now?
I don't have that timed. I will alter the app to record this.
Post by Michael McCandless
Also, 64 threads doing document construction seems too high? You may be
losing some performance to the cost of thread context switching.
I did performance tests on 8,16,24,32,48,64,72,96,128 threads. 64 was
the sweet spot, for this particular configuration.
Post by Michael McCandless
Did you use autoCommit=false? I think it should help since you have so
many stored fields and some term vectors.
Damn, I am using the defaults (autoCommit=true). I will re-run and
post the results!
Thanks, :-)
-glen
Post by Michael McCandless
Mike
Post by Glen Newton
Cass,
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Post by Glen Newton
Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").
Sorry if it hurt your eyes! :-)
-Glen
Post by Cass Costello
I just did that so I could read it. :) I'll leave it up until Glen
resends
Post by Glen Newton
Post by Cass Costello
or posts it somewhere...
http://www.casscostello.com/?page_id=28
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that
this
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the
documents
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title,
abstract);
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields
besides
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed
multithreaded
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>Eight separate parallel IndexWriters are fed by the
pipeline
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same
time.
Post by Glen Newton
Post by Cass Costello
Post by Glen Newton
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
---------------------------------------------------------------------
Post by Glen Newton
Post by Cass Costello
--
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4-timeline.jpg
Post by Glen Newton
--
-
---------------------------------------------------------------------
---------------------------------------------------------------------
--
-
---------------------------------------------------------------------
---------------------------------------------------------------------
--
-
Ian Holsman
2008-04-16 00:18:10 UTC
Permalink
Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.
Post by Glen Newton
<benchmark>
<ul>
<p>
<b>Hardware Environment</b><br/>
<li><i>Dedicated machine for indexing</i>: yes</li>
<li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
<li><i>RAM</i>: 8GB</li>
<li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
</p>
<p>
<b>Software environment</b><br/>
<li><i>Lucene Version</i>: 2.3.1</li>
<li><i>Java Version</i>: Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
<li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
<li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
<li><i>Location of index</i>: Filesystem, on attached storage</li>
</p>
<p>
<b>Lucene indexing variables</b><br/>
<li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
Filesystem</li>
<li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
<li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
<li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
<li><i>Number of fields per document</i>: 24</li>
<li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
<li><i>Index persistence</i>: FSDirectory</li>
<li><i>Index size</i>: 83GB</li>
<li><i>Number of terms</i>: 143,298,010</li>
</p>
<p>
<b>Figures</b><br/>
<li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
<li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
<li><i>Memory consumption</i>: -Xms4000m -Xmx6000m</li>
<li><i>Query speed</i>: average time a query takes, type
of queries (e.g. simple one-term query, phrase query),
not measuring any overhead outside Lucene</li>
</p>
<p>
<b>Notes</b><br/>
<ul>
<li>
These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
</li>
<li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC -server -Xms4000m
-Xmx6000m
</li>
<li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
</li>
<li>File system file reading and Un-gzip performed multithreaded
</li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
</li>
<li>
Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
</li>
</ul>
</li>
</p>
</ul>
</benchmark>
Loading...