Discussion:
force deletes - terms enum still has deleted terms?
Rob Audenaerde
2018-09-28 12:40:41 UTC
Permalink
Hi all,

We build a FST on the terms of our index by iterating the terms of the
readers for our fields, like this:

for (final LeafReaderContext ctx : leaves) {
final LeafReader leafReader = ctx.reader();

for (final String indexField : indexFields) {
final Terms terms =
leafReader.terms(indexField);
// If the field does not exist in this
reader, then we get null, so check for that.
if (terms != null) {
final TermsEnum termsEnum =
terms.iterator();

However, it sometimes the building of the FST seems to find terms that are
from documents that are deleted. This is what we expect, checking the
javadocs.

So, now we switched the IndexWriter to a config with a TieredMergePolicy
with: setForceMergeDeletesPctAllowed(0).

When calling indexWriter.forceMergeDeletes(true) we expect that there will
be no more deletes. However, the deleted terms still sometimes appear. We
use the DirectoryReader.openIfChanged() to refresh the reader before
iterating the terms.

Are we forgetting something?

Thanks in advance.
Rob Audenaerde
Erick Erickson
2018-09-28 14:48:11 UTC
Permalink
You might be hitting a rounding error. When this happens, how many
deleted documents are there in the remaining segments? 1?

The calculation for whether to merge the segment is:

double pctDeletes = 100. * ((double) deleted_docs_in_segment /
(double) doc_count_in_segment_including_deleted_docs
if (pctDeletes > forceMergeDeletesPctAllowed) {merge the segment}.

At any rate, calling findForcedMerges instead will purge all deleted
docs no matter what.

NOTE: as of 7.5, the behavior has changed in that both of these
methods will respect the maximum segment size by default. Prior to
7.5, either of these could produce a single segment for all the
segments that were merged (all of them in forceMerge, all with > n%
deleted docs in forceMergeDeletes). If you require a single segment to
result, you can specify the maxSegmentCount as 1.

See LUCENE-7976 for all the gory details of this change if you're curious

Best,
Erick
Post by Rob Audenaerde
Hi all,
We build a FST on the terms of our index by iterating the terms of the
for (final LeafReaderContext ctx : leaves) {
final LeafReader leafReader = ctx.reader();
for (final String indexField : indexFields) {
final Terms terms =
leafReader.terms(indexField);
// If the field does not exist in this
reader, then we get null, so check for that.
if (terms != null) {
final TermsEnum termsEnum =
terms.iterator();
However, it sometimes the building of the FST seems to find terms that are
from documents that are deleted. This is what we expect, checking the
javadocs.
So, now we switched the IndexWriter to a config with a TieredMergePolicy
with: setForceMergeDeletesPctAllowed(0).
When calling indexWriter.forceMergeDeletes(true) we expect that there will
be no more deletes. However, the deleted terms still sometimes appear. We
use the DirectoryReader.openIfChanged() to refresh the reader before
iterating the terms.
Are we forgetting something?
Thanks in advance.
Rob Audenaerde
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-***@lucene.apache.org
For additional commands, e-mail: java-user-***@lucene.apache.org
Loading...