Discussion:
Using lucene as a database... good idea or bad idea?
(too old to reply)
John Evans
2008-07-29 01:53:08 UTC
Permalink
Hi All,

I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
appreciated.

Thanks,
-
John
Hasan Diwan
2008-07-29 04:55:48 UTC
Permalink
Check the nutch or solr projects, both of which are subprojects of lucene. Feel free to drop me a line if you should run into difficulties.
Sent via BlackBerry by AT&T

-----Original Message-----
From: "John Evans" <***@jpevans.com>

Date: Mon, 28 Jul 2008 18:53:08
To: <java-***@lucene.apache.org>
Subject: Using lucene as a database... good idea or bad idea?


Hi All,

I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?

The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.

Any advice/input/theories anyone can contribute would be greatly
app
Ian Lea
2008-07-29 08:48:20 UTC
Permalink
John


I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.

Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.


--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
Ganesh - yahoo
2008-07-29 09:34:03 UTC
Permalink
Hello all,

I am also interested in this. I want to archive the content of the document
using Lucene.

Is it a good idea to use Lucene as storage engine?

Regards
Ganesh

----- Original Message -----
From: "Ian Lea" <***@gmail.com>
To: <java-***@lucene.apache.org>
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
Post by Ian Lea
John
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends http://in.messenger.yahoo.com
Grant Ingersoll
2008-07-29 12:31:29 UTC
Permalink
I think the answer is it can be done and probably quite well. I also
think it's informative that Nutch does not use Lucene for this
function, as I understand it, but that shouldn't stop you either. You
might also have a look at Apache Jackrabbit, which uses Lucene
underneath as a content repository.

-Grant
Post by Ganesh - yahoo
Hello all,
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
Post by Ian Lea
John
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add
arbitrary fields
to each document and you can store the actually data in the
database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/
replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will
scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
2008-07-29 12:43:27 UTC
Permalink
The way I see it, search solutions (on whatever scale) have three components
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.

So, I do not quite follow, why should Lucene be used for datastore ?

Nagesh
I think the answer is it can be done and probably quite well. I also think
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have a
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
Ian Lea
2008-07-29 13:21:35 UTC
Permalink
I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.


--
Ian.


On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three components
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
I think the answer is it can be done and probably quite well. I also think
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have a
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
For additional commands, e-mail: java-
Grant Ingersoll
2008-07-29 13:56:13 UTC
Permalink
Agreed, no one is saying should. Additionally, Lucene can be faster
for a number of things like storage when databases are overkill (i.e.
you don't need transactions, complex joins, etc.) After all, even the
lookup of a file, can be viewed as a "search", even if it is just for
a single unique key and doesn't require any fuzziness.
Post by Ian Lea
I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.
--
Ian.
On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್
ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three components
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll
I think the answer is it can be done and probably quite well. I also think
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have a
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any
concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing
a
data-store to back a web crawler. The crawler can be
configured to
retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add
arbitrary
fields
to each document and you can store the actually data in the
database.
I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar
things...
however,
I know it's not meant to be used this way, so I thought I would
post
here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
may reach the point where we have tens of millions of documents
and a
high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme
fast
text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for
iterating over
all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
2008-07-29 14:02:10 UTC
Permalink
Hi Ian,
Yes, I see that we are discussing an "option" here.

But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.

Basically, I am not able to connect "database" and Lucene java. :)

Nagesh
Post by Ian Lea
I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.
--
Ian.
On Tue, Jul 29, 2008 at 1:43 PM, ಚಟಗೇಶ್ ಞುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three
components
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our
individual
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple
of
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there
any
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will
scale.
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding
new
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
documents to the index, retrieving documents by ID (for iterating
over
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
A***@equifax.com
2008-07-29 14:07:48 UTC
Permalink
Look at Compass wrapper for Lucene...

Regards,
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: ***@equifax.com



"ಚಟಗೇಶ್ ಞುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)" <***@gmail.com>
07/29/2008 10:02 AM
Please respond to
java-***@lucene.apache.org


To
java-***@lucene.apache.org
cc

Subject
Re: Using lucene as a database... good idea or bad idea?






Hi Ian,
Yes, I see that we are discussing an "option" here.

But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.

Basically, I am not able to connect "database" and Lucene java. :)

Nagesh
Post by Ian Lea
I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.
--
Ian.
On Tue, Jul 29, 2008 at 1:43 PM, ಚಟಗೇಶ್ ಞುಬ್ರಹ್ಮಣ್ಯ
(Nagesh S)
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three
components
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
a
data-store to back a web crawler. The crawler can be configured
to
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
natural fit for this scenario since you can obviously add
arbitrary
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
fields
to each document and you can store the actually data in the
database.
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
I've
done some research to make sure that it would meet all of our
individual
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple
of
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would
post
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
here
and ask for guidance? Has anyone done something similar? Is there
any
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will
scale.
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
We
may reach the point where we have tens of millions of documents
and a
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme
fast
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
text
search times for this, but we would need reasonable times for adding
new
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
documents to the index, retrieving documents by ID (for iterating
over
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
Post by Ian Lea
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail ***@equifax.com.
Grant Ingersoll
2008-07-29 14:25:31 UTC
Permalink
Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
Connect data storage with simple, fast lookup and Lucene.

One field is the key (i.e. the filename) the other field is a binary,
stored Field containing the contents of the file. Of course, there
are other ways of slicing and dicing, such that one can search (in the
fuzzy sense) the content and the key by adding tokenization, etc.
This is the more traditional model for Lucene

Also, have a look at Apache Jackrabbit. It is a content repository
that is implemented with Lucene.

-Grant

On Jul 29, 2008, at 10:02 AM, ನಾಗೇಶ್
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Hi Ian,
Yes, I see that we are discussing an "option" here.
But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.
Basically, I am not able to connect "database" and Lucene java. :)
Nagesh
Post by Ian Lea
I don't think that anyone in this thread has said "should", just
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.
--
Ian.
On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್
ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three
components
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
- data aggregation, indexing/searching and presentation of
results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
it's informative that Nutch does not use Lucene for this
function, as I
understand it, but that shouldn't stop you either. You might also have
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5
million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
a
data-store to back a web crawler. The crawler can be
configured to
retrieve
arbitrary fields from arbitrary pages, so the result is that
each
document
may have a random assortment of fields. It seems like Lucene may be
a
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
natural fit for this scenario since you can obviously add
arbitrary
fields
to each document and you can store the actually data in the
database.
I've
done some research to make sure that it would meet all of our
individual
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
requirements (that we can iterate over documents, update
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple
of
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
references around the net to other people trying similar
things...
however,
I know it's not meant to be used this way, so I thought I
would post
here
and ask for guidance? Has anyone done something similar? Is there
any
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will
scale.
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
We
may reach the point where we have tens of millions of
documents and a
high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal
extreme fast
text
search times for this, but we would need reasonable times for adding
new
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
documents to the index, retrieving documents by ID (for
iterating
over
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by Grant Ingersoll
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
2008-07-29 14:28:14 UTC
Permalink
"Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
Connect data storage with simple, fast lookup and Lucene."
Thanks, Grant for the clarification. I see now.

Nagesh
Post by Grant Ingersoll
Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.
Connect data storage with simple, fast lookup and Lucene.
One field is the key (i.e. the filename) the other field is a binary,
stored Field containing the contents of the file. Of course, there are
other ways of slicing and dicing, such that one can search (in the fuzzy
sense) the content and the key by adding tokenization, etc. This is the
more traditional model for Lucene
Also, have a look at Apache Jackrabbit. It is a content repository that is
implemented with Lucene.
-Grant
Hi Ian,
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Yes, I see that we are discussing an "option" here.
But, as I said before (the three parts to search-based solution), I do not
know (but, would like to know) how Lucene (java only - not Nutch, Solr,
etc.) can be used as a datastore.
Basically, I am not able to connect "database" and Lucene java. :)
Nagesh
I don't think that anyone in this thread has said "should", just
Post by Ian Lea
"could" - it is a valid option (IMHO). Personally, I use it as a
store for lucene related data because I know and like and trust it, it
is already there for this project so no need to introduce another
software dependency, and because it is blindingly fast.
--
Ian.
On Tue, Jul 29, 2008 at 1:43 PM, ಚಟಗೇಶ್ ಞುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
The way I see it, search solutions (on whatever scale) have three
components
Post by ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
- data aggregation, indexing/searching and presentation of results. I
thought, Lucene did the second part only.
So, I do not quite follow, why should Lucene be used for datastore ?
Nagesh
I think the answer is it can be done and probably quite well. I also
think
it's informative that Nutch does not use Lucene for this function, as I
understand it, but that shouldn't stop you either. You might also have
a
look at Apache Jackrabbit, which uses Lucene underneath as a content
repository.
-Grant
Hello all,
Post by Ganesh - yahoo
I am also interested in this. I want to archive the content of the
document using Lucene.
Is it a good idea to use Lucene as storage engine?
Regards
Ganesh
Sent: Tuesday, July 29, 2008 2:18 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
John
Post by Ian Lea
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Hi All,
Post by John Evans
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with
developing
a
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be
a
natural fit for this scenario since you can obviously add arbitrary
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
fields
to each document and you can store the actually data in the
database.
I've
done some research to make sure that it would meet all of our
individual
requirements (that we can iterate over documents, update
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
(delete/replace)
documents, etc.) and everything looks good. I've also seen a couple
of
references around the net to other people trying similar things...
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there
any
specific reason to think this is a bad idea?
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
The one thing that I am least certain about his how well it will
scale.
We
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
may reach the point where we have tens of millions of documents and
a
high
percentage of those documents may be relatively large (10k-50k each).
We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding
new
documents to the index, retrieving documents by ID (for iterating
Post by Ganesh - yahoo
Post by Ian Lea
over
all
Post by Ganesh - yahoo
Post by Ian Lea
Post by John Evans
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
Karsten F.
2008-07-31 08:54:29 UTC
Permalink
Hi Grant,

you made mention of jackrabbit as example of storing data in lucene.
I did not find something like that in source-code. I found
"LocalFileSystem" and "DatabaseFileSystem".
(I found lucene for indexing and searching.)

Have I overlooked something?

Best regards
Karsten
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think it's informative that Nutch does not use Lucene for this
function, as I understand it, but that shouldn't stop you either. You
might also have a look at Apache Jackrabbit, which uses Lucene
underneath as a content repository.
-Grant
--
View this message in context: http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18750334.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Grant Ingersoll
2008-07-31 09:52:23 UTC
Permalink
Hmmm, I thought it did. Can't say I've studied the code though, so
I'll take your word for it.

Never mind on the Jackrabbit suggestion :-)

Cheers,
Grant
Post by Karsten F.
Hi Grant,
you made mention of jackrabbit as example of storing data in lucene.
I did not find something like that in source-code. I found
"LocalFileSystem" and "DatabaseFileSystem".
(I found lucene for indexing and searching.)
Have I overlooked something?
Best regards
Karsten
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think it's informative that Nutch does not use Lucene for this
function, as I understand it, but that shouldn't stop you either.
You
might also have a look at Apache Jackrabbit, which uses Lucene
underneath as a content repository.
-Grant
--
View this message in context: http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18750334.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Ganesh - yahoo
2008-07-31 11:58:57 UTC
Permalink
which one will be the best to use as storage server. Lucene or Jackrabbit.

My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]

Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data and
separate database to index it, will be better option or one database to be
used as archive and index.

One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?

I am in desiging phase and i have time to explore and prototype any other
products. Please do suggest me a good one.

Regards
Ganesh


----- Original Message -----
From: "Grant Ingersoll" <***@apache.org>
To: <java-***@lucene.apache.org>
Sent: Thursday, July 31, 2008 3:22 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
Hmmm, I thought it did. Can't say I've studied the code though, so I'll
take your word for it.
Never mind on the Jackrabbit suggestion :-)
Cheers,
Grant
Post by Karsten F.
Hi Grant,
you made mention of jackrabbit as example of storing data in lucene.
I did not find something like that in source-code. I found
"LocalFileSystem" and "DatabaseFileSystem".
(I found lucene for indexing and searching.)
Have I overlooked something?
Best regards
Karsten
Post by Grant Ingersoll
I think the answer is it can be done and probably quite well. I also
think it's informative that Nutch does not use Lucene for this
function, as I understand it, but that shouldn't stop you either. You
might also have a look at Apache Jackrabbit, which uses Lucene
underneath as a content repository.
-Grant
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18750334.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
---------------------------------------------------------------------
Send instant messages to your online friends http://in.messenger.yahoo.com
Karsten F.
2008-07-31 13:12:27 UTC
Permalink
Hi Ganesh,

in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage with
simple, fast lookup and Lucene..)

I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look to
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
- dSpace (real working content repository software, with good permissions
management)

Both use lucene for searching

Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or Jackrabbit.
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data and
separate database to index it, will be better option or one database to be
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any other
products. Please do suggest me a good one.
Regards
Ganesh
--
View this message in context: http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Andy Liu
2008-07-31 14:46:34 UTC
Permalink
If essentially all you need is key-value storage, Berkeley DB for Java works
well. Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.

Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points. So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)

Andy

On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
Post by Karsten F.
Hi Ganesh,
in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage with
simple, fast lookup and Lucene..)
I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look to
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
- dSpace (real working content repository software, with good permissions
management)
Both use lucene for searching
Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or
Jackrabbit.
Post by Ganesh - yahoo
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data
and
Post by Ganesh - yahoo
separate database to index it, will be better option or one database to
be
Post by Ganesh - yahoo
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any other
products. Please do suggest me a good one.
Regards
Ganesh
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Ganesh - yahoo
2008-08-01 05:44:40 UTC
Permalink
Thanks Andy and Karsten.

----- Original Message -----
From: "Andy Liu" <***@gmail.com>
To: <java-***@lucene.apache.org>
Sent: Thursday, July 31, 2008 8:16 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
Post by Andy Liu
If essentially all you need is key-value storage, Berkeley DB for Java works
well. Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.
Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points. So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)
Andy
On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
Post by Karsten F.
Hi Ganesh,
in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage with
simple, fast lookup and Lucene..)
I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look to
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
- dSpace (real working content repository software, with good permissions
management)
Both use lucene for searching
Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or
Jackrabbit.
Post by Ganesh - yahoo
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data
and
Post by Ganesh - yahoo
separate database to index it, will be better option or one database to
be
Post by Ganesh - yahoo
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any other
products. Please do suggest me a good one.
Regards
Ganesh
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Send instant messages to your online friends http://in.messenger.yahoo.com
Marcus Herou
2008-08-03 15:07:43 UTC
Permalink
And for the heck of it I implemented a berkeleydb "java.util.Map" storage as
well.

http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/sleepycat/BerkelyDbCache.java

Kindly

//Marcus
Post by Ganesh - yahoo
Thanks Andy and Karsten.
Sent: Thursday, July 31, 2008 8:16 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
If essentially all you need is key-value storage, Berkeley DB for Java
Post by Andy Liu
works
well. Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.
Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points. So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)
Andy
On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
Post by Karsten F.
Hi Ganesh,
in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage with
simple, fast lookup and Lucene..)
I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look to
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
- dSpace (real working content repository software, with good permissions
management)
Both use lucene for searching
Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or
Jackrabbit.
Post by Ganesh - yahoo
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data
and
Post by Ganesh - yahoo
separate database to index it, will be better option or one database to
be
Post by Ganesh - yahoo
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any >
other
Post by Ganesh - yahoo
products. Please do suggest me a good one.
Regards
Ganesh
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Send instant messages to your online friends http://in.messenger.yahoo.com
---------------------------------------------------------------------
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Marcus Herou
2008-08-03 15:08:40 UTC
Permalink
Ah well some test classes could be appropriate:
http://dev.tailsweep.com/svn/abstractcache/trunk/src/test/java/org/tailsweep/abstractcache/test/
Post by Marcus Herou
And for the heck of it I implemented a berkeleydb "java.util.Map" storage
as well.
http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/sleepycat/BerkelyDbCache.java
Kindly
//Marcus
Post by Ganesh - yahoo
Thanks Andy and Karsten.
Sent: Thursday, July 31, 2008 8:16 PM
Subject: Re: Using lucene as a database... good idea or bad idea?
If essentially all you need is key-value storage, Berkeley DB for Java
Post by Andy Liu
works
well. Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.
Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points. So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)
Andy
On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
Post by Karsten F.
Hi Ganesh,
in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage with
simple, fast lookup and Lucene..)
I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look to
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
- dSpace (real working content repository software, with good permissions
management)
Both use lucene for searching
Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or
Jackrabbit.
Post by Ganesh - yahoo
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data
and
Post by Ganesh - yahoo
separate database to index it, will be better option or one database
to
be
Post by Ganesh - yahoo
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any >
other
Post by Ganesh - yahoo
products. Please do suggest me a good one.
Regards
Ganesh
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
Send instant messages to your online friends
http://in.messenger.yahoo.com
---------------------------------------------------------------------
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
http://www.tailsweep.com/
http://blogg.tailsweep.com/
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Marcus Herou
2008-08-03 13:23:41 UTC
Permalink
Hi.

I have actually created a solution where I use Lucene as Tuple (key/value)
storage. It was surprisingly fast and it has more balanced read/write
performance compared to BTrees but I think it is a little slower in
retrieving value per key than a Tree.

Look here for hints:
http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/lucene/LuceneCache.java

Kindly

//Marcus
Post by Andy Liu
If essentially all you need is key-value storage, Berkeley DB for Java works
well. Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.
Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points. So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)
Andy
On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
Post by Karsten F.
Hi Ganesh,
in this Thread nobody said, that lucene is a good storage server.
Only "it could be used as storage server" (Grant: Connect data storage
with
Post by Karsten F.
simple, fast lookup and Lucene..)
I don't now about automatic rentention.
But for the rest in your list of features I suggest to take a deep look
to
Post by Karsten F.
- Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV
support)
Post by Karsten F.
- dSpace (real working content repository software, with good
permissions
Post by Karsten F.
management)
Both use lucene for searching
Best regards
Karsten
Post by Ganesh - yahoo
which one will be the best to use as storage server. Lucene or
Jackrabbit.
Post by Ganesh - yahoo
My requirement is to provide support to
1) Archive the documents
2) Do full text search on the documents.
3) Do backup the index store and archive store. [periodical basis]
4) Remove the documents after certain period [rentention policy]
Whether Lucene could be used as archival store. Most of them in this mailing
list said 'yes'. If so going for separate database to archive the data
and
Post by Ganesh - yahoo
separate database to index it, will be better option or one database to
be
Post by Ganesh - yahoo
used as archive and index.
One more idea from this list is to use Jackrabbit / JDBM / My SQL to archive
the data. Which will be the best?
I am in desiging phase and i have time to explore and prototype any
other
Post by Karsten F.
Post by Ganesh - yahoo
products. Please do suggest me a good one.
Regards
Ganesh
--
http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
Post by Karsten F.
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Matthew Hall
2008-07-29 19:28:52 UTC
Permalink
Yeah.. we do the same thing here for indexes of up to 57M documents
(rows), and that's just one part of our implementation.

It takes quite a bit of.. wrangling to use lucene in this manner.. but
we've found it to be utterly worthwhile.

Matt
Post by Ian Lea
John
I think it's a great idea, and do exactly this to store 5 million+
documents with info that it takes way too long to get out of our
Oracle database (think days). Not as many docs as you are talking
about, and less data for each doc, but I wouldn't have any concerns
about scaling. There are certainly lucene indexes out there bigger
than what you propose. You can compress the stored data to save some
space. Run times for optimization might get interesting but see
recent threads for suggestions on that. And since you are not too
concerned about performance you may not need to optimize much, or even
at all.
Of course you need to remember that this is not a DBMS solution in the
sense of transactions, recovery, etc. but I'm sure you are already
aware of that.
--
Ian.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
--
Matthew Hall
Software Engineer
Mouse Genome Informatics
***@informatics.jax.org
(207) 288-6012
Bill Janssen
2008-07-29 16:39:24 UTC
Permalink
I do this with uplib (http://uplib.parc.com/) with fair success.
Originally I thought I'd need Lucene plus a relational database to
store metadata about the documents for metadata searches. So far,
though, I've been able to store the metadata in Lucene and use the
same Lucene DB for both metadata and content.

Bill
Chris Lu
2008-07-29 21:37:42 UTC
Permalink
It surely is possible. AFAIK, LinkedIn use lucene to store some data.

But, Lucene index in a sense is similar to database index. Both are data
structures for a specialized and limited query execution path.

So this depends on your applications' query, and how you create the lucene
index. The normal usage you listed sounds reasonable.
But you may also need to think about maintenance. In case the index is
corrupted somehow, you may also consider store the data into database, which
are more easier to manually manipulate.
--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
John Evans
2008-07-30 17:47:48 UTC
Permalink
Hi All,

Thanks for all of the feedback. Largely as a result of the responses I've
received from the mailing list, Lucene has made it's way on to our short
list of possible solutions. I'm not sure what the timeframe is for
implementing a prototype and testing it, but I will try to report back with
the results when/if it happens.

Thanks again,
-
John
Post by Chris Lu
It surely is possible. AFAIK, LinkedIn use lucene to store some data.
But, Lucene index in a sense is similar to database index. Both are data
structures for a specialized and limited query execution path.
So this depends on your applications' query, and how you create the lucene
index. The normal usage you listed sounds reasonable.
But you may also need to think about maintenance. In case the index is
corrupted somehow, you may also consider store the data into database, which
are more easier to manually manipulate.
--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each
document
Post by John Evans
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary
fields
Post by John Evans
to each document and you can store the actually data in the database.
I've
Post by John Evans
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things...
however,
Post by John Evans
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
Post by John Evans
may reach the point where we have tens of millions of documents and a
high
Post by John Evans
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over
all
Post by John Evans
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
Marcelo Ochoa
2008-07-30 19:41:52 UTC
Permalink
Hi John:
Did you test/know Lucene Domain Index for Oracle database?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
If you are using Oracle 10g/11g is completed integrated in Oracle
memory space like Oracle Text but based in Lucene.
No network round trip is involved during indexing/querying time and
Lucene store is replaced by BLOB database storage.
Also you can query your Oracle text store direct by SQL and two new
operators, lcontains and lscore based and Lucene and directly
integrated with the Oracle execution plan.
Best regards, Marcelo.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/
Jason Rutherglen
2008-07-30 20:44:49 UTC
Permalink
A possible open source solution using a page based database would be to
store the documents in http://jdbm.sourceforge.net/ which offers BTree,
Hash, and raw page based access. One would use a primary key type of
persistent ID to lookup the document data from JDBM.

Would be a good Lucene project to implement and I think a good solution for
Ocean LUCENE-1313. Storing documents in Lucene is fine but for a realtime
search index with many documents being deleted a lot of garbage builds up.
Frequent merging of documents files becomes IO intensive.

Of course one issue with JDBM which I am not sure other SQL page based
systems do is load individual fields directly from disk rather than load the
entire page into RAM first, then load the pages. Maybe it does not matter.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each document
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary fields
to each document and you can store the actually data in the database. I've
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things... however,
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale. We
may reach the point where we have tens of millions of documents and a high
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over all
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
Marcus Herou
2008-08-03 12:49:21 UTC
Permalink
Hi Don't use JDBM if you want a BTree it is too slow... Been there.

Use Xindice's Filer class: org.apache.xindice.core.filer.Filer and
instantiate a org.apache.xindice.core.filer.BTreeFiler which is a whole lot
faster.

JDBM and Xindice are comparable in reading speed but surely not in writing.

Check out my implementation of the BTree:
http://dev.tailsweep.com/svn/abstractcache/trunk/src/main/java/org/tailsweep/abstractcache/disk/xindice/TreeCache.java

Kindly

//Marcus



On Wed, Jul 30, 2008 at 10:44 PM, Jason Rutherglen <
Post by Jason Rutherglen
A possible open source solution using a page based database would be to
store the documents in http://jdbm.sourceforge.net/ which offers BTree,
Hash, and raw page based access. One would use a primary key type of
persistent ID to lookup the document data from JDBM.
Would be a good Lucene project to implement and I think a good solution for
Ocean LUCENE-1313. Storing documents in Lucene is fine but for a realtime
search index with many documents being deleted a lot of garbage builds up.
Frequent merging of documents files becomes IO intensive.
Of course one issue with JDBM which I am not sure other SQL page based
systems do is load individual fields directly from disk rather than load the
entire page into RAM first, then load the pages. Maybe it does not matter.
Post by John Evans
Hi All,
I have successfully used Lucene in the "tradtiional" way to provide
full-text search for various websites. Now I am tasked with developing a
data-store to back a web crawler. The crawler can be configured to retrieve
arbitrary fields from arbitrary pages, so the result is that each
document
Post by John Evans
may have a random assortment of fields. It seems like Lucene may be a
natural fit for this scenario since you can obviously add arbitrary
fields
Post by John Evans
to each document and you can store the actually data in the database.
I've
Post by John Evans
done some research to make sure that it would meet all of our individual
requirements (that we can iterate over documents, update (delete/replace)
documents, etc.) and everything looks good. I've also seen a couple of
references around the net to other people trying similar things...
however,
Post by John Evans
I know it's not meant to be used this way, so I thought I would post here
and ask for guidance? Has anyone done something similar? Is there any
specific reason to think this is a bad idea?
The one thing that I am least certain about his how well it will scale.
We
Post by John Evans
may reach the point where we have tens of millions of documents and a
high
Post by John Evans
percentage of those documents may be relatively large (10k-50k each). We
actually would NOT be expecting/needing Lucene's normal extreme fast text
search times for this, but we would need reasonable times for adding new
documents to the index, retrieving documents by ID (for iterating over
all
Post by John Evans
documents), optimizing the index after a series of changes, etc.
Any advice/input/theories anyone can contribute would be greatly
appreciated.
Thanks,
-
John
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
***@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Continue reading on narkive:
Loading...