koha MARC-DB structure
Fast access
a MARC record can have a LOT of details. Most of them are useful only once a year or less. And most will never be useful for an OPAC user. That's why we must keep data in 2 parts : a MARC-DB and a non-MARC-DB. The non-MARC-DB can be the 1.1.2 one (or a derivative).
marc_biblio
The marc_biblio is the entry table for all biblio.
Fields :
marc_Subfield_Table
Stores, for each value of each subfield of each tag of a biblio the data line. The marc_word table is used for searches. When a line is more than 255 char long, a link to a blob record in the blob_table is used and the value is in this blob. The char is empty.
Fields :
marc_word
This table contains the datas used for searches. The idea is explained in the next chapter. This table stores one record for each word in the marc section.
Fields :
Words longer than 255 char are ignored (in french, the longest word is 26 char long ) Words present in the stopwords table are not stored in the MARC-DB
DISCUSSION about MARC searches The most important thing about MARC tables, is that searches must be as well optimized as possible. Anyway, a search is always a headache in MARC, as the simple solution would be to have a biblio stored in a single line. As it's not possible to built a flat table with all the possible MARC tags, we must use a complex algorithm to do a search.
Suppose we must do a select with :
If we have 10 tables, a select should be : select biblio_id from 2xx_subfield_table as 2xx,
1xx_subfield_table as 1xx, 0xx_subfield_table as 0xx
where (tag=“200” and 2xx.Mark=“a” and (2xx.value like ”%lord%” and 2xx.value like ”%ring%”)) and (tag=“101” and 1xx.mark=“b” and 1xx.value like ”%Tolkien%”) and (tag=“001” and 0xx.mark=“a” and 0xx.value=“123456”) and (0xx.Bib_Id=1xx.Bib_Id and 0xx.bib_Id=2xx.Bib_id)
This select is quite complex, but in 1 select, we can find what we want. The only problem is this select uses 3 tables, 2 indexed joins (the last line), an indexed select on table 0xx and a partially indexed select on tables 1xx and 2xx. So, on an important library, this could be not fast… (this proves we MUST have bib_id as common key in all tables : with differents id, we would have need )
With the tag_word field, we can do only indexed searches. The search algorythm could search the Bib_id. If we want to search (A or B) and not(C), we can do 2 indexed select in arrays, merge both arrays, do a selected select for C and delete Bib_id found from the previous array.