marcdbstructure [Koha Developer Wiki]

koha MARC-DB structure Fast access a MARC record can have a LOT of details. Most of them are useful only once a year or less. And most will never be useful for an OPAC user. That's why we must keep data in 2 parts : a MARC-DB and a non-MARC-DB. The non-MARC-DB can be the 1.1.2 one (or a derivative).

marc_biblio The marc_biblio is the entry table for all biblio. Fields :

Bib_code
Date_created
Date modified
Origin_code : this code is used to store the differents origins from the biblio records. It might be useful in the case of exchanges between libraries.

marc_Subfield_Table Stores, for each value of each subfield of each tag of a biblio the data line. The marc_word table is used for searches. When a line is more than 255 char long, a link to a blob record in the blob_table is used and the value is in this blob. The char is empty. Fields :

Bib_code
Tag_number char(3)
Tagorder
Subfield_code char(1)
subfield_order
Tag_value varchar(255) (used only if length(value) <255)
Blob_link_id longint (used only if length(value) >255)

marc_word This table contains the datas used for searches. The idea is explained in the next chapter. This table stores one record for each word in the marc section. Fields :

Bib_code
Tag_number char(3)
Subfield_id char(1)
Word varchar(255)

Words longer than 255 char are ignored (in french, the longest word is 26 char long ;-) ) Words present in the stopwords table are not stored in the MARC-DB

DISCUSSION about MARC searches The most important thing about MARC tables, is that searches must be as well optimized as possible. Anyway, a search is always a headache in MARC, as the simple solution would be to have a biblio stored in a single line. As it's not possible to built a flat table with all the possible MARC tags, we must use a complex algorithm to do a search.

Suppose we must do a select with :

tag=200, $a contains “lord” and “ring”
tag=101, $b contains “Tolkien”
tag=001, $a equals “123456”

If we have 10 tables, a select should be : select biblio_id from 2xx_subfield_table as 2xx,

1xx_subfield_table as 1xx,
0xx_subfield_table as 0xx

where (tag=“200” and 2xx.Mark=“a” and (2xx.value like ”%lord%” and 2xx.value like ”%ring%”)) and (tag=“101” and 1xx.mark=“b” and 1xx.value like ”%Tolkien%”) and (tag=“001” and 0xx.mark=“a” and 0xx.value=“123456”) and (0xx.Bib_Id=1xx.Bib_Id and 0xx.bib_Id=2xx.Bib_id)

This select is quite complex, but in 1 select, we can find what we want. The only problem is this select uses 3 tables, 2 indexed joins (the last line), an indexed select on table 0xx and a partially indexed select on tables 1xx and 2xx. So, on an important library, this could be not fast… (this proves we MUST have bib_id as common key in all tables : with differents id, we would have need )

With the tag_word field, we can do only indexed searches. The search algorythm could search the Bib_id. If we want to search (A or B) and not(C), we can do 2 indexed select in arrays, merge both arrays, do a selected select for C and delete Bib_id found from the previous array.