Koha 3.0 No Zebra
Accepted limits for noZebra koha 3.0
Storing records in SQL
Searching something
- Handling CCL querying
- Sorting results
Stop words

Koha 3.0 No Zebra

Koha 3.0 will be quite hard to setup & administer, and small libraries will probably prefer staying with a SQL-only Koha.

This page shows my 1st thoughts on a koha 3.0 without zebra (SQL only)

Zebra introduces two major things : indexes & ccl languages. Indexes group some MARC fields/subfields to make them searchables. CCL language help user querying the catalogue. A SQL-only version of Koha should keep those 2 new features, as they are interesting improvements. Such a version also must be in the official release & be as transparent/easy as needed to maintain. It is thus excluded to modify rel_3_0 deeply and have 2 different CVS branches.

Fortunatly, all zebra related functions are stored in only 2 places : - storing MARC records is all done in a single function “zebraop” (in Biblio.pm). Atm, it is not a “true” zebra storing : it just note in the SQL table zebraqueue that a cron script must deal with the biblio/authority into zebra. - retrieving MARC records is also done in a single function “getRecords” (in Search.pm).

Accepted limits for noZebra koha 3.0

We agree that noZebra koha 3.0 is for small libraries, and it can miss some features that koha 3.0 has. For example, true CCL queries can contain “proximity” statements. noZebra koha 3.0 probably won't. The list of the limits will be detailed here :

no ranking features
no proximity feature
limited “order by” possibilities (title, author, callnumber, publicationyear, publishercode)
no () in CCL queries
no more that 4 AND/OR/NOT (X or Y or Z or T or U won't work)

Storing records in SQL

Koha 3.0NZ (NZ = No Zebra, not New Zealand :-) ) will let the user define as many indexes as needed. An index consist of : - a 'name' (like 'title', 'author',…) - a list of MARC subfields attached to the index name ('700*','701*','702*','200f' for UNIMARC author : 200$ and all 700, 701, 702 subfields)

The index will be stored in an inverted index in a SQL table. The inverted index will contain the name of the index, the word indexed, the list of biblionumber having that word, separated by comma (,) For example : TI HUGO 1021,54,2578,1021,1257

If a word is more than once in a given biblio, it can be X times in the list.

adding something

It's quite easy to add something with this structure : for all fields/subfields in a given index, search the line with a given word (or create a new one) and append the biblionumber to the list.

modifying something

It's a little bit more tricky, as it requires to delete all no more existing entries in the inverted list (and add the new ones)

The easiest solution (although CPU consuming, but we don't care, it's for libraries with a small catalogue + updates are not so frequents) will probably be to : - read the previous biblio - delete all index entries - create them again.read

deleting something

With the deleted biblio, we can remove all words from the indexes just reading them, regexping and updating the database.

Searching something

Searching something with the previous DB is quite easy : if a user want books from “hugo”, we just have to retrieve the author/hugo entry in the reverted index entry, and we have got the result.

OK, but : - how to handle CCL querying (ti=VICTOR HUGO and au=Paris) ? - how to sort the results (as the inverted list is unordered)

Handling CCL querying

The best (only ?) solution is to write a basic CCL parser. by “basic” I mean something simple to write. Thus without some CCL features. For example, our goal could be to be able to understand queries based on an index, an operator, an operand, repeated and separated by and/or/not (up to 4 of them) the operand being or not being inside ””

For example, we could manage : author=HUGO author=“victor hugo” author=“hugo” NOT title=“paris” publisher=“pub1” OR publisher=“pub2” publicationyear >= 2000

Such a parser should not be that easy to write (& i'm investigating cpan to find if something can help us)

once it is written, however, it should not be too hard to query SQL indexes to find what we need. For publisher=“pub1” OR publisher=“pub2” for example, it will be : * read line publisher/pub1, and create an array of biblionumbers * read line publisher/pub2, and create an array of biblionumber * do an union of both array (as it's a “OR”. For a AND do intersect and for a NOT exclude existing values in array1)

Then we would have the list of biblionumbers.

Sorting results

The biblionumber array consist of just unordered numbers. The difficulty will be to sort the result. The easiest solution would be to read all biblios and then sort them. Quite CPU consumming… A better solution would be to consider that most of the time the order will be title and embeed sorting informations into the index. How to do that ? When storing a biblionumber in the reversed index, we could add the title to get them at the same time we get the biblionumber. Thus the sorting would be faster. However, titles can be long, so the inverted index could be large. So I suggest to limit the stored title to the 10 first letters of the title. That should be enough to have a perfect sorting 99% of the time, and a small mis ordering at the 11th letter of a title. As 3.0NZ catalogue should be only small catalogues, I bet it will be OK 99.99% of the time in fact. The positive point with this method is that it will enable us to remove words useless for ordering depending on marc flavour (as MARC21 and UNIMARC deals them differently)

Thus, the reversed index would be : (for “Perl in action”, “Programming Perl”, “Java OO programming” books) Title / Perl / 1004-Perl in ac;257-Programmin;257-Programmin; Title / Action / 1004-Perl in ac; Title / Programming / 257-Programmin;753-Java OO pr;

Stop words

The stop word list will have to be reintroduced for 3.0NZ Stop words will be stored in the reverted index and removed only for searching : if a word is added/removed from the stop word list, nothing will have to be rebuilded.