Koha for big libraries [Koha Developer Wiki]

- Koha for big libraries

Features requested for futures versions of Koha, and the version that should contain them : 2.2.x or 2.4.0

When a feature does not require a DB change, it is in 2.2.x. When a feature require an important DB change, it is in 2.4.0 When a feature require a minor DB change, it depends ;-)

Parameters / systempreferences	2.2.x	see http://katipo.co.nz/gallery/koha-admin proposal
Stats	2.2.x	Stat module funded by EMN in France, developped by paul poulain ⇒ DONE in 2.2.2
user card generator	2.4.0	see http://biblio.fisica.unlp.edu.ar/sitio/librarian/kohapf/dos/en
calendar use	2.4.0	see http://biblio.fisica.unlp.edu.ar/sitio/librarian/kohapf/tres/en
User account, receipt print	2.4.0	see http://biblio.fisica.unlp.edu.ar/sitio/librarian/kohapf/cuatro/en
Suspensions and fees	2.4.0	see http://biblio.fisica.unlp.edu.ar/sitio/librarian/kohapf/cinco/en
Ranking	2.4.0	We could add the possibility for borrowers to rank the books, and store comments. Ranking and comments could be read by other borrowers
Dev Docs	2.4.0	Every module perldoc'd, following coding standards and releases following release standards
OPAC css changes	2.4.0	Adding a feature to change the stylesheet dynamically (X stylesheets for a given theme). The systemprefs contains the default css in the theme, and an identified user can define it's own. And also define default language ⇒ DONE IN 2.2.1
Timeout different for OPAC and librarian	2.2.x	Have a different timeout for OPAC and for librarian interface
img for itemtypes	2.2.x or 2.4.0	Attach an img to itemtype, that is shown in OPAC.
UNICODE support	2.2.x	see discussion in koha-dev mailing list http://sourceforge.net/mailarchive/message.php?msg_id=9764230
list chooser for subject & other fields (OPAC)	2.2.x	Add … that open a popup to help the user select an existing subject. Extend this popup to other fields eventually (authors…) Deal with authorities for that (rejected form, accepted form…)

Possibly a project for some brave soul – building a set of helper features for small/personal catalog users. [IE: Simplify hunting down of MARC data from ISBN is one that other small-catalog systems do well enough.] More a user friendly-ness thing

Koha for big libraries

The DB scheme has some mistakes or limits that prevents Koha to have high performances on large scale libraries. This chapter tries to explain them & find a solution.

Catalogue

The structure with marc_word as index has limit. mySQL full text searches has some limits that we can't accept (mySQL dependant, problem with extended characters…) That's why the marc_word table has been done in 2.0 : it contains a line for each tag/subfield/word in the MARC biblio. Works fine, but has far too many lines for a large library. another problem : when searching 2 words that are used a lot -say 10 000 and 15 000 times- the 25 000 lines have to be read before joining them, giving maybe only 3 results. Quite too many once again. The solution would be to use an “inverted index” : have only 1 line with the word and the list of each bibid containing the word. For example, if the Word Koha is in bibid #2, #150 and #1532, we would have a single line : word : Koha bibidlist : 2,150,1532 This scheme is highly better from perf point of view :

The number of line in the table tend to be limited, as there is only a limited number of words in a given language (yes, I know in german, you have virtually an infinite number of words…)
A search on 2 words used 10 000 and 15 000 times requires only 2 SQL read, plus creating 2 arrays in Perl, and joining them. Quite an easy task for Perl.

But it gives another improvement : Ranking will be easier : we could count how many times a word is used & rank the result with it. Some tests on a french DB (the biggest I have : 45 000 biblios) gives a cut by 8 of the table size.

(added by JF) I like the marc_words idea but it's very hard to update a list of items in a field (possibly requiring full text searching anyway?). I have an idea for it that will possibly improve the ease of updating as well as solve another perf problem. Here's how it works: we have marc_words that has one instance of each word per tagsubfield and a 'wordid' (so 'word', 'tagsubfield' and 'wordid' – maybe even 'count' for relevence searching if a subfield has the item more than once). Then we have marc_words_index with 'wordid', 'bibid', and 'display'. 'display' is very important: it contains all the data that we want to display on the opac initial results page: title, subtitle, copyrightdate, copies held, etc. … in fact, perhaps 'display' should be broken up into a number of columns … but the idea is that we only need two queries at this point for all searches: one to grab 'wordid' and one to pull out all the data for every bibid on the list. So instead of performing many queries to get our search results page we only do two. We could even have several 'sort by' columns, one for title, one for author, one for relevance, etc.

Sorting results

Sorting (in 2.2.x) can be requested in title, author and some other orders. With the change previously described, we could also sort by ranking. ranking would not require any more query (ordering is a result of the previous SQL queries) For instance, all other sortings requires 1 SQL read for each line of result. That can be a lot when you have too many results (note : we have to query all results before sorting & limiting results to 20/50/100) Some solutions to improve the sorting time (from simple to complex) :

limit the results to 200. If more than 200 results, return “too many results, do a more precise search”. Simple, could probably be handled in 2.2.3 easily.
save the resultset somewhere (server or client side). If the user want to change the page, the result is ready, no need to do it again (as for instance)
If we consider the sorting will be done usually on ranking or title (and sometimes on any other field), then we could add the title to the previous table : then, we would get the title while reading words, and could sort by title without another SQL read. this solution would require some tests to see how mySQL handle very long TEXT fields (as 10 000 bibids + 10 000 words would be a big field)

circulation

for libraries with big circulation rates, the issues table perf can be poor. The main idea would be to divide this table in 2 : active & non active issues (=ie : books returned). Maybe we just need to delete all issues that are more than X month old. That should not affect stat module too much, as the statistic table contains all what is needed. Needs some more investigations.

biblios display formatting

(by Ernie) our librarians have been requesting different display formatting, depending on the itemtype or some marc field content. Here at http://biblio.ort.edu.uy we've made some improvements but they are basically hardcoded. The librarians request also includes the display formatting in the search results list. We've thinking and we think that the way to achieve this is: always load the marc records instead of biblio+biblioitem+items tables. I know this is a perf downgrade but, how much?, typically I only have to load 20 marc records for the search results page. If we gather the human resources we are planning to develope the following feature:

Create display categories, with two of them predefined, one for search results and the other for the _normal_ display, plus a big default in any case.
Define some kind of equation based on marc fields which tells the system decide which is the biblio itemtype, this can be a system preference.
Then I imagine some kind of matrix addressed by the itemtype and the display category.

The value of the matrix at this point may be a text string which is basically some sort of formatting code (better then ISBD). Then for example when search results page executes, the system reads the marc records, calculates the itemtype and assumes the search results display category. Now the system has the needed formatting codes and runs the correspondant code on each record to generate HTML. With this feature librarians could define display categories and write their own display format for each itemtype, very laborious but very flexible, and last but not least it's developers-independant. Following this idea, in the normal display and based on the marc record calculated itemtype I could also generate links to display it in any other format available for that specific record. One of the main problems is defining the display code grammar, it must be simple but powerfull. We can translate the marc record to xml and place an xsl code to format it, but it's very time consumming and I don't believe librarians would understand xsl.