Zebra Programmer Guide

Zebra Programmer Guide

Here are some commands that you may find useful if you're managing a Zebra installation with Koha

Counting Records

You can find out how many records are in your database thusly:

 Z> base IR-Explain-1
 Z> form sutrs
 Z> f @attr exp1 1=1 databaseinfo
 Sent searchRequest.
 Received SearchResponse.
 Search was a success.
 Number of hits: 4, setno 1
 SearchResult-1: databaseinfo(4)
 records returned: 0
 Elapsed: 0.069880
 Z> s
 Sent presentRequest (1+1).
 Records: 1
 [IR-Explain-1]Record type: SUTRS
 explain:
    databaseInfo: DatabaseInfo
      commonInfo:
        dateAdded: 20020911101011
        dateChanged: 20020911101011
        languageCode: EN
      accessinfo:
        unitSystems:
          string: ISO
        attributeSetIds:
          oid: 1.2.840.10003.3.5
          oid: 1.2.840.10003.3.1
          oid: 1.2.840.10003.3.1000.81.2
        schemas:
         oid: 1.2.840.10003.13.2
      name: gils
      userFee: 0
      available: 1
      recordCount:
        recordCountActual: 48
      zebraInfo:
        recordBytes: 123562
 Elapsed: 0.068221
 Z> s
 Sent presentRequest (2+1).
 Records: 1
 [IR-Explain-1]Record type: SUTRS

Field weighting

There is a way to do this. The zebra documentation is updated on this issue now.

A) It is possible to apply dynamic ranking on only parts of the PQF query:

            @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer

searches for all documents which have the term 'Utah' on the body of text, and which have the term 'Springer' in the publisher field, and sort them in the order of the relevance ranking made on the body-of-text index only.

B) Ranking weights may be used to pass a value to a ranking algorithm, using the non-standard BIB-1 attribute type 9.

This allows one field of a query to use one value while another field uses a different one. For example, we can search for utah in the @attr 1=4 index with weight 30, as well as in the @attr 1=1010 index with weight 20:

       @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city

The default weight is sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score is 1000 and the bottom score is 0, encoded in integers.

In general, it is better to rank down (i.e. using numbers smaller than 34) than up, as total rank is truncated at 1000, if total rank exceeds 1000.

So, if you observe strange ranking behaviour, try with smaller ranking weights.

the following:

 @attr 2=102 @or @attr 9=30 @attr 4=1 @attr 6=3@attr 1=4 it @attr 9=20 @attr 1=1016 it

will get all the records with 'It' as the exact title first, followed by the rest of the records with 'it' somewhere in the record.

 @attr 2=102 @or @attr 9=30 @attr 4=1 @attr 6=3 @attr 1=4 it @attr 9=20 @attr 1=4 it

Also worked, to pull up all the results with 'It' as the exact title, followed by those with 'It' somewhere in the title…

Query Manipulation

How can I map a single search query (like title) to a pqf string like the above?

I tried doing something with CCL, but it won't take more than one of each attribute as an argument.

You can not ask zebra to do complicated query manipulation … this is the job of the client application layer on top of zebra.

(You can't either get a relational DB to re-write SQL queries before executing them, so I feel we are in very good company with this design choice)

The YAZ layer (which does all the protocol and PQF parsing work for zebra) has an additional CQL and/or CCL parser, and a simple configuration of it. It allows translating CCL to PQF, _assuming_ that the _structure_ of the boolean operator tree is not altered.

You want to re-write for example something like this boolean query tree

 @attr 1=4 it   (or the CCL/CQL equivalent  "title = it"  )

to this one

 @attr 2=102 @or @attr 9=30  @attr 6=3 @attr 1=4 it @attr 9=20 @attr 1=4 it

which has a definitely very different structure, i.e., multiple leave nodes with added attributes.

I have two solution suggestions:

A) Write your own CCL parser program it to output a PQF with all the bells and whistles added you need. There are some CQL parsers out in almost any programming language, so it should not be too hard to use them as a starting point.

B) Deploy any tree-to-tree transformation you know to do the job The only honest of such transformation languages I know of is XSLT, so one can do the following trick:

transform PQF to PQF-XML
transform PQF-XML to PQF-XML by XSLT
transform PQF-XML back to PQF

This gives you the total freedom of doing any transformation, inclusive adding '@or' nodes.

You might want to experiment with this on command line, just to get an idea: step into the YAZ CVS checkout/tarball in the

yaz/utils directory

./yaz-xmlquery -p '@and @attr 1=1016 @attr 4=2 @attr 6=3 the @attr 1=4 fish'

generates

<?xml version=“1.0”?> <query>

<rpn set="Bib-1">
  <operator type="and">
    <apt>
      <attr type="6" value="3"/>
      <attr type="4" value="2"/>
      <attr type="1" value="1016"/>
      <term type="general">the</term>
    </apt>
    <apt>
      <attr type="1" value="4"/>
      <term type="general">fish</term>
    </apt>
  </operator>
</rpn>

</query>

than, there is a XSLT stylesheet as example - use any XSLT processor, for example xsltproc, to apply it (it does nothing by default, unless you remove the out-commented transformation rules. But the you can do virtually everything.. open the file and study it a bit ..)

xsltproc pqf2pqf.xsl test.xml > test2.xml

and re-sampeling the PQF then gives

./yaz-xmlquery -x test2.xml

RPN @attrset Bib-1 @and @attr 1=1016 @attr 4=2 @attr 6=3 the @attr 1=4 fish

Now, making your transformations is just an exercise in XSLT template programming.

Off course, you need to do steps 1)-3) in your application layer, and there you need exposure of the PQF ←→ XML translation, as well as a decent XSLT engine at your hand.

Multiple databases, indexes, ports ... etc.

You have in principle three different options, which all have their own advantages and disadvantages.

1) For each Koha installation, you use a new hardware box, that is, new IP address, and a new Zebra process and register area on it's own disc.

This is not really what you want, but this gives of course best performance and independency of services.

2) You use one hardware box, possible with different disc drives,

and designate a file system schema as the above, for example

/usr/lib/zebra/koha-afognak/zebra
/usr/lib/zebra/koha-afognak/data

/usr/lib/zebra/koha-wipo/zebra
/usr/lib/zebra/koha-wipo/data

where you place zebra.cfg, *.att …. whatever you need of config files in the /usr/lib/zebra/koha-*/zebra dirs, and your XML-or-whatever data (if you use file system data storage ..) in the /usr/lib/zebra/koha-*/data dirs.

If you wish - put all common configuration files in a common area, like

/usr/lib/zebra/koha/

and refer to them from the local zebra.cfg files.

Then you write a start-stop script to automagically awake one zebra instance for each koha installation. We do this in our keystone PHP web portal application with great success, and I can send you a script you might want to use as starting point.

You start either zebra under a local socket - for example

/usr/lib/zebra/koha-wipo/zebra/socket

(this is the way we run multiple Keystone installations on one box)

or you invent a naming schema for mapping from koha instance to a port number, and start each zebra server on it's own port.

You could even mount different discs in the different filesystem dirs, to avoid that one DB takes disc performance from the other (but you have obviously to share memory and CPU's, so this scheme is great on fast multiple CPU boxes with plenty of RAM and multiple scsi discs). Do this only if you need faster disc performance.

Advantage: easy to administer, easy to debug, each zebra server is only concerned with it's own data, thus is individually started/stopped/maintained. No complex backup problems either, as each data or register area can be backed up individually.

The natural choice, unless:

3) You use one box, one zebra process, and different logical database names, all listening on the same port/socket, using the same register area. Advantages: everything you need to remember is one IP address/port

Disadvantages: clients need to specify the correct database name to use, and can not rely on the default one. All DB's are up or down at the same time, no possibility to just take one down. Reconfiguring one means starting all of them up again. Backup is a bit of a mess, as register areas are common. If you need different indexation rules for the the same XML files (i.e. different versions of the same *.abs files or *.att files) you are definitely toast.

So, this schema is only good if - you must have one and only one socket/port - all DB's are configured equal, i.e. *.abs and *.att (and all other config files) can be shared.

This will seldom be the case, I believe.

So I can have two zebra servers running, each on a different port, but each with several databases. Now, let's throw shadow indexes into the mix … what happens during a commit when you've got one zebrasrv running several databases … does the commit affect them all (ie, do you only need one commit for the whole server?) or does it just affect the single database?

Running multiple databases in one zebra server means running multiple databases in the same register files, so, yes, one commit means one commit per server/register files, i.e. all databases are commited at the same time.

CCL Embedded Sorting

In the documentation I don't see a way to add the sorting 7=1 / 7=2 options using a CCL query … is there any way to do it?

You could make a qualifier called 'titleascending' which sets 7=1 4=1, and do a search like

au=ferraro OR titleascending=0

Migrating from ABS to DOM style indexing

How do we convert from ABS to DOM style indexing?

if it's something XPATH like

 melm 020$a      ISBN:w,Identifier-standard:w

or

 elm 245/?/a             title           !:w,!:p

the straightforward translation is to make an XSLT template which mimics these rules. Something like.

<xsl:template match="m:datafield[@tag='245']">
  <z:index name="title:w title:p title:s any:w">
    <xsl:value-of select="m:subfield[@code='a']"/>
  </z:index>
</xsl:template>

in XSLT (assuming that we are indexing MARCXML records, of course ..)

One could also use an

<xsl:template match="m:datafield[@tag='245']">
  <xsl:for-each select="m:subfield[@code='a']">
    <z:index name="title:w title:p title:s any:w">
      <xsl:value-of select="."/>
    </z:index>
  </xsl:for-each>
</xsl:template>

if one has many 245 datafields without any code a subfield.

Have a look at the very simple

zebra/test/xslt/dom-index-element.xsl

and test the idea by running

xsltproc zebra/test/xslt/dom-index-element.xsl mymarcxml-record.xml