Importing data into Koha from a relational database

After doing some initial testing, manually entering items into the catalog, using the circulation and OPAC, and getting somewhat familiar with Koha, I decided it was time to import lots and lots of records. To do this, I used a copy of the book database at http://iblist.com/ . It is available under an open license for non-commercial use.

So, I had about 20,000 books to import, and this is what I discovered during the process:

Conversion process

The recommended way to import data into Koha is by converting it to MARC binary records, then importing with the supplied bulkmarcimport.pl script. I tried this, but later decided to use MARC21slim XML instead of the older binary MARC, due to character-encoding problems. Modifying the import script was trivial, and doing so saved me a lot of trouble.

My original translation path was as follows:

  1. put iblist.com data into a mysql database
  2. query database, translate, and save results in a LoC MODS XML file (http://www.loc.gov/standards/mods/)
  3. convert MODS to MARC21slim XML using a LoC stylesheet (http://www.loc.gov/standards/marcxml/)
  4. convert MARC21slim to binary MARC using perl's MARC::Record and MARC::File::XML modules
  5. import binary MARC file into koha with bulkmarcimport.pl
  6. adjust Koha's MARC bindings to fit data into Koha's internal formats

After discovering some encoding problems, I modified the process a bit. Instead of converting to binary MARC, I imported the MARC XML directly. This was faster, easier, and less error-prone.

During the process, the bulk of my time was spent writing a script to query the database and write the results in MODS format. This involved learning far more about MARC than I had expected, and modifying the MODS schema and stylesheet to accomodate MARC fields the LoC left out. I discovered quite a bit about Koha's MARC mappings during the process too, and had to modify some of Koha's templates slightly to display certain fields. I ran into a couple limitations in Koha 2.0.0 too, such as its lack of support for most repeating MARC fields.

The Koha users mailing list was very helpful throughout this process, especially in respect to finding the correct MARC fields to put data into.

Specific issues encountered

  • Some of the book data fields do not work correctly after importing with bulkmarcimport.pl. I had to run the rebuildnonmarc.pl script afterward to fix things. (book number in a series, for example)
  • iblist data uses non-ascii encodings, which are not handled well (or at all) by perl's MARC::Record, so creating binary MARC files was not feasible without a lot of effort. (MARC::File::XML works fine)
  • iblist “original language” data is not in ISO format, and had to be mapped appropriately (to iso639-2b)
  • It's not always clear how to escape data to fit in XML. The input data used several character encodings, ranging from ascii to cp437 to iso8859-1 to html, and numerous encoding errors. I ignored most of this, but had to work around html/xml-encoded characters in order to make the output become valid XML. Mapping & to &amp; and <> to &lt;&gt; is easy enough, but some of the data does not follow the format of ampersand, symbol, semicolon (&symbol;) and it can be difficult to tell whether any remaining &'s should be converted.
  • Koha doesn't seem to have a facility for subtitles. For example,
  • iblist does not indicate primary and secondary authors; it is unclear whether to use 100 or 700 MARC tags for each
  • MARC has no facility for detailed author info; a long bio will get stored many times (once per book) if imported. Koha has no built-in facility for providing this sort of info in a more normalized manner, though it would not be terribly difficult to add.
  • iblist book types do not map at all onto MARC media types. (Novel, screenplay, etc all become MARC's basic “text” type)
  • It's unclear whether both the iblist “type” and genre can both fit into MARC. Both seem to be mapped to the same tag, but this may be a misunderstanding on my part. I want to store “Novel” as a media type, and “Science Fiction” as a genre, for example.
  • iblist has various data errors here and there, such as having both “France” and “French” as languages
  • MODS does not provide ways to generate all MARC tags, such as 9xx or x9x tags, or even things like the series a book is a part of… I had to add this to the stylesheet.
  • iblist generally has lots of extra data which doesn't fit into MARC
  • Koha does not display all MARC info, and by default did not even display some of its own (properly mapped) data fields (series title) in the OPAC book view. I was able to fix some of this by modifying the display templates.

Results

After building translation tools, I found myself with a Koha database containing over 20,000 books. I am not yet finished with the process, but things are going reasonably well. I still need to figure out how to create in-stock items during the process, so I can later test circulation.

I intend to release the tools and scripts I used for this process, if anyone is interested. For now, contact me (ToyKeeper) if you would like copies, or more details.

 
tk_import.txt · Last modified: 2006/04/04 09:23 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki