Encoding and Character Sets in Koha

Started by Joshua Ferraro <jmf AT liblime DOT com>, extended by all the community.

Introduction

This document attempts to raise awareness of character set issues so that Koha developers and administrators can understand how best to proceed with development as well as setup and configuration of Koha systems.

MARC Records

MARC21 records can 'legally' only have two encodings: MARC-8 or UTF-8. The encoding is set in position 9 of the leader (LEADER / 09). MARC-8 is not recognized in modern web browsers and since Koha is a web-based system, if you are using MARC21 records, the encoding MUST be UTF-8. This means that the records should be pre-processed before entering your Koha system (in whatever way they enter). Some of this is handled internally within Koha, but don't leave it to chance: if you're migrating MARC21 data into Koha expect to spend a significant amount of time to dealing with properly pre-processing and storing your data in Koha.

Conversion from MARC-8 to UTF-8 for MARC21 records is handled in Koha with the MARC::* suite of Perl modules. There are significant issues with properly configuring your system (with the proper SAX parsers, etc.) and there are also some questions raised about whether this suite is handling all character set / encoding issues correctly. For some details, please refer to the following posts:

http://www.nntp.perl.org/group/perl.perl4lib/2369

http://lists.nongnu.org/archive/html/koha-devel/2006-07/msg00000.html

One thing to remember is that LEADER / 09 is used in MARC::* to determine the encoding of a given record. This means that if it's not set correctly, you will very likely mangle any records you are importing/exporting.

System

Be sure to set your system locales up correctly to use UTF-8. You can test your locale settings by running:

 $ locale

or

 $ echo $LANG
 en_US.UTF-8

If it's not en_US.UTF-8 (or UTF-8 of your language), en_US means it's configured for iso-8859-1/latin1. Be sure to reconfigure your locales. On Debian, you can configure locales thusly:

 $ sudo dpkg-reconfigure locales

Then, you'll need to quit your shell session and log back in again to check the default.

 NOTE: on some systems, the root user won't have locale set properly, use 
 a non-root user when working with Koha and the 'sudo' command if you need
 elevated permissions

Here's a decent reference describing all the bases to cover when you want your system to be UTF-8:

http://gentoo-wiki.com/HOWTO_Make_your_system_use_unicode/utf-8

Web Browser

OSX (10.3 and 10.4) with Firefox (up to at least version 1.5) does not properly display UTF-8 precomposed (combining) characters in the Arial font. It does, however, display them correctly in the Lucida Grande font. For catalogs that have combining characters make sure that your default fonts include Lucida Grande.

http://www.mezzoblue.com/archives/2005/07/25/glyphs/

http://www.dionysia.org/html/entities/latin1.html

MJR ran a test. For browser computers, it seems that the important things are that the unicode fonts are correctly configured (see above for one example, or ask on the mailing list) and that the page with the form is in the correct encoding (UTF-8?). A keymap that lets the user type the characters they want is also a good idea. Firefox at least doesn't seem to care about the X locale. IRC log of the test is here:

http://koha.org/cgi-bin/logs.pl?user=&channel=%23koha&action=&text=&user-ddl=&channel-ddl=&action-ddl=&startdate=24%2F08%2F06+04%3A29%3A19&enddate=24%2F08%2F06+05%3A33&saveas=&save=save

Apache2

Be sure to have these lines in your httpd.conf:

 AddCharset UTF-8 .utf8
 AddDefaultCharset   UTF-8

MySQL 4.1

Server Configuration

MySQL Version 4.1 is absolute minimum if you want to handle encoding correctly

Please refer to the MySQL Manual Chapter 10: http://dev.mysql.com/doc/refman/4.1/en/charset.html

You will probably have to edit your my.cnf to set some variables so that the server will use utf8 by default. Even standard packages like the one provided by Debian Sarge have the variables set to use latin1 by default. Make sure you have the following in the [mysqld] section of your my.cnf:

 init-connect = 'SET NAMES utf8'
 character-set-server=utf8
 collation-server=utf8_general_ci
 character_set_client=utf8

Connect to mysql as a non-root user and type:

 show variables;

Or:

 show variables where variable_name LIKE "character_set%" OR  variable_name LIKE "collation%";

NOTE: The root user won't show the variables correctly so connect as a non-superadmin user to check the values. From the MySQL site:

 You might be tempted to start the server with the --init_connect="SET NAMES 'utf8'"  option
 to cause SET NAMES to be executed automatically for each client that connects. However, this
 will yield inconsistent results because the init_connect value is not executed for users who
 have the SUPER privilege.

NOTE: The only way I could get kohaadmin to use utf8 for everything that it was supposed to was by adding 'skip-character-set-client-handshake' to my.cnf. I hadn't tested root before setting that, but now they're both correct. – CM

Check to make sure the following are set to utf8:

 | character_set_client            | utf8                                                     |
 | character_set_connection        | utf8                                                     |
 | character_set_database          | utf8                                                     |
 | character_set_results           | utf8                                                     |
 | character_set_server            | utf8                                                     |
 | character_set_system            | utf8                                                     |
 | character_sets_dir              | /usr/share/mysql/charsets/                               |
 | collation_connection            | utf8_general_ci                                          |
 | collation_database              | utf8_general_ci                                          |
 | collation_server                | utf8_general_ci

You must create your Koha database after you set the character set defaults otherwise the database could be set to the wrong defaults

Important: On Debian Sarge moving from mysql-server to mysql-server-4.1 modifies existing tables and in some cases seems to reduce the length of fields, causing strange behaviour in Koha. For example, on our system it truncated the 'ip' field in the 'sessions' table to 5 characters, which meant that you had to log in every time you clicked on a link in the intranet.

If you are moving from a mysql 4.0 database to a 4.1, you need to pay special attention to how to properly deal with your charsets. If you are storing utf-8 data in mysql 4.0 but your table types are set to latin1, you will need to convert to blob or binary before changing the table type otherwise mysql will attempt a conversion and you will end up with double-encoded utf8:

http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

Also, if you are using marc-8 encoded data in a latin1 type database you probably need to do the same thing, export your records from marc_subfield_table into a marc file (after converting to type blob), then process the file, changing everything to utf8, then change the table type in mysql, then re-import.

mysql string functions: http://mysql.he.net/doc/refman/5.0/en/string-functions.html

Combining Characters and Collations

The word Univerzalitás is a unicode combining form. When you copy/paste it into a text editor or use a keyboard to type it, it is most likely going to be the non-combining form: Univerzalitás. (in the non-combining form, the hex for the accented a is: Hex 0301; for the non-combining form it's: Hex 61, Hex 00e1).

Non-combining form: http://www.fileformat.info/info/unicode/char/00e1/index.htm

Combining form: http://www.fileformat.info/info/unicode/char/61/index.htm http://www.fileformat.info/info/unicode/char/0301/index.htm

Univerzalitás Univerzalitás

It seems that the utf8_general_ci collation doesn't support equality for those two forms. However, utf8_unicode_ci seems to work. If you have combining characters in your data, you may want to go with statements like:

ALTER TABLE marc_word MODIFY word VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;

and be sure to add init-connect = 'SET collation_connection = utf8_unicode_ci' to your my.cnf

Finding out the Character Set / Collation

The SHOW CHARACTER SET command shows all available character sets.

The output from SHOW COLLATION includes all available character sets.

SHOW FULL COLUMNS FROM items; will show the collation for each column in the items table

If your mysql server needs other default character sets

You can modify the mysql dbh thusly:

sub dbh
{
      my $self = shift;
      my $sth;

      if (defined($context->{"dbh"})) {
          $sth=$context->{"dbh"}->prepare("select 1");
          return $context->{"dbh"} if (defined($sth->execute));
      }

      # No database handle or it died . Create one.
      $context->{"dbh"} = &_new_dbh();

$sth=$context->{"dbh"}->prepare("SET NAMES 'utf8'");
$sth->execute;
      return $context->{"dbh"};
}

(I'll add this to CVS once I confirm it's working properly – JF)

Database Backups

http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

http://textsnippets.com/posts/show/84 (probably not the best way)

mysqldump

mysqlhotcopy

mysqlhotcopy is the only safe way to back your database up if you are using utf-8 data in a latin1 database. It will take an exact snapshot of the data without changing anything. However, the disadvantage is that mysqlhotcopy isn't always the right tool for the job of upgrading from an older version of mysql to a newer version.

Changing Default Character Sets for MARC21 libraries using UTF-8

If you're like me, you have a MARC21 library with utf-8 encoded records but your database was incorrectly set to use latin1 at the server, database, table and column level. This may put you in a spot where you cannot search using utf-8 characters typed using your keyboard into the search box (for instance, Search for Péru, München, información, Világszervezet ). Here is the solution I've used with some success:

fix the encoding of apache, mysql's default server-wide encoding, and collation as specified above
perform the following operations on your database:

 
 -- Convert the database default encoding to UTF-8 and default collate to utf8
 
 ALTER DATABASE kohawipofinal DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
 
 -- change the table encoding type (varchar -> binary -> varchar ) for MARC Data
 alter table marc_subfield_table modify subfieldvalue BINARY(255);
 alter table marc_subfield_table modify subfieldvalue VARCHAR(255) CHARACTER SET utf8;
 
 -- change the table encoding type (text -> blob -> text ) for Koha Tables
 alter table biblio modify title BLOB;
 alter table biblio modify title TEXT CHARACTER SET utf8;
 
 alter table biblio modify author BLOB;
 alter table biblio modify author TEXT CHARACTER SET utf8;
 
 alter table biblio modify unititle BLOB;
 alter table biblio modify unititle TEXT CHARACTER SET utf8;
 
 alter table biblio modify notes BLOB;
 alter table biblio modify notes TEXT CHARACTER SET utf8;
 
 alter table biblio modify seriestitle BLOB;
 alter table biblio modify seriestitle TEXT CHARACTER SET utf8;
 
 alter table biblio modify abstract BLOB;
 alter table biblio modify abstract TEXT CHARACTER SET utf8;
 
 
 alter table biblioitems modify abstract BLOB;
 alter table biblioitems modify abstract TEXT CHARACTER SET utf8;

Perl

Here are some links to good references for perl encoding issues:

http://www.ahinea.com/en/tech/perl-unicode-struggle.html http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod

SAX Parser

1. the XML::SAX::PurePerl parser cannot properly handle combining characters. I've reported this bug here:

http://rt.cpan.org/Public/Bug/Display.html?id=19543

At the suggestion of several, I tried replacing my default system parser with expat, which cause another problem:

2. handing valid UTF-8 encoded XML to new_from_xml() sometimes causes the entire record to be destroyed when using XML::SAX::Expat as the parser (with PurePerl these seem to work). It fails with the error:

not well-formed (invalid token) at line 23, column 43, byte 937 at /usr/lib/perl5/XML/Parser.pm line 187

I haven't been able to track the cause of this bug, I eventually found a workaround that didn't result in the above error, but instead, silently mangled the resulting binary MARC record on the way out:

3. Using incompatible version of XML::SAX::LibXML and libxml2 will cause binary MARC records to be mangled when passed through new_from_xml() in some cases. The solution here is to make sure you're running compatible versions of XML::SAX::LibXML and libxml2. I run Debian Sarge and when I just used the package maintainer's versions it fixed the bug. It's unclear to me why the binary MARC would be mangled, this may indicate a problem with MARC::* but I haven't had time to track it down and since installing compatible versions of the parser back-end solves the problem I can only assume it's the fault of the incompatible parsers.

Issues #3 and #4 above can be replicated following batch of records through the roundtrip.pl script above:

http://liblime.com/public/several.mrc

If you want to test #2, try running this record through roundtrip.pl:

http://liblime.com/public/combiningchar.mrc

BTW: you can change your default SAX parser by editing the .ini file … mine is located in /usr/local/share/perl/5.8.4/XML/SAX/ParserDetails.ini

So the bottom line is, if you want to use MARC::File::XML in any serious application, you've got to use compatible versions of the libxml2 parser and XML::SAX::LibXML. Check the README in the perl package for documentation on which are compatible…

DBD::MySQL

DBD::mysql 4.004 or later required

This version of DBD::mysql is required in order to have proper handling of UTF8 strings. use YAML::Syck for serialization

When using mysql and a sufficiently recent version of DBD::mysql that actually respects the mysql_enable_utf8 option, setting the current branch to a name that includes a diacritic causes the session to fail and forces the user to the login screen upon clicking another link.

This turned out to be due to the default (Data::Dumper) serialization used by CGI::Session, which truncated the session object at the diacritic. By enabling that module's YAML serialization option, the problem was resolved.

DBI Module

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html

Movable Type uses the Perl modules DBI and DBD::mysql to 
access the MySQL database. And guess what? They don’t have 
any Unicode support. In fact, forget marking the UTF-8 flag 
properly, according to this, DBD::mysql doesn’t even preserve 
UTF-8  flag when it’s already there.

Wait for Unicode support for DBI/DBD::mysql which might be a 
long time since nobody is sure if it should be provided by the 
database-independent interface DBI or by the MySQL driver DBD::mysql 
or both together in some way.

Use decode_utf8 on every output from the database. This is not very easy to do.
http://perldoc.perl.org/Encode.html#PERL-ENCODING-API

Use a patch which blesses all database data (yes that includes the binary 
fields) as UTF-8 based on a flag you set when connecting to the database.
http://lists.mysql.com/perl/3563 (one patch version 2.9000008)
http://rt.cpan.org/Public/Bug/Display.html?id=17829 (patch version 3.0000000006)
http://dysphoria.net/2006/02/05/utf-8-a-go-go/ (another) 
http://perl7.ru/lib/UTF8DBI.pm

Here's one that seems to indicate that it's best to grab DBI from CPAN:

http://www.codecomments.com/archive237-2006-4-786695.html

DBD::mysql will just pass
everything through unaltered. So if you use UTF-8 as connection charset,
you have to encode('utf-8', ...) all queries and parameters, unless you
are sure that they are either plain ASCII or already have the UTF-8 bit
set. And you will get raw UTF-8 strings back, which you have to decode()
explicitely.

However, I notice that on Debian Sarge (on which I did my testing),
libdbd-mysql-perl depends on libmysqlclient12. So there may be a problem
with mixing releases (The server is 4.1, but libmysqlclient12 belongs to
4.0, which doesn't know about UTF-8).

After some checking, DBD::mysql's database handler attribute mysql_enable_utf8 should take care of setting Perl's utf8 flag. However, this option is not available until version 3.008_1 of DBD::mysql, and per the ChangeLog is not fully functional until version 4.004. Therefore, it looks like Koha should require at least that version of DBD::mysql.

Debian Etch's stable libdbd-mysql-perl package is only version 3.008. There is a lenny testing package now at version 4.006, but it has requirements of newer libc6 and perl packages. Use 4.005 from CPAN (or newer, if you can swing it).

CGI Module

Coming soon …

Opening Files

There is an important trick here, to deal with utf8 files : don't forget to open your files with ”:utf8” if you want a correct behaviour. That's VERY important, because if you read or save a MARC record (iso2709 file) without this flag, you will face a difference between the leader size and the data size, as accented chars are coded on 2 Bytes in utf8 and 1 Byte in MARC-8, and the leader size will have been calculated accordingly by MARC::Record.

using bulkmarcimport

FIXME : add info

Zebra

Since Zebra 2.0.20, Zebra is UNICODE compliant. Zebra can support indexing and searching in any UNICODE compliant script such as Arabic, Chinese, Hebrew, Japanese and Korean. Zebra has this support via the ICU libraries (“International Components for Unicode”) a development project sponsored, supported, and used by IBM. ICU consists of a set of open source program libraries that help applications process Unicode data in any script.

Useful Sites

http://people.w3.org/rishida/scripts/uniview/