Chapter 14. Multilanguage support in eXo JCR RDB backend

Whenever relational database is used to store multilingual text data of eXo Java Content Repository, we need to adapt configuration in order to support UTF-8 encoding. Here is a short HOWTO instruction for several supported RDBMS with examples.

The configuration file you have to modify: .../webapps/portal/WEB-INF/conf/jcr/repository-configuration.xml

Note

Datasource jdbcjcr used in examples can be configured via InitialContextInitializer component.

14.2. Oracle

In order to run multilanguage JCR on an Oracle backend Unicode encoding for characters set should be applied to the database. Other Oracle globalization parameters don't make any impact. The only property to modify is NLS_CHARACTERSET.

We have tested NLS_CHARACTERSET = AL32UTF8 and it works well for many European and Asian languages.

Example of database configuration (used for JCR testing):

NLS_LANGUAGE             AMERICAN
NLS_TERRITORY            AMERICA
NLS_CURRENCY             $
NLS_ISO_CURRENCY         AMERICA
NLS_NUMERIC_CHARACTERS   .,
NLS_CHARACTERSET         AL32UTF8
NLS_CALENDAR             GREGORIAN
NLS_DATE_FORMAT          DD-MON-RR
NLS_DATE_LANGUAGE        AMERICAN
NLS_SORT                 BINARY
NLS_TIME_FORMAT          HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT     DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT       HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT  DD-MON-RR HH.MI.SSXFF AM TZR
NLS_DUAL_CURRENCY        $
NLS_COMP                 BINARY
NLS_LENGTH_SEMANTICS     BYTE
NLS_NCHAR_CONV_EXCP      FALSE
NLS_NCHAR_CHARACTERSET   AL16UTF16

Warning

JCR 1.12.x doesn't use NVARCHAR columns, so that the value of the parameter NLS_NCHAR_CHARACTERSET does not matter for JCR.

Create database with Unicode encoding and use Oracle dialect for the Workspace Container:

<workspace name="collaboration">
          <container class="org.exoplatform.services.jcr.impl.storage.jdbc.JDBCWorkspaceDataContainer">
            <properties>
              <property name="source-name" value="jdbcjcr" />
              <property name="dialect" value="oracle" />
              <property name="multi-db" value="false" />
              <property name="max-buffer-size" value="200k" />
              <property name="swap-directory" value="target/temp/swap/ws" />
            </properties>
          .....

DB2 Universal Database (DB2 UDB) supports UTF-8 and UTF-16/UCS-2. When a Unicode database is created, CHAR, VARCHAR, LONG VARCHAR data are stored in UTF-8 form. It's enough for JCR multi-lingual support.

Example of UTF-8 database creation:

DB2 CREATE DATABASE dbname USING CODESET UTF-8 TERRITORY US

Create database with UTF-8 encoding and use db2 dialect for Workspace Container on DB2 v.9 and higher:

<workspace name="collaboration">
          <container class="org.exoplatform.services.jcr.impl.storage.jdbc.JDBCWorkspaceDataContainer">
            <properties>
              <property name="source-name" value="jdbcjcr" />
              <property name="dialect" value="db2" />
              <property name="multi-db" value="false" />
              <property name="max-buffer-size" value="200k" />
              <property name="swap-directory" value="target/temp/swap/ws" />
            </properties>
          .....

Note

For DB2 v.8.x support change the property "dialect" to db2v8.

14.4. MySQL

JCR MySQL-backend requires special dialect MySQL-UTF8 to be used for internationalization support. But the database default charset should be latin1 to use limited index space effectively (1000 bytes for MyISAM engine, 767 for InnoDB). If database default charset is multibyte, a JCR database initialization error is thrown concerning index creation failure. In other words, JCR can work on any singlebyte default charset of database, with UTF8 supported by MySQL server. But we have tested it only on latin1 database default charset.

Repository configuration, workspace container entry example:

<workspace name="collaboration">
          <container class="org.exoplatform.services.jcr.impl.storage.jdbc.JDBCWorkspaceDataContainer">
            <properties>
              <property name="source-name" value="jdbcjcr" />
              <property name="dialect" value="mysql-utf8" />
              <property name="multi-db" value="false" />
              <property name="max-buffer-size" value="200k" />
              <property name="swap-directory" value="target/temp/swap/ws" />
            </properties>
          .....

14.5. PostgreSQL

On PostgreSQL-backend, multilingual support can be enabled in different ways:

Using the locale features of the operating system to provide locale-specific collation order, number formatting, translated messages, and other aspects. UTF-8 is widely used on Linux distributions by default, so it can be useful in such case.
Providing a number of different character sets defined in the PostgreSQL server, including multiple-byte character sets, to support storing text of any languages, and providing character set translation between client and server. We recommend to use UTF-8 database charset, it will allow any-to-any conversations and make this issue transparent for the JCR.

Create database with UTF-8 encoding and use PgSQL dialect for Workspace Container:

<workspace name="collaboration">
          <container class="org.exoplatform.services.jcr.impl.storage.jdbc.JDBCWorkspaceDataContainer">
            <properties>
              <property name="source-name" value="jdbcjcr" />
              <property name="dialect" value="pgsql" />
              <property name="multi-db" value="false" />
              <property name="max-buffer-size" value="200k" />
              <property name="swap-directory" value="target/temp/swap/ws" />
            </properties>
          .....