E.44. unaccent

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search.

The current implementation of unaccent cannot be used as a normalizing dictionary for the thesaurus dictionary.

E.44.1. Configuration

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

An unaccent dictionary accepts the following options:

The rules file has the following format:

A more complete example, which is directly useful for most European languages, can be found in unaccent.rules, which is installed in $SHAREDIR/tsearch_data/ when the unaccent module is installed.

E.44.2. Usage

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

Installing the unaccent extension creates a text search template unaccent and a dictionary unaccent based on it. The unaccent dictionary has the default parameter setting RULES='unaccent', which makes it immediately usable with the standard unaccent.rules file. If you wish, you can alter the parameter, for example

mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');

or create new dictionaries based on the template.

To test the dictionary, you can try:

mydb=# select ts_lexize('unaccent','Hôtel');
 ts_lexize
-----------
 {Hotel}
(1 row)

Here is an example showing how to insert the unaccent dictionary into a text search configuration:

mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','Hôtels de la Mer');
    to_tsvector
-------------------
 'hotel':1 'mer':4
(1 row)

mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 ?column?
----------
 t
(1 row)

mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
      ts_headline
------------------------
 <b>Hôtel</b> de la Mer
(1 row)

E.44.3. Functions

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

The unaccent() function removes accents (diacritic signs) from a given string. Basically, it's a wrapper around the unaccent dictionary, but it can be used outside normal text search contexts.

unaccent([dictionary, ] string) returns text

For example:

SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');