Trigram search
Trigram search is a method of searching for text when the exact syntax or spelling of the target object is not precisely known[1] or when queries may be regular expressions.[2] It finds objects which match the maximum number of three consecutive character strings (i.e. trigrams) in the entered search terms, which are generally near matches.[3] Two strings with many shared trigrams can be expected to be very similar.[4] Trigrams also allow for efficiently creating search engine indexes for searches that are regular expressions or match the text inexactly. Indexes can significantly accelerate searches.[5][6] A threshold for number of trigram matches can be specified as a cutoff point, after which a result is no longer considered a match.[4]
Using trigrams for accelerating searches is a technique used in some systems for code searching, in situations in which queries that are regular expressions may be useful,[5][2][7] in search engines such as Elasticsearch,[8] as well as in databases such as PostgreSQL.[4]
Examples
Consider the string "alice". The trigrams of the string would be "ali", "lic", and "ice," not including spaces.[5] Searching for this string in a database with a trigram-based index would involve finding which objects contain as many of the three trigrams as possible.
As a concrete example of using trigram search to search for a regular expression query, consider searching for the string ab[cd]e
, where the brackets denote that the third character in the string being searched for could be c
or d
. In this situation, one could query the index for objects that have the two trigrams abc
and bce
or the two trigrams abd
and bde
. Thus, finding this query would involve no string matching, and could just query the index directly, which can be faster in practice.[2]
References
- Hardarson, Omar (1997). "Interactive coding of economic activity using trigram search in BLAISE III" (PDF). International Blaise User Group. Note: This article discusses trigram search as a way of efficiently encoding certain kinds of economic data, and finds that this technique is particularly useful when the users of the system don't have a lot of context for the structure of the data.
- Cox, Russ (January 2012). "Regular Expression Matching with a Trigram Index or How Google Code Search Worked".
- Adams, Elizabeth; Meltzer, Arnold (1 March 1993). "Trigrams as index element in full text retrieval: Observations and experimental results". Proceedings of the 1993 ACM conference on Computer science - CSC '93. pp. 433–439. doi:10.1145/170791.170891. ISBN 0897915585. S2CID 16701550.
- "F.33. pg_trgm". PostgreSQL Documentation. 2022-05-12. Retrieved 2022-05-28.
- "Fast Search Using PostgreSQL Trigram Text Indexes". GitLab. 2016-03-18. Retrieved 2022-05-28.
- Zobel, Justin; Moffat, Alistair; Sacks-Davis, Ron (1993). "Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files" (PDF). Conference on Very Large Databases (VLDB). Note: This research paper does not use the term "trigram search" but does seem to be the first instance in the literature of using n-grams as indices, and is cited in the Russ Cox article as the first instance of the structure of a reverse index based on trigrams. The paper also cited successful performance results from using this style of search.
- "Big Grep". resources.sei.cmu.edu. Retrieved 2022-06-12. Also called BigGrep. Uses n-grams (not always 3),
- "N-gram tokenizer | Elasticsearch Guide [8.2] | Elastic". www.elastic.co. Retrieved 2022-05-28.