n-gram
n-gram is a series of n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles.[1]
Examples
Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
---|---|---|---|---|---|
Vernacular name | unigram | bigram | trigram | ||
Order of resulting Markov model | 0 | 1 | 2 | ||
Protein sequencing | amino acid | ... Cys-Gly-Leu-Ser-Trp ... | ..., Cys, Gly, Leu, Ser, Trp, ... | ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... | ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ... |
DNA sequencing | base pair | ...AGCTTCGA... | ..., A, G, C, T, T, C, G, A, ... | ..., AG, GC, CT, TT, TC, CG, GA, ... | ..., AGC, GCT, CTT, TTC, TCG, CGA, ... |
Language model | character | ...to_be_or_not_to_be... | ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... | ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... | ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ... |
Word n-gram language model | word | ... to be or not to be ... | ..., to, be, or, not, to, be, ... | ..., to be, be or, or not, not to, to be, ... | ..., to be or, be or not, or not to, not to be, ... |
Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.
Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.[2]
3-grams
- ceramics collectables collectibles (55)
- ceramics collectables fine (130)
- ceramics collected by (52)
- ceramics collectible pottery (50)
- ceramics collectibles cooking (45)
4-grams
- serve as the incoming (92)
- serve as the incubator (99)
- serve as the independent (794)
- serve as the index (223)
- serve as the indication (72)
- serve as the indicator (120)
References
- Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDN Systems. 29 (8): 1157–1166. doi:10.1016/s0169-7552(97)00031-7. S2CID 9022773.
- Alex Franz and Thorsten Brants (2006). "All Our N-gram are Belong to You". Google Research Blog. Archived from the original on 17 October 2006. Retrieved 16 December 2011.
Further reading
- Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press: 1999. ISBN 0-262-13360-1.
- White, Owen; Dunning, Ted; Sutton, Granger; Adams, Mark; Venter, J.Craig; Fields, Chris (1993). "A quality control algorithm for dna sequencing projects". Nucleic Acids Research. 21 (16): 3829–3838. doi:10.1093/nar/21.16.3829. PMC 309901. PMID 8367301.
- Frederick J. Damerau, Markov Models and Linguistic Theory. Mouton. The Hague, 1971.
- Figueroa, Alejandro; Atkinson, John (2012). "Contextual Language Models For Ranking Answers To Natural Language Definition Questions". Computational Intelligence. 28 (4): 528–548. doi:10.1111/j.1467-8640.2012.00426.x. S2CID 27378409.
- Brocardo, Marcelo Luiz; Issa Traore; Sherif Saad; Isaac Woungang (2013). Authorship Verification for Short Messages Using Stylometry (PDF). IEEE Intl. Conference on Computer, Information and Telecommunication Systems (CITS).
External links
- Ngram Extractor: Gives weight of n-gram based on their frequency.
- Google's Google Books n-gram viewer and Web n-grams database (September 2006)
- STATOPERATOR N-grams Project Weighted n-gram viewer for every domain in Alexa Top 1M
- 1,000,000 most frequent 2,3,4,5-grams from the 425 million word Corpus of Contemporary American English
- Peachnote's music ngram viewer
- Stochastic Language Models (n-Gram) Specification (W3C)
- Michael Collins's notes on n-Gram Language Models
- OpenRefine: Clustering In Depth