This page was translated from German into English using DeepL.
(remove this paragraph once the translation is finished)
In statistics, significance is a key figure that describes the probability of a systematic correlation between variables, i.e., in the case of text analyses, between subtexts (e.g., words). The significance expresses whether an apparent connection could be purely coincidental nature or with high probability actually exists.
Depending on the object of investigation, different formulas are used for the calculation, which originate primarily from computational linguistics. The significance measures should help to separate important from unimportant cooccurrences. Statistical parameters, such as corpus size, frequency of individual words or frequency of co-occurrence, are put into relation.
One of the simplest significance measures is a frequency-sorted co-occurrence list, i.e., the frequency of co-occurrence of two words in the entire corpus. A disadvantage of frequency-sorted lists is that according to Zipf's law, the beginning of quantitative linguistics, very many words occur very rarely. Consequently, a threshold greater than 1, i.e., multiple co-occurrences of a word pair, can be used to filter out about two-thirds of the co-occurrences. Calculated by the eAQUA tools, this looks as follows for selected corpora:
corpus | number of co-occurrences | co-occurrences freq = 1 | in percent |
---|---|---|---|
BTL 1) | 137.486.214 | 110,876,836 | 80,65 |
MPL 2) | 580.247.568 | 398.935.822 | 68,75 |
Perseus Shakespeare 3) | 6.746.602 | 5.027.170 | 74,51 |
TLG 4) | 355.021.014 | 258.961.566 | 72,94 |
As can be seen from the small overview, a large part of the cooccurrences found can rather be described as low-frequency. In order to filter out the important ones, calculation methods are required, some of which are presented here.
The Dice coefficient (also Sørensen-Dice coefficient, named after the botanists Thorvald Sørensen and Lee Raymond Dice) indicates the similarity of two terms by means of a number between 0 and 1. Basis of calculation are so-called N-grams. With N-grams, a term or a text is divided into equally sized fragments. These fragments can be letters, phonemes, whole words or similar. The number of N-grams present in both terms is determined in order to set them in relation to the total number of N-grams. It is calculated according to the formula where is the intersection of both terms and or is the number of N-grams formed per term.
Example 1: Term a = Tür Term b = Tor | |
---|---|
Bigram | Trigram |
a = { §T, Tü, ür, r§ } b = { §T, To, or, r§ } | a = { §§T, §Tü, Tür, ür§, r§§ } b = { §§T, §To, Tor, or§, r§§ } |
Example 2 Term a = Spiegel Term b = Spargel | |
---|---|
Bigram | Trigram |
a = { §S, Sp, pi, ie, eg, ge, el, l§ } b = { §S, Sp, pa, ar, rg, ge, el, l§ } | a = { §§S, §Sp, Spi, pie, ieg, ege, gel, el§, l§§ } b = { §§S, §Sp, Spa, par, arg, rge, gel, el§, l§§ } |
When evaluating co-occurrences, the dice coefficient can be used by relating the frequencies (frequencies) of the words. Here, and are the frequencies of the terms, is the number of common occurrences. The above calculation results in relatively simple evaluation scales. The more frequently the two terms are used together, the more the value approaches 1. If the two terms only occur together, the highest significance is achieved with 1. How often this co-occurrence is found in the corpus is irrelevant. This results in an important property of the Dice coefficient: Cooccurrences that rarely occur together, where one word is high-frequency and the other is low-frequency, are scored as nonsignificant.
The Jaccard coefficient (after the botanist Paul Jaccard) indicates the similarity of two terms by means of a number between 0 and 1. The calculation basis for text mining methods are so-called N-grams. With N-grams, a term or a text is broken down into equal parts. These fragments can be letters, phonemes, whole words or similar. The number of N-grams present in both terms is determined in order to set them in relation to the total number of N-grams. It is calculated according to the formula where is the intersection of both terms and or is the number of N-grams formed per term.
Example 1: Term a = Tür Term b = Tor | |
---|---|
Bigram | Trigram |
a = { §T, Tü, ür, r§ } b = { §T, To, or, r§ } | a = { §§T, §Tü, Tür, ür§, r§§ } b = { §§T, §To, Tor, or§, r§§ } |
Example 2 Term a = Spiegel Term b = Spargel | |
---|---|
Bigram | Trigram |
a = { §S, Sp, pi, ie, eg, ge, el, l§ } b = { §S, Sp, pa, ar, rg, ge, el, l§ } | a = { §§S, §Sp, Spi, pie, ieg, ege, gel, el§, l§§ } b = { §§S, §Sp, Spa, par, arg, rge, gel, el§, l§§ } |
For the evaluation of co-occurrences, the Jaccard coefficient is similar to the Dice coefficient. Both calculate the significance value similarly, the relative order of the co-occurrences remains the same, only the absolute significance value differs marginally. A model calculation with mean frequency of 100 looks as follows:
Dice | Jaccard | |||
---|---|---|---|---|
100 | 100 | 1 | 0,01 | 0,005 |
100 | 100 | 10 | 0,1 | 0,05 |
100 | 100 | 50 | 0,5 | 0,33 |
100 | 100 | 90 | 0,9 | 0,82 |
100 | 100 | 100 | 1 | 1 |
An approach to computing significant co-occurrences is based on the Poisson distribution (named after the mathematician Siméon Denis Poisson), a discrete probability distribution
On the basis of the Poisson distribution give Quasthoff / Wolff 5) the Poisson measure with the formula. which has been used, for example, to compute corpora in the Vocabulary Portal, and in which the two factors n (number of sentences in the corpus) and k (frequency of co-occurrence, also called ) are relevant.
After a conversion and the basic assumption we get the following calculation
Thus, the Poisson measure could be reduced to the difference between Local Mutual Information and Frequency.
One of the most popular significance measures in the analysis of large text corpora is according to Dunning 6) the log-likelihood measure, which is based on the binomial distribution, one of the most important discrete probability distributions.
Dunning finally arrives at the formula when calculating log likelihood:
under the condition
The log-likelihood measure can thus be derived as follows
Characteristic of the log-likelihood measure, in contrast to the Poisson measure for example, is the equal treatment of significantly frequent and significantly rare events. Thus, in the digitized data from TLG in version TLG-E, there are about 1.3 million co-occurrences in about 73.8 million words that occur only once and yet are assigned an lgl value of 30 and a little more. A similarly large value of 34.553 has, for example, καὶ and τὸ, which together were counted 14311 times.