Post on 21-Jun-2015
Medium Information QuantityVitalie Scurtu
History
● Shannon - "A Mathematical Theory of Communication“ - 1948
● Lossless compression (Shannon-Fano, Adaptive Huffman)
● Today it is used in cryptography, analysis of DNA sequences (http://pnylab.com/pny/papers/cdna/cdna/index.html), in Natural Language Processing applications
What is information quantity?
The information quantity of one phenomenon depends on its frequency● Low information quantity
○ It rains in London○ The economy is in crisis○ Berlusconi went with Escorts
Low Information quantity: I am telling things you already heard of many times, nothing new
● High information quantity○ Today it snows in Rome○ Dentists on strike
High Information quantity: I am telling things you never or rarely heard of, much new information
Entropy and the medium information quantity
● The formula of information quantity○ H (m ) = -Σ p(m) * log p(m)
○ V(m) = 1/n Σ 1- p(m) ■ p(m) - probability that m will happen
H(m) - Entropy or information quantity (IQ)V(m) - Medium information quantity (MQI)
Probability coefficients
lim p(x) -> 1
x=1..n y=p(x)
Logarithm coefficients
lim log(x) -> -∞
x=0..1 y=log(x)
Coefficients of Shannon Entropy
p(x) log(p(x))*-1
● Very likely words: in, the, is, has,of
● Very unlikely words: APT,
x=p(1..n) y=x*log(x)*-1
Documents distribution based on its entropy
Zipf distribution (long tail)Few in superior extremity and many in inferior extremity● MIN=0● MAX=1700(no limit)
x=doc(1..n) y=H(doc(1..n)
Documents distribution based on its medium information quantity
Gaussian DistributionFew in extremity and the majority inside the medium values
MIN=0MAX=1.0
x=doc(1...n) Y=V(doc(1...n))
Documents distribution based on its medium quantity
x=doc(1...n) y=V(doc(1...n))
Entropy depends on the text length
Correlation: 0.99 - the highest correlation, almost identical
Conclusions
● Very low correlation of MIQ with text lengths 0.05 vs. 0.985
● Correlation of MIQ vs. IQ is 0.57 ● Entropy depends on text length, MIQ does
not depend on text length therefore it find anomalies
● MIQ: information about text style● MIQ compensates IQ
The End
¿Questions?email to scurtu19@gmail.com