Science Fair Project Encyclopedia
Latent semantic analysis
LSA is a preprocessing step, used before the classification or search of documents. The purpose of LSA is to make documents easier to classify and search. LSA is meant to solve two fundamental problems in natural language processing: synonymy and polysemy. In synonymy, different writers use different words to describe the same idea. Thus, a person issuing a query in a search engine may use a different word than appears in a document, and may not retrieve the document. In polysemy, the same word can have multiple meanings, so a searcher can get unwanted documents with the alternate meanings.
LSA starts with a document-term matrix, a sparse matrix whose rows correspond to documents and whose columns correspond to terms (typically stemmed words that appear in the documents). The values of the matrix are typically tf-idf : they are proportional to the number of times the terms appear in the matrix, where rare terms are upweighted to reflect their relative importance.
LSA then finds a low-rank approximation to the document-term matrix, through the use of singular value decomposition (SVD). In LSA, this SVD is truncated, so that each document and term is represented by a vector of much lower dimensionality than the total number of words in the vocabulary. Thus, when a query is issued by a user, it gets mapped into this low-dimensional space, and gets compared to documents in that same space.
Because it uses a low-dimensional representation for terms and documents, it must represent meaning in documents, rather than simply which terms occur. Thus, document and terms with similar meaning are close in the low-dimensional space. This can mitigate polysemy (by using more than one word in the query to disambiguate in the low-dimensional space) and synonymy (because the synonymous words map similarly in the low-dimensional space).
Recently, LSA has come under criticism, because its probabilistic model does not match the observed data. LSA assumes that words and documents form a joint Gaussian model. However, Gaussian models can generate negative values, and it is impossible to have a negative number of words in a document. Thus, a newer alternative is probabilistic latent semantic analysis , based on a multinomial model, which is reported to give better results than standard LSA. However, LSA still remains a standard algorithm in information retrieval.
External links and references
- the first place to start with LSA
- Introduction to Latent Semantic Analysis, by T. K. Landauer, P. W. Foltz, & D. Laham, Discourse Processes, 25, 259-284 (1998).
- Indexing by Latent Semantic Analysis, by S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Journal of the Society for Information Science, 41(6), 391-407, (1990).
- Probabilistic Latent Semantic Analysis, by T. Hofmann, Proc. Uncertainty in Artificial Intelligence, (1999)
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details