Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Aptitude
  • Engineering Mathematics
  • Discrete Mathematics
  • Operating System
  • DBMS
  • Computer Networks
  • Digital Logic and Design
  • C Programming
  • Data Structures
  • Algorithms
  • Theory of Computation
  • Compiler Design
  • Computer Org and Architecture
Open In App
Next Article:
Web Information Retrieval | Vector Space Model
Next article icon

Web Information Retrieval | Vector Space Model

Last Updated : 30 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

It goes without saying that in general a search engine responds to a given query with a ranked list of relevant documents.The purpose of this article is to describe a first approach to finding relevant documents with respect to a given query. In the Vector Space Model (VSM), each document or query is a N-dimensional vector where N is the number of distinct terms over all the documents and queries.The i-th index of a vector contains the score of the i-th term for that vector. 

The main score functions are based on: Term-Frequency (tf) and Inverse-Document-Frequency(idf). 

Term-Frequency and Inverse-Document Frequency - 
The Term-Frequency (tf_{ij}     ) is computed with respect to the i-th term and j-th document : $$ tf_{i, j} = \frac{n_{i, j}}{\sum_k n_{k, j}} $$     where $ n_{i, j} $     are the occurrences of the i-th term in the j-th document. 

The idea is that if a document has multiple receptions of given terms, it will probably deals with that argument. 
The Inverse-Document-Frequency (idf_{i}     ) takes into consideration the i-th terms and all the documents in the collection : $$ idf_i = \mbox{log} \frac{|D|}{|{d : t_i \in d}|} $$     

The intuition is that rare terms are more important that common ones : if a term is present only in a document it can mean that term characterizes that document. 
The final score w_{i, j}     for the i-th term in the j-th document consists of a simple multiplication : tf_{ij}*idf_{i}     . Since a document/query contains only a subset of all the distinct terms in the collection, the term frequency can be zero for a big number of terms : this means a sparse vector representation is needed to optimize the space requirements. 

Cosine Similarity - 
In order to compute the similarity between two vectors : a, b (document/query but also document/document), the cosine similarity is used : \begin{equation} \cos ({\bf a}, {\bf b})= {{\bf a} {\bf b} \over \|{\bf a}\| \|{\bf b}\|} = \frac{ \sum_{i=1}^{n}{{\bf a}_i{\bf b}_i} }{ \sqrt{\sum_{i=1}^{n}{({\bf a}_i)^2}} \sqrt{\sum_{i=1}^{n}{({\bf b}_i)^2}} } \end{equation}     
This formula computes the cosine of the angle described by the two normalized vectors : if the vectors are close, the angle is small and the relevance is high. 
It can be shown the cosine similarity is the same of the Euclidean distance under the assumption of vector normalization. 

Improvements - 
There is a subtle problem with the vector normalization: short document that talks about a single topic can be favored at the expenses of long document that deals with more topics because the normalization does not take into consideration the length of a document. 

The idea of pivoted normalization is to make document shorter than an empirical value ( pivoted length : l_{p}     ) less relevant and document longer more relevant as shown in the following image: Pivoted Normalization 

 



A big issue that it is not taken into consideration in the VSM are the synonyms : there is no semantic relatedness between terms since it is not captured neither by the term frequency nor the inverse document frequency. In order to solve this problems the Generalized Vector Space Model(GVSM) has been introduced.

The Vector Space Model (VSM) is a widely used information retrieval model that represents documents as vectors in a high-dimensional space, where each dimension corresponds to a term in the vocabulary. The VSM is based on the assumption that the meaning of a document can be inferred from the distribution of its terms, and that documents with similar content will have similar term distributions.

To apply the VSM, first a collection of documents is preprocessed by tokenizing, stemming, and removing stop words. Then, a term-document matrix is constructed, where each row represents a term and each column represents a document. The matrix contains the frequency of each term in each document, or some variant of it (e.g., term frequency-inverse document frequency, TF-IDF).

The query is also preprocessed and represented as a vector in the same space as the documents. Then, a similarity score is computed between the query vector and each document vector using a cosine similarity measure. Documents are ranked based on their similarity score to the query, and the top-ranked documents are returned as the search results.

The VSM has many advantages, such as its simplicity, effectiveness, and ability to handle large collections of documents. However, it also has some limitations, such as the "bag of words" assumption, which ignores word order and context, and the problem of term sparsity, where many terms occur in only a few documents. These limitations can be addressed using more sophisticated models, such as probabilistic models or neural models, that take into account the semantic relationships between words and documents.
 

Advantages:

Access to vast amounts of information: WIR provides access to a vast amount of information available on the internet, making it a valuable resource for research, decision-making, and entertainment.

Easy to use: WIR is user-friendly, with simple and intuitive search interfaces that allow users to enter keywords and retrieve relevant information quickly.

Customizable: WIR allows users to customize their search results by using filters, sorting options, and other features to refine their search criteria.

Speed: WIR provides rapid search results, with most queries being answered in seconds or less.

Disadvantages:

Quality of information: The quality of information retrieved by WIR can vary greatly, with some sources being unreliable, outdated, or biased.

Privacy concerns: WIR raises privacy concerns, as search engines and websites may collect personal information about users, such as their search history and online behavior.

Over-reliance on algorithms: WIR relies heavily on algorithms, which may not always produce accurate results or may be susceptible to manipulation.

Search overload: With the vast amount of information available on the internet, WIR can be overwhelming, leading to information overload and difficulty in finding the most relevant information.


Next Article
Web Information Retrieval | Vector Space Model

A

AngeloCatalani
Improve
Article Tags :
  • Misc
  • DBMS
Practice Tags :
  • Misc

Similar Reads

    Issues in Information Retrieval
    Indexing is the most vital part of any Information Retrieval System. It is a process in which the documents required by the users are transformed into searchable data structures. Indexing can be also referred to as the process of extraction rather than analysis of particular content. It creates a co
    2 min read
    Types of Queries in IR Systems
    During the process of indexing, many keywords are associated with document set which contains words, phrases, date created, author names, and type of document. They are used by an IR system to build an inverted index which is then consulted during the search. The queries formulated by users are comp
    3 min read
    Difference between Web Content, Web Structure, and Web Usage Mining
    Web mining is an application of the Data Mining technique that is used to find information patterns from the web data. Web Mining helps to improve the power of web search engines by identifying the web pages and classifying web documents. Types of Web Mining : 1. Web Content Mining - Web Content Min
    2 min read
    Inverted Index
    An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve documents or web pages containing a specific term or set of terms. In an inverted index, the index is organized by terms (words), and each term points to a list of documents or web pages that contain
    7 min read
    Project Idea | (LinkBook)
    LinkBook aims to provide users a platform to store, manage, share and discover web URLs. Motivation Facebook is one way to share links with one's friends, however, it does not provide an option to manage those links for future reference. LinkBook provides the users with the option to store their lin
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences