SEO Articles

From Corpora to Matching


Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.


MORE RESOURCES:

Internet Marketing News

SEO Consult Maintains Its Ranking as The UK's Number 1 Search ...
MSNBC - 18 hours ago
TopSEOs, the independent authority on search vendors, has ranked SEO Consult, the UK's leading Search Engine Optimisation (SEO) company, as the number one ...
Bruce Clay, Inc. Releases New Free SEO Tools SYS-CON Media
The Role of Contextual Links In SEO Is Immense Promotion World (press release)
There's No Shortcut to Good SEO...Or Is There? Search Engine Watch
BigNews.biz (press release) - Search Engine Land
all 56 news articles


Search Engine Land

Considering In-house SEO? How to ease into in-house SEO Smoothly
Search Engine Land, CT - 7 hours ago
In-house SEO can be more cost efficient, but it can also be very costly because in-house SEO has many unique challenges. If you are planning to bring SEO ...
How PPC and SEO Can Benefit Companies In A Struggling Economy Promotion World (press release)
all 2 news articles


Ecommerce Know-How: Information Architecture to Improve SEO and ...
Practical Ecommerce (subscription), Grand Junction - 3 hours ago
In this Ecommerce Know-How, I will briefly define search engine optimization (SEO) and information architecture (IA), describe one of the many places where ...


BizAtomic Launches Version 5.1 of its SEO Optimized E3 Ecommerce ...
Emediawire (press release), WA - 3 hours ago
St Augustine, FL (PRWEB) January 7, 2009 -- BizAtomic.com, a leading provide of Ecommerce Websites and SEO services, today announced the launch of version ...


Indianapolis SEO, IndianapolisSeo.com, an SEO and Web Design Firm ...
PR Web (press release), WA - 8 hours ago
Indianapolis SEO, www.IndianapolisSeo.com, a Web design and search engine optimization firm is launching the new website for PJ's College of Cosmetology ...


IndiaBook Indian SEO Company Launches Search Marketing Outsourcing ...
PowerHomeBiz.com (press release) - 1 hour ago
January 7, 2009 ( PowerHomeBiz ) - India -- India SEO a Professional Internet marketing Firm located in Chandigarh, northern part of India, since 1999 has ...


Search Engine Journal

Search for SEO eBooks and PDF Tutorials
Search Engine Journal - 28 minutes ago
There are several really cool groups there found for [SEO] search each containing really good books. 2. Google offers a nifty advanced operator allowing to ...


Increase Traffic Using Article marketing, Article Submission and ...
Meadow Free Press, ID - Jan 6, 2009
If you are a business person and have a website, you may have heard the term SEO or search engine optimization. Article marketing is a very effective way to ...
New Free One-Stop Website Providing Expert Resources Launched ... 1888 Press Release (press release)
all 2 news articles


How to use Google Analytics to Track SEO Rankings
Marketing Pilgrim, NC - 19 hours ago
With Google and other search engines continuing to roll out personalization updates, it is becoming increasingly difficult for SEOs to get an idea of their ...


This Time, SEO Gets Personal
Mediapost.com, NY - 22 hours ago
Tom Pick reminds us that CEOs and other executives need to appreciate the importance of their personal brand, but he doesn't mention that employees need to ...

SEO - Google News

Home | Sitemap | Forums | Contact Us | Advertise
© 2006 SEO Articles. A Real Web Talk Presentation.
Network Sites : Index Rated Directory | SEO Care Directory | Humour Inc. | Free Mobile Ringtones | A1hosts | SEO Blog
the four required words