Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998.
-
Upload
milo-mccarthy -
Category
Documents
-
view
212 -
download
0
Transcript of Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998.
www.nlsearch.com
Classification at Northern Light
Presentation to Access 98
October 4, 1998
www.nlsearch.com
“This year, the World Wide Web has arrived as a serious supplier
of ‘serious’ online information.”
Sue Feldman, “Web Search Services in 1998: Trends and Challenges,” Searcher
Magazine, June 1998
www.nlsearch.com
Search engines are being held to higher standards
All users want freshness and manageable results sets
Professional information seekers want
– high relevance and high quality content first
– good descriptive information for all results
– precision searching
– text and tables
www.nlsearch.com
Web search environment
constant growth in all dimensions (pages, countries, languages, file formats)
constantly increasing traffic
continuous onslaught of spam
www.nlsearch.com
Practical considerations for search engines
significant engineering time spent counteracting spam
constantly adding disk space: 3 terabytes at Northern Light
crawler efficiency: must balance new page discovery with known-page re-crawl
www.nlsearch.com
You step in the stream, but the water has moved on.
This page is not here.
www.nlsearch.com
Search engines: limitations
lack the higher quality sources not found on the Web
no concept of classification as found in library systems
like an index of every word on every page in every book in your library
– with no subject catalog
www.nlsearch.com #
Northern Light’s fundamental goals
Combine Web data with quality information not on the Web in a single integrated search
Make results set manageable for user (already a problem; worse after non-Web data is added)
www.nlsearch.com
Research Engine : Content as of Oct 98
Web
– 96,000,000 pages
Special Collection
– 3,600,000+ full-text documents
– 4600 journals, magazines, books, trusted reference works, etc.
Mixes free (Web) and Fee (Special Collection)
www.nlsearch.com
Relevancy ranking still critical
Engines continue to improve their ranking algorithms
All seem to agree that relevancy ranking is not enough to manage results lists of size commonly seen now
www.nlsearch.com
Techniques for taming results sets
abridge the database (Excite, Lycos, Infoseek)
re-sort by popularity (HotBot/Direct Hit)
suggest further refinement steps to user (Alta Visa Refine)
sort based on number of inbound links (Infoseek…?)
sort by classification metadata (Northern Light)
www.nlsearch.com
Research Engine: Classification
classify the Web according to the same standards found in journal literature
sort results for user, based on this classification
work with the user to refine the question (reference interview approach)
www.nlsearch.com
Relevancy ranking has its limits
Library patron: “I need some baseball information.”
Librarian: “OK. Here are 41,536 books and sources about baseball, relevancy ranked.”
Good general sources may be ranked on top, but the user probably had something more specific in mind...
www.nlsearch.com
Reference librarian approach: work with the user to refine the question
“I need some baseball information.”
“OK. Tell me more. Do you want general info, teams and players, recent news...?”
“Um... team info”
“OK. Red Sox, Yankees, ...?”
“Red Sox.”
www.nlsearch.com
www.nlsearch.com
Classification helps organize results
shows aspects of a topic (‘baseball’, ‘diagnostic tests’)
disambiguates queries (‘what is balance’)
sometimes answers questions directly (‘12th President’)
www.nlsearch.com
www.nlsearch.com
www.nlsearch.com
Search Current News
Computer networksLocal area networksModemsCable modems
all others...
Special Collection
Personal computersComputer cachesBuses (computer)
Health care softwareSoftware industryCircuit design
www.nlsearch.com
www.nlsearch.com
Special Collection documentsCommercial sites
Sociology of the familyEmployee assistance programs
Neurology
Online bankingHelicoptersMartial artsChinese philosophy
all others...
1. WHAT IS BALANCE?84% - Articles & General info: WHAT IS BALANCE? Back to New Evangelicanism Reports. Back to the Way of Life Home Page Way of Life Literature Online Catalog You Can Own…11/09/97Personal Page: http://www.dsinclair.com /~dcloud/fbns /whatisbalance.htm
2. Emotional Stability is Balance77% - Articles & General info: Emotional Stability is Balance Emotional Stability is Balance - 1 He is unbalanced - 2 She’s not on an even keel - 3 They’re upset…03/24/95Educational site:http://cogsci.berkeley.edu/metaphors/ EmotionalStabilityIsBalance.html
3. What is balance?73% - Biographical sources: “What is balance?” This is an ongoing, soul-searching, head-scratching question that my husband, Don, and I ponder on a regular bases….07/01/96Exceptional parent (magazine): Available at Northern Light
www.nlsearch.com
www.nlsearch.com
Subject classification of Web documents
exists for sites in Web directories (Yahoo, Looksmart, The Mining Co)
exists behind CGI interfaces
doesn’t exist at the document level
except where supplied by the page creator
www.nlsearch.com
Cost of document classification
Original cataloging of book: $37
Creating a journal article abstract: $1.50
Deriving subject headings from journal abstract: $.20
for 95,000,000 Web documents = $161.5 million
www.nlsearch.com
Metadata manufacturing
Automatically determine document’s subject, type, source and language metadata
Controlled vocabularies interoperate with classifier system
System classifies pages
Fraction of cent per document
www.nlsearch.com
NL’s controlled vocabularies
Editorially developed
Hierarchical in form (graph)
Exist for subjects, types, and sources
www.nlsearch.com
NL’s subject vocabulary
Subject scope is unlimited (as in LC, Dewey, Yahoo)
Major points of reference were DDC, LC Subject headings, UMI subject headings, and subject-specialized classification schemes
Unique, selective conflation of these
Mapping NL with content partners’ vocabularies gives freshness, completion
20,000 concepts; 200-300,000 concept equivalents
www.nlsearch.com
Subject classification process
Three main techniques:
– mapping
– automatic classification
– editorial classification of whole web sites
www.nlsearch.com
Mapping
Indexing vocabularies of content partners are normalized
with NL vocabularies
Excellent source of new terms; helps maintain freshness
and ensure complete coverage of a topic
All terms become synonyms, equivalents of NL terms and
are used in automatic classification... creating a ‘network
effect’ of subject knowledge
www.nlsearch.com
Partner vocabularies mapped to date
journal aggregators: UMI, IAC, Ethnic News
Watch, Responsive Database Services
news databases: AP News, Comtex Newswires,
Newsbytes
others: U.S. Pharmacopeia, American Banker,
Engineering News Record
www.nlsearch.com
Automatic classification
based on words contained in document
uses Term Frequency/Inverse Document Frequency methods
document must have a strong degree of
‘aboutness’ to class
www.nlsearch.com
NL’s type classification
This scheme too is hierarchical, e.g.• Reviews
– Book reviews– Movie reviews– Product reviews
classification process based on words and structure of document
www.nlsearch.com
Librarians at Northern Light
Build and maintain controlled vocabulary
Map vocabularies of new partners
Continually tune classification performance
Help design and test user interface
Mine and classify whole web sites
Edit databases
www.nlsearch.com
Database editing
Classification used to slice NL database into “vertical search engines”
Since Feb 98, we’ve released
– 17 subject search engines on NL Power Search
– 26 industry databases (for NL; also on Netscape Netcenter)
– 5 personal finance databases (for Doubleclick)
– music industry database (with Billboard magazine)
– construction industry database (with Engineering News Record)
www.nlsearch.com
Automatic classification is still a fledgling technology, however...
it has proved practical for classifying close to 100 million web pages
it is remarkably accurate, given the breadth of concept space it covers
it is responsive to tuning
it is effective in managing results sets for users
www.nlsearch.com
Joyce WardDirector, Content ClassificationNorthern Light Technology LLC222 Third St.Cambridge, MA [email protected]