Introduction to Data Abstraction, Algorithms and Data Structures
Introduction to Data Structures
-
Upload
chastity-dominguez -
Category
Documents
-
view
42 -
download
0
description
Transcript of Introduction to Data Structures
Overview
Java you need for the Project Search Engine and Data Structures THIS Code Structure On the Data Structure front
Dictionaries (Dictionary Structures) Java Collections Linked List Queue
[c] Vamshi Ambati 2
Java you will need for the Project
Core Programming + I/O and Files OOPS
Inheritance Packages Encapsulation
Java API Collections
[c] Vamshi Ambati 3
What is a Search Engine? A sophisticated tool for finding information
on the web An Index for the World Wide Web
Analogous to the Index on a textbook
Just Imagine a world without Search Engine!
[c] Vamshi Ambati 4
Why Index in the first place? Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak A Sorted list always helps
Permits binary search. About log2n probes into list
log2(1 billion) ~ 3
[c] Vamshi Ambati 5
How search engines work The search engines maintain data of web
sites in its database. Use programs (often referred to as
"spiders" or "robots") to collect information.
The information is then indexed by the search engine.
It allows users to look for the words or combination of words found in the index
Inverted Files
A file is a list of words and this file contains words at various positions. Each entry of the word is associated with a position.
[c] Vamshi Ambati 8
POS1
10
20
30
36
FILE
a (1, 4, 24…)entry (17…)file (2, 10)contains(11,….)position (25…)positions (15…)word (20….)words (6,12..)..
INVERTED FILE
Inverted Files for Multiple Documents
[c] Vamshi Ambati 9
107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132. . .
“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORD INDEX
A comprehensive form of Inverted Index
[c] Vamshi Ambati 10SOURCE: http://www.searchtools.com/slides/bestsearch/bls-24.html
THIS Search engine for the website http://www.hinduonnet.com/
Website for the news paper The Hindu Not for the entire web Results are confined to only one web site
[c] Vamshi Ambati 11
Index Structure for our Project (THIS)
http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=bl :: 4http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=mag :: 7
..
…
http://www.hindu.com/2004/10/09/stories/2004100904051900.htm :: 23http://www.hindu.com/2004/10/09/stories/2004100910970300.htm :: 3..
….
http://www.hinduonnet.com/thehindu/gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/gallery/0048/004807.htm :: 1..
…
…
…
…
…
[c] Vamshi Ambati 12
India
ManMohan
Cricket
Bollywo
Sharukh
Sachin
…
….
Search Engine Differences Coverage (What part of the web do they
really cover?) Crawling algorithms
Frequency of crawl depth of visits
http://www.msitprogram.net/ Depth -0 http://www.msitprogram.net/admissions.html/
Depth -1 Indexing policies
Data Structures Representation
Search interfaces Ranking
[c] Vamshi Ambati 14
Index
[c] Vamshi Ambati 17
Query
retrieve
ResultSet
FinalResult
Sort by Rank
ResultPage
makePage
TheWeb
Spider
Parser
URLList
crawl parse
getNextUrl
addUrls
addPage
Indexer
store
retrieve
Index
[c] Vamshi Ambati 18
Query
retrieve
ResultSet
FinalResult
Sort by Rank
ResultPage
makePage
TheWeb
Spider
Parser
URLList
crawl parse
getNextUrl
addUrls
addPage
Indexer
store
retrieve
Where are our data structures and algorithms lying?
QueuePriority Queue
Hashtable
BinaryTree
LinkedList
MergeSort&InsertionSort
Code Structure (THIS)
[c] Vamshi Ambati 19
PageImg PageHref
PageElement
Spider
WebSpider
PageWord
Queue
SearchDriver
PageLexer
HttpTokenizer URLTextReader
CrawlerDriver
TreeDictionary
Query
addPage
ListDictionary
Indexer
Index
HashDictionary
Index
Save
Restore
Crawl
Parse
DictionaryInterface
Inheritance
Uses
Calls
DictionaryDriver
Dictionary Structures (Lexicon) A Dictionary is an unordered container that contains key-
element pairs Ordered Dictionary has the elements in sorted order
Keys are unique, but the values could be any
[c] Vamshi Ambati 20
Dictionary ADT size(): returns the number of items in D
Output: Integer isEmpty(): Test whether D is empty.
Output: Boolean elements(): Return the elements stored in D.
Output: iterator of elements (objects) keys(): Return the keys stored in D.
Output: iterator of keys (objects) findElement(k): if D contains an item with key == k, then return the element of
that item, else return NO_SUCH_KEY. Output: Object
findAllElements(k): Output: Iterator of elements with key k
insertItem(k,e): Insert an Item with element e and key k into D. removeElement(k): Remove an item with key == k and return it. If no such
element, return NO_SUCH_KEY Output: Object (element)
removeAllElements(k): Remove from D the items with key == k. Output: iterator of elements
[c] Vamshi Ambati 21
Also see the Java Standard API for Dictionary http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html
Dictionary ADT in THIS Project size(): returns the number of items in D
Output: Integer isEmpty(): Test whether D is empty.
Output: Boolean getKeys(): Return all the keys of the elements stored in D.
Output: String array (Ideally it should be Vector!!) getValue(k): if D contains an item with key == k, then return the
element of that item, else return NULL. Output: Object
insertItem(k,e): Insert an Item with element e and key k into D. remove(k): Remove an Item with key k from D.
We have customized the Dictionary a bit as we would be inserting only elements of the type <String,Object> !!
[c] Vamshi Ambati 22
Java Collections java.util.* (A quite helpful library)
Has implementations for most of the Data Structures They make life really easy You can not use the data structures inbuilt unless
specified (Eg:Task1 Tasklet-A) Use them for non-data structural purposes - Collections
Eg: Arrays,Vectors, Iterators,Lists, Sets etc You would definitely be using “Iterator” atleast as you
would be dealing with many Objects at a time!
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterator.html.
[c] Vamshi Ambati 23
See: http://java.sun.com/docs/books/tutorial/collections/
Other Data structures Queue LinkedList
Beware! there are no Pointers in Java However there are “references”
Learn more about References in Java
Do not use the java.util package for DataStructures or Sorting Algorithms! You are expected to code them
[c] Vamshi Ambati 24
Summary Learn data structures by implementing
THIS
Mini version of a real search engine
Frame work is provided
More details in the next video
[c] Vamshi Ambati 25