Manual de partes GA 11+, GA 15+, GA 18+, GA 22+, GA 26+, GA 30
Profiling Web Archives IIPC GA 2015
-
Upload
sawood-alam -
Category
Internet
-
view
962 -
download
0
Transcript of Profiling Web Archives IIPC GA 2015
Profiling Web Archivesmemento
and Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Sawood Alam Michael L. Nelson
Los Alamos National Laboratory, Los Alamos, NMHerbert Van de Sompel
Stanford University Libraries, Stanford, CADavid S. H. Rosenthal
Memento Aggregator
mementoAggregates ~20 archives and countingOnly a few archives return good resultsfor any queryTime, network, and resource wastageQuery routing can be helpful
Long Tail Matters400B+ web pages at IA donot cover everythingTop three archives after IAproduce full TimeMap52% of the timeTargeted crawlsSpecial focus archivesRestricted resourcesPrivate archives
The Portuguese Web Archive and Memento unveil the first homepage of the Smithsonian Institution from May 1995... fb.me/3VAo6gEba1:12 PM 5 Jan 2015
8 1
PortugueseWebArchive @PT_WebArchive
Follow
Dennis Ritchie's Homepage has been deleted: cm.belllabs.com/cm/cs/who/dmr/ and the site has a robots.txt that blocks it from the Wayback.2:37 PM 22 Apr 2015
76 23
Jason Scott @textfiles
Follow
A Client RequestCanonical URLAccept-Datetime (optional)Accept-Language (optional)
G E T / t i m e g a t e / h t t p : / / w w w . c n n . c o m / H T T P / 1 . 1H o s t : m e m e n t o w e b . o r gA c c e p t : t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l ; q = 0 . 9 , i m a g e / w e b p , * / * ; q = 0 . 8A c c e p t - E n c o d i n g : g z i p , d e f l a t e , s d c hA c c e p t - D a t e t i m e : S a t , 1 6 J u n 2 0 1 2 0 0 : 0 0 : 0 0 G M TA c c e p t - L a n g u a g e : e n - U S , e n ; q = 0 . 8C a c h e - C o n t r o l : m a x - a g e = 0I f - M o d i f i e d - S i n c e : T h u , 2 3 A p r 2 0 1 5 1 6 : 5 1 : 5 0 G M TI f - N o n e - M a t c h : " 7 f f 8 - 5 1 4 6 7 1 8 9 2 9 5 8 0 "C o n n e c t i o n : k e e p - a l i v eC o o k i e : _ _ u n a m = 3 4 c 3 c 7 d - 1 4 c e 9 1 7 c e 6 2 - 4 3 c 3 8 e 5 e - 7 . . .U s e r - A g e n t : M o z i l l a / 5 . 0 L i n u x x 8 6 _ 6 4 C h r o m e / 4 2 . 0 . 2 3 1 1 . 9 0 . . .
An Archive ResponseCanonical URL (known)Memento-DatetimeOriginal Content-Language (optional)
H T T P / 1 . 1 2 0 0 O KS e r v e r : T e n g i n e / 2 . 0 . 3D a t e : S u n , 2 6 A p r 2 0 1 5 0 0 : 2 5 : 5 7 G M TC o n t e n t - T y p e : t e x t / h t m l ; c h a r s e t = u t f - 8C o n t e n t - L e n g t h : 8 5 9 4 5C o n n e c t i o n : k e e p - a l i v es e t - c o o k i e : w a y b a c k _ s e r v e r = 3 7 ; D o m a i n = a r c h i v e . o r g ; P a t h = / ; E x p i r e s = T u e , 2 6 - M a y - 1 5 0 0 : 2 5 : 5 7 G M T ;M e m e n t o - D a t e t i m e : S a t , 2 5 A p r 2 0 1 5 1 3 : 3 8 : 1 6 G M TL i n k : ; r e l = " o r i g i n a l " , ; r e l = " t i m e m a p " ; t y p e = " a p p l i c a t i o n / l i n k - f o r m a t " , X - A r c h i v e - G u e s s e d - C h a r s e t : U T F - 8X - A r c h i v e - O r i g - v i a : 1 . 1 v a r n i s h , 1 . 1 v a r n i s h , 1 . 1 v a r n i s hX - A r c h i v e - O r i g - c o n t e n t - l a n g u a g e : e nX - A r c h i v e - O r i g - x - c o n t e n t - t y p e - o p t i o n s : n o s n i f fX - A r c h i v e - O r i g - v a r y : A c c e p t - E n c o d i n g , C o o k i eX - A r c h i v e - O r i g - c o n t e n t - t y p e : t e x t / h t m l ; c h a r s e t = U T F - 8X - A r c h i v e - O r i g - c a c h e - c o n t r o l : p r i v a t e , s - m a x a g e = 0 , m a x - a g e = 0 , m u s t - r e v a l i d a t eX - A r c h i v e - O r i g - s e r v e r : A p a c h e
A CDX SnippetCanonical URLMemento Datetime
c n n . c o m / 2 0 0 8 0 2 2 6 1 9 3 7 5 7 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 2 Q 4 O Z S V K P Z M U F 3 6 U N 6 F B X F N G D K A R P A 7 N - - 1 0 3 6 8 8 9 2 A R C H I V E I T - 1 0 2 2 - M O L L Y A S T R I D - C A S T R O R E S I - 2 0 0 8 0 2 2 6 1 9 3 7 1 9 - 0 0 0 0 0 - c r a w l i n g 1 0 . u s . a r c h i v e . o r g . a r c . g zc n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J E U Z 3 J O J 4 H B Z Y B 4 - - 1 3 2 8 2 5 0 0 A R C H I V E I T - 1 0 2 3 - 2 0 0 9 0 3 1 4 0 2 4 0 1 5 - 0 0 0 5 8 - c r a w l i n g 1 0 5 . u s . a r c h i v e . o r g . w a r c . g zc n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J E U Z 3 J O J 4 H B Z Y B 4 - - 2 4 5 9 5 3 6 0 A R C H I V E I T - 1 0 2 3 - 2 0 0 9 0 3 1 4 0 2 4 0 0 3 - 0 0 0 5 7 - c r a w l i n g 1 0 5 . u s . a r c h i v e . o r g . a r c . g zi . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 1 0 2 0 8 3 5 5 4 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l 2 0 0 D 2 M J U R 6 2 V J 5 D 6 C N L 5 P U D Q F E W 4 G I R G I X 2 - - 3 3 9 0 1 7 1 1 A R C H I V E I T - 1 0 2 3 - Q U A R T E R L Y - U G G V Z U - 2 0 1 3 0 1 0 2 0 8 0 2 5 2 - 0 0 0 0 7 - w b g r p - c r a w l 0 6 3 . u s . a r c h i v e . o r g - 6 6 8 2 . w a r c . g zi . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 4 0 4 1 7 2 9 1 3 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l 2 0 0 J Z K L 7 H G G B N 7 3 B U X F I S E J L M 7 Y N A X E 7 M T I - - 2 7 4 5 0 8 0 8 1 A R C H I V E I T - 1 0 2 3 - Q U A R T E R L Y - 5 8 8 5 - 2 0 1 3 0 4 0 4 0 7 4 7 1 6 9 4 8 - 0 0 0 0 2 - w b g r p - c r a w l 0 6 7 . u s . a r c h i v e . o r g - 6 4 4 3 . w a r c . g z
Complete URI-R ProfilingSanderson et al. created a URIR profile for variousarchivesExtracted every URI-R from all the CDX filesGained complete knowledge of the holding of theparticipating archivesProfiles were hugeDifficult to keep up-to-dateMisses URI-Rs added later in the archive
TLD-only ProfilingAlSum et al. created a TLDprofile for various archivesCollected statistics aboutvarious archives onvarious TLDsLightweight profilesLots of false-positivesAll the ".com" queries willbe routed to an archivethat has only a few URI-Rswith ".com" TLD
Middle GroundPartial URI-Rs, such as:
Registered domain nameComplete domain name (along with any sub-domains)Complete domain name and first few path segments
Registered domain name and counts of other segmentssuch as sub-domain, path, and query parameterCombining above with other attributes such as Content-Language and Memento-Datetime
Archive ProfileHigh-level digest of an archivePredicts presence of mementos of a URI-R in an archiveProvides various statistics about the holdingsSmall in sizePublicly availableEasy to update and partially patchUseful for Memento query routing and other things
StructureA r c h i v e m e t a d a t aS t a t i s t i c s : P r o f i l e t y p e s : K e y s : F r e q u e n c y m e a s u r e m e n t s
Profile typesURI-R based
Complete URI-RTLD onlyURI-R hashes, such as:
Only first few segments of the URI-R (Sub-URI)Registered domain name along with counts of othersegments (Segment-Digest)
LanguageDatetimeMany more...
KeysDepend on the profile typeControl the balance between profile size and details
U R I - R : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "T L D : " . u k "S u b - U R I : " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s "S e g - D i g e s t : " 0 / b b c . c o . u k / 4 "L a n g u a g e : " e n - G B "D a t e t i m e : " 2 0 1 4 0 3 " # Y Y Y Y M M
Frequency MeasurementsCan have the same structure for all profile typesFlexible to choose the attribute set to be includedAffects the profile complexityPredicts the presence of the mementos of a URI-R
" u k , c o , b b c ) / " : u r i m : m a x : 2 m i n : 1 t o t a l : 1 2 8 u r i r : 1 1 5
Horizontal and Vertical Holdings" u k , c o , b b c ) / " : u r i m : m a x : 1 0 0 m i n : 1 0 0 t o t a l : 1 0 0 u r i r : 1
" u k , c o , b b c ) / " : u r i m : m a x : 1 m i n : 1 t o t a l : 1 0 0 u r i r : 1 0 0
" u k , c o , b b c ) / " : u r i m : m a x : 2 0 m i n : 5 t o t a l : 1 0 0 u r i r : 1 0
Sample Profile- - -" @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d "" @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "a b o u t : a c c e s s p o i n t : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " m e c h a n i s m : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " n a m e : " U K W A 1 9 9 6 C o l l e c t i o n " p r o f i l e _ u p d a t e d : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " s u b u r i _ c l a s s : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " m o r e _ m e t a _ d a t a : " . . . "s t a t s : l a n g u a g e : " e n - U S " : u r i m : { m a x : 1 3 , m i n : 1 , t o t a l : 4 7 5 2 9 } u r i r : 2 5 6 2 1 " m o r e _ l a n g u a g e s " : " . . . " s u b u r i : " u k ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 9 3 2 4 3 2 } u r i r : 8 6 7 8 1 7 " u k , c o ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 4 1 0 9 7 9 } u r i r : 3 7 8 6 8 6
URI-R Based ProfilesURI-R preprocessing
CanonicalizeApply SURTSplit segmentsExtract registered domainCount segments (sub-domain, path, query params)
Generate all Sub-URIsIncrementally add segments from left-to-rightOnly up to max host and path segments config
Create Segment-Digest with registered domainPrefix sub-domain countSuffix path and query params count
Key Generationhttps://www.BBC.co.uk/images/Logo.png?width=200&height=80#f
Intermediate Values{ c a n o n i c a l _ u r l : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , s u r t _ u r l : " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , r e g _ d o m a i n : " b b c . c o . u k " , p a t h _ i n i t i a l : " i " , s u b d o m a i n _ c o u n t : 1 , p a t h _ c o u n t : 2 , q u e r y _ p a r a m s _ c o u n t : 2 }
Sub-URI(H 3 P 1 )[ " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s " ]
SegDigest( include_path_initial)" 1 / b b c . c o . u k / i 4 "
ImplementationGitHub:
A python module to generate Sub-URIs from SURTGitHub:
Various profile generation scripts
/oduwsdl/suburi_generator
/oduwsdl/archive_profiler
CanonicalizationRemove "http(s)", "www", and fragment of a URIDowncase hostnameRemove some known query paras e.g., "jsessionid"Sort query params by keys and values (secondary)
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "C a n o n i c a l i z e ( U R L )# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
Sort-friendly URI ReorderingTransform (SURT)
Take canonical URL as inputJoin hostname segments by commas in reverse orderSeparate hostname and path by closing parenthesis
C a n _ U R L = " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "S U R T ( C a n _ U R L )# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
Sub-URITake SURT URL as inputIncrementally add segments from left-to-right one-by-oneStop if hostname or path segment limit policy reachesReturn the list of all Sub-URIsS U R T _ U R L = " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )# = > [ " u k ) / " ,# " u k , c o ) / " ,# " u k , c o , b b c ) / " ,# " u k , c o , b b c ) / i m a g e s " ]
URL to Sub-URIU R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
C a n _ U R L = C a n o n i c a l i z e ( U R L )# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S U R T _ U R L = S U R T ( C a n _ U R L )# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S u b _ U R I s = S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )# = > [ " u k ) / " ,# " u k , c o ) / " ,# " u k , c o , b b c ) / " ,# " u k , c o , b b c ) / i m a g e s " ]
Segment Count DigestExtract registered domain name and initial letter of pathCount sub-domain and trailing (path + query) segmentsSerialize as follows:{ s u b d o m a i n _ c o u n t } / { r e g _ d o m a i n } / { p a t h _ i n i t i a l } ? { t r a i l i n g _ c o u n t }
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "S e g m e n t s = S e g m e n t i z e ( U R L )# = > { r e g _ d o m a i n : " b b c . c o . u k " ,# p a t h _ i n i t i a l : " i " ,# s u b d o m a i n _ c o u n t : 1 ,# p a t h _ c o u n t : 2 ,# q u e r y _ p a r a m s _ c o u n t : 2 ,# t r a i l i n g _ c o u n t : 4 }S e g D i g e s t ( S e g m e n t s , p o l i c y = " e x c l u d e _ p a t h _ i n i t i a l " )# = > " 1 / b b c . c o . u k / 4 "S e g D i g e s t ( S e g m e n t s , p o l i c y = " i n c l u d e _ p a t h _ i n i t i a l " )# = > " 1 / b b c . c o . u k / i 4 "
JSON SerializationCan have complex nesteddata structureJSON-LD for linked dataNo partial key lookupUnsuitable for textprocessing toolsAllows processing onlywhen fully loadedA single malformedcharacter makes itunparsableDifficult to patch
{ " s u b u r i " : { " u k ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } , " u k , c o ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6 } , " u k , c o , b b c ) / " : { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8
CDX-JSON SerializationFusion of CDX and JSON file formatsA key followed by strict single line JSON valueUnlike CDX, values can have arbitrary attributesText processing tool friendlyNo single root node or single document restrictionsEnables binary searchEnables partial key lookupError resilient
@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
MergingOnly process new data to periodically update forfreshnessParallel processingDifficult to keep detailed measures with absolute valuesDerived simple heuristic measures to predict presence ofmementos
Merging ExampleBase Profile
c o m , c n n ) / { " u r i r _ s u m " : 3 0 , " s o u r c e s " : 1 } ,u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 }
New Profilec o m , c n n ) / { " u r i r _ s u m " : 1 0 , " s o u r c e s " : 1 } ,c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
Merged Profilec o m , c n n ) / { " u r i r _ s u m " : 4 0 , " s o u r c e s " : 2 } ,u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } ,c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
Sample Query SetsSample Size In Archive-It In UKWA
DMOZ 100,000 4,042 1,896
MementoProxy 100,000 4,222 193
IAWayback 100,000 3,999 275
EvaluationRelate CDX Size, URI-M, URI-R, and Sub-URIAnalyze profile growthEstimate Relative CostEvaluate Routing Precision vs. Relative Cost
Relative Cost = |Keys in the Profile||URI-R in the Archive|
Routing Precision = |URI-R Present in the Archive||URI-R Predicted by the Profile in Archive|
UKWA Dataset
Yearly data as seprate collectionsAverage CDX line size: 275 bytesURI-M/URI-R ratio: 2.46
Accumulated URI-R Growth (UKWA)Successive yearly datawas mergedFollows Heaps' Law
K = 3.897β = 0.892
= KCr Cβm
Sub-URI Key Growth (UKWA)Slope of the fit line is theRelative Cost for theprofile policyComplete URI-R profilehas Relative Cost 1
Search Precision of Various Profiles
Search Precision wrt TLD-only profileDouble for H3P0Five fold for HxP1
Segment-Digest is as good as H3P0
Relative Cost vs. Search Precision
Up to 22% routing precision with <5% Reltive Cost<0.3% sample URIs from MementoProxy and IAWaybacklogs present in UKWAShallow crawling of UKWA results in higher cost
Relative Profile Cost (UKWA)Profile Cost Profile Cost Profile Cost
H1P0 3.2e-06 H3P2 0.26823 HxP2 0.38313
H2P0 0.00027 H3P3 0.37343 HxP3 0.53928
H2P1 0.00059 H4P0 0.01348 HxP4 0.63889
H2P2 0.00099 H5P0 0.01388 HxP5 0.71568
H3P0 0.00862 HxP0 0.01401 HxPx 0.83107
H3P1 0.11864 HxP1 0.16349 URIR 1.00000
Future WorkGenerating sample URI setsProfiling via samplingLanguage profilesEvaluation of combination profiles such as Sub-URI alongwith DatetimeProfiles for usage other than Memento routing, such as,
Media-type profiles (e.g., images, pdf, audio etc.)Site classification based profiles (e.g., news, wiki, socialmedia, blog etc.)
ConclusionsGenerated profiles with different policies for two archivesExamined cost-accuracy trade-offs of various profilesRelated CDX Size, URI-M, URI-R, and Sub-URIGained up to 22% routing precision with <5% relative costwithout any false negatives<5% of the queried URIs are present in each of theindividual archivesImplementation codes are available at:
GitHub:GitHub:
/oduwsdl/suburi_generator/oduwsdl/archive_profiler