Exploring Large Digital Library Collections using a Map ... · PDF fileExploring Large Digital...
date post
30-Oct-2019Category
Documents
view
2download
0
Embed Size (px)
Transcript of Exploring Large Digital Library Collections using a Map ... · PDF fileExploring Large Digital...
Exploring Large Digital Library Collections using a Map-based Visualisation
Dr Mark Hall
Research Seminar, Department of Computing, Edge Hill University
7.11.2013
The information access problem
h tt
p :/
/w w
w .f
lic kr
.c o
m /p
h o
to s/
ca rl
co lli
n s/
1 9
9 7
9 2
9 3
9 /
http://www.flickr.com/photos/dolescum/3567687501/
h tt
p :/
/w w
w .f
lic kr
.c o
m /p
h o
to s/
ar ch
iv es
n z/
8 7
5 9
9 3
9 8
0 6
/
The information access problem
http://www.flickr.com/photos/brokenthoughts/122096903/
The information access problem
• Search works • If you know what you are looking for • If you know what the right keywords are for the
collection • If you are looking for a specific thing
• Search does not work • If you don’t know what you are looking for • If you don’t know what the right keywords are • If you are looking for an overview over a topic • If you want to find out what kind of things a collection
contains
The information access problem
• Mass digitisation has created a scaling problem
• Europeana – The European Digital Library • > 24 million records
• The UK National Archives • > 11 million records
• The British Library • > 56 million records
Alternative access methodologies
Recommendation Facetted search
Visualisations
http://www.flickr.com/photos/[email protected]/7196130228/in/photolist-bXU3NQ-aCYjRc-bvqmYY-9jxrF9-9jukA8-9jukDD-9jukDX-amSdre- dtvDHA-cCPhVE-cCPp5Q-cCPru3-cCPtu3-dtqiwZ-dtvDR3-dtvDQb-dtqiy4-bCdkKC-dtvDNE-bPVdAk-bPVdFg-bB1zdj-bPVdtc-bB1zaY-dsYD9d- bB1z8A-e3ad7K-dW6dUu-dVZC7Z-dVZBTi-bZ9cVh-bwMqoJ-aywe1X-axUNoT-axLkKN-ayyVD7-5FXvXi-7dmvWR-7dmt3g-7dqnbj-7dmtSp-awED7a- awEzM2-awHgXW-awECyK-awEB1F-awHjdb-awHnsu-awEzk6-awEBpk-awEEBi
Spatialisation
• Turn a higher-dimensional semantic space into a two-dimensional representation
• Map similarity in the higher-dimensional space into distance in the two-dimensional space
• Provides a visual overview over the topics in a collection
• People readily understand the distance – similarity metaphor
Spatialisation
• A number of algorithms exist • Multi-Dimensional
Scaling
• Self-Organising Maps
• Issues • Computationally
complex
• Semantic overload
• Interpretation problems http://lazarus.elte.hu/cet/publications/13-ormeling7.jpg
Potential solution
• Use hierarchical structures to overcome the issues
TechnologyAgriculture Arts Culture
Everything
Art Craft Design Visual arts
Artisans Crochet Watchmaker
• Each topic can be processed independently
• Structure can be used to provide visual summaries
Hierarchical spatialisation algorithm 1. Pre-processing
1. Tree pruning
2. Item pruning
3. Vectorisation
2. Spatialisation 1. Initial spatialisation
2. Final positioning
3. Post-processing
Pre-processing
• Ensures that the hierarchy is compatible with the core algorithm • Hierarchy must be a full tree
• Items must only be assigned to leaf topics in the tree
• Ensures that all items & topics have the necessary pre-calculated data for the spatialisation
Tree pruning
• Transforms the hierarchy from a Directed-Acyclical Graph to a tree
Item pruning
• Ensures that items are only assigned to leaf topics
Vectorisation
• Each object to spatialise with MDS must be defined via a vector • Extract keywords from titles and descriptions of items • Filter keywords that appear less than 5 times in the collection
or in more than half the documents • From the keywords use TFIDF (term frequency – inverse
document frequency) to create the vectors
• Items • Use item’s keywords
• Topics • Use the keywords of all items that
belong to the topic or to one of its child topics
𝑡𝑓 𝑡, 𝑑 = 𝑓(𝑡)
𝑑
𝑖𝑑𝑓 𝑡, 𝐷 = log 𝐷
𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑 𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑, 𝐷 = 𝑡𝑓 ∙ 𝑖𝑑𝑓
Core spatialisation
• Hierarchy is spatialised bottom-up • Parent topic is spatialised after all its children have been spatialised
Core spatialisation
Initial spatialisation Neighbourhood graph Final, compact spatialisation
Degenerate MDS
Parallelisation
• Use the inverse tree as an activation graph
TechnologyAgriculture Arts Culture
Everything
Art Craft Design Visual arts
Artisans Crochet Watchmaker
Parallelisation
• Enables the algorithm to scale to large data-sets • 500 000 items processed in ~16 hours on a multi-core
desktop processor
• Limited by the shared map storage backend
Placement
• Due to the parallel nature of the algorithm topic areas will overlap
Post-processing
• Re-calculate boundaries to achieve visual attractivity
Semantic map
Semantic map
• Generally provides overviewing and exploration support
• Hierarchy provides overview labels at higher zoom levels
• Interaction follows the widely adopted Google- maps pattern (zoom / pan)
• At lower zoom levels allows interaction with individual items
• Provides a natural interface for touch-based devices
Semantic map
• Algorithm written in Python
• Data stored in PostgreSQL + PostGIS database
• Individual tiles rendered using • Mapnik – for the actual rendering
• TileLite – for caching and serving
• Web-based user interface provided via Leaflet
Where next?
• Evaluation
• Support continuous updates to the map
• Create more “natural” boundaries
Thank you Questions?
See a demo at http://explorer.paths-project.eu
http://explorer.paths-project.eu/