Exploring Large Digital Library Collections using a Map ... · PDF fileExploring Large Digital...

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Exploring Large Digital Library Collections using a Map ... · PDF fileExploring Large Digital...

  • Exploring Large Digital Library Collections using a Map-based Visualisation

    Dr Mark Hall

    Research Seminar, Department of Computing, Edge Hill University


  • The information access problem

    h tt

    p :/

    /w w

    w .f

    lic kr

    .c o

    m /p

    h o

    to s/

    ca rl

    co lli

    n s/

    1 9

    9 7

    9 2

    9 3

    9 /


    h tt

    p :/

    /w w

    w .f

    lic kr

    .c o

    m /p

    h o

    to s/

    ar ch

    iv es

    n z/

    8 7

    5 9

    9 3

    9 8

    0 6


  • The information access problem


  • The information access problem

    • Search works • If you know what you are looking for • If you know what the right keywords are for the

    collection • If you are looking for a specific thing

    • Search does not work • If you don’t know what you are looking for • If you don’t know what the right keywords are • If you are looking for an overview over a topic • If you want to find out what kind of things a collection


  • The information access problem

    • Mass digitisation has created a scaling problem

    • Europeana – The European Digital Library • > 24 million records

    • The UK National Archives • > 11 million records

    • The British Library • > 56 million records

  • Alternative access methodologies

    Recommendation Facetted search


    http://www.flickr.com/photos/[email protected]/7196130228/in/photolist-bXU3NQ-aCYjRc-bvqmYY-9jxrF9-9jukA8-9jukDD-9jukDX-amSdre- dtvDHA-cCPhVE-cCPp5Q-cCPru3-cCPtu3-dtqiwZ-dtvDR3-dtvDQb-dtqiy4-bCdkKC-dtvDNE-bPVdAk-bPVdFg-bB1zdj-bPVdtc-bB1zaY-dsYD9d- bB1z8A-e3ad7K-dW6dUu-dVZC7Z-dVZBTi-bZ9cVh-bwMqoJ-aywe1X-axUNoT-axLkKN-ayyVD7-5FXvXi-7dmvWR-7dmt3g-7dqnbj-7dmtSp-awED7a- awEzM2-awHgXW-awECyK-awEB1F-awHjdb-awHnsu-awEzk6-awEBpk-awEEBi

  • Spatialisation

    • Turn a higher-dimensional semantic space into a two-dimensional representation

    • Map similarity in the higher-dimensional space into distance in the two-dimensional space

    • Provides a visual overview over the topics in a collection

    • People readily understand the distance – similarity metaphor

  • Spatialisation

    • A number of algorithms exist • Multi-Dimensional


    • Self-Organising Maps

    • Issues • Computationally


    • Semantic overload

    • Interpretation problems http://lazarus.elte.hu/cet/publications/13-ormeling7.jpg

  • Potential solution

    • Use hierarchical structures to overcome the issues

    TechnologyAgriculture Arts Culture


    Art Craft Design Visual arts

    Artisans Crochet Watchmaker

    • Each topic can be processed independently

    • Structure can be used to provide visual summaries

  • Hierarchical spatialisation algorithm 1. Pre-processing

    1. Tree pruning

    2. Item pruning

    3. Vectorisation

    2. Spatialisation 1. Initial spatialisation

    2. Final positioning

    3. Post-processing

  • Pre-processing

    • Ensures that the hierarchy is compatible with the core algorithm • Hierarchy must be a full tree

    • Items must only be assigned to leaf topics in the tree

    • Ensures that all items & topics have the necessary pre-calculated data for the spatialisation

  • Tree pruning

    • Transforms the hierarchy from a Directed-Acyclical Graph to a tree

  • Item pruning

    • Ensures that items are only assigned to leaf topics

  • Vectorisation

    • Each object to spatialise with MDS must be defined via a vector • Extract keywords from titles and descriptions of items • Filter keywords that appear less than 5 times in the collection

    or in more than half the documents • From the keywords use TFIDF (term frequency – inverse

    document frequency) to create the vectors

    • Items • Use item’s keywords

    • Topics • Use the keywords of all items that

    belong to the topic or to one of its child topics

    𝑡𝑓 𝑡, 𝑑 = 𝑓(𝑡)


    𝑖𝑑𝑓 𝑡, 𝐷 = log 𝐷

    𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑 𝑡𝑓𝑖𝑑𝑓 𝑡, 𝑑, 𝐷 = 𝑡𝑓 ∙ 𝑖𝑑𝑓

  • Core spatialisation

    • Hierarchy is spatialised bottom-up • Parent topic is spatialised after all its children have been spatialised

  • Core spatialisation

    Initial spatialisation Neighbourhood graph Final, compact spatialisation

    Degenerate MDS

  • Parallelisation

    • Use the inverse tree as an activation graph

    TechnologyAgriculture Arts Culture


    Art Craft Design Visual arts

    Artisans Crochet Watchmaker

  • Parallelisation

    • Enables the algorithm to scale to large data-sets • 500 000 items processed in ~16 hours on a multi-core

    desktop processor

    • Limited by the shared map storage backend

  • Placement

    • Due to the parallel nature of the algorithm topic areas will overlap

  • Post-processing

    • Re-calculate boundaries to achieve visual attractivity

  • Semantic map

  • Semantic map

    • Generally provides overviewing and exploration support

    • Hierarchy provides overview labels at higher zoom levels

    • Interaction follows the widely adopted Google- maps pattern (zoom / pan)

    • At lower zoom levels allows interaction with individual items

    • Provides a natural interface for touch-based devices

  • Semantic map

    • Algorithm written in Python

    • Data stored in PostgreSQL + PostGIS database

    • Individual tiles rendered using • Mapnik – for the actual rendering

    • TileLite – for caching and serving

    • Web-based user interface provided via Leaflet

  • Where next?

    • Evaluation

    • Support continuous updates to the map

    • Create more “natural” boundaries

  • Thank you Questions?

    See a demo at http://explorer.paths-project.eu