Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group...

25
Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    0

Transcript of Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group...

Methods for Exploiting Academic Hyperlinks

Mike Thelwall

Statistical Cybermetrics Research Group

University of Wolverhampton, UK

The Problem To map patterns of communication between

researchers in a country based upon university web sites

Patterns of communication are also mapped based upon journal citations or journal title words Provides useful information about the structure and

evolution of research fields Can identify previously unknown field connections

Web analysis could illustrate wider and more current patterns

Data collection Web crawler AltaVista advanced querieshost:wlv.ac.uk AND link:gla.ac.uk AllTheWeb advanced queries Google

Does not support same level of Boolean querying

Types of link count

Direct link counts Inter-site links only

Co-inlink counts B and C are co-inlinked

Co-outlink counts D and E are co-outlinked

B C

A D E

F

Alternative Document Models Domain ADM

Count links between domains (ignoring multiple links) instead of pages

P1P2P3

P4P5P6

www.scit.wlv.ac.uk www.dcs.gla.ac.uk

Alternative Document Models Directory ADM

Counts links between directories Estimated using URL slashes

University ADM Counts links between entire university Web sites Too extreme for most purposes

ADMs reduce the impact of replicated links E.g. a subsite of 1000 pages linking to another

university home page in its navigation bar

Some Inter-University Hyperlink Patterns

For the UK and Europe

Citation-Style Hyperlink Analysis Citation counts are known to be reasonable

indicators of research quality but is the same true for inlink counts? Counts of links to universities within a country can

correlate significantly with measures of research productivity

The significance of this result is in giving ‘permission’ to investigate the use of inter-university links for researching scholarly communication

Most links are only loosely related to research 90% of links between UK university sites have some

connection with scholarly activity, including teaching and research But less than 1% are equivalent to citations

So link counts do not measure research dissemination but are more a natural by-product of scholarly activity Cannot use link counts to assess research Can use link counts to track an aspect of communication

Links to UK universities against their research productivity

The reason for the strong correlation is the quantity of Web publication, not its quality

This is different to citation analysis

Universities tend to link to neighbours

Universitiesclustergeographically

Language is a factor in international interlinking

English the dominant language for Web sites in the Western EU

In a typical country, 50% of pages are in the national language(s) and 50% in English

Non-English speaking extensively interlink in English

{Research with Rong Tang & Liz Price}

Can map patterns of international communicationCounts of links between EU universities in Swedish are represented by arrow thickness.

Counts of links between EU universities in French are represented by arrow thickness.

Which language???

Which language???

Linking patterns vary enormously by discipline No evidence of a significant geographic trend Disciplinary differences in the extent of

interlinking: e.g., history Web use is very low, Chemistry is very high

Individual research projects can have an enormous impact upon individual departments E.g. Arts web sites are often for specific exhibitions

or for digital media projects Links not frequent enough to reliably reveal

patterns of interdiscipliniarity

Clustering using links

Background: Power laws in Academic Webs

Academic Webs have a topology dominated by power laws, including Counts of links to pages (inlink counts) Counts of links to pages (outlink counts) Groups of interconnected pages

Directed component sizes Undirected component sizes

Power laws mean that clustering connected components will not yield useful results

Page Outlinks

Topological component sizes

Community Identification Algorithm Can apply to page, directory and domain models Gives complimentary results: a “layered

approach”

1

10

100

1000

10000

1 10 100 1000 10000

Community size: Directory model, k = 32

Freq

uenc

y

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Community size: page model, k = 32

Freq

uenc

y

Stretching links further: co-inlinks, co-outlinks For the UK academic Web, about 42% of

domains connected by links alone host similar disciplines, and about 43% connected by links, co-inlinks and co-outlinks

But over 100 times more domains are colinked or coupled than are directly linked

Links in any form are less than 50% reliable as indicators of subject similarity

Summary Studies of the relatively restricted

subdomain of university web sites Produce directly useful results

For Web IR, they also Help refine methodologies Help build intuition