Using socially authored content to provide new routes through content archives
-
Upload
monkeyhelper -
Category
Technology
-
view
396 -
download
0
description
Transcript of Using socially authored content to provide new routes through content archives
Using socially authored content to provide new routes through existing content archives
Rob Lee : [email protected]
Using commons authored content to add new paths into, through and out of your content archives
Socially authored content/ user-generated content / has been with us for about 8 years nowTime magazine cover from 2006Mainstream, only going to increase - how can we use it with our content
Wikipedia - primary example of community created content, the largest commons based project on the web with over 2 million entries on things from Axolotl to Zebu.
Wikipedia supports wilfing - What Was I Looking For ?
Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist
Wikipedia supports wilfing - What Was I Looking For ?
Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist
http://ben.pixelmachine.org/articles/2008/04/25/mapping-the-distraction-that-is-wikipedia
“Mapping the distraction that is Wikipedia”
Ben Hughes post on pixelmachine “Mapping the distraction this is wikipedia” - wrote greasemonkey w/back-end app that tracked his wikipedia usage and created a map so he could track how he moved through wikipedia Demonstrates the linked web of information
A brief interlude
Innovation Labs+
A brief interlude
Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way
Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way
First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context
MIT Licence
MySQL AB
Web App David Heinemeier Hansson
X-Platform
MVC
37 Signals
Web App Framework
March 14
2007Ruby Programming Language
Open Source
Don’t Repeat Yourself
Web Design
Web App Framework
First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context
Organic Farming
Sewage SludgeJapan
Human Waste
Farmers’ Market
Food Additive
European Union
United States
Ionizing Radiation
Fertilizer
Organic Farm
Pesticide
New York Times
Organic Certification
Antibiotic
OrganicFood
background ?
Menston
West Riding of YorkshireRiver Wharfe
Metropolitan Borough
Burley-in-Wharfedale
The Chevin
Sailing
West Yorkshire
Pool-in-Wharfedale
England
Angling
1944
Wharfedale
City of Leeds
Birdwatching
Otley
Great - we can generate these relationships for anything that has a representation in Wikipedia - over x million things
circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts
Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting
How can we start to utilise this structure - how can we relate it to our content ?
circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts
Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting
How can we start to utilise this structure - how can we relate it to our content ?
- how do we relate it to our commons content ?We need to relate Wikipedia data to our content, most archives have poor classification/descriptive data, journalists make poor librarians. How can we improve the classification of content in our archives, how can find out what it’s about ?
Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia
Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia
Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing
Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing
- how does this page support wilfing ?- How can we make it better ?
CCTV Example -> Nottinghamshire police, guardian
We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing
How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?
Looked at google and technorati buzz - but we need to consider popularity and context when recommending links
How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?
Looked at google and technorati buzz - but we need to consider popularity and context when recommending links
Use another 3rd party service to ranks external URL’s, tried using google rank, technorati buzz - but they didn’t necessarily rank the interesting links highly (e.g. try searching for google.com on google and see how many hits there are) - del.icio.us is different though users provide both context and ranking (via tagging and the process of bookmarking a URL) - so for each external link from all the wikipedia pages, we can see how many of the original extracted tags there are compared to the all the tags for a URL - high matches = high rank - how relevant is this url to the story in question and the fact it’s been bookmarked means it’s interesting to someone
highlight muddyboots internet links section
A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)
highlight muddyboots internet links section
A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)
Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method
Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method
Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance
Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance
Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography
Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography
Wikipedia has a LOT of data - many different things could be used as measures of interestingnessOther metrics that could define interesting and could be used in our processWikipedia traffic statsMaybe group recommendations by IP Locked articles Magnolia instead of del.icio.us (for speed - and includes a rating system)
A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’
Apple Apple ?
A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’
How about using something that has semantics built in ?
geonames.org
Wikipedia is great for humans, not so great for computers. Fortunately projects exist that make it easier for computers too, DBPedia and Freebase both expose Wikipedia in a more semantic way, DBpedia has information about more than 218 million things including over 80,000 people,
Also possible to link other datasources such as geonames or musicbrainz, so we can find out more information than previously was possible just using wikipedia
For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.
For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.
So if our story is about Apple computers then it’s a fair bet it mentions iphone, itunes, wozniak, steve jobs - all of which can be found on this page but are unlikely to be found on the page about apple the fruit - we can say with a much greater degree of confidence this is the ‘apple’ referenced in out story
We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content
We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content
We can also create better relationships between our content as if one story is about apple the company then we can recommend another that has been semantically marked up this way - great for large content archives with diverse material - e.g. newsWe can better define what is related
We’ve just created a simple opencalais from commons sourced data
How accurate is wikipedia, what are the problems with using commons sourced data ?Nature study (2005) found wikipedia comparable in accuracy to the Encyclopedia Britannica (contested)Stern Magazine compared German language version of wikipedia (50 articles) and found it to be more accurate than the encyclopedia compared to
What mechanisms are there to prevent gaming/incorrect dataLocked articles, featured article candidates, formal peer review process, flaggedrevs coming soon
In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives
Photos:http://www.flickr.com/photos/drydens/2320501752/http://www.flickr.com/photos/laughingsquid/260374487/http://www.flickr.com/photos/bekathwia/2120050762/http://www.flickr.com/photos/25094278@N02/2368981952/http://www.flickr.com/photos/pmtorrone/1570649523http://www.flickr.com/photos/kightp/1562395268http://www.flickr.com/photos/valter/604187726/http://flickr.com/photos/speedye/1489563371/
Rob Lee:[email protected]
In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives