Big Data Brighton | Big Data in Academia | Jan 2013
-
Upload
big-data-brighton -
Category
Technology
-
view
478 -
download
2
description
Transcript of Big Data Brighton | Big Data in Academia | Jan 2013
January 2013 at
University of Brighton
http://meetup.com/Big-Data-Brighton
Agenda• Miltos Petridis, Professor of Computer Science, University
of Brighton
• Dr Patricia Roberts, Senior Lecturer & Researcher in database design, development and management, University of Brighton - Structured vs Unstructured Data: why structure matters.
• Simon Wibberley, PhD student in computational linguistics at the Text Analytics Group at the University of Sussex. Real-time text stream analysis, event detection, and entity recognition. Event detection on Twitter.
• Kevin Long, Teradata - Summary and Business context
Big Data
“A new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-speed capture, discovery and/or analysis”1
New investment initiatives are coming, such as in the US in 2012:
“more than $200 million in new funding through six agencies and departments to improve the nation’s ability to extract knowledge and insights from large and complex collections of digital data” 2
Knowledge and insights... hmm Before companies rush to use the technologies
they should be asking some questions:
• Can we make any assumptions about the
quality of the data we are using?
• Is there a significant difference between structured and unstructured data?
• Can the underlying structure of the data affect what you can do with it?
In this brief talk, I will be examining these
questions with reference to my research and recent trends
Can we make any assumptions about the quality of the data we are using?
• One of the problems about the recent explosion in the amount of data is that some data (particularly collected from social networking sites) is of dubious quality – A straw pole of my students found that 1 in 5
deliberately enter incorrect data about themselves online to protect their identity
• We might not have any assurance that the data is true or that it is correctly linked to metadata – Is data typed? – Is the data related to other data? How is it related? – Are relationships between data and its meaning
being lost?
A view of different data models 3
Is there a significant difference between structured and unstructured
data? • How is data structured? • Does the underlying data model matter? • What are the options for a data model? • Over the years many models of data have
evolved and most are still in use • Data models used give insights into
assumptions about the semantics of the data
Finding meaning from ‘flat’ data
• A problem with ‘flat’ or unstructured data representations is that it has traditionally been difficult to aggregate and present to users in a way that they can understand
• In contrast, structured data can be summarised easily and its structure represents the meaning of data within an organization
• Data analytics are changing this by presenting accessible information from ‘flat’ data
Can the underlying structure of the data affect what you can do with it?
• The short answer from my research is ‘YES’
• How it affects what you can do with the data is the long answer – It is really easy to store a piece of data but
retrieving it (intact with its meaning and its relationships to other data) is more difficult
– When ‘Big Data’ technologies are used to knowledge and insights from the data we should be sure that the technology is not introducing new problems
Impedance mismatch problems
• Moving data from one paradigm to another often causes the meaning to be lost
• Can cause problems for developers who move data from one paradigm to another
• Also a problem for end users who may lose the connections
A way forward
• Working out goals in your data management • Understanding the structure of the data you
are using, wherever it comes from • Getting assurance about the quality of the
data • Then having confidence that the knowledge
and insights are based in firm foundations
Thank you
Any questions?
References 1. Carter, P (2011) , Big Data Analytics: Future
Architectures, Skills and Roadmaps for the CIO, SAS White paper, IDC Go-to-Market Services
2. E. Gianchandani. Obama administration unveils $200m big data r&d initiative. In The Computing Community Consortium (CCC) Blog, 2012.
3. Renzo Angles and Claudio Gutierrez. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1, Article 1 (February 2008)
What are Events? We just don’t know.
Event Categories
Constrained Unconstrained
Well Reported
Poorly ReportedInteresting
Relatively Easy Interesting
Very Tricky
Algorithms
• Query Driven
– Volume / rate analysis of matching data
– Addresses constrained event type
• Data Driven
– Mine stream for interesng data
– Addresses unconstrained event type
GB Dressage Gold
London Riots
London Riots
Event Characterisaon
• Fill in unknowns
• Self explanatory for (very) constrained events
• Select representave / well formed Tweet[s]
• Term relevance / clustering
• Topic analysis
• Geo-locaon / Enty extracon
CASM
• Centre for the Analysis of Social Media
• Collaboraon between DEMOS and TAG
• Applying text analycs to social media to
answer sociological quesons
• OSI funded EU senment anaylsis pilot project
h�p://www.demos.co.uk/projects/casm/
Ethics
Narrow Broad
Anonymous
Identity Preserving StasiJudiciary
Me!Social Science
Reffin, J (2012)