Towards a Data Network for Integrated Social Science Research

35
Towards a Data Network for Integrated Social Science Research Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate Director, Harvard-MIT Data Center Senior Research Scientist, Institute for Quantitative Social Sciences E: [email protected] W: http://maltman.hmdc.harvard.edu / [Presented at the DLF Meeting 2008]

Transcript of Towards a Data Network for Integrated Social Science Research

Page 1: Towards a Data Network for Integrated Social Science Research

Towards a Data Network for Integrated Social Science

Research Micah AltmanHarvard University

Archival Director, Henry A. Murray Research ArchiveAssociate Director, Harvard-MIT Data CenterSenior Research Scientist, Institute for Quantitative Social Sciences

E: [email protected]: http://maltman.hmdc.harvard.edu/

[Presented at the DLF Meeting 2008]

Page 2: Towards a Data Network for Integrated Social Science Research

This Talk

Why is Access to Social Science Data Important?What are Challenges to Integrated Access?Social Science and Cyberinfrastructure Google ++ (--?) Dataverse Network (DVN): Virtual Archiving Data Preservation Alliance for Social Sciences

(Data-PASS): Replicated Institutional Preservation The Social Science Research Computing

Environment (RCE): Social Science & Research Workflows

ConclusionsMicah Altman, Senior Research Scientist

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

[Soc. Sci. Data Networks, DLF 2008](Page 2)

Page 3: Towards a Data Network for Integrated Social Science Research

Related Work

Articles M. Altman and G. King. “A Proposed Standard for the Scholarly Citation of

Quantitative Data”, D-Lib, 13, 3/4 (March/April). 2007. M. Altman, et. al, “Data Preservation Alliance for the Social Sciences: A

Model for Collaboration” Proceedings of DigCcurr07, Chapel Hill. April 2007. G. King, “An Introduction to the Dataverse Network as an Infrastructure for

Data Sharing”, Sociological Methods and Research, 32, 2 (November, 2007): 173–199.

M. Altman , "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag. Forthcoming 2008.

Collaborators & Co-conspirators Margaret Adams, Ken Bollen, Cavan Capps, Jonathan Crabtree, Darrell

Donakowski, Myron Gutmann, Gary King, Lois Timms-Ferrarra, Marc Maynard, Amy Pienta

Research Support Thanks to the Library of Congress (PA#NDP03-1), the National Institutes of

Aging (P01 AG17625-01), the National Science Foundation (SES-0318275, IIS-9874747), the Harvard University Library, the Institute for Quantitative SocialScience, the Harvard-MIT Data Center, and the Murray Research Archive.

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 3)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 4: Towards a Data Network for Integrated Social Science Research

What is Digital Social-Science Data?

DIGITAL Optical: DVD, CD Magnetic: Tapes, ‘Floppies’ Paper: cards, tapes

SOCIAL SCIENCE Social:

class, crime, social movements, culture, folklore, family

Economic: wealth, prosperity, labor, business, equity

Psychology: cognition, attitudes, stereotypes

Politics:justice, democracy, public policy, public administration, international conflic

DATA Raw measurements Numeric tables Administrative records (& email) Video and audio interviews, transcripts

(& blogs) Digital objects (web sites, interactive

databases)

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 4)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 5: Towards a Data Network for Integrated Social Science Research

Data Access is the Key to Science

Science is not (only) about being scientificScientific progress requires community: Competition and collaboration in the pursuit of common goalsWithout access to the same materials: no community exists

… data is the nucleus of collaboration.

The value of an article that can’t be replicated: ?Scholarly articles are summaries, not the actual research resultsBut: Data access is spotty by field, finding the data is still hardHard for journal editors to verify.If you find it, how do you know it’s the same?Replication projects show:most published articles in social science cannot be replicated

… data is necessary for replication and verification

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 5)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 6: Towards a Data Network for Integrated Social Science Research

Data Access is a Key to Democracy

Statistics = state-isticsThe state tax authority: counting people, estimating wealthReformers use data to assess the performance of the stateScience informs public policy continuallyIn modern democracy: the public needs a direct source of information

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 6)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 7: Towards a Data Network for Integrated Social Science Research

How Data Is Lost

Data Intentionally Discarded “It was just too long ago, I generally keep data for something like 10 years

beyond the last time I do something with them.” “Destroyed, in accord with APA 5-year post-publication rule.”

Unintentional Hardware Problems “Some data were collected, but the data file was lost in a technical

malfunction.” Destroyed for Confidentiality Reasons

“The material…was considered sensitive data. Institutional review boards.. required us to promise to destroy the data after a certain period of time...”

Acts of Nature “The data from the studies were on punched cards that were destroyed in a

flood in the department in the early 80s.”Discarded or Lost in a Move

“As I retired …. Unfortunately, I simply didn’t have the room to store these data sets at my house.”

Obsolescence “Speech recordings stored on a LISP Machine…, an experimental computer

which is long obsolete.”Simply Lost

“For all I know, they are on a [University] server, but it has been literally years and years since the research was done, and my files are long gone.”

Micah Altman, Senior Research Scientist

Research by:

[Soc. Sci. Data Networks, DLF 2008](Page 7)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 8: Towards a Data Network for Integrated Social Science Research

Challenges to Research and PolicyLegal ChallengesTechnical Privacy ChallengesData DelugeNew Forms of Research

Micah Altman, Senior Research Scientist(Page 8) [Soc. Sci. Data Networks, DLF 2008]

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 9: Towards a Data Network for Integrated Social Science Research

Legal Requirements

Micah Altman, Senior Research Scientist(Page 9) [Soc. Sci. Data Networks, DLF 2008]

Personal Information

Open accessIntellectual Property

Contract

SponsorInterests

HIPAA FERPA45 CFR 26

Invasionof

Privacy

Defamation

FOIAState FOI

PublisherInterests

DMCACopyright

Trademark

Patent Trade Secret

Contracts Licenses

Click-wrapagreements

Contributor Interests

ConsumerInterest

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 10: Towards a Data Network for Integrated Social Science Research

Technical Privacy Challenges

Some challenging findings… Large, sparse datasets can “leak” private information when correlated

with external data. Even when significantly sub-sampled, perturbed, etc. [Narayan and Shmatikov 2008]

Repeated release of perturbation-masked geospatial point data leaks increasing amounts of information. Does not help to combine with aggregation masking [Zimmerman and Pavlik 2008]

Possible to identify other relationships in networks if you can generate seemingly innocuous relationships in same network [Backstrom, et. al 2007]

Pseudonymous communication can be linked through textual analysis [Tomkins et. al 2004]

K-anonymized data still vulnerable if homogenous, or attacker has enough background knowledge. L-diversity offered as replacement [MachanavaJJhala, et al 2007]

Additional anonymization challenges for geospatial data Very fine grained location – versus multi-state aggregation mask

required by HIPAA, and large social science surveys Background knowledge very likely

Easy to integrate with other datasets; Some data points may be directly observable Sequences of locations even more challenging

May cross aggregation units; Repetitive, temporally correlated; Induces unique networks

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 10)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 11: Towards a Data Network for Integrated Social Science Research

Management of Legal Risks

Embedding all sensitive data access in a digital library can greatly improve subject privacy: Authentication, vetting, and access control Standardized license terms governing analysis

(derived from metadata and data characteristics) Models can be run on-line without access to raw data Monitoring and auditing of data use Limit sequence of analyses by a user, in some cases

( for promising results, see [Dwork, et al 2006]

Licensing and Intellectual Property Protections Standard licence terms and metadata Click-through agreements, vetting workflows Authentication, auditing, logging

Micah Altman, Senior Research Scientist(Page 11) [Soc. Sci. Data Networks, DLF 2008]

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 12: Towards a Data Network for Integrated Social Science Research

Long-Term Social Science Data

Needs*… Social science –> human activities and perceptionsComputational capacity of human brain: 10^14 – 10^19?Future storage of a human history: 10^30 bytes/person? Compare to 10^10 bytes

– for a long high-res FMRI session

Micah Altman, Senior Research Scientist

* Or, “what are you thinking?”

[Soc. Sci. Data Networks, DLF 2008](Page 12)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 13: Towards a Data Network for Integrated Social Science Research

Social Science Data Deluge*…

Collective holdings of all U.S. numeric social science data in all major data archives, government repositories: ~estimated 10’s of TB“Ambient” data increasingly becoming subject of social science research. Data deluge annually (2002 annual): Web (surface): 167 TB Radio: 3,500 TB Television: 69,000 TB Web (deep): 92000 TB Email (originals): 441,000 TB Telephone: 18,000,000 TB

Micah Altman, Senior Research Scientist

* Or, “what are you thinking?”

[Soc. Sci. Data Networks, DLF 2008](Page 13)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 14: Towards a Data Network for Integrated Social Science Research

Research Infrastructure Challenges

Social science challenges… Few definitive answers Complex conceptual primitives Complex theories of behavior Reliance on observational data Specification uncertainty Changing evidence base

(blogs, video, continuously recorded behavioral data)Some trends

Compute-intensive inferential statistics Specification searches Sensitivity analyses Curse of dimensionality Data explosion Changing evidence base Agent-based models

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 14)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 15: Towards a Data Network for Integrated Social Science Research

Why Infrastructure for Data?

Accessibility: Most large data sets: in public archives Most data in published articles:

not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available

Problems even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are lost

Data sets are not like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have altered content!

Why not Single Centralized infrastructure ? Single point of failure Data is heterogeneous in format, origin, size, effort needed to collect or analyze, IRB access

rules, etc. Data producers want credit, control, and visibility

Requirements Recognition, for data producers, distributors, related publishers Rule-based Public Distribution Authorization: fulfill requirements the author originally met Validation: check that data exists, without authorization Persistence Decades from now. . . . Verification: meaning of data remains unchanged, even as formats and computer systems Ease of Use: researchers are not archivists Standardize and Document Legal Protections: IRB, intellectual property,

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 15)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 16: Towards a Data Network for Integrated Social Science Research

Emerging Technologies* Social

Science DataGoogle++ Virtual-Hosted archivesWorkflow systemsData networks

Micah Altman, Senior Research Scientist(Page 16) [Soc. Sci. Data Networks, DLF 2008]

* Plus Ça Change, Plus C'est Fou

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 17: Towards a Data Network for Integrated Social Science Research

Google++ (--?)

Micah Altman, Senior Research Scientist

+ + +

+ = ?* Can you count how many ’s are in this picture?

[Soc. Sci. Data Networks, DLF 2008](Page 17)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Privacy? Law?

Preservation?Analysis?

Page 18: Towards a Data Network for Integrated Social Science Research

Virtual Archiving: The Dataverse Network*

An Open-Source, Federated, Web 2.0 Data Network

Gateway to over 20000 social science studies (world’s largest catalog)Web Virtual Hosting 2.0 ServiceFederated access to other networks Unified access to major U.S. research data archives, government dataOpen service – endowed hostingOpen source – GPL-Affero-3

Micah Altman, Senior Research Scientist

Discovery Services Simple & fielded search Virtual collection browsing

Management Ingest Curation & review Virtual Hosting and administration

Metadata delivery Descriptive and structural Provenance (chain-of custody

metadata) Human and OAI interfaces

Preservation Standards based Reformatting Universal Numeric Fingerprints

Enhanced Delivery Replication Layered analysis services

To date: 132 Dataverses; 23,058 Studies; 576,387 Files(April 28, 2008)

To date: 132 Dataverses; 23,058 Studies; 576,387 Files(April 28, 2008)

[Soc. Sci. Data Networks, DLF 2008](Page 18)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 19: Towards a Data Network for Integrated Social Science Research

DVN Screenshots

Micah Altman, Senior Research Scientisthttp://dvn.iq.harvard.edu/

[Soc. Sci. Data Networks, DLF 2008](Page 19)

Page 20: Towards a Data Network for Integrated Social Science Research

Some Dataverse UsesFuture Researchers:

discovery; linking; forward citation; verification; analysisJournals, for replicationAuthors, for their own dataTeachers, in depth analysisSections of scholarly organizations, to organize existing dataGranting agenciesResearch centersArchivesMajor Research ProjectsAcademic departments, universities, centers, libraries

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 20)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 21: Towards a Data Network for Integrated Social Science Research

DVN: Data Citations

Citations are a traditional formal mechanism to link together intellectual worksCitations glue together: Regulations, Publications, and EvidenceBut, lack of rules for citing numeric data:

No consistency in practice No fixed rules for copyeditors Sometimes in the list of references; sometimes a casual mention in

the text Sometimes the archive is noted Sometimes a version number exists Sometimes the version number is listed (if it exists) Archive numbers are sometimes given, if they exist Sometimes the author is noted Date of creation is sometimes given URLs often given, rarely persist Dates of access: protect the researcher, do not help find the data The data may not be available publicly The data may no longer exist

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 21)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 22: Towards a Data Network for Integrated Social Science Research

A Unified Citation Standard for Quantitative

Data

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 22)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 23: Towards a Data Network for Integrated Social Science Research

DVN: What’s NewTimeline

Version 1.0 (release) Dec. 2007 Version 1.1 March 2008 Version 1.2 April 2008

New Stuff OAI enhancements: Export Custom sets (1.2); Import DC, FGDC (1.1) as well as DDI Data services: zip delivery of remote files (1.2); plain-text and tab-delimited exports (1.2) Java 6 Support (1.2) Workflow Support Enhancements

Terms of use on login, upload, and download, configurable at network, dataverse, and study level (1.1, 1.2)

Enhanced workflows for account requests, password recovery, non-privileged (“drop box”) submissions, submissions review (1.1, 1.2)

Network Admin UI Enhancements JHove validation of individual studies (1.2) Batch ingest (1.2)

Numerous other performance, end-user, curator, and network UI enhancementsFuture: 2.0 (summer)

Data Services: save analyses to R, additional formats GUI for assigning geographic bounding box for study Support harvesting of DVN through LOCKSS Export multiple citation formats And many more features scheculed including Open Journal Integration, GenePattern

workflow integeration

See: http://thedata.org/software/releases

Micah Altman, Senior Research Scientist(Page 23) [Soc. Sci. Data Networks, DLF 2008]

Page 24: Towards a Data Network for Integrated Social Science Research

DataPASS

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 24)

Page 25: Towards a Data Network for Integrated Social Science Research

Collaboration for Preservation

Partnership Agreements Agreement to establish

good practice Preservation copies of

data collected Transfer Protocol: in

case of archival failure

Cooperating Operations Central database of

leads for acquisition Development of shared

procedures Review of acquisitions

Micah Altman, Senior Research Scientist

Joint “Not-bad” practices Identification & selection Metadata Security Confidentiality

Shared Catalog Unified Discovery Content exchange Layered Services

[Soc. Sci. Data Networks, DLF 2008](Page 25)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

"Nothing new that is really interesting comes without collaboration" -- James Watson

Page 26: Towards a Data Network for Integrated Social Science Research

Data Rescued Examples

U.S. Information Agency Surveys Directly informed U.S. foreign policy through surveys of foreign public opinion Previously, only surveys from 1970-1990 were held in the national archives Collaboration be NARA and Roper to create a much more complete series

spanning the 1950-1990 Surveys conducted in Europe, Latin America, Asian countries include nuclear

arms control, Recent Subjects include US-Soviet relations, US strike on Libya, Soviet Union

invasion of Afghanistan, and economic matters, terrorism, economic summits, arms control, and the Soviet actions in Afghanistan, drug trafficking, democratization, and conflicts in El Salvador and Nicaragua.

Longitudinal Study of Personality Development. By Jack and Jeanne Humphrey Block The most intensive study of human personality development in existence. Thirty year longitudinal study. Mixed methods – quantitative, audio, video. More than 100 instruments, and 1000’s of measures (variables) Resulted in more than 100 publications. (Also shows how whiny kids are more likely to grow up to be conservatives.)

National Network of State Polls Diverse membership of 50 members in 38 states Covers a tremendous range of local and national issues Data imminently at risk

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 26)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 27: Towards a Data Network for Integrated Social Science Research

Selected Topics & Sponsors

Political activity, political activism, voting behavior, protest activity, voter registration, fundraising, political alienation, relationship to the Black community, feminism, racial identity, attitudes toward abortion, attitudes toward federal programs; television viewing habits, affects of having children on the marriage, giving too much/little independence, discipline, overscheduling, overprotecting, measuring levels of success in teaching values, self-control, good citizenship, good money habits, religion, worries that parents have of the future facing their children; problems facing parents and children from drugs, sex, violence to the lack of various family and religious values; daycare, mothers working, childrearing, taxes, government spending, morals, children’s issues, economy, jobs, education, crime, health care, social security, local school administration, standardized testing, impact of poor scores on teachers, higher academic standards needed, too much/little homework, summer school., teachers, administrators, quality of academics, discipline matters, class size, level of science and math skills taught, Shakespeare, life skills, athletics, citizenship, Role of the US in the world and assessing US performance, terrorism, war in Iraq, respondent identified level of understanding of foreign affairs, US and foreign aid, assisting emerging democracies, enhancing national security, image of the US abroad, Seriousness of Welfare problems--abuse, fraud, generational, etc.; assessing list of remedies--limit duration, require job training, provide day care, unannounced visits, business tax breaks for hiring recipients, penalize recipients who have more children, etc.; profiling welfare recipients (e.g. more likely to be better/worse parents, lazy or hardworking, from troubled families; defining the American Ideal, how to teach kids what it means to be American, , national identity, appreciation of freedoms in the US, importance of voting, ashamed of nation's history of racism, job US does in teaching immigrant children, bi-lingualism, fly an American flag; most about the meaning of the rights the Constitution guarantees, assessing the level of appreciation of those rights in the US and how it is perceived to the international community; aging. Money Mangers; on union organizations, employers, and labor market institutions; tort law reforms; crime and urbanization; law and social control; natural disasters; awareness of selfNSF, NIH, The Danforth Foundation, The Ford Foundation, The David and Lucille Packard Foundation, and Ewing Marion Kauffman Foundation., State Farm Insurance, Ronald McDonald House Charities, Advertising Council, American Federation of Teachers, the Annenberg Institute, the George Gund Foundation, the National School Boards Association, U.S. Department of Education, GE Foundation, Nellie Mae Education Foundation, Wallace Foundation, Bill & Melinda Gates Foundation, Pew Charitable Trust, National Constitution Center, Alliance for Aging Research, American Federation for Aging Research; the MacArthur Foundation, NiMH

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 27)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 28: Towards a Data Network for Integrated Social Science Research

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 28)

Page 29: Towards a Data Network for Integrated Social Science Research

Micah Altman, Senior Research Scientist[Soc. Sci. Data Networks, DLF 2008](Page 29)

Page 30: Towards a Data Network for Integrated Social Science Research

Replication as Institutional Insurance

Schema driven:capture inter-archival preservation commitmentsAsymmetric: resource commitments proportional to holdingsVersioned: versioned data and citationsIntegration: LOCKSS + Archival Replication Schema + DVN technology + archival workflows

Micah Altman, Senior Research Scientist

Data-PASS Syndicated Storage Project

External Causes of Preservation Failure

Third party attacks Institutional funding Change in legal regimes

Quis custodiet ipsos custodes?

Unintentional curatorial modification

Loss of institutional knowledge & skills

Intentional removal Change in institutional

mission

[Soc. Sci. Data Networks, DLF 2008](Page 30)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 31: Towards a Data Network for Integrated Social Science Research

Workflow Systems*

Micah Altman, Senior Research Scientist

Emerging tools for integration of research process in natural sciencesOrchestrate Data Collection, Transformation, AnalysisExamples: Taverna, Kepler, Genepattern, VisTrailsMost are science and grid-orientedAddresses different parts of scholarly work lifecycleNot focused on social science tasks

* Or “life on the grid”

[Soc. Sci. Data Networks, DLF 2008](Page 31)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 32: Towards a Data Network for Integrated Social Science Research

Intersection of DL and Workflows

GenePattern Genomics workflow system Supports construction of complex reproducible data

analysis pipelines Targeted to local operations, but can make use of some

job queueing systems (LSF, SGE) http://www.broad.mit.edu/cancer/software/genepattern/

Integration project Extends coverage of total research lifecycle DVN will store GenePattern analyses as they evolve When analyses are published, dissemination,

preservation and reuse should be seamless Funded project in early planning stage

Micah Altman, Senior Research Scientist(Page 32) [Soc. Sci. Data Networks, DLF 2008]

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 33: Towards a Data Network for Integrated Social Science Research

New Social Science From Social Science “Research Computing Environment”

Project Assess need for high performance computing among social scientists

at Harvard Prototype interfaces to make grid computing usable by social

scientists Examples

Harvesting and analysis of blogs for virtual political opinion surveys Continuous collection of CSPAN, real-time subject coding, continuous

dissemination Cell phone data: movement, proximity to others, social network

analysis Participative goals-based redistricting Agent-based models of emerging institutions FMRI analyses of reaction to political and social scenarios

Modal Features** Analyses emerge through exploration and interactions Data collection from non-experimental, non instrumental, sources Increasing scale of data Compute limited Data confidentiality High-level analysis tools Remote collaboration is part of projectsMicah Altman, Senior Research Scientist

** Meta-features of social science: messy data + an abundance of plausible models

[Soc. Sci. Data Networks, DLF 2008](Page 33)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 34: Towards a Data Network for Integrated Social Science Research

Mind the Gaps No tool covers entire scholarly research lifecycle Most tools immature Poor integration across most tools Many tools for hard science do not meet social science needs for

non-experimental messy data (“strange sensors”), confidentiality, complex inferential methods …

Decoupling of dissemination, formal publication, citation, peer-review No tools integrate comprehensive, standard, flexible control over privacy,

intellectual property

Micah Altman, Senior Research Scientist

des

ign

pu

bli

shin

g

dis

sem

inat

ion

pre

ser

vati

on

reu

se

coll

ecti

on

pro

cess

ing

inte

gra

tio

n

anal

ysis

cati / capisweave / statdocscitations / identifiers Google-__________

data archives, hosting, networksGeneral digital libraries and repositoriesworkflow systems

[Soc. Sci. Data Networks, DLF 2008](Page 34)

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions

Page 35: Towards a Data Network for Integrated Social Science Research

For More Information

Micah Altman, Senior Research Scientist(Page 35) [Soc. Sci. Data Networks, DLF 2008]

Dataverse Network Project:http://TheData.Org

Data-PASS Alliance: http://www.icpsr.umich.edu/DATAPASS/

Contact me:

http://maltman.hmdc.harvard.edu/ <[email protected]>

Introduction Access Challenges Google++

DVN Data-PASS RCE Conclusions