Helen Berman Data Curation Interview

DATA CURATION INSIGHTS

HELEN BERMAN PROFESSOR

DIRECTOR, RCSB PROTEIN DATA BANK

BIG Big Data Public Private Forum

“So what are the

changes? A lot

more data, higher

complexity and

there is an

acceptance for the

methodology that

we use.”

Introduction

Helen M. Berman is the director of the RCSB Protein Data Bank—

one of the member organizations of the Worldwide Protein Data

Bank and a Board of Governors Professor of Chemistry and

Chemical Biology at Rutgers University. A structural biologist, her

work includes structural analysis of protein-nucleic acid

complexes, and the role of water in molecular interactions. She is

also the founder and director of the Nucleic Acid Database, and

leads the Protein Structure Initiative Structural Genomics

Knowledge Base.

Edward Curry: Please tell us a little about yourself and your

history in the industry, role at the organisation, past

experience?

Helen Berman: I have been involved in one way or the other with

the establishment and running of the Protein Data Bank for more

than 40 years since its founding. From the beginning, in 1966, we

had a firm conviction in the value of such a big data resource,

even though we couldn’t prove it yet. In 1971, the PDB was

established at Brookhaven. I was spending a lot of time trying to

convince people that this was a good idea, to collect this data and

archive them. And then, we fast forward to 1998. In the 90’s there

were several calls for proposals to manage the PDB and in 1998 I

won the cooperative agreement to run the PDB in America and

then, once that happened, it became very obvious that there was a

worldwide interest in handling the data, not just in America. We

just formed a consortium with EBI and the group in Osaka, Japan,

to form a worldwide PDB and the purpose on that was to agree on

standards for the data as a well as methods for processing,

curating and distributing the data. So, we basically formed this

organisation ten years ago so that we have worldwide agreement

on all of this. The PDB began in 1971 with about seven structures,

we will have 100,000 by next year and we have about 95,000 now.

So this is structured data determined by structural biologists by X-

Ray, NMR, Electron microscopy, and other methods as they

evolve. Lately, I have mostly been involved with standardisation

and curation of the data.

Edward Curry: What is the biggest change going on in your

industry at this time? Is Big Data playing a role in it?

Helen Berman: Structure determination has definitely been

transformed the most, as the traditional X-Ray crystallography has

been joined by many new developments. This demands new ways

of describing and representing the data.

The big changes right now in terms of structural biology is that the

structures are getting bigger and more complicated and that more

methods are being used to solve the structures. After years of

discussion, there is now a tremendous acceptance for our idea that

this should all be done by defining our terms. So there used to be this

idea that, if you do that you are going to get in the way of creativity,

just collect the data and don’t worry about it. But we always worried

about the metadata before other people worried about it. But now the

good news is that everybody we are working with has appreciated that

and they really have cooperated in every possible way to allow us to

do that. That’s made a huge difference.

So what are the changes? A lot more data, higher complexity and

there is an acceptance for the methodology that we use.”

Edward Curry: How has your job changed in the past 12 months?

How do you expect it to change in the next 3–5 years?

Helen Berman: In the last 12 months there has been an enormous

growth of the community. More people in the community have now

accepted the way in which we have done things. So we have driven

very hard the concept of the data dictionary and now that’s well

accepted. The software developers are using this dictionary, so that's

a big thing. It took a long time to get that to be accepted. We started

creating these dictionaries about 20 years ago. So that's the biggest

change: it is the acceptance. The way in which my job will change in

the next three to five years is that we can work on more complicated

structures, figure out how to handle them. We can do more, we can

work on more complicated structures. That part of convincing the

community is no longer necessary.

Edward Curry: What does data curation mean to you in your

organisation?

Helen Berman: We base all our data curation on our data

dictionaries, it is where every term is defined. So our data dictionary

now has 5,000 terms and we keep refining the dictionary, so it is very

clear and we expanded as the methods come in. All of the tools that

we developed and, with the world wide PDB, we are doing new

_______

software that is completely based on this dictionary. The other issue

is that we validate all the data based on community recognized

standards. So the way we do that is, we have task forces that

consist of experts in data validation and they come up with

recommendations on how the data should be validated and then we

implement it. Again, it's all community driven. In the wwPDB, the

members review the curation procedures to see whether or not they

are giving us the best representation of the data so we are

constantly reviewing the entire archive to see whether or not it is

consistent and, if we notice that there are inconsistencies, then we

go back to see whether our validation procedures or curation

procedures can be improved so that we can get data in a better

shape.

Edward Curry: What data do you curate? What is the size of

dataset? How many users are involved in curating the data?

Helen Berman: We curate three dimensional coordinates of

biological macromolecules, whose structures have been determined

using established methods. Right now those methods are X-rays,

NMR and EM. We look at the coordinates, and we also look at the

supporting data underneath that data. In the case of X-ray that

could be structure factors, for example. We look at the maps from

EM, we look from restraints and chemical shifts from NMRs and we

will see whether the model matches the data. We have 100,000

structures and about 750,000 files and about 300 GB in storage.

There are about 20 annotators worldwide who are working on

curating the data. There are probably another ten or twelve people

involved in software development. And those are the people who

are actually processing the data and thousands of structural

biologists who are submitting the data.

Edward Curry: What are the uses of the curated data? What

value does it give?

Helen Berman: It is used by the researchers in molecular,

structural, and computational biology and in pharmacology. The

drug industry makes heavy use of it. It supports teaching biology

students. We have more than 300,000,000 downloads of coordinate

data in a year. The value of the data is that it helps to give insight of

biological mechanisms and function. We look to see who uses the

data and we noticed that mathematicians, statisticians, computer

scientists explore new methodologies for their research. It is not

biological research, but research which handles complex data,

because each data file has about 500 data items associated with

them, not counting the coordinates (the metadata is about 500). The

dictionary has about 5,000 terms, of those around 500 are collected.

As people are willing to deposit the data, we have all the terms in

place so when they finally decide they want to give us certain kinds

of data we can do it. Everything is built on this very structured

framework.

Edward Curry: What processes and technologies do you use

for data curation? Any comments on the performance or

design of the technologies?

Helen Berman: The three centers that I have talked about had two

different data processing pipelines. We use the same algorithms to

process the data, but different programs. About five years ago, we

engaged in a project to create a common tool for deposition and

annotation. We all use the same everything. We use a workflow

____

manager, a web interface and modular system. That is completely

portable and extensible. Now we have three data centers and we

can have ten data centers with this methodology or 100 data

centers. We’ve created a dictionary-driven deposition annotation

system with a workflow manager that has all the rules and all the

experience that we’ve gained in the many years we have been

doing it and we have a pipeline for the data processing. Most it is

done computationally and the role of the annotator or the curator is

to check at the end of the pipeline whether the final data makes

sense. As times goes on, in the annotation step, we have more

educated curators because things that they have to look at are at a

higher level, as we let computers handle all of the routine stuff, and

we let them free to look at the actual meaning of the structure and

to see whether or not that makes sense. And then the efficiency of

the whole thing gets way better. So the number of structures per

annotator has gone up. So we haven’t increased our number of

annotators in the last ten years, even though the structures have

gone up from 2,000/year to 10,000/year, but because we have

better tools we can do that.

We have created the software but we have borrowed from all the

experts in the field, reusing existing software and expertise in the

field.

Edward Curry: What is your understanding of the term "Big

Data"?

Helen Berman: From my point of view, Big Data as it is defined

now, includes datasets that are very large, complex and extremely

noisy where the signal is very low and the job of the Big Data

experts is to figure out ways to extract information from this very

noisy data. Regarding the X-Ray crystallography pipeline, all the

data comes out of the synchrotrons, this massive data, and we had

no definition of terms, no methodologies, no algorithms developed,

and then you throw it at a computer scientist and then you tell him,

you figure out what this means. And then you would have to

develop all kinds of methodologies to figure out the signal, which is

the structure in our case. In the history of structural biology and X-

ray crystallography, a few hundreds years ago, people saw crystals,

and people asked why are they shinning? Why do they have

straight edges? They then developed the whole thing called crystal

systems based on just looking at these crystals, and it turns out that

everything they developed by just looking at the morphology of the

crystal is absolutely correct. Then as time went on, they developed

something called international tables of crystallography, which

organises the details of what is called space groups that define

these crystal and there are 230 of them. So then we have these

tables on how crystals can be organized and no one has ever

demonstrated that is everything but what these tables have shown.

In the 20th century, people decided they want to look at the

structures of proteins so they started doing X-ray diffraction and

over time they developed methodologies for taking diffraction

patterns from crystals.

So the diffraction patterns come from X-rays and then people

developed algorithms which return structure from this data, which is

very big. Over time, they figured out how to refine those structures.

Before the PDB ever existed, the methods of going from the raw

data to the crystal structures were well determined and then PDB

come along because of the peoples’ background, there were a lot of

crystallographers. Then the same thing happened: let’s define every

term, let’s figure out exactly how we collect the data and get the

metadata exactly right.

“So basically from my point of view, Big Data as it is being defined now are datasets that are

very large, complex and extremely noisy where the signal is very low and the job of the Big

Data experts is to figure out ways to extract information from this very noisy data.”

Over time, this process became more sophisticated. The current

PDB, what we have now is extraordinarily well-curated data.

Because you have all those very compulsive people that have spent

their entire careers figuring out how do you get the structure and

how you curate the data. So our signal is very high and now we are

up to the point of fussing about the signal. So do we have the

standard deviation right for all the atoms? How do we describe the

error properly? So from what I see when I try to understand why we

are not Big Data, we are not Big Data because we’ve had a whole

group of people focusing all their attention on defining the terms in

ways that I’m not aware of one other field that has done that.

The issue is the science, the technology and the community. The

community has a mindset that says we must define our terms, we

must be precise. Everybody who contributes to the PDB

understands that. If you compare that with some newer fields where

there are huge amounts of data, if you ask somebody in this field,

how did you get to that data point. How many communities will do

that? Not very many. That is that issue that I see. I actually think

that there is huge value studying that. How do get a community to

cooperate? Because we are told we are the gold standard for

databases in Biology. It is really built on the backs of people who

had this idea that you should define everything. What happened is

basically one of the persons that I talked to said: “the problem with

you guys is that you solved the problem, so you are not Big Data”.

And then I say: maybe somebody would have to pay attention to

what we did as a community, and take the lessons from that

community and apply it. There is the technology that was applied in

this process and there is the sociological perspective on how to

build this community consensus. How do you get a community to

actually cooperate? That wasn’t easy. There were times when

people were fighting and screaming about all sorts of things, but in

the end I think we had succeeded in doing something that I think

another community could take this lesson and apply it, but

collapsing it down into an year or two instead of 40 years.

Edward Curry: What influences do you think "Big Data" will

have on future of data curation? What are the technological

demands of curation in the "Big Data" context?

Helen Berman: What we focused on was figuring out how to

develop ontologies and data models to describe the data. We spent

a lot of time on that. That is a very the laborious process. And the

other way is using natural language processing (NLP) and similar

methods like pattern recognition and similarity functions to try to pull

data out. I think that what really has to happen is to marry those two

things. Don’t just do NLP, but do come up with a way to marry the

sort of laborious dictionary and ontology development, with the NLP,

and cross those two things, and I think from my point of view that

what has to happen. I hope I am wrong but I don’t think that you can

just throw modern computer science methodologies at this data

without somehow capturing the controlled vocabularies and

ontologies. I see those two communities not working together

enough. It’s a big challenge. If you are going to solve Big Data

problems you have to match the domain expertise with the

computer science expertise.

It would be important to bring a sociologist into the analysis of the

problem as well. A sociologist can take a look on some successful

and some unsuccessful cases.

In terms of curation challenges, for example, we’ve got the structure

of a large HIV capsid, a huge structure with 1,300 chains in it, and

we had to figure out how to represent it in a mechanism that wasn’t

used to handling this. Because we had been thinking about this

____

when we created our dictionary, we set it to have no limit on atoms

or chains. In the end we did it. We are going to have more

structures that are going to be determined by five different methods

and we are going to have to put it all together, we have to figure out

what the error limits are, a structure that is very small that has a

large data/parameter ratio will be a lot easier to represent accurately

than other structures that are going to be huge macromolecular

machines that we are going to be able to handle. That’s where our

biggest challenge is, and that’s why it’s fun. That’s modern science.

Edward Curry: What data curation technologies will cope with

"Big Data"?

Helen Berman: As I said before how to bridge the two communities

better (natural language processing and ontology engineering). We

also need to be able to handle large volumes of data and be able to

process that. We need to be able to decide how much data and

which data we need to collect is very important. I think you need to

think ahead every time: what am I going to get if I collect this data?

We don’t do formal crowdsourcing. You want to make sure that

what you are getting is the best possible curation. What we do

crowdsourcing is by having these task forces whenever we have a

problem to solve, we bring experts and we sit down and talk for a

couple days and we say here are the questions we have, how do

we deal with that. So bring in people in that way, but we don’t

actually bring in people to process the data. Even with a very well

trained annotator with good tools, it requires a lot of surveillance.

About the BIG Project

The BIG project aims to create a collaborative platform to address the

challenges and discuss the opportunities offered by technologies that provide

treatment of large volumes of data (Big Data) and its impact in the new

economy. BIG‘s contributions will be crucial for both the industry and the

scientific community, policy makers and the general public, since the

management of large amounts of data play an increasingly important role in

society and in the current economy.

“Don’t just do natural language processing (NLP), but do come up with a way to marry the sort

of laborious dictionary and ontology development, with the NLP.”

CONTACT

http://big-project.eu

TECHNICAL WORKING GROUPS AND

INDUSTRY

Project co-funded by the European Commission within the 7th Framework Programme (Grant Agreement No. 318062)

Collaborative Project

in Information and Communication Technologies

2012 — 2014

General contact

[email protected]

Project Coordinator

Jose Maria Cavanillas de San Segundo

Research & Innovation Director

Atos Spain Corporation

Albarracín 25

28037 Madrid, Spain

Phone: +34 912148609

Fax: +34 917543252

Email: [email protected]

Strategic Director

Prof. Wolfgang Wahlster

CEO and Scientific Director Deutsches Forschungszentrum für Künstliche Intelligenz

66123 Saarbrücken, Germany

Phone: +49 681 85775 5252 or 5251

Fax: +49 681 85775 5383


Data Curation Working Group Dr. Edward Curry

Digital Enterprise Research Institute

National University of Ireland, Galway


mailto:[email protected]

Helen Berman Data Curation Interview

Data & Analytics

Transcript of Helen Berman Data Curation Interview