Helen Berman Data Curation Interview
-
Upload
big-project -
Category
Data & Analytics
-
view
685 -
download
2
description
Transcript of Helen Berman Data Curation Interview
DATA CURATION INSIGHTS
HELEN BERMAN PROFESSOR
DIRECTOR, RCSB PROTEIN DATA BANK
BIG Big Data Public Private Forum
“So what are the
changes? A lot
more data, higher
complexity and
there is an
acceptance for the
methodology that
we use.”
Introduction
Helen M. Berman is the director of the RCSB Protein Data Bank—
one of the member organizations of the Worldwide Protein Data
Bank and a Board of Governors Professor of Chemistry and
Chemical Biology at Rutgers University. A structural biologist, her
work includes structural analysis of protein-nucleic acid
complexes, and the role of water in molecular interactions. She is
also the founder and director of the Nucleic Acid Database, and
leads the Protein Structure Initiative Structural Genomics
Knowledge Base.
Edward Curry: Please tell us a little about yourself and your
history in the industry, role at the organisation, past
experience?
Helen Berman: I have been involved in one way or the other with
the establishment and running of the Protein Data Bank for more
than 40 years since its founding. From the beginning, in 1966, we
had a firm conviction in the value of such a big data resource,
even though we couldn’t prove it yet. In 1971, the PDB was
established at Brookhaven. I was spending a lot of time trying to
convince people that this was a good idea, to collect this data and
archive them. And then, we fast forward to 1998. In the 90’s there
were several calls for proposals to manage the PDB and in 1998 I
won the cooperative agreement to run the PDB in America and
then, once that happened, it became very obvious that there was a
worldwide interest in handling the data, not just in America. We
just formed a consortium with EBI and the group in Osaka, Japan,
to form a worldwide PDB and the purpose on that was to agree on
standards for the data as a well as methods for processing,
curating and distributing the data. So, we basically formed this
organisation ten years ago so that we have worldwide agreement
on all of this. The PDB began in 1971 with about seven structures,
we will have 100,000 by next year and we have about 95,000 now.
So this is structured data determined by structural biologists by X-
Ray, NMR, Electron microscopy, and other methods as they
evolve. Lately, I have mostly been involved with standardisation
and curation of the data.
Edward Curry: What is the biggest change going on in your
industry at this time? Is Big Data playing a role in it?
Helen Berman: Structure determination has definitely been
transformed the most, as the traditional X-Ray crystallography has
been joined by many new developments. This demands new ways
of describing and representing the data.
The big changes right now in terms of structural biology is that the
structures are getting bigger and more complicated and that more
methods are being used to solve the structures. After years of
discussion, there is now a tremendous acceptance for our idea that
this should all be done by defining our terms. So there used to be this
idea that, if you do that you are going to get in the way of creativity,
just collect the data and don’t worry about it. But we always worried
about the metadata before other people worried about it. But now the
good news is that everybody we are working with has appreciated that
and they really have cooperated in every possible way to allow us to
do that. That’s made a huge difference.
So what are the changes? A lot more data, higher complexity and
there is an acceptance for the methodology that we use.”
Edward Curry: How has your job changed in the past 12 months?
How do you expect it to change in the next 3–5 years?
Helen Berman: In the last 12 months there has been an enormous
growth of the community. More people in the community have now
accepted the way in which we have done things. So we have driven
very hard the concept of the data dictionary and now that’s well
accepted. The software developers are using this dictionary, so that's
a big thing. It took a long time to get that to be accepted. We started
creating these dictionaries about 20 years ago. So that's the biggest
change: it is the acceptance. The way in which my job will change in
the next three to five years is that we can work on more complicated
structures, figure out how to handle them. We can do more, we can
work on more complicated structures. That part of convincing the
community is no longer necessary.
Edward Curry: What does data curation mean to you in your
organisation?
Helen Berman: We base all our data curation on our data
dictionaries, it is where every term is defined. So our data dictionary
now has 5,000 terms and we keep refining the dictionary, so it is very
clear and we expanded as the methods come in. All of the tools that
we developed and, with the world wide PDB, we are doing new
_______
software that is completely based on this dictionary. The other issue
is that we validate all the data based on community recognized
standards. So the way we do that is, we have task forces that
consist of experts in data validation and they come up with
recommendations on how the data should be validated and then we
implement it. Again, it's all community driven. In the wwPDB, the
members review the curation procedures to see whether or not they
are giving us the best representation of the data so we are
constantly reviewing the entire archive to see whether or not it is
consistent and, if we notice that there are inconsistencies, then we
go back to see whether our validation procedures or curation
procedures can be improved so that we can get data in a better
shape.
Edward Curry: What data do you curate? What is the size of
dataset? How many users are involved in curating the data?
Helen Berman: We curate three dimensional coordinates of
biological macromolecules, whose structures have been determined
using established methods. Right now those methods are X-rays,
NMR and EM. We look at the coordinates, and we also look at the
supporting data underneath that data. In the case of X-ray that
could be structure factors, for example. We look at the maps from
EM, we look from restraints and chemical shifts from NMRs and we
will see whether the model matches the data. We have 100,000
structures and about 750,000 files and about 300 GB in storage.
There are about 20 annotators worldwide who are working on
curating the data. There are probably another ten or twelve people
involved in software development. And those are the people who
are actually processing the data and thousands of structural
biologists who are submitting the data.
Edward Curry: What are the uses of the curated data? What
value does it give?
Helen Berman: It is used by the researchers in molecular,
structural, and computational biology and in pharmacology. The
drug industry makes heavy use of it. It supports teaching biology
students. We have more than 300,000,000 downloads of coordinate
data in a year. The value of the data is that it helps to give insight of
biological mechanisms and function. We look to see who uses the
data and we noticed that mathematicians, statisticians, computer
scientists explore new methodologies for their research. It is not
biological research, but research which handles complex data,
because each data file has about 500 data items associated with
them, not counting the coordinates (the metadata is about 500). The
dictionary has about 5,000 terms, of those around 500 are collected.
As people are willing to deposit the data, we have all the terms in
place so when they finally decide they want to give us certain kinds
of data we can do it. Everything is built on this very structured
framework.
Edward Curry: What processes and technologies do you use
for data curation? Any comments on the performance or
design of the technologies?
Helen Berman: The three centers that I have talked about had two
different data processing pipelines. We use the same algorithms to
process the data, but different programs. About five years ago, we
engaged in a project to create a common tool for deposition and
annotation. We all use the same everything. We use a workflow
____
manager, a web interface and modular system. That is completely
portable and extensible. Now we have three data centers and we
can have ten data centers with this methodology or 100 data
centers. We’ve created a dictionary-driven deposition annotation
system with a workflow manager that has all the rules and all the
experience that we’ve gained in the many years we have been
doing it and we have a pipeline for the data processing. Most it is
done computationally and the role of the annotator or the curator is
to check at the end of the pipeline whether the final data makes
sense. As times goes on, in the annotation step, we have more
educated curators because things that they have to look at are at a
higher level, as we let computers handle all of the routine stuff, and
we let them free to look at the actual meaning of the structure and
to see whether or not that makes sense. And then the efficiency of
the whole thing gets way better. So the number of structures per
annotator has gone up. So we haven’t increased our number of
annotators in the last ten years, even though the structures have
gone up from 2,000/year to 10,000/year, but because we have
better tools we can do that.
We have created the software but we have borrowed from all the
experts in the field, reusing existing software and expertise in the
field.
Edward Curry: What is your understanding of the term "Big
Data"?
Helen Berman: From my point of view, Big Data as it is defined
now, includes datasets that are very large, complex and extremely
noisy where the signal is very low and the job of the Big Data
experts is to figure out ways to extract information from this very
noisy data. Regarding the X-Ray crystallography pipeline, all the
data comes out of the synchrotrons, this massive data, and we had
no definition of terms, no methodologies, no algorithms developed,
and then you throw it at a computer scientist and then you tell him,
you figure out what this means. And then you would have to
develop all kinds of methodologies to figure out the signal, which is
the structure in our case. In the history of structural biology and X-
ray crystallography, a few hundreds years ago, people saw crystals,
and people asked why are they shinning? Why do they have
straight edges? They then developed the whole thing called crystal
systems based on just looking at these crystals, and it turns out that
everything they developed by just looking at the morphology of the
crystal is absolutely correct. Then as time went on, they developed
something called international tables of crystallography, which
organises the details of what is called space groups that define
these crystal and there are 230 of them. So then we have these
tables on how crystals can be organized and no one has ever
demonstrated that is everything but what these tables have shown.
In the 20th century, people decided they want to look at the
structures of proteins so they started doing X-ray diffraction and
over time they developed methodologies for taking diffraction
patterns from crystals.
So the diffraction patterns come from X-rays and then people
developed algorithms which return structure from this data, which is
very big. Over time, they figured out how to refine those structures.
Before the PDB ever existed, the methods of going from the raw
data to the crystal structures were well determined and then PDB
come along because of the peoples’ background, there were a lot of
crystallographers. Then the same thing happened: let’s define every
term, let’s figure out exactly how we collect the data and get the
metadata exactly right.
“So basically from my point of view, Big Data as it is being defined now are datasets that are
very large, complex and extremely noisy where the signal is very low and the job of the Big
Data experts is to figure out ways to extract information from this very noisy data.”
Over time, this process became more sophisticated. The current
PDB, what we have now is extraordinarily well-curated data.
Because you have all those very compulsive people that have spent
their entire careers figuring out how do you get the structure and
how you curate the data. So our signal is very high and now we are
up to the point of fussing about the signal. So do we have the
standard deviation right for all the atoms? How do we describe the
error properly? So from what I see when I try to understand why we
are not Big Data, we are not Big Data because we’ve had a whole
group of people focusing all their attention on defining the terms in
ways that I’m not aware of one other field that has done that.
The issue is the science, the technology and the community. The
community has a mindset that says we must define our terms, we
must be precise. Everybody who contributes to the PDB
understands that. If you compare that with some newer fields where
there are huge amounts of data, if you ask somebody in this field,
how did you get to that data point. How many communities will do
that? Not very many. That is that issue that I see. I actually think
that there is huge value studying that. How do get a community to
cooperate? Because we are told we are the gold standard for
databases in Biology. It is really built on the backs of people who
had this idea that you should define everything. What happened is
basically one of the persons that I talked to said: “the problem with
you guys is that you solved the problem, so you are not Big Data”.
And then I say: maybe somebody would have to pay attention to
what we did as a community, and take the lessons from that
community and apply it. There is the technology that was applied in
this process and there is the sociological perspective on how to
build this community consensus. How do you get a community to
actually cooperate? That wasn’t easy. There were times when
people were fighting and screaming about all sorts of things, but in
the end I think we had succeeded in doing something that I think
another community could take this lesson and apply it, but
collapsing it down into an year or two instead of 40 years.
Edward Curry: What influences do you think "Big Data" will
have on future of data curation? What are the technological
demands of curation in the "Big Data" context?
Helen Berman: What we focused on was figuring out how to
develop ontologies and data models to describe the data. We spent
a lot of time on that. That is a very the laborious process. And the
other way is using natural language processing (NLP) and similar
methods like pattern recognition and similarity functions to try to pull
data out. I think that what really has to happen is to marry those two
things. Don’t just do NLP, but do come up with a way to marry the
sort of laborious dictionary and ontology development, with the NLP,
and cross those two things, and I think from my point of view that
what has to happen. I hope I am wrong but I don’t think that you can
just throw modern computer science methodologies at this data
without somehow capturing the controlled vocabularies and
ontologies. I see those two communities not working together
enough. It’s a big challenge. If you are going to solve Big Data
problems you have to match the domain expertise with the
computer science expertise.
It would be important to bring a sociologist into the analysis of the
problem as well. A sociologist can take a look on some successful
and some unsuccessful cases.
In terms of curation challenges, for example, we’ve got the structure
of a large HIV capsid, a huge structure with 1,300 chains in it, and
we had to figure out how to represent it in a mechanism that wasn’t
used to handling this. Because we had been thinking about this
____
when we created our dictionary, we set it to have no limit on atoms
or chains. In the end we did it. We are going to have more
structures that are going to be determined by five different methods
and we are going to have to put it all together, we have to figure out
what the error limits are, a structure that is very small that has a
large data/parameter ratio will be a lot easier to represent accurately
than other structures that are going to be huge macromolecular
machines that we are going to be able to handle. That’s where our
biggest challenge is, and that’s why it’s fun. That’s modern science.
Edward Curry: What data curation technologies will cope with
"Big Data"?
Helen Berman: As I said before how to bridge the two communities
better (natural language processing and ontology engineering). We
also need to be able to handle large volumes of data and be able to
process that. We need to be able to decide how much data and
which data we need to collect is very important. I think you need to
think ahead every time: what am I going to get if I collect this data?
We don’t do formal crowdsourcing. You want to make sure that
what you are getting is the best possible curation. What we do
crowdsourcing is by having these task forces whenever we have a
problem to solve, we bring experts and we sit down and talk for a
couple days and we say here are the questions we have, how do
we deal with that. So bring in people in that way, but we don’t
actually bring in people to process the data. Even with a very well
trained annotator with good tools, it requires a lot of surveillance.
About the BIG Project
The BIG project aims to create a collaborative platform to address the
challenges and discuss the opportunities offered by technologies that provide
treatment of large volumes of data (Big Data) and its impact in the new
economy. BIG‘s contributions will be crucial for both the industry and the
scientific community, policy makers and the general public, since the
management of large amounts of data play an increasingly important role in
society and in the current economy.
“Don’t just do natural language processing (NLP), but do come up with a way to marry the sort
of laborious dictionary and ontology development, with the NLP.”
CONTACT
http://big-project.eu
TECHNICAL WORKING GROUPS AND
INDUSTRY
Project co-funded by the European Commission within the 7th Framework Programme (Grant Agreement No. 318062)
Collaborative Project
in Information and Communication Technologies
2012 — 2014
General contact
Project Coordinator
Jose Maria Cavanillas de San Segundo
Research & Innovation Director
Atos Spain Corporation
Albarracín 25
28037 Madrid, Spain
Phone: +34 912148609
Fax: +34 917543252
Email: [email protected]
Strategic Director
Prof. Wolfgang Wahlster
CEO and Scientific Director Deutsches Forschungszentrum für Künstliche Intelligenz
66123 Saarbrücken, Germany
Phone: +49 681 85775 5252 or 5251
Fax: +49 681 85775 5383
Email: [email protected]
Data Curation Working Group Dr. Edward Curry
Digital Enterprise Research Institute
National University of Ireland, Galway
Email: [email protected]