Chapter 5 Paper Outline
-
Upload
lewiskeller -
Category
Documents
-
view
222 -
download
0
Transcript of Chapter 5 Paper Outline
-
8/2/2019 Chapter 5 Paper Outline
1/4
Outline
Introduction
I came across the top 10 largest databases in the world.
Im not surprised that the top two are from our own government.
What did surprise me was that Google was #7, despite the wealth of information it
has.
Overall, the size of all of these databases is very astounding.
#1: The Library of Congress
They have 130 million documents altogether.
The text data, if digitized, would approximately be 20 TB in total size.
They have 5 million digital documents.
10,000 items are being added to the database each day.
I did a search on Vietnam, and came across the 10,000-item limit.
The newest document I came across in my search was an article from 1991.
I was given 5 minutes for a search session before I had to renew it.
#2: The CIA
The overall size of the database is unknown, due to the number of classified files
that it contains.
However, there are portions of it available to the public, such as The World Fact
Book and the Freedom of Information Act Electronic Reading Room.
The Electronic Reading Room makes some (potentially sensitive) government
documents available to the document.
-
8/2/2019 Chapter 5 Paper Outline
2/4
I did a search on Africa, and was able to come up with 98 items (available in both
GIF and PDF formats).
The database contains statistics on more than 250 countries and entities.
#3: Amazon.com
Database contains 42 TB of data.
Maintains extensive records on its customers.
This database that gathers and keeps massive amounts of intimate information
about its millions of shoppers, including their religion, sexual orientation, ethnicity
and income.
Combines information disclosed voluntarily by customers with facts gleaned from
public databases.
This gives Amazon more detailed information about its customers than any other
retailer.
#4: YouTube
In 2006, it was projected to have 45 TB of data.
Database is open for people who want to access it, which I find kind of astonishing.
You must request special developer and client keys before accessing the Data API.
Estimating the size of YouTube's database is particularly difficult due to the
varying sizes and lengths of each video.
Geared toward developers with experience programming server-side languages, the
Data API contains pre-built client libraries that simply the development task.
#5: ChoicePoint
-
8/2/2019 Chapter 5 Paper Outline
3/4
ChoicePoint's database of 17 billion public records is used for background checks,
insurance applications and tenant screening.
Contains information on 250 million people.
Database contains 250 TB of personal data.
Data is mostly being sold to the highest bidders, which include our government.
Much of the companys business is being governed by the Fair Credit Reporting
Act.
#6: Sprint
Has 53 million subscribers.
Database is spread across 2.85 trillion data insertions (largest in the world).
365 million call detail records processed per day.
Phone information has been leaked out of it, though.
Large telecommunication companies like Sprint are notorious for having immense
databases to keep track of all of the calls taking place on their network.
#7: Google
Googles database contains all of the words that are used in search terms.
A crawler visits a page, copies the content and follows the links from that page to
the pages linked to it, repeating this process over and over until it has crawled
billions of pages on the web.
Like the CIAs database, the size of Googles database is unknown (due to it being
locked in a vault).
Google searches account for more than 50% of all internet searches.
Database contains virtual profiles of countless number of users.
-
8/2/2019 Chapter 5 Paper Outline
4/4
#8: AT&T
Database contains 323 TB of data.
Database has 1.9 trillion phone call records.
AT&T is so meticulous with their records that they've maintained calling data from
decades ago -- long before the technology to store hundreds of terabytes of data ever
became available.
#9: NERSC
The NERSC database encompasses 2.8 PB of information and is operated by more
than 2,000 computational scientists.
The database is privy to a host of information including atomic enegry research,
high energy physics experiements, simulations of the early universe and more.
What distinguishes the center is its success in creating an environment that makes
these resources effective for scientific research.
#10: The World Data Centre for Climate
Largest database in the world.
220 TB of web data.
110 TB of climate simulation data.
6 PB of additional data on magnetic tape.
Database is used on a computer that cost 35 million euros.
Conclusion