1 Archiving Workshop (Soleil, May 2010) Archiving System Status.
Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003.
-
Upload
randall-may -
Category
Documents
-
view
212 -
download
0
Transcript of Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003.
Archiving Web Content
CE Course #285
Sheraton NY Hotel & Towers
Sunday, June 8, 2003
Introductions
Meet today’s panelists
Today’s Panelists
Barry Abisch - The Journal News Olivia Kobelt – Christian Science Monitor Mark Stencel – Washingtonpost.com Janine Yagielski – CNN.com
Agenda
Introductions & session overview Technology Workflow and processes Brainstorming Break Brainstorming recap Role of the librarian Building the business case Closing comments & session evaluation
Technology
Panelist: Janine Yagielski
Technology Overview
What can be archived? Preparing content to be archived Storing and serving archived content Searching archived content
What can be archived?
Overview of file formats (handout) Dynamic and static content Archiving presentation as well as content Archiving secondary information about online
content (traffic information) Challenges of changing technologies
Technology
Overview of File Formats(handout)
Text formats Image/graphic file formats Video formats Other definitions
Technology
Static and Dynamic Content
Static Content: Content that once posted does not change.
Example: Simple story or information page
Dynamic Content: Constantly changing content
Example 1: Weather data, Stock Prices
Example 2: Election Results, Sports Scores (fixed end point)
Technology
Static and Dynamic Content
Hybrid: Changes occasionally but does not have a predictable updating schedule or end point
Example: Top story with multiple and significant updates
Example: Home page or section page
Technology
Archiving Presentation and Content
CNN.com has built an internal system to archive some presentation
Home Page, US, World, Politics, International Edition
One week of pages Every 30 minutes Perl Script Size of archive: 55.4 MB
Technology
Archiving Secondary Information about Online Content (traffic)
CNN.com has extensive Webstats reporting system that parses and archives the information from Web server logs.
Simple statistics: Page Views, hits (back to 1996)
Advanced statistics: Unique users, time spent, IP address, OS, browser
Real Time Monitor: tracks click through rates of links Home and US pages One week of info on links Tracks average and peak for links
Technology
Challenges of Changing Technology
Interdependencies of the Web make it difficult to maintain old content when optimizing for new content.
Examples: .shtml pages, Vivo video, some Shockwave, other antiquated multimedia technology based on plug-ins
Technology
Preparing Content to be Archived
Directory Structure/Database Key to consistency and automation in subject specific archives. cnn.com/2003/WORLD/meast/06/02/sprj.nitop.political.council/
Slugs conventions Provide additional method of automation archiving
Examples: sprj; sprj.nitop; .ap
Technology
Preparing Content to be Archived
Content Management System Imposes and uses directory structure to prepare content for publication,
syndication and in some cases archiving and searching
Metadata in stories on publish <meta name="DESCRIPTION" content="A U.S. soldier was killed and five were wounded early Thursday in the Iraqi city of Fallujah, the U.S. Central Command announced -- the latest casualties in the city, which has become a center of resistance."> <meta name="AUTHOR" content=""> <meta name="SECTION" content="WORLD"> <meta name="SUBSECTION" content="meast"> <meta name="DATE" content="2003-06-05 05:22:20">
Technology
Preparing Content to be Archived
XML (Extensible Markup Language) CNN.com produces a XML file with every story for site search. We also
produce XML feeds of story headlines and other data sent to
syndication partners.
Metadata and XML for Multimedia CNN.com is looking into way to insert metadata and produce XML feeds
of non-traditional stories. Currently only an internal and manual process of archiving the location and subject of interactive (pop-up) content.
Technology
Storing and Serving Archived Content
Simple storage of content Content servers Burn to CD Web servers (internal and external even if not served) Tape backup
Serving to internal users Image query Directory browsing on the inside Web servers Content purged from outside available (AP, partner stories) Limited space on internal Web server (36 GB)
Technology
Serving to All External Users
All unique URLs published on CNN.com from the launch of the site are still available, unless there was an editorial decision to remove or redirect a URL.
CNN video is hosted by AOL. Because of changes in hosting and capacity of video servers. Not all previous video streams are
available.
Technology
Serving to All External Users
Web servers/NFS ServerHardware: Sun and Intel (running Linux)
Cost: $10,000-$15,000 (Sun), $5,000 (Intel)
Capacity: Storage capacity expanded by adding additional hard drives. Serving capacity varies by content. HTML -- 25K hits/minute; images, style sheets -- 60-70K hits/minute
Video Servers
Hardware: Reconfigured and video dedicated Web server
Cost : $1,500-$3,000
Capacity: Depends on length and size of video and disk space
Technology
Serving to Select Users
Registration E-mail newsletters New e-mail alerts Backend Oracle database JSP’s dynamically served
Subscription Video Real Networks handles CNN.com’s subscriber authentication
Technology
Searching Archived Content
Searching for internal users
Limited functionality for internal materials. Graphics image search. New publishing tools will incorporate a search of externally content.
Searching for external users
Site Search: Run by AOL. CMS produces and publishes (restricted by IP) XML files for every story. At set intervals AOL picks up the XML files uses those files to produce CNN.com’s internal search results.
Web Search: Powered by Google. Sponsored links from Overture. Both sets of
results are returned to CNN.com in XML feeds published on a CNN.com template.
Video/multimedia search: Exploring
Technology
Workflow
Panelist: Olivia Kobelt
Workflow Overview
Types of web content – what do we archive? Archiving old content Internal vs. external archive Making corrections/fixes Search ability Current workflow Systems we use Future Vision
Brainstorming!
Break!
Be back in 15 minutes!
Brainstorming Recap!Legal compliance vs. business user or needCopyright – can you archive someone else’s content, partner content?Talking to IT about what the requirements areHow do you approach gathering user requirements?Who are users?What are retention criteria? (date, size of files, originals/drafts/versioning, exclude search, business value)Hierarchy starting at bottom with knowledge, corporate, business use/reuse, compliance, vital recordsHow to capture and keep the hybrid web pages?What software applications are available?Microfilm archiving?What tools are available to automate the archival process?Where do we begin? Seeking advice in relation to storage, retrieval, technology, etc.What type of information/literature is available on the topic of archiving web data?Selling the idea to managementArchiving “how it looked”How did we do it? Examples of how a project was done.Measure what people are trying to find in older filesManaging the customer service side of it
Role of the Librarian
Panelist: Barry Absich
Librarian Role Overview You are the expert. What do you need? What do readers need? A news Web site has as much in common with a
library as it does with a newspaper. Become familiar with your newspaper’s Web site. If it is politically correct, insist that you be consulted
on all matters relating to both archiving and searching.
If you can't insist, at least offer your services. Odds are, your online editor will welcome the offer.
Building A Business Case
Panelist: Mark Stencel
Business Case Overview
What’s worth saving Making money Indirect revenue Costs and challenges Getting credit
Business Case
Does It Pay To Save?
Key points: Your news organization can profit from its
archive of original online content Making money isn’t always profitable (your
business case should account for the cost of doing business, not just revenue)
Business Case
Original Content
Breaking news stories Standing text (FAQs, online guides and
primers) Video/Audio Photo Galleries E-mail Newsletters Interactive Discussions/Chats Databases (listings, scores)
Business Case
Making Money
Sponsorships (e.g., local visitor guides) Resale (paid archives; research services,
such as LexisNexis, Factiva; online reprint rights)
Note: Few good models for selling non-text content (video, audio, galleries)
Business Case
Business Case
Business Case
Business Case
Business Case
Indirect Revenue
Promotion (can archived content attract more online users or even print or online subscribers?)
Registration (will users provide valuable e-mail addresses or other personal information in exchange for access to content)
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Business Case
Costs and Challenges
Do systems, process, equipment or personnel cost more than you can make?
Rights Management (which content do you have legal rights to use, re-use, or re-sell online)
Content Management (publishing systems and file/directory management for keeping track of where your content is)
Business Case
Costs and Challenges (cont’d.)
Fulfillment and Customer Service (supporting services you provide to the public or to partners)
Revenue Shares (accounting for your partner’s shares)
Coordinating With Parents or Siblings (do your plans fit in or conflict with the overall business goals/strategies of your chain?)
Business Case
Costs and Challenges (cont’d.)
Hosting (server space, streaming) Un-hosting (time and effort to delete or de-
link content; automatically deleting content vs. selectively maintaining content)
Business Case
Get Credit!
Make sure your department gets credit for any revenue it generates, not just the bill for the cost of providing money-making content and services.
Business Case
Questions & Answers
Closing remarks
Please complete an evaluation form.
Suggested Resources “The Archival Black Hole” by Scott Kirsner, 9/19/98,
Editor & Publisher "Archiving the Internet" by Brewster Kahle 11-4-96 From the Scientific American Nothing But Net, Preserving the Internet, 1 Terabyte at a
Time by Bill Barnes, Slate.msn.com "It Was Here a Minute Ago!": Archiving the Net
By Susan E. Fledman, Searcher: The Magazine for Database Professionals
SCC systems archiving billions of bytes at newspapers Newspapers & Technology March 2000
http://www.archive.org