Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluation for Improved Image...

1
Figure 4: An Individual Record from the Gateway to Oklahoma History Website Emily Kolvitz, University of Oklahoma, School of Library and Information Studies & Bynder LLC RESEARCH METHODS INTRODUCTION Figure 3: Detailed Information on the types, quality and quantity of metadata from Gateway to Oklahoma History Image Resources Image Source: http://gateway.okhistory.org/ark:/67531/metadc317314/ Image Source: http://gateway.okhistory.org/ark:/67531/metadc190366/ Image Source: http://gateway.okhistory.org/ark:/67531/metadc316961/ Image Source: http://gateway.okhistory.org/ark:/67531/metadc316903 Figure 2: Screen capture of Embedded Metadata Figure 1: Structured Data Linter Tool Results CONCLUSION © Copyright Emily Kolvitz, Product Consultant at Bynder, LLC MLIS from The University of Oklahoma Emily Kolvitz [email protected] +1 405 471 2570 https://www.theinformationprofessional.com https://www.getbynder.com FURTHER INFORMATION ACKNOWLEDGEMENTS I would like to thank The University of Oklahoma, who offered resources to make this project possible and also The Oklahoma History Center for allowing me to work on such a wonderful digitization project for the Gateway to Oklahoma History during my study. RESULTS Image Resource Findability on the World Wide Web is still very much a land-grab. For the Semantic Web to become a reality online businesses and individuals have to get their hands dirty and also come face-to-face with the realization that search engine giants are increasingly becoming the go-to tool for information resource retrieval. “Increasingly, students use Web search engines such as Google to locate information resources rather than seek out library online catalogs or databases of scholarly journal articles” (Lippincott 2013). This puts the search engine giant in a unique position to dictate how the future of search will work on the Web - and therefore, your organization’s future presence (or lack thereof) on the Web. Image search and retrieval is a more difficult area than text search and retrieval because accessibility to the image content is largely dependent on the context presented in and around the image resource. The widespread adoption of Schema.org and structured data on the Web did not gain traction until the big four search engines (Google, Bing, Yahoo, and Yandex ) agreed that a standard was needed to pave the way forward. “On-page markup helps search engines understand the information on web pages and provide richer search results. A shared markup vocabulary makes easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts,” (Schema.org 2014).Structured data is only one piece of the findability algorithm. Metadata near content, embedded within content, or listed in the alt text of an html document all tell machines something about the content inside of the record as well. There are no guardians of the Web, ensuring structured data is uniformly applied to all records with equal attention and care and there is no standard, mandated requirement for records on the Web to provide context for image resource findability. Most search engines do not crawl embedded XMP data or the invisible Web, leaving text near images, file names or text in the alt-text in html markup as the only context for image resources. The search algorithms for image retrieval are subject to change frequently (Kritzinger 2013) and additionally, social media sites and organizations strip embedded data from images (Embedded Metadata Manifesto 2014). Embedded metadata provides context and provenance for image resources. Even with the dramatic adoption of structured data markup utilizing schema.org vocabularies, there still remains metadata opportunities on the Web. Reicks recommends embedded metadata as a strategy for online findability by showcasing examples of applications that parse embedded data into structured data around images on the Web such as PhotoShelter and LicenseStream (2010). The one variable in the equation of Web findability that remains a staple is good quality metadata under the hood of the Website. In this case study, a methodology is applied to the Gateway to Oklahoma History’s Website. This study can be generalized to organizations looking to benchmark their own findability maturity on the Web from an image-centric viewpoint. The following research question informed this project: What are the types and quality of structured data, XMP, and metadata records available for image resources appearing on the website? Utilizing the Structured Data Linter Tool and Phil Harvey’s ExifTool, information was gathered to quantify these research questions. Image records on the Gateway to Oklahoma History’s website were investigated for the types, quality and quantity of embedded metadata and structured data. The Gateway to Oklahoma History’s Website has a wealth of structured data and metadata pertaining to its image resources. Search queries utilizing structured data markup tags and/or embedded metadata yielded relevant and accurate results during a normal web search, but did not yield relevant and/or accurate image resources during an image search. Descriptive filenames were not used for image resources, which is an important part of image retrieval through web search engines. Adding Schema.org tags to the on-page markup, to accompany the structured data already present is another area for improvement. An interesting finding from this research was that embedded metadata was only found on the largest, original version of the image resource, and never on smaller derivative images. Structured data included in the on-page mark-up included Open Graph Protocol and Dublin Core. IPTC was the primarily type of embedded metadata present for the image resources. The results and methodology for this research can help GLAM institutions (Galleries, Libraries, Archives & Museums) by bringing awareness to the state of structured data and image resource findability for cultural heritage institutions on the Web. GLAMs must be active in the SEO space, support machine-readable language in the markup of their sites, and utilize Schema.org vocabularies and descriptive filenames for relevancy in search engine results. The Digital Library Federation, which is a program of the Council on Library and Information Resources, concludes that “Getting found means repository objects must be included in the indexes of major search engines because most students and faculty now begin their research with Internet search engines. Digital repositories created by libraries will be largely invisible to users if their contents are not indexed in these search engines” (Digital Library Foundation 2014). REFERENCES Corlosquet, Stephane and Gregg Kellogg. Last accessed August 1, 2015 “Structured Data Linter,” http://linter.structured-data.org/ Digital Library Federation. Last accessed October 20, 2014. “SEO for Digital Libraries.” http://www.diglib.org/community/groups/seo-for-digital-libraries/ Harvey, Phil. Last accessed August 1, 2015. “ExifTool by Phil Harvey,” http://www.sno.phy.queensu.ca/~phil/exiftool/ International Business, Times. 0006. "Bing, Google and Yahoo merge to make search easier with schema.org." International Business Times, April. IPTC International Press Telecommunications Council, 2014. “Embedded Metadata Manifesto” Last accessed November 20, 2014. http://www.embeddedmetadata.org/social-media-test-results.php (Embedded Metadata Manifesto 2014). Kritzinger, W. T. "Search Engine Optimization and Pay-per-Click Marketing Strategies." Journal of Organizational Computing and Electronic Commerce, no. 3 (2013): 273-86. Lippincott, Joan K. “Net Generation Students and Libraries,” EDUCAUSE (2005), accessed November 19, 2014, http://www.educause.edu/research-and-publications/books/educating-net-generation/net-generation- students-and-libraries Reicks, David. 2010. “Why Embedded Metadata Won’t Help Your SEO,” Last Updated December 30, 2013. Last Accessed November 23, 2014. http://www.controlledvocabulary.com/blog/embedded-metadata-wont-help-seo.html Schema.org. 2015. “About Schema.org” Last Updated Unknown. https://schema.org/docs/faq.html

Transcript of Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluation for Improved Image...

Page 1: Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluation for Improved Image Findability on the Web

       

 

 

 

 

 

 

 

 

Figure 4: An Individual Record from the Gateway to Oklahoma History Website

Emily Kolvitz, University of Oklahoma, School of Library and Information Studies & Bynder LLC

RESEARCH  METHODS

INTRODUCTION

Figure 3: Detailed Information on the types, quality and quantity of metadata from Gateway to Oklahoma History Image Resources

Image Source: http://gateway.okhistory.org/ark:/67531/metadc317314/

Image Source: http://gateway.okhistory.org/ark:/67531/metadc190366/

Image Source: http://gateway.okhistory.org/ark:/67531/metadc316961/

Image Source: http://gateway.okhistory.org/ark:/67531/metadc316903

Figure 2: Screen capture of Embedded Metadata Figure 1: Structured Data Linter Tool Results

CONCLUSION

© CopyrightEmily Kolvitz, Product Consultant at Bynder, LLCMLIS from The University of Oklahoma

Emily Kolvitz [email protected]+1 405 471 2570 https://www.theinformationprofessional.com

https://www.getbynder.com

FURTHER  INFORMATION  

ACKNOWLEDGEMENTS  I would like to thank The University of Oklahoma, who offered resources to make this project possible and also The Oklahoma History Center for allowing me to work on such a wonderful digitization project for the Gateway to Oklahoma History during my study.

RESULTS  

Image Resource Findability on the World Wide Web is still very much a land-grab. For the Semantic Web to become a reality online businesses and individuals have to get their hands dirty and also come face-to-face with the realization that search engine giants are increasingly becoming the go-to tool for information resource retrieval. “Increasingly, students use Web search engines such as Google to locate information resources rather than seek out library online catalogs or databases of scholarly journal articles” (Lippincott 2013). This puts the search engine giant in a unique position to dictate how the future of search will work on the Web - and therefore, your organization’s future presence (or lack thereof) on the Web.

Image search and retrieval is a more difficult area than text search and retrieval because accessibility to the image content is largely dependent on the context presented in and around the image resource. The widespread adoption of Schema.org and structured data on the Web did not gain traction until the big four search engines (Google, Bing, Yahoo, and Yandex ) agreed that a standard was needed to pave the way forward. “On-page markup helps search engines understand the information on web pages and provide richer search results. A shared markup vocabulary makes easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts,” (Schema.org 2014).Structured data is only one piece of the findability algorithm. Metadata near content, embedded within content, or listed in the alt text of an html document all tell machines something about the content inside of the record as well.

There are no guardians of the Web, ensuring structured data is uniformly applied to all records with equal attention and care and there is no standard, mandated requirement for records on the Web to provide context for image resource findability. Most search engines do not crawl embedded XMP data or the invisible Web, leaving text near images, file names or text in the alt-text in html markup as the only context for image resources. The search algorithms for image retrieval are subject to change frequently (Kritzinger 2013) and additionally, social media sites and organizations strip embedded data from images (Embedded Metadata Manifesto 2014). Embedded metadata provides context and provenance for image resources.

Even with the dramatic adoption of structured data markup utilizing schema.org vocabularies, there still remains metadata opportunities on the Web. Reicks recommends embedded metadata as a strategy for online findability by showcasing examples of applications that parse embedded data into structured data around images on the Web such as PhotoShelter and LicenseStream (2010).

The one variable in the equation of Web findability that remains a staple is good quality metadata under the hood of the Website. In this case study, a methodology is applied to the Gateway to Oklahoma History’s Website. This study canbe generalized to organizations looking to benchmark their own findability maturity on the Web from an image-centric viewpoint.

The following research question informed this project:

What are the types and quality of structured data, XMP, and metadata records available for image resources appearing on the website?

Utilizing the Structured Data Linter Tool and Phil Harvey’s ExifTool, information was gathered to quantify these research questions. Image records on the Gateway to Oklahoma History’s website were investigated for the types, quality and quantity of embedded metadata and structured data.

The Gateway to Oklahoma History’s Website has a wealth of structured data and metadata pertaining to its image resources. Search queries utilizing structured data markup tags and/or embedded metadata yielded relevant and accurate results during a normal web search, but did not yield relevant and/or accurate image resources during an image search. Descriptive filenames were not used for image resources, which is an important part of image retrieval through web search engines.

Adding Schema.org tags to the on-page markup, to accompany the structured data already present is another area for improvement. An interesting finding from this research was that embedded metadata was only found on the largest, original version of the image resource, and never on smaller derivative images. Structured data included in the on-page mark-up included Open Graph Protocol and Dublin Core. IPTC was the primarily type of embedded metadata present for the image resources.  

The results and methodology for this research can help GLAM institutions (Galleries, Libraries, Archives & Museums) by bringing awareness to the state of structured data and image resource findability for cultural heritage institutions on the Web. GLAMs must be active in the SEO space, support machine-readable language in the markup of their sites, and utilize Schema.org vocabularies and descriptive filenames for relevancy in search engine results.

The Digital Library Federation, which is a program of the Council on Library and Information Resources, concludes that “Getting found means repository objects must be included in the indexes of major search engines because most students and faculty now begin their research with Internet search engines. Digital repositories created by libraries will be largely invisible to users if their contents are not indexed in these search engines”  (Digital Library Foundation 2014).  

REFERENCESCorlosquet, Stephane and Gregg Kellogg. Last accessed August 1, 2015 “Structured Data Linter,” http://linter.structured-data.org/Digital Library Federation. Last accessed October 20, 2014. “SEO for Digital Libraries.” http://www.diglib.org/community/groups/seo-for-digital-libraries/Harvey, Phil. Last accessed August 1, 2015. “ExifTool by Phil Harvey,”http://www.sno.phy.queensu.ca/~phil/exiftool/International Business, Times. 0006. "Bing, Google and Yahoo merge to make search easier with schema.org." International Business Times, April.IPTC International Press Telecommunications Council, 2014. “Embedded Metadata Manifesto” Last accessed November 20, 2014. http://www.embeddedmetadata.org/social-media-test-results.php(Embedded Metadata Manifesto 2014). Kritzinger, W. T. "Search Engine Optimization and Pay-per-Click Marketing Strategies." Journal of Organizational Computing and Electronic Commerce, no. 3 (2013): 273-86.Lippincott, Joan K. “Net Generation Students and Libraries,” EDUCAUSE (2005), accessed November 19, 2014, http://www.educause.edu/research-and-publications/books/educating-net-generation/net-generation-students-and-librariesReicks, David. 2010. “Why Embedded Metadata Won’t Help Your SEO,” Last Updated December 30, 2013. Last Accessed November 23, 2014. http://www.controlledvocabulary.com/blog/embedded-metadata-wont-help-seo.htmlSchema.org. 2015. “About Schema.org” Last Updated Unknown. https://schema.org/docs/faq.html