iDesign - Recommender System (white paper)

9
Kevin Champion of Team Serendipity 1 Recommender System Application University Library Materials This report explores the design of a system to make recommendations of university library materials. These materials will be heavily composed of books, but the idea will be flexible enough to also handle journal articles, videos, and perhaps even web content. The primary notion is that the library is ordinarily thought of as a repository of materials. This, however, is really just its first order service. A second order service that a university library is poised to provide is embedded in the expertise and familiarity with the materials of the librarians and faculty at the university. With this in mind, a recommender system could be developed to utilize this rich set of knowledge to curate subsets of the overall library collections, which could then be used to make recommendations to users. A large number of these subsets from across the university could be interconnected and used to surface new content to users, enhance their experience, and break down artificial barriers created by different subject areas. Domain Characteristics The domain of university library resources is an interesting area to examine because the traditional focus of libraries is to provide access to the widest and most all- encompassing set of materials. However, this is far from its only function. Indeed a valid argument can be made that simply purchasing and making available materials is only one part of the process of making these materials truly accessible to users. A more holistic perspective of providing access to materials requires that additional effort be placed on how the user is able to discover and interface with these materials. In this vein, accessibility is a much larger concept. For instance, just because a user is able to retrieve a particular book from the libraries' collections without being charged a fee, does not ensure that the user will actually be able to discover that this book exists, know that the university libraries' owns it, and be able to find and retrieve the actual physical book. Over time libraries have put considerable work into providing catalogs that are easy to search along with finding aids that help users check to see if their library system contains the resources they are interested in (ie. Get It). Nonetheless, in almost all instances, users are still required to have a very good idea of what they are seeking before interfacing with the libraries' tools. This is to say that libraries have yet to discover meaningful and truly effective ways of helping users discover materials when they do not know what they need; in effect there are yet few instances of tools that help users browse materials as they might the physical shelves of a bookstore. Browsing is an important function that libraries and library users would benefit from in an online environment. Browsing enables serendipitous discoveries whereby users are able to find resources that they did not know they needed. In essence, browsing provides the

Transcript of iDesign - Recommender System (white paper)

Page 1: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 1

Recommender System Application

University Library Materials

This report explores the design of a system to make recommendations of university library materials. These materials will be heavily composed of books, but the idea will be flexible enough to also handle journal articles, videos, and perhaps even web content. The primary notion is that the library is ordinarily thought of as a repository of materials. This, however, is really just its first order service. A second order service that a university library is poised to provide is embedded in the expertise and familiarity with the materials of the librarians and faculty at the university. With this in mind, a recommender system could be developed to utilize this rich set of knowledge to curate subsets of the overall library collections, which could then be used to make recommendations to users. A large number of these subsets from across the university could be interconnected and used to surface new content to users, enhance their experience, and break down artificial barriers created by different subject areas.

Domain Characteristics

The domain of university library resources is an interesting area to examine because the traditional focus of libraries is to provide access to the widest and most all-encompassing set of materials. However, this is far from its only function. Indeed a valid argument can be made that simply purchasing and making available materials is only one part of the process of making these materials truly accessible to users. A more holistic perspective of providing access to materials requires that additional effort be placed on how the user is able to discover and interface with these materials. In this vein, accessibility is a much larger concept. For instance, just because a user is able to retrieve a particular book from the libraries' collections without being charged a fee, does not ensure that the user will actually be able to discover that this book exists, know that the university libraries' owns it, and be able to find and retrieve the actual physical book. Over time libraries have put considerable work into providing catalogs that are easy to search along with finding aids that help users check to see if their library system contains the resources they are interested in (ie. Get It). Nonetheless, in almost all instances, users are still required to have a very good idea of what they are seeking before interfacing with the libraries' tools. This is to say that libraries have yet to discover meaningful and truly effective ways of helping users discover materials when they do not know what they need; in effect there are yet few instances of tools that help users browse materials as they might the physical shelves of a bookstore.

Browsing is an important function that libraries and library users would benefit from in an online environment. Browsing enables serendipitous discoveries whereby users are able to find resources that they did not know they needed. In essence, browsing provides the

Page 2: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 2

antithetical experience to searching; users are not required to have a good idea of what they are looking for before engaging in the browsing experience. Engineering this sort of serendipitous discovery relies upon providing a highly relevant experience to users in much the same way that search depends upon relevance. One way that libraries have traditionally attempted to tackle the issue of relevance is through classification via the Library of Congress system, among others, and the use of metadata about the materials. A sort of pseudo-browsing experience is often enabled in library search systems whereby users are able to engage in faceted browsing once an initial search has been made. This functions much like many popular and highly usable online shopping sites whereby results can be narrowed and expanded by adding or removing categories or other metadata descriptors. While this is a proven method for interfacing with search results listings, it does not address the scenario whereby the user does not know what to initially search and it does not utilize the expertise embedded in the collective “mindspace” of the university's librarians and faculty.

One method to create a truer browsing experience and utilize the expertise in the university that deserves exploration is the use of a recommendation framework to generate highly relevant resources for users. One activity that both professors and librarians are already doing is creating lists of resources tailored to specific classes, topic areas, or disciplines. Professors do this every time they develop a reading list for a particular course whereas librarians do this when working with specific courses to help students with their research, when creating subject guides, and when curating targeted physical collections. Each of these instances of lists are carefully curated by professionals who are experts in their domain. As such, an item simply existing in one of these lists is different from any other item in the libraries' collections that does not show up in one of these lists. Furthermore, an item that shows up in more than one list is different from an item that occurs in only one list. Due to this, an occurrence of an item in one of these lists can be thought of as a weighted vote for that item. Using this frame of thought, a recommendation system can be developed to take into account these weighted votes. The resulting recommendations can be presented to users in an interface tailored to browsing and focusing on connecting highly relevant materials to each other.

Academic materials in the form of books and research papers have specific characteristics that make working with them different than working with popular books, commercial products, or other types of more ordinary recommender domains such as movies or restaurants. The most prescient of these differences is the shear quantity of academic resources available. When attempting to find materials on a broad topic, users will most often run into problems of scale in which there are multiple orders of magnitude greater number or resources in existence in a particular domain than the user needs or can process. In addition, while most all of these materials have metadata, the metadata is often less useful than with more popular and commercial areas.

The issue of ineffectiveness of metadata is caused by a number of reasons, but one of them is the highly specified language used to describe resources in different subject areas. Over time, academia has developed an ever broadening set of fields and disciplines as scientific methodology has necessitated ever increasing levels of specialization. This creates a branching effect, which allows for specialization but also can tend to separate disciplines

Page 3: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 3

from each other. Since each discipline develops its own language to describe itself, highly related disciplines that end up on different branches sometimes lose connection to each other. However, reality dictates that a more appropriate metaphor is a web and that even though academia has created branches that effectively put fields into their own silos, they often share many characteristics and ideas in common. One way of connecting these branches once more is to build a web by mapping resources. However, instead of using metadata to do this, there is an opportunity to use co-occurrence in lists to draw connections between distinct resources. This idea is not dissimilar to the groundbreaking “PageRank” algorithm developed by Google to weight webpages by the number of links from other webpages. In fact, many of the characteristics specific to academic resources are also found in the characteristics of webpages (scale, ineffective metadata). As a result, if we think of Google's results as an explicit form of recommendation engine, it is not a stretch to the applicability of a recommender system to this idea of linking together academic resources.

Design Dimensions

Note on privacy

Public and academic libraries place supreme importance on the privacy of their users. As a result, they do a number of things to ensure that users maintain privacy when using library resources and services. One of the most important mechanisms that libraries employ for maintaining privacy is to simply not track and store user behavior. What this means is that libraries intentionally purge information about what resources a particular user has checked out or viewed in the past. This makes it so that libraries are not put in an ethically compromised position if the government comes to them asking for information about a particular user; if they do not have any information they do not have to break this user's privacy.

This policy has implications for developing user interfaces and experiences. Since the libraries do not store this information, they cannot create an interface that allows a user to login to her account and view her previous checkout history, for instance. It also has implications for the type of recommendation systems that can be built for library materials. Since user-data is intentionally purged, user-user algorithms and collaborative filtering techniques are not possible because we have to presume an environment where we do not have user specific information. Users are unable to rate resources and the library does not have a way of tracking browsing behavior to an individual user over time. That said, there are some libraries who are developing opt-in systems to enable some of these features if users agree to the privacy implications. Nonetheless, this paper attempts to outline a system that can be effective using item-item algorithms and content filtering approaches in the university context without the need for user specific data.

Page 4: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 4

Content-based recommending

One way of building a recommender for library resources is to utilize the reading and resource lists created by professors and librarians. In terms of the recommender, each list can be considered similarly to a user and each item in a list can be considered a vote for that item. Since these lists are curated by domain experts, we can operate with confidence that by considering items' existence in a list as a vote the recommender will be working with resources that have both a high degree of quality and relevance. Therefore, this technique will result in a matrix of lists and items where the lists are along one axis and the items are along the other. Using this matrix, a simple item-item recommender algorithm can be used to recommend related resources to the current resource being viewed. Additionally, since there will be a much larger number or resources than there will be lists and since there will be a relatively small amount of overlap from resources that are listed more than once (ie. resources that have more than one vote), this matrix will work best with algorithms that deal well with sparse ratings matrices. Consequently, an SVD algorithm can be used here to discover features of the items based on their votes and make recommendations based on these features.

Along with this way of counting instances of resources in lists as a vote, other content-based techniques can be used in other algorithms to develop interesting recommendations. There are a number of useful pieces of metadata that each resource is likely to have. Most of these forms of metadata will be useful for recommending similar items, but the following will be most effective: list metadata that details the course the list is used for and the subject area/s the list items deal with, item subject data derived from the Library of Congress subject headings, full-text descriptions of items derived from abstracts and summary paragraphs, and additional subjects or “tags” applied to the items by the professors and librarians when they list them. Of these four types of metadata, three of them are essentially keywords that are used to categorize the items into some sort of taxonomy. As such, these can all be grouped together and used to create a content keyword frequency matrix, which can then be fed into a content-filtering SVD algorithm. The other type of full-text metadata descriptions can be used by first running them through a natural language processing machine in order to derive keywords from the descriptions. However, the resultant keywords should probably not be combined with the categorical keywords contained in the other metadata because they come from an uncontrolled vocabulary and are not used for taxonomical purposes. All of the other metadata come from controlled vocabularies and will thus result in a more concentrated frequency matrix. Adding the natural language keywords would pollute this concentration rendering this keyword matrix less effective. Instead, the abstract keywords can construct their own content keyword frequency matrix which can be fed into another SVD algorithm to generate similarities.

It must also be mentioned that by thinking of lists as users and the existence of items in lists as votes for those items, it is possible to utilize a pseudo-user-user algorithm. In this case the user-user algorithm might be better described as a list-list algorithm. By running the vote matrix into a list-list algorithm, it would be possible to generate recommendations of other lists similar to the current list being viewed. The simple user-user algorithm would

Page 5: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 5

discover these similar lists by finding lists that had the greatest number of resources co-listed in each. Since we have already discussed that this sort of overlap will be sparse (even though it will exist), an SVD algorithm would be more effective in this instance as well because it could discover relationships in a more relationally complex way, which will hopefully result in more relevant recommendations of similar lists.

Design recommendations

In order to create the most useful recommendation system, I recommend the use of most of the algorithmic content-based techniques mentioned above. In this vein, I think a hybrid system should be used to help surface related resources to users. When a user views a particular item in a particular list, the recommendation system will employ a number of techniques to feed recommendations into the interface.

The matrix of “votes” created from resources existing in lists will be fed into a SVD algorithm using ten features (in the optimization process the number of features will be tweaked to get optimal results). The content keyword frequency matrix derived from subject classifications will also be fed into an SVD algorithm. Using the output features and weightings of each SVD calculation, a weighted feature combination technique will be employed to join the features of the two content-based SVDs. This combinatorial approach is a hybrid itself of the “weighted” (Burke, 2002, p. 339) and “feature combination” (Burke, 2002, p. 341) hybridization techniques and will work by first artificially inflating the weightings of the list-based SVD to give its results primacy and then will combine the features so that one set of recommendations is output. This approach of combining the SVDs will allow the resources' existence in lists to reveal relationships, but will also utilize the inherent taxonomic connections between resources that have been described by classification experts.

In addition to these two algorithms, a third will be run on the content keyword frequency matrix created from the full-text abstracts of each resource. This matrix will be fed into its own SVD and the weightings and features from it will be combined with the hybrid results which have already been combined. It will do this using one of two techniques depending on the interface to be employed: weighted combination or “mixed” combination (Burke, 2002, p. 341). If the desired interface requires only one set of recommendations then a weighted combination will occur whereby the full-text recommendations will be weighted as less important than the combined vote and subject keyword based recommendations. This is the case because the full-text derived recommendations do not take into account the professors' and librarians' expertise, which is a key element missing in current systems that this paper proposes will lead to better recommendations. In interfaces which can accommodate a more complex display, the full-text recommendations will be displayed in a separate location alongside the other recommendations using a “mixed” strategy.

Lastly, a third type of recommendation will be used to recommend other lists similar to the current one being viewed. For this set of recommendations the original matrix of resources and lists will be sent to an SVD to discover features of the lists, which will lead to

Page 6: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 6

recommendations for other lists. These recommendations will be displayed in the interface apart from all the other recommendations as they are distinct.

Performance

Performance in this system is not likely to be a problem because almost all of the computation can happen offline before recommendations will be used. Since this system does not have to deal directly with user input via ratings or other usage metrics, it has little need to update in real-time. The main event that would require that the algorithms be run is if a list was added to or edited in the system. Since this will happen only semi-frequently, most computation can happen offline without negatively impacting the user-experience.

Interface

This recommendation system is geared primarily at developing a highly relevant and useful information architecture for academic library resources. As a result, the end goal of the recommender engine is not as simple as outputting a list of items that can be put into a widget-like box somewhere on an already existing library page. In fact, this system requires an entire web framework to be built, and requires that the recommendations from the system be tightly integrated with this framework in a usable interface.

A framework needs to be developed to house the lists that librarians and professors create. Each list will be housed at its own unique URL that librarians and professors could share with their users. Lists themselves will be a part of a larger system that houses all of the lists. Each lists will be characterized by a highly visual, fast, and interactive interface that encourages clicking and browsing. APIs will be used to pull in book cover and other images to represent each resource visually, and Javascript will be used heavily to utilize the processing power of the browser to ensure a highly responsive level of interaction. When an individual resource is selected, a light-box will open displaying more information about that resource and the recommendations from the algorithms to other resources. These recommendations will be displayed equally visually by offering an image of the resource along with a title and perhaps an indication of what list/s it exists in. Recommendations of other lists will display in a sidebar of each main list page. Due to its visual nature, there will be very few textual descriptions to detail why these recommended resources are being presented to the user. Instead, the interface will leave these details vague in the belief that users need not concern themselves with the specifics if they are finding interesting resources. While there are many alternatives to the specifics of how this user-interface could be developed, ensuring that it is highly visual and fast to interact with will be key pillars of its success.

In addition to the actual interface of lists of resources, this system can be integrated with the main library catalog. When a user enters an item in the main catalog, an interface element can be added to the information about that item, which shows and links to the lists it

Page 7: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 7

has been listed in. This type of element has been pioneered in the commercial sector with features like Amazon's Listmania. In it, shoppers create lists of products and then when navigating to a product that has been listed, an element is added to the item which displays the lists it belongs to and related items to it from those lists.

Drawbacks and pitfalls

While this recommendation system and user-interface is conceptually sound, it is not practically implementable given the current state of most universities. In this section I will predominantly relate the situation at the University of Michigan, and make the broad assumption that it is characteristic of most universities. Even though the university has a huge amount of knowledge embedded in the course reading lists from professors, these lists are not made available to the public with any consistency. There are Open institutes such as Open.Michigan that attempt to release university courseware under an open license, but even these efforts are not sufficient for the type of system envisioned here. Even presuming that all courseware at the university could be published openly, this system would still require consistently formatted resources and machine parse-able reading lists. Given that the university is highly decentralized, the sort of standardization that would be required to accomplish this is not likely to occur easily. It is more conceivable that librarians could organize around a standardized way of developing and formatting these types of lists, but without the lists from professors' courses, the system would not contain a sufficient amount of resources to lead to useful recommendations.

In addition to issues with the practical prerequisites of this system, it would be quite challenging to get the user-interface right so that this tool was a solution instead of just another set of webpages that add to the already complicated and confusing library web profile. A big part of developing the user-interface would also involve developing an administrative interface that made it easy for librarians and professors to create and manage lists. This would be of prime importance because most of these professionals would not be technically knowledgeable and most of them would not use such a system if it required a great deal of effort and difficulty to setup. So, in addition to the end user interface, the system would have to get the administrative interaction right so that it was simple and pleasant to use for librarians and professors. That said, if the administrative interface was successful it could be used as an opportunity to add additional information to the system to make it more effective. For instance, as mentioned above, one possibility would be to allow librarians and professors to add keywords or tags to resources as they are creating the lists, which would presumably make recommendations even more effective.

Future possibilities

Along with the recommendations already mentioned, if this system were put in place and were successful, there would be a lot of opportunity to expand the system by

Page 8: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 8

implementing user-input based recommendations. This could conceivably be accomplished by offering users an opt-in system in which they were able to choose to allow the library to store certain usage and user-input data. If this were in place, the library could track browsing behavior and solicit user-inputted profile information, which could be used to create a keyword/subject profile of the user. This profile could then be used to help weight the recommendation engines so that items would be more relevant to that user's profile. In addition to these mechanisms, actual user-ratings, reviews, and even user contributed tagging could be added to the system if it developed a critical mass of usage. Individual items could solicit ratings from users, which could then be ran through additional algorithms to modify the existing recommendations or create new ones.

Note on sources

This paper was developed in concert with a project I am doing to submit to the iDesign competition for the University of Michigan Libraries. Due to this, much of the domain specific information and knowledge within was ascertained from a series of interviews and discussions with University of Michigan librarians and library staff, along with staff of the Open.Michigan program. Also of note is that this paper outlines a number of aspects of the actual design I will be submitting for the iDesign competition.

Page 9: iDesign -  Recommender System (white paper)

Kevin Champion of Team Serendipity 9

References

Burke, R. (2002). Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4), 331.