Metadata based statistics for DSpace
-
Upload
bram-luyten -
Category
Technology
-
view
410 -
download
4
Transcript of Metadata based statistics for DSpace
![Page 2: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/2.jpg)
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
![Page 3: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/3.jpg)
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
![Page 4: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/4.jpg)
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
![Page 5: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/5.jpg)
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address• Country• User Agent• HTTP Referrer• ...
![Page 6: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/6.jpg)
ITEM METADATA
Relate usage event to information stored in your repository.
Allows statistics queries based on item metadata.
→ Not possible with a statistics solution that is not tied to the repository.
![Page 7: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/7.jpg)
GENERATING METADATA BASED STATISTICS
How many downloads did author "Barnes, Douglas F.” get in the last year, grouped
by month
![Page 8: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/8.jpg)
![Page 9: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/9.jpg)
![Page 10: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/10.jpg)
![Page 11: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/11.jpg)
![Page 12: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/12.jpg)
![Page 13: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/13.jpg)
LINKING METADATA TO USAGE EVENTS
Solr Query http://localhost:8080/solr/statistics/select?facet=true&facet.offset=0&facet.mincount=1&facet.sort=false&q=*:*&facet.limit=24&facet.field=dateYearMonth&facet.method=enum&fq=bundleName:ORIGINAL&fq=type:+0&fq=statistics_type:view&fq=-isBot:true&fq=-isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,\+Douglas\+F.)+&wt=javabin&rows=0
![Page 14: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/14.jpg)
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonthgroup by the field dateYearMonth
fq=type:+0only include bitstream downloads
fq=bundleName:ORIGINALonly include files in bundle “ORIGINAL”
fq=-isBot:truefilter out all bot statistics
fq=-isInternal:truefilter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]only include stats that are between Jul 1st 2014 and Jun 6th 2015
fq=+(author_mtdt:Barnes,\+Douglas\+F.)+only include statistics that are by author Barnes, Douglas F.
![Page 15: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/15.jpg)
<response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst></response>
![Page 16: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/16.jpg)
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no metadata
• The metadata is stored in the database
![Page 17: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/17.jpg)
PROPOSED SOLUTION
1. Query the database for bitstream IDs based on the author metadata
2. Use those IDs to query solr for statistics
![Page 18: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/18.jpg)
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and inefficient to execute
• Inefficient but still possible
![Page 19: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/19.jpg)
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events for each author
• sort those counts, and return the 10 highest
![Page 20: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/20.jpg)
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only need top 10
![Page 21: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/21.jpg)
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the regular query
• results will be grouped by the values they have for that field
![Page 22: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/22.jpg)
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
![Page 23: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/23.jpg)
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
![Page 24: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/24.jpg)
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
![Page 25: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/25.jpg)
SIZE OF SINGLE USAGE EVENT
<doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long></doc>
25 elements
![Page 26: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/26.jpg)
<doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ...</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
![Page 27: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/27.jpg)
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be updated as well
![Page 28: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/28.jpg)
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other statistical reports
![Page 29: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/29.jpg)
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
![Page 30: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/30.jpg)
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request can’t be handled in time
interrupt the operation when Solr’s memory usage nears its max
![Page 31: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/31.jpg)
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time (e.g. 24 hours)
![Page 32: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/32.jpg)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached→ show the outdated version
In the mean time→ generate a new version
Automatically show new report when it’s done
![Page 33: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/33.jpg)
EXAMPLE: CACHE MISS
![Page 34: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/34.jpg)
EXAMPLE: CACHE MISS
![Page 35: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/35.jpg)
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
![Page 36: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/36.jpg)
“Who are the Most Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
![Page 37: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/37.jpg)
https://openknowledge.worldbank.org/most-popular/author
![Page 38: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/38.jpg)
Ferreira, Francisco H. G. Ferreira, Francisco H.G.Ferreira, Francisco
3 name variants:
![Page 39: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/39.jpg)
![Page 40: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/40.jpg)
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:(Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
![Page 41: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/41.jpg)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
![Page 43: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/43.jpg)
Desktop view Phone view
![Page 44: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/44.jpg)
Desktop view
Phone view
![Page 45: Metadata based statistics for DSpace](https://reader031.fdocuments.in/reader031/viewer/2022032003/55b6e67dbb61eb73688b46c4/html5/thumbnails/45.jpg)
Desktop view
Phone view