JCDL2015: How Well are Arabic Websites Archived?
Transcript of JCDL2015: How Well are Arabic Websites Archived?
![Page 1: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/1.jpg)
How Well Are Arabic Websites Archived?
Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle Old Dominion University
Department of Computer Science Norfolk, Virginia 23529 USA
JCDL 2015 Knoxville, TN
June 21-25, 2015
![Page 2: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/2.jpg)
Archived events on English sites vs. Arabic sites
2
![Page 3: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/3.jpg)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/
Search: Baltimore (one week old)
Archived events on English sites vs. Arabic sites
3
![Page 4: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/4.jpg)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/
Search: Baltimore (one week old)
Search: Yemen Houthis (one week old)
http://www.yemenakhbar.com/yemen-news/178683.html
Archived events on English sites vs. Arabic sites
4
![Page 5: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/5.jpg)
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/
Search: Yemen Houthis (one week old)
Archived events on English sites vs. Arabic sites
5 http://www.yemenakhbar.com/yemen-news/178683.html
![Page 6: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/6.jpg)
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/
Search: Yemen Houthis (one week old)
Archived events on English sites vs. Arabic sites
6 http://www.yemenakhbar.com/yemen-news/178683.html
![Page 7: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/7.jpg)
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-bloody-memorial-day-weekend-in-baltimore-capping-off-deadliest/
Search: Yemen Houthis (one week old)
Archived events on English sites vs. Arabic sites
7 http://www.yemenakhbar.com/yemen-news/178683.html
![Page 8: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/8.jpg)
English sports websites are more archived than Arabic
www.espn.go.com www.kooora.com 8
![Page 9: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/9.jpg)
English e-Marketing websites are more archived than Arabic
www.amazon.com www.haraj.com.sa 9
![Page 10: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/10.jpg)
English encyclopedia websites are more archived than Arabic
en.wikipedia.org ar.wikipedia.org 10
![Page 11: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/11.jpg)
Top ten languages in the Internet
World Language Map Source: Quick Maps of the World immigration - http://www.allcountries.org/maps/world_language_maps.html
Source: Internet World Stats - http://www.internetworldstats.com/stats7.htm
11
![Page 12: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/12.jpg)
2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
Source: http://www.internetworldstats.com/stats19.htm
Arabic speaking Internet users
12
![Page 13: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/13.jpg)
2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
Source: http://www.internetworldstats.com/stats19.htm
2009 Arabic Total=17.5% World Total=26.6%
Arabic speaking Internet users
13
![Page 14: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/14.jpg)
2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
2013 Arabic Total=35.8% World Total=39.0%
Source: http://www.internetworldstats.com/stats19.htm
2009 Arabic Total=17.5% World Total=26.6%
14
Arabic speaking Internet users
![Page 15: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/15.jpg)
Ø The number of Arabic speaking Internet users has grown rapidly
Ø There has been previous work on the coverage of web archives
Ø Little has been done in terms of Arabic language content
15
Why are we doing this?
![Page 16: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/16.jpg)
How Much of the Web Is Archived?
Ø Sample of URIs from four different sources (DMOZ, Delicious, Bitly, Search engine indexes)
Ø The archival percentages ranged from 16% to 79%
2013, A follow-on study: Ø Archival percentages had increased
from 33% to 95%
Ø These studies were not focused on content from specific countries or content in specific languages
16
![Page 17: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/17.jpg)
A fair history of the Web? Examining country balance in the Internet Archive Ø Examined country balance in the
Internet Archive:
Country Domain Archived US .com 92% Taiwan .com.tw 73% China .com.cn 58% Singapore .com.sg 73%
17
Ø This work focused on TLD rather than content language or location
![Page 18: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/18.jpg)
Characterization of National Web Domains
Ø Used 10 national web domains § 120 million pages § 24 countries § They studied page sizes,
degrees, link based scores, etc. § They found that depth,
response code were similar
Ø In this work, additional methods are required to determine if a site belongs to a particular country
18
![Page 19: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/19.jpg)
Characterizing a National Community Web
Ø Used Portuguese dataset: § (.pt) ccTLD § (.com,.net,.org,.tv) in Portuguese
language that has at least one incoming link from (.pt) ccTLD
Ø They identify, collect, and characterize the Portuguese Web
19
![Page 20: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/20.jpg)
GeoIP only ccTLD only
Both Neither
² News: al-watan.com ² ccTLD: Not Arabic (.com) ² GeoIP: Arabic country (Qatar)
How do we classify Arabic websites?
20
![Page 21: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/21.jpg)
GeoIP only ccTLD only
Both Neither
² E-Marketing: haraj.com.sa ² ccTLD: Arabic (.sa) ² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com ² ccTLD: Not Arabic (.com) ² GeoIP: Arabic country (Qatar)
21
How do we classify Arabic websites?
![Page 22: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/22.jpg)
GeoIP only ccTLD only
Both Neither
² E-Marketing: haraj.com.sa ² ccTLD: Arabic (.sa) ² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com ² ccTLD: Not Arabic (.com) ² GeoIP: Arabic country (Qatar)
22
² Educational: uoh.edu.sa ² ccTLD: Arabic (.sa) ² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
![Page 23: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/23.jpg)
GeoIP only ccTLD only
Both Neither
² News: alarabiya.net ² ccTLD: Not Arabic (.net) ² GeoIP: Not Arabic country (US)
² E-Marketing: haraj.com.sa ² ccTLD: Arabic (.sa) ² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com ² ccTLD: Not Arabic (.com) ² GeoIP: Arabic country (Qatar)
23
² Educational: uoh.edu.sa ² ccTLD: Arabic (.sa) ² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
![Page 24: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/24.jpg)
Selecting seed URIs
Name Registered Year URI count DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743
• 15,092 unique seed URIs • 11,014 URIs that existed in the live web
24
![Page 25: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/25.jpg)
Determining a webpage language • HTTP header Content-Language • HTML title tag language • Trigram method • Language detection API client
25
![Page 26: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/26.jpg)
> curl –I www.alquds.com HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Wed, 03 Jun 2015 19:11:31 GMT Content-‐Type: text/html; charset=utf-‐8 Connection: keep-‐alive X-‐Powered-‐By: PHP/5.3.3 X-‐Drupal-‐Cache: HIT Etag: "1433361507-‐0" Content-‐Language: ar …
HTTP header Content-Language example#1
26
![Page 27: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/27.jpg)
> curl –I www.alquds.com HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Wed, 03 Jun 2015 19:11:31 GMT Content-‐Type: text/html; charset=utf-‐8 Connection: keep-‐alive X-‐Powered-‐By: PHP/5.3.3 X-‐Drupal-‐Cache: HIT Etag: "1433361507-‐0" Content-‐Language: ar …
HTTP header Content-Language example#1
27
![Page 28: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/28.jpg)
> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive …
HTTP header Content-Language example#2
28
![Page 29: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/29.jpg)
> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive … > curl www.raddadi.com
<!DOCTYPE html PUBLIC "-‐//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-‐transitional.dtd"> <html dir="rtl" xmlns="http://www.w3.org/1999/xhtml" xml:lang="ar" lang="ar" > <head>
HTTP header Content-Language example#2
29
![Page 30: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/30.jpg)
> curl –I www.raddadi.com HTTP/1.1 200 OK Server: nginx/1.8.0 Date: Sat, 06 Jun 2015 22:47:09 GMT Content-‐Type: text/html Connection: keep-‐alive … > curl www.raddadi.com
<!DOCTYPE html PUBLIC "-‐//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-‐transitional.dtd"> <html dir="rtl" xmlns="http://www.w3.org/1999/xhtml" xml:lang="ar" lang="ar" > <head>
HTTP header Content-Language example#2
30
![Page 31: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/31.jpg)
https://code.google.com/p/guess-language/
> curl www.star28.com … <META name="Copyright" content="© 2011 www.star28.com"> <META name="DISTRIBUTION" content="GLOBAL"> <META name="REVISIT-‐AFTER" content="1 DAYS"> <TITLE> الشامل العرب دليل </TITLE> <META name="description" content=" للمواقع دليل
باستمرار يحدث, العاملية املواقع أفضل و العربية "> <META name="keywords" content=" دليل مواقع, جتارة,جتارة, مواقع دليل
العاب, جافا سكربت, رياضة, منتديات, علوم, كومبيوتر, اسالم, اخبار,اخبار, اسالم, كومبيوتر, علوم, منتديات, رياضة, سكربت جافا, العابتوظيف, زواج, تعليم, سياحة, تلفزيون, صحف ">
…
HTML title tag language
31
![Page 32: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/32.jpg)
> curl www.star28.com … <META name="Copyright" content="© 2011 www.star28.com"> <META name="DISTRIBUTION" content="GLOBAL"> <META name="REVISIT-‐AFTER" content="1 DAYS"> <TITLE> الشامل العرب دليل </TITLE> <META name="description" content=" للمواقع دليل
باستمرار يحدث, العاملية املواقع أفضل و العربية "> <META name="keywords" content=" دليل مواقع, جتارة,جتارة, مواقع دليل
العاب, جافا سكربت, رياضة, منتديات, علوم, كومبيوتر, اسالم, اخبار,اخبار, اسالم, كومبيوتر, علوم, منتديات, رياضة, سكربت جافا, العابتوظيف, زواج, تعليم, سياحة, تلفزيون, صحف ">
…
https://code.google.com/p/guess-language/
Then we use guess-language Python library to determine the language
HTML title tag language
32
![Page 33: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/33.jpg)
https://code.google.com/p/guess-language/
Ø curl -‐s www.gulfup.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > gulfup_title.txt
33
HTML title tag language example#1
![Page 34: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/34.jpg)
https://code.google.com/p/guess-language/ 34
Ø curl -‐s www.gulfup.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > gulfup_title.txt
> Python >>> myfile=open("gulfup_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'ar'
HTML title tag language example#1
![Page 35: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/35.jpg)
https://code.google.com/p/guess-language/ 35
Ø curl -‐s www.gulfup.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > gulfup_title.txt
> Python >>> myfile=open("gulfup_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'ar'
HTML title tag language example#1
![Page 36: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/36.jpg)
https://code.google.com/p/guess-language/ 36
Ø curl -‐s www.cnn.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > cnn_title.txt
HTML title tag language example#2
![Page 37: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/37.jpg)
https://code.google.com/p/guess-language/ 37
Ø curl -‐s www.cnn.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > cnn_title.txt
> Python >>> myfile=open("cnn_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'en'
HTML title tag language example#2
![Page 38: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/38.jpg)
https://code.google.com/p/guess-language/ 38
Ø curl -‐s www.cnn.com | grep -‐io "<title>[^<]*" | tail -‐c+8 > cnn_title.txt
> Python >>> myfile=open("cnn_title.txt", "r") >>> data=myfile.read() >>> from guess_language import guess_language >>> guess_language(data) 'en'
HTML title tag language example#2
![Page 39: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/39.jpg)
§ Built in C++ and wrapped as a python module § Identification is performed through basic trigram lookups
paired with unicode character set recognition § Accuracy is high for even short sample texts
https://github.com/decultured/Python-Language-Detector
Trigram method
39
![Page 40: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/40.jpg)
https://github.com/decultured/Python-Language-Detector
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
Trigram method example#1
40
![Page 41: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/41.jpg)
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://github.com/decultured/Python-Language-Detector
>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) ar
41
Trigram method example#1
![Page 42: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/42.jpg)
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) ar
https://github.com/decultured/Python-Language-Detector
42
Trigram method example#1
![Page 43: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/43.jpg)
https://github.com/decultured/Python-Language-Detector
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
43
Trigram method example#2
![Page 44: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/44.jpg)
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://github.com/decultured/Python-Language-Detector
>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) en
44
Trigram method example#2
![Page 45: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/45.jpg)
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://github.com/decultured/Python-Language-Detector
>>> import sys >>> sys.path.append('languageDetector') >>> import languageIdentifiera >>> languageIdentifier.load("languageDetector/trigrams/") >>> print languageIdentifier.identify(text, 300, 300) en
45
Trigram method example#2
![Page 46: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/46.jpg)
Language detection API client • Returns detected language codes and scores • You have to setup your personal API key,
(http://detectlanguage.com) • Example of output:
https://detectlanguage.com
{"data":{"detections":[{"language":"ar","isReliable":true,"confidence":9.54}]}}
46
![Page 47: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/47.jpg)
• Returns detected language codes and scores • You have to setup your personal API key,
(http://detectlanguage.com) • Example of output:
https://detectlanguage.com
{"data":{"detections":[{"language":"ar","isReliable":true,"confidence":9.54}]}}
• how much text you pass
• how well it is identified
False means that the confidence is low
Language code
47
Language detection API client
![Page 48: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/48.jpg)
https://detectlanguage.com
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
Language detection API client example#1
48
![Page 49: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/49.jpg)
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://detectlanguage.com
>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"ar","isReliable":true,"confidence":8.32},{"language":"tk","isReliable":false,"confidence":0.01}]}}
49
Language detection API client example#1
![Page 50: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/50.jpg)
> curl www.raddadi.com > raddadi.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("raddadi.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://detectlanguage.com
>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"ar","isReliable":true,"confidence":8.32},{"language":"tk","isReliable":false,"confidence":0.01}]}}
50
Language detection API client example#1
![Page 51: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/51.jpg)
https://detectlanguage.com
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
51
Language detection API client example#2
![Page 52: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/52.jpg)
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://detectlanguage.com
>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"en","isReliable":true,"confidence":6.14}]}}
52
Language detection API client example#2
![Page 53: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/53.jpg)
> curl www.cnn.com > cnn.txt > Python >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(open("cnn.txt")) >>> for script in soup(["script", "style"]): … script.extract() >>> text = soup.get_text() >>> lines = (line.strip() for line in text.splitlines()) >>> chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) >>> text = '\n'.join(chunk for chunk in chunks if chunk)
https://detectlanguage.com
>>> import detectlanguage >>> detectlanguage.configuration.api_key = "YOUR API KEY" >>> detectlanguage.detect(text) {"data":{"detections":[{"language":"en","isReliable":true,"confidence":6.14}]}}
53
Language detection API client example#2
![Page 54: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/54.jpg)
Language test intersection testing for Arabic language
54
~41%
![Page 55: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/55.jpg)
55
~38% ~41%
Language test intersection testing for Arabic language
![Page 56: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/56.jpg)
56
~41% ~38%
~36%
Language test intersection testing for Arabic language
![Page 57: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/57.jpg)
57
~41% ~38%
~36% ~39%
Language test intersection testing for Arabic language
![Page 58: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/58.jpg)
58
~41% ~38%
~36% ~39%
872
~8%
Language test intersection testing for Arabic language
![Page 59: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/59.jpg)
Language test intersection testing for Arabic language
59
~41% ~38%
~36% ~39%
Total Arabic = 7,976
![Page 60: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/60.jpg)
Crawling Arabic seed URIs
Unique:663,443
60
![Page 61: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/61.jpg)
Crawling Arabic seed URIs
61
![Page 62: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/62.jpg)
62
Crawling Arabic seed URIs
![Page 63: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/63.jpg)
Total Arabic URIs Dataset = (7,976+292,670) = 300,646 63
Crawling Arabic seed URIs
![Page 64: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/64.jpg)
17,536 Unique domains
Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport
64
![Page 65: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/65.jpg)
Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP location is at rank 17 65
17,536 Unique domains
![Page 66: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/66.jpg)
Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport
6 out of 10 top unique domains are news websites 66
17,536 Unique domains
![Page 67: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/67.jpg)
Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular western pages are in the top unique domains 67
17,536 Unique domains
![Page 68: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/68.jpg)
TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94%
Almost 58% are .com
68
![Page 69: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/69.jpg)
TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94%
Almost 58% are .com
69
![Page 70: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/70.jpg)
TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94%
Small percentage of Arabic TLD
70
![Page 71: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/71.jpg)
TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82%
Small percentage of Arabic TLD
71
![Page 72: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/72.jpg)
TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82%
Small percentage of Arabic TLD
72
![Page 73: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/73.jpg)
Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02%
More than 57% are of depth 0 and 1
73
![Page 74: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/74.jpg)
Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02%
74
More than 57% are of depth 0 and 1
![Page 75: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/75.jpg)
53.77% of Arabic URIs are archived
• January-March 2015 • ODU CS Memento Aggregator
Median=16
75
![Page 76: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/76.jpg)
URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine
Most of the top archived URI-Rs are news websites
76
![Page 77: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/77.jpg)
URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine
77
Most of the top archived URI-Rs are news websites
![Page 78: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/78.jpg)
Archiving has accelerated since 2011
78
![Page 79: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/79.jpg)
March 2015
79
Archiving has accelerated since 2011
![Page 80: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/80.jpg)
Two methods to determine the presence in each archive
1. Percent of URI-Rs present in each archive e.g. http://aljazeera.net
2. Percent of URI-Ms present in each archive e.g. http://wayback.archive-it.org/all/20070727215420/http://www.aljazeera.net/ e.g. http://web.archive.org/web/20150618104846/http://aljazeera.net/
80
![Page 81: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/81.jpg)
Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10
Presence in each archive example
81
![Page 82: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/82.jpg)
1- Percent of URI-Rs present in each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160%
Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10
82
Presence in each archive example
![Page 83: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/83.jpg)
Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10
Archive Total Percentage
Internet Archive 6/10=0.6 60% Archive.today 3/10=0.3 30% Webcitation 1/10=0.1 10% Total 100%
2- Percent of URI-Ms present in each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160%
83
1- Percent of URI-Rs present in each archive
Presence in each archive example
![Page 84: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/84.jpg)
Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86%
Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100%
84
1- Percent of URI-Rs present in each archive
2- Percent of URI-Ms present in each archive
Presence in each archive
![Page 85: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/85.jpg)
Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86%
Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100%
85
1- Percent of URI-Rs present in each archive
2- Percent of URI-Ms present in each archive
Presence in each archive
![Page 86: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/86.jpg)
Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86%
Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100%
Presence in each archive
86
1- Percent of URI-Rs present in each archive
2- Percent of URI-Ms present in each archive
![Page 87: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/87.jpg)
Average archiving period (days)
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento
Median=48 days
87
![Page 88: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/88.jpg)
Values less than 1 indicate that the URI is archived multiple times per day
The larger the period, the more irregularly the URI was captured by the archives
Median=48 days
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento 88
Average archiving period (days)
![Page 89: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/89.jpg)
Creation date for archived Arabic URIs
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate 89
![Page 90: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/90.jpg)
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate
18 years
90
Creation date for archived Arabic URIs
![Page 91: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/91.jpg)
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
2013 is the most frequent year
We used CarbonDate for creation date estimate
18 years
91
Creation date for archived Arabic URIs
![Page 92: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/92.jpg)
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
92
![Page 93: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/93.jpg)
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
93
![Page 94: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/94.jpg)
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab Emirates
0.67%
Top GeoIP locations
94
![Page 95: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/95.jpg)
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab Emirates
0.67%
Top GeoIP locations
95
![Page 96: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/96.jpg)
Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
Status of Arabic seed URIs
96
![Page 97: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/97.jpg)
Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
(Good) discovered and saved
97
Status of Arabic seed URIs
![Page 98: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/98.jpg)
Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
(Good) discovered and saved
(Bad) undiscovered and not saved
98
Status of Arabic seed URIs
![Page 99: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/99.jpg)
Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
31% were not indexed by Google
99
Status of Arabic seed URIs
![Page 100: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/100.jpg)
18% have creation dates over 1 year before the first memento was archived
19.48% of the URIs have an estimated creation date that is the same as first memento date
Difference between creation date and first memento
100
![Page 101: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/101.jpg)
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
DMOZ URIs are more likely to be found and archived
101
![Page 102: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/102.jpg)
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
102
DMOZ URIs are more likely to be found and archived
![Page 103: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/103.jpg)
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
103
DMOZ URIs are more likely to be found and archived
![Page 104: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/104.jpg)
Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
Hosted in Western countries would be more likely to be archived
104
![Page 105: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/105.jpg)
Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
105
Hosted in Western countries would be more likely to be archived
![Page 106: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/106.jpg)
Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a higher indexing rate
106
![Page 107: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/107.jpg)
Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a higher indexing rate
107
![Page 108: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/108.jpg)
The spread of memento was not affected by location or ccTLD
Ø Kolmogorov-Smirnov test
Category Mean Ar GeoIP 0.5010 Ar ccTLD 0.5013 Both 0.5016 Neither 0.5005
Category D-Value P-Value Ar ccTLD vs. neither
0.017 <0.002
Ar GeoIP vs. neither
0.014 <0.002
108
![Page 109: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/109.jpg)
Just because a webpage is older it does not mean that it is archived more
Because of low historical archiving rates
109
![Page 110: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/110.jpg)
We look in the last three years 110
Just because a webpage is older it does not mean that it is archived more
![Page 111: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/111.jpg)
We look in the last three years 111
Just because a webpage is older it does not mean that it is archived more
![Page 112: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/112.jpg)
In the last three years the older the resource is the more memento it has
112
![Page 113: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/113.jpg)
Full Data Set Seed Data Set Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
Top level URIs are more likely to be archived and indexed
113
![Page 114: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/114.jpg)
Full Data Set Seed Data Set Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
114
Top level URIs are more likely to be archived and indexed
![Page 115: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/115.jpg)
Full Data Set Seed Data Set Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
115
Top level URIs are more likely to be archived and indexed
![Page 116: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/116.jpg)
• Collected URIs from three Arabic directories (7,976): Ø DMOZ Ø Raddadi.com Ø Star28.com
• Crawl seed dataset (1,299,671) • Check if they are unique (663,443) • Check if they are live (482,905) • Check for Arabic Language (300,646)
Summary of collection methods
116
![Page 117: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/117.jpg)
§ Our Arabic language dataset was not largely located in Arabic countries Ø Only 14.84% had an Arabic ccTLD Ø Only 10.53% had a GeoIP in an Arabic country Ø Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the
top 10 § Arabic webpages are not particularly well archived or indexed
Ø 46% were not archived Ø 31% were not indexed by Google
§ An Arabic webpage is more likely to be... Ø indexed if it is present in a directory Ø archived if it is present in DMOZ Ø archived if it has neither Arabic GeoIP nor Arabic ccTLD
For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ
Findings
117
![Page 118: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/118.jpg)
118
![Page 119: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/119.jpg)
Backup Slides
119
![Page 120: JCDL2015: How Well are Arabic Websites Archived?](https://reader038.fdocuments.in/reader038/viewer/2022110315/55cd8a25bb61ebc2448b45f7/html5/thumbnails/120.jpg)
GeoIP Location
• We obtained the IP addresses of the hostnames using nslookup, (which uses DNS to convert the hostname to its IP address)
• We used the MaxMind GeoLite29 database to determine location from the IP address. (Which tests at 99.8% accuracy at the country level)
h,p://dev.maxmind.com/geoip/geoip2/geolite2/ h,p://dev.maxmind.com/faq/how-‐‑accurate-‐‑are-‐‑the-‐‑ geoip-‐‑databases/
120