Accessing and Extracting Data from Internet Using SAS Group Presentation… · Accessing and...

104
Outline Introduction Web Elements Tools Examples Accessing and Extracting Data from Internet Using SAS George Zhu, Sunita Ghosh Alberta Health Services - Cancer Care Oct 26, 2011 Edmonton SAS User Group (eSUG) Meeting George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Transcript of Accessing and Extracting Data from Internet Using SAS Group Presentation… · Accessing and...

Outline Introduction Web Elements Tools Examples

Accessing and Extracting Data from InternetUsing SAS

George Zhu, Sunita Ghosh

Alberta Health Services - Cancer Care

Oct 26, 2011Edmonton SAS User Group (eSUG) Meeting

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

1 Introduction

2 Web ElementsURLsHTML

3 ToolsSAS FunctionsSAS StatementscURLPerl/LWP

4 ExamplesExample 1: Download .csv fileExample 2: Get the list of eSUG presentationsExample 3: Find out all registered clinical trialsExample 4: Get and plot today’s temperatureExample 5: Download City of Edmonton job postings

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Suppose you want to:Download financial time series from FRED, Yahoofinancial, OANDA, etc.Obtain a list of clinical trials conducted in Canada from thewww.clinicaltrials.gov websiteDownload presentations from eSUG webpage, or allProceedings collected in Lex Jansen’s websiteMonitor career websites and notify you if suitable positionsare availableGet Twitter feeds on some hot topics and save them indata sets for further analysis or text mining.And many more...

Question: Is it possible to do these using SAS?

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Yes!Mostly, but not for all web accessing with SAS only

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

Yes!Mostly, but not for all web accessing with SAS only

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

SAS is capable for web miningSAS has its own web access tools: filename url,filename socket, etc.SAS has data extraction tools: character string functionsand powerful Perl Regular Expression functions and callprocessSAS provides two mechanisms to integrate specializedweb access programs to get the work done: X statement,filename pipe statement

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

SAS is capable for web miningSAS has its own web access tools: filename url,filename socket, etc.SAS has data extraction tools: character string functionsand powerful Perl Regular Expression functions and callprocessSAS provides two mechanisms to integrate specializedweb access programs to get the work done: X statement,filename pipe statement

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

SAS is capable for web miningSAS has its own web access tools: filename url,filename socket, etc.SAS has data extraction tools: character string functionsand powerful Perl Regular Expression functions and callprocessSAS provides two mechanisms to integrate specializedweb access programs to get the work done: X statement,filename pipe statement

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples

Introduction

SAS is capable for web miningSAS has its own web access tools: filename url,filename socket, etc.SAS has data extraction tools: character string functionsand powerful Perl Regular Expression functions and callprocessSAS provides two mechanisms to integrate specializedweb access programs to get the work done: X statement,filename pipe statement

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

Web Elements

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

For Web accessing, we need to know some basic elementsabout internet:

Where is the information located: URL (UniversalResource Locator)How the information is organized: HTML (HypertextMarkup Language)How to communicate (send and receive) with the serverhaving the information: HTTP (Hypertext TransferProtocol), HTTPS (secured HTTP), FTP, and moreRequest header and response header: These are critialinformation for more complicated websites, such ascookies, response status codes, etc.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

For Web accessing, we need to know some basic elementsabout internet:

Where is the information located: URL (UniversalResource Locator)How the information is organized: HTML (HypertextMarkup Language)How to communicate (send and receive) with the serverhaving the information: HTTP (Hypertext TransferProtocol), HTTPS (secured HTTP), FTP, and moreRequest header and response header: These are critialinformation for more complicated websites, such ascookies, response status codes, etc.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

For Web accessing, we need to know some basic elementsabout internet:

Where is the information located: URL (UniversalResource Locator)How the information is organized: HTML (HypertextMarkup Language)How to communicate (send and receive) with the serverhaving the information: HTTP (Hypertext TransferProtocol), HTTPS (secured HTTP), FTP, and moreRequest header and response header: These are critialinformation for more complicated websites, such ascookies, response status codes, etc.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

For Web accessing, we need to know some basic elementsabout internet:

Where is the information located: URL (UniversalResource Locator)How the information is organized: HTML (HypertextMarkup Language)How to communicate (send and receive) with the serverhaving the information: HTTP (Hypertext TransferProtocol), HTTPS (secured HTTP), FTP, and moreRequest header and response header: These are critialinformation for more complicated websites, such ascookies, response status codes, etc.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

Elements of Web Accessing

For Web accessing, we need to know some basic elementsabout internet:

Where is the information located: URL (UniversalResource Locator)How the information is organized: HTML (HypertextMarkup Language)How to communicate (send and receive) with the serverhaving the information: HTTP (Hypertext TransferProtocol), HTTPS (secured HTTP), FTP, and moreRequest header and response header: These are critialinformation for more complicated websites, such ascookies, response status codes, etc.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

This is the address we type or displayed in the address barin the web browser, such as:http://www.sas.com/offices/NA/canada/en/edmonton.html

http://maps.google.com/maps?q=edmonton

Usually has 3 components:Transfer Protocol: http, https, ftp, etc.IP Address or hostname: www.sas.com andmaps.google.comThe path and file name of the webpage on the host (webserver): /offices/NA/canada/en (path), edmonton.html(filename).The third part can also be a function name (maps) andpairs of parameter=value (q=edmonton).

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two types of URLs: Static and DynamicA Static webpage is the html file that is already stored inthe web server. The file Edmonton.html is alreadyphysically stored in the SAS server, under the specifiedpath.http://www.sas.com/offices/NA/canada/en/edmonton.html

A Dynamic webpage is a webpage that is dynamicallygenerated, depending on what information the serverreceived.http://maps.google.com/maps?q=edmonton

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two types of URLs: Static and DynamicA Static webpage is the html file that is already stored inthe web server. The file Edmonton.html is alreadyphysically stored in the SAS server, under the specifiedpath.http://www.sas.com/offices/NA/canada/en/edmonton.html

A Dynamic webpage is a webpage that is dynamicallygenerated, depending on what information the serverreceived.http://maps.google.com/maps?q=edmonton

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two types of URLs: Static and DynamicA Static webpage is the html file that is already stored inthe web server. The file Edmonton.html is alreadyphysically stored in the SAS server, under the specifiedpath.http://www.sas.com/offices/NA/canada/en/edmonton.html

A Dynamic webpage is a webpage that is dynamicallygenerated, depending on what information the serverreceived.http://maps.google.com/maps?q=edmonton

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

For all static webpages and some dynamic webpages, theurl (when you click the link) is usually given with <ahref="url"> in the HTML file, so you can easily extractthe next urlFor many dynamic webpages, you need to determinefunction name and what parameter/value pairs are neededin order to get the required results.The function name and parameters are usually specified inthe current HTML with tags like: <formsubmit=[function]>, <input name=[param],value=["value"]>.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

For all static webpages and some dynamic webpages, theurl (when you click the link) is usually given with <ahref="url"> in the HTML file, so you can easily extractthe next urlFor many dynamic webpages, you need to determinefunction name and what parameter/value pairs are neededin order to get the required results.The function name and parameters are usually specified inthe current HTML with tags like: <formsubmit=[function]>, <input name=[param],value=["value"]>.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

For all static webpages and some dynamic webpages, theurl (when you click the link) is usually given with <ahref="url"> in the HTML file, so you can easily extractthe next urlFor many dynamic webpages, you need to determinefunction name and what parameter/value pairs are neededin order to get the required results.The function name and parameters are usually specified inthe current HTML with tags like: <formsubmit=[function]>, <input name=[param],value=["value"]>.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

URLs can not contain space or any special characters (like#.,()"’;:, etc.). These need to be encodedSAS provides the function urlencode for this purpose. Also,the function urldecode will do the oppositeIf there are more than one pair of parameter=value in thedynamic url, a & is used to separate the pairsThese special characters (especially &) will cause troublewhen constructing URLs in a SAS macro. Need to usemacro quoting functions (like %str, %quote, %nrquote,%superq, etc.) for the URL

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

URLs can not contain space or any special characters (like#.,()"’;:, etc.). These need to be encodedSAS provides the function urlencode for this purpose. Also,the function urldecode will do the oppositeIf there are more than one pair of parameter=value in thedynamic url, a & is used to separate the pairsThese special characters (especially &) will cause troublewhen constructing URLs in a SAS macro. Need to usemacro quoting functions (like %str, %quote, %nrquote,%superq, etc.) for the URL

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

URLs can not contain space or any special characters (like#.,()"’;:, etc.). These need to be encodedSAS provides the function urlencode for this purpose. Also,the function urldecode will do the oppositeIf there are more than one pair of parameter=value in thedynamic url, a & is used to separate the pairsThese special characters (especially &) will cause troublewhen constructing URLs in a SAS macro. Need to usemacro quoting functions (like %str, %quote, %nrquote,%superq, etc.) for the URL

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

URLs can not contain space or any special characters (like#.,()"’;:, etc.). These need to be encodedSAS provides the function urlencode for this purpose. Also,the function urldecode will do the oppositeIf there are more than one pair of parameter=value in thedynamic url, a & is used to separate the pairsThese special characters (especially &) will cause troublewhen constructing URLs in a SAS macro. Need to usemacro quoting functions (like %str, %quote, %nrquote,%superq, etc.) for the URL

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

URLs

Two methods of sending the required parameters to the server:GET: through the static or dynmaic url - parameters can beseen on the address barPOST: through the request header - parameters notdisplayed in the address barPOST method can be used to transfer large amount ofparameters or sensitive data (such as password)In SAS, most GET requests can be realized withFILENAME URL statementFor POST requests, you can only use FILENAME Socketstatement - very complicated! - not recommended. Useother tools (like cURL)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

The response from the server is usually in the form ofHTML fileIn web browser, you can right-click and select View pagesource to see the webpage in HTML codesHTML file is a text file, so it can be read into a SAS dataset.Each line is stored as one record in SAS data set.In SAS, use the INFILE statement with varying length toread each lineBe aware that SAS has a length limit of 32767 - charactersafter this limit in a line can’t be read into SAS.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

HTML file consists pairs of tags and actual information.Tags are used to instruct the web browser how to displaythe response on the screenWe can use Tags to locate the required information andhow to extract them, for example, where is the URL for nextpage, where is the data table and what values need to beextracted.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

HTML file consists pairs of tags and actual information.Tags are used to instruct the web browser how to displaythe response on the screenWe can use Tags to locate the required information andhow to extract them, for example, where is the URL for nextpage, where is the data table and what values need to beextracted.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

HTML file consists pairs of tags and actual information.Tags are used to instruct the web browser how to displaythe response on the screenWe can use Tags to locate the required information andhow to extract them, for example, where is the URL for nextpage, where is the data table and what values need to beextracted.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples URLs HTML

HTML

Examples of tags (normally come in pairs):<a href=...>: for specifying a link (an URL)<div id=...>: for identifying a division or a section in anhtml document<table>, <th>, <tr>, <td>: identify the table, tablehead, table row, table data<form ...method="post">, <input..name=**value=**>: specify what parameters (name) and valuesare needed to be sent to the server, and what method tosend the request (get or post)

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

Web Accessing Tools

Web Accessing Tools

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

Tools for Web Accessing

Tools (software) we use for getting information from Internet:Web browsers: Internet Explorer, Firefox, Google Chrome,Safari, etc. - For browsing and clicking, not for automation.Command line programs (cURL), LWP package for Perlprograming language - Very powerful web accessing,basically can replicate any web browser functionalities.SAS filename statements: URL, Socket, FTP, EMail - basicweb accessing for specific functionalities, may not enoughfor accessing all websites.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

Tools for Web Accessing

Tools (software) we use for getting information from Internet:Web browsers: Internet Explorer, Firefox, Google Chrome,Safari, etc. - For browsing and clicking, not for automation.Command line programs (cURL), LWP package for Perlprograming language - Very powerful web accessing,basically can replicate any web browser functionalities.SAS filename statements: URL, Socket, FTP, EMail - basicweb accessing for specific functionalities, may not enoughfor accessing all websites.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

Tools for Web Accessing

Tools (software) we use for getting information from Internet:Web browsers: Internet Explorer, Firefox, Google Chrome,Safari, etc. - For browsing and clicking, not for automation.Command line programs (cURL), LWP package for Perlprograming language - Very powerful web accessing,basically can replicate any web browser functionalities.SAS filename statements: URL, Socket, FTP, EMail - basicweb accessing for specific functionalities, may not enoughfor accessing all websites.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

Tools for Web Accessing

Tools (software) we use for getting information from Internet:Web browsers: Internet Explorer, Firefox, Google Chrome,Safari, etc. - For browsing and clicking, not for automation.Command line programs (cURL), LWP package for Perlprograming language - Very powerful web accessing,basically can replicate any web browser functionalities.SAS filename statements: URL, Socket, FTP, EMail - basicweb accessing for specific functionalities, may not enoughfor accessing all websites.

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS functions for text extraction

SAS string functions: FIND(), INDEX(), SCAN(), etc.These functions are useful for locating the targetinformationPerl Regular Expression functions (Version 9):prxparse(), prxmatch(), prxchange(), and theprx-call routines: call prxchange(), callprxmatch(), call prxnext(). These are verypowerful for exacting the informationExample: Extract the URLs in the record (variable name:line):url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS functions for text extraction

SAS string functions: FIND(), INDEX(), SCAN(), etc.These functions are useful for locating the targetinformationPerl Regular Expression functions (Version 9):prxparse(), prxmatch(), prxchange(), and theprx-call routines: call prxchange(), callprxmatch(), call prxnext(). These are verypowerful for exacting the informationExample: Extract the URLs in the record (variable name:line):url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS functions for text extraction

SAS string functions: FIND(), INDEX(), SCAN(), etc.These functions are useful for locating the targetinformationPerl Regular Expression functions (Version 9):prxparse(), prxmatch(), prxchange(), and theprx-call routines: call prxchange(), callprxmatch(), call prxnext(). These are verypowerful for exacting the informationExample: Extract the URLs in the record (variable name:line):url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS functions for text extraction

SAS string functions: FIND(), INDEX(), SCAN(), etc.These functions are useful for locating the targetinformationPerl Regular Expression functions (Version 9):prxparse(), prxmatch(), prxchange(), and theprx-call routines: call prxchange(), callprxmatch(), call prxnext(). These are verypowerful for exacting the informationExample: Extract the URLs in the record (variable name:line):url=prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,line);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

SAS provides two main statements for web access:FILENAME URL: for the GET request method (static ordynamic url)filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";

Before 9.2, FILENAME URL did not support HTTPSprotocol completely, now in 9.2, it seems to support HTTPSfor secured transferFILENAME Socket: two-way commnuication, for morecomplicated web requests (like POST method, cookies,referer, etc.)Other statements:

FILENAME EMAIL - for accessing and sending emailsFILENAME FTP - for transfering files with the web server

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

Two SAS mechanisms for extending SAS functions with othersoftware:

FILENAME PIPE statement: the results from thisstatement are treated as the inputs to the SAS data set.For example:filename fileref pipe "<DOS Command>";

X statement: leave SAS temperately and run the externalprogram, and then return to SAS after the execution:X "<DOS Command>";

The X statement does not provide direct interactionbetween SAS and the external programThe PIPE statement feeds the results directly to SAS

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

Two SAS mechanisms for extending SAS functions with othersoftware:

FILENAME PIPE statement: the results from thisstatement are treated as the inputs to the SAS data set.For example:filename fileref pipe "<DOS Command>";

X statement: leave SAS temperately and run the externalprogram, and then return to SAS after the execution:X "<DOS Command>";

The X statement does not provide direct interactionbetween SAS and the external programThe PIPE statement feeds the results directly to SAS

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

Two SAS mechanisms for extending SAS functions with othersoftware:

FILENAME PIPE statement: the results from thisstatement are treated as the inputs to the SAS data set.For example:filename fileref pipe "<DOS Command>";

X statement: leave SAS temperately and run the externalprogram, and then return to SAS after the execution:X "<DOS Command>";

The X statement does not provide direct interactionbetween SAS and the external programThe PIPE statement feeds the results directly to SAS

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

Two SAS mechanisms for extending SAS functions with othersoftware:

FILENAME PIPE statement: the results from thisstatement are treated as the inputs to the SAS data set.For example:filename fileref pipe "<DOS Command>";

X statement: leave SAS temperately and run the externalprogram, and then return to SAS after the execution:X "<DOS Command>";

The X statement does not provide direct interactionbetween SAS and the external programThe PIPE statement feeds the results directly to SAS

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

SAS Web Access Statements

Two SAS mechanisms for extending SAS functions with othersoftware:

FILENAME PIPE statement: the results from thisstatement are treated as the inputs to the SAS data set.For example:filename fileref pipe "<DOS Command>";

X statement: leave SAS temperately and run the externalprogram, and then return to SAS after the execution:X "<DOS Command>";

The X statement does not provide direct interactionbetween SAS and the external programThe PIPE statement feeds the results directly to SAS

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

cURL is a command line tool for transferring data with URLsyntaxlibCurl is a C library, and cURL is binary program byintegrating all the functions in libCurllibCurl has been implemented in many programminglanguages and softwareBasically anything you can do with a web browser can bereplicated with cURL, including clicking, downloading files(any file, like music, video, pdf, etc) and uploading.Most importantly, it is FREE. It can be downloaded from:http://curl.haxx.se/

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

cURL is a command line tool for transferring data with URLsyntaxlibCurl is a C library, and cURL is binary program byintegrating all the functions in libCurllibCurl has been implemented in many programminglanguages and softwareBasically anything you can do with a web browser can bereplicated with cURL, including clicking, downloading files(any file, like music, video, pdf, etc) and uploading.Most importantly, it is FREE. It can be downloaded from:http://curl.haxx.se/

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

cURL is a command line tool for transferring data with URLsyntaxlibCurl is a C library, and cURL is binary program byintegrating all the functions in libCurllibCurl has been implemented in many programminglanguages and softwareBasically anything you can do with a web browser can bereplicated with cURL, including clicking, downloading files(any file, like music, video, pdf, etc) and uploading.Most importantly, it is FREE. It can be downloaded from:http://curl.haxx.se/

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

cURL is a command line tool for transferring data with URLsyntaxlibCurl is a C library, and cURL is binary program byintegrating all the functions in libCurllibCurl has been implemented in many programminglanguages and softwareBasically anything you can do with a web browser can bereplicated with cURL, including clicking, downloading files(any file, like music, video, pdf, etc) and uploading.Most importantly, it is FREE. It can be downloaded from:http://curl.haxx.se/

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

cURL is a command line tool for transferring data with URLsyntaxlibCurl is a C library, and cURL is binary program byintegrating all the functions in libCurllibCurl has been implemented in many programminglanguages and softwareBasically anything you can do with a web browser can bereplicated with cURL, including clicking, downloading files(any file, like music, video, pdf, etc) and uploading.Most importantly, it is FREE. It can be downloaded from:http://curl.haxx.se/

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

Basic usage of cURLCURL --options urls

cURL has lots of options, with these options it can dealwith virtually any requests with any websites.with FILENAME PIPE statement:

filename eSUG pipe "CURL --options urls";

with X statement:x "CURL --options urls";

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

Basic usage of cURLCURL --options urls

cURL has lots of options, with these options it can dealwith virtually any requests with any websites.with FILENAME PIPE statement:

filename eSUG pipe "CURL --options urls";

with X statement:x "CURL --options urls";

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

cURL Command Line Program

Basic usage of cURLCURL --options urls

cURL has lots of options, with these options it can dealwith virtually any requests with any websites.with FILENAME PIPE statement:

filename eSUG pipe "CURL --options urls";

with X statement:x "CURL --options urls";

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

the LWP package with Perl

Perl programming language is very powerful and flexible inparsing information from HTML or XML filesThe LWP (library for WWW in Perl) package is a webaccessing package in PerlGo to the website http://www.perl.org/ for moreinformation about Perl and its packagesThe web accessing and information parsing with LWP ismuch faster than with SASYou can write a Perl script and use FILENAME PIPEstatement or X statement to let SAS run the Perl script andreturn the results to SAS for further processing

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

the LWP package with Perl

Perl programming language is very powerful and flexible inparsing information from HTML or XML filesThe LWP (library for WWW in Perl) package is a webaccessing package in PerlGo to the website http://www.perl.org/ for moreinformation about Perl and its packagesThe web accessing and information parsing with LWP ismuch faster than with SASYou can write a Perl script and use FILENAME PIPEstatement or X statement to let SAS run the Perl script andreturn the results to SAS for further processing

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

the LWP package with Perl

Perl programming language is very powerful and flexible inparsing information from HTML or XML filesThe LWP (library for WWW in Perl) package is a webaccessing package in PerlGo to the website http://www.perl.org/ for moreinformation about Perl and its packagesThe web accessing and information parsing with LWP ismuch faster than with SASYou can write a Perl script and use FILENAME PIPEstatement or X statement to let SAS run the Perl script andreturn the results to SAS for further processing

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

the LWP package with Perl

Perl programming language is very powerful and flexible inparsing information from HTML or XML filesThe LWP (library for WWW in Perl) package is a webaccessing package in PerlGo to the website http://www.perl.org/ for moreinformation about Perl and its packagesThe web accessing and information parsing with LWP ismuch faster than with SASYou can write a Perl script and use FILENAME PIPEstatement or X statement to let SAS run the Perl script andreturn the results to SAS for further processing

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples SAS Functions SAS Statements cURL Perl/LWP

the LWP package with Perl

Perl programming language is very powerful and flexible inparsing information from HTML or XML filesThe LWP (library for WWW in Perl) package is a webaccessing package in PerlGo to the website http://www.perl.org/ for moreinformation about Perl and its packagesThe web accessing and information parsing with LWP ismuch faster than with SASYou can write a Perl script and use FILENAME PIPEstatement or X statement to let SAS run the Perl script andreturn the results to SAS for further processing

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Examples

Examples

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 1: Download .csv file

Download Moody’s seasoned AAA coporate bond yield from FRED website:

The return data is in .csv format, so no data extraction is needed, just like reading a local .csv file

SAS Codes

filename DAAA url"http://research.stlouisfed.org/fred2/series/DAAA/downloaddata/DAAA.csv";

data BY_AAA(drop=dd);length dd $10;format date date9.;infile DAAA dlm="," dsd;input dd$ value;date=input(dd,yymmdd10.);if not missing(date) then output;

run;filename DAAA clear;

Resulting Data Set

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 2: Get the list of eSUG presentations

Download the list of all presentations in eSUG webpage

All the information is in one webpage, no need to look for url for next page

All presentations are in .pdf files, so for data extraction we search the string within quotation marks andends with.pdf

SAS Codes

filename eSUG url "http://www.sas.com/offices/NA/canada/en/edmonton.html";data eSUG_achive(keep=pdffile);

length pdffile $200;

infile eSUG length=len lrecl=32767;input line $varying32767. len;

*all the file names end with .pdf;if find(line,".pdf") then do;

*get the string ending with .pdf and enclosed by quotation marks;pdffile=prxchange(’s/ˆ.*?"(.*?\.pdf)".*$/$1/i’,-1,line);output;

end;run;filename eSUG clear;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 2: Get the list of eSUG presentations

Resulting Data Set: eSUG Archive

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 2: Get the list of eSUG presentationsHow about down load all the presentations?

FILENAME URL does not work - can’t save files

use cURL command line program with the SAS X statement

cURL option−−output 〈filename〉 can save the result to a file specified by 〈filename〉

SAS Codes

proc sql noprint;select pdffile into :pdffiles separated by "|" from eSUG_achive;

quit;

options noxwait;

%macro downLoadFiles();%let eSUGFolder=H:\eSUG\PDFs;

%let i=1;%let pdf1=%scan(&pdffiles.,&i.,"|");%do %while (%quote(&pdf1.)˜=);

*I only need the file name, don’t need the path name;%let filename=%scan(&pdf1.,-1,"/");

*the --output option saves the file;x "curl --output &eSUGFolder./&filename. &pdf1.";%let i=%eval(&i.+1);%let pdf1=%scan(&pdffiles.,&i.,"|");

%end;%mend downLoadFiles;%downLoadFiles;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 3: Find out all clinical trialsLocate and extract information in the html using the tagsFind the url for next webpage

SAS Codes

%macro GetTrials(startlink=, out=Trials);%let Continue=Yes;%let NextLink=&startlink.;%let i=0;%do %while(&Continue.=Yes);

filename Trials url "http://www.clinicaltrial.gov/%superq(NextLink)" lrecl=8192;data _Trials_(drop=recordline record nextpage);

length Status $25 Study $200 link $200;retain Rank Status Study link;retain recordLine 0; =1 means this line may contain needed information;

infile Trials length=len;input record $varying8192. len;

if prxmatch(’/>Study<\/th>/i’,record) then recordLine=1;if recordLine=1 and find(record,"/div") then recordLine=0;

if (recordLine=1) then do; *now get the related information;if prxmatch(’/<span.*>.*?<\/span>\s*$/’,record) then

status=prxchange(’s/ˆ.*?<span.*>(.*?)<\/span>\s*$/$1/’,-1,record);if find(record,"href") then do;Study=prxchange(’s/ˆ.*>(.*?)<.*$/$1/’,-1,record);link=prxchange(’s/ˆ.*?href="(.*?)".*$/$1/’,-1,record);

end;if find(record,"</table>") then do;rank=input(scan(link,-1,"="),5.0);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 3: Find out all clinical trialsSAS Codes (continued)

output;end;

end;if find(record,"Next Page") then do;

if find(record,"href") then do;nextpage=tranwrd(prxchange(’s/ˆ.*?href="(.*?)".*?$/$1/’,-1,record),’&amp;’,’&’);call symput("nextlink",nextpage);

end;else call symput("Continue","No");

end;run;filename Trials clear;data _Trials_;

set _Trials_ end=last;if (last˜=1) then output;

run;%if (&i.=0) %then %do;

data &out.;set _Trials_;

run;%end;%else %do;

data &out.;set &out. _Trials_;

run;%end;%let i=%eval(&i.+1);

%end;%mend GetTrials;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 3: Find out all clinical trials

This macro accepts a start link and then keep looking forall the trials in next pagesTo get all trials in Alberta, use the start link ofct2/results?state1=NA%3ACA%3AABNote: NA%3ACA%3AAB is the encoded string of NA:CA:AB

SAS Codes

%GetTrials(startlink=%str(ct2/results?state1=NA%3ACA%3AAB),out=AB_Trials);

For all trials in Canada:SAS Codes

%GetTrials(startlink=%str(ct2/results?state1=NA%3ACA),out=CA_Trials);

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 4: Get and plot today’s temperatureFrom the weather underground website: www.wunderground.com

Although the url link can be specified easily, it is not the final link, so SAS URL statement can not get therequired results

cURL option−−location can track down to the final link and get the resulting text file (basically a .csv file)

Use FILENAME PIPE statement to get the return from cURL directly into SAS

SAS Codes

*There are 3 weather stations in Edmonton: IABEDMON10 IABEDMON12 IABEDMON13;%let WStation=IABEDMON13; *This is Edmonton Downtown weather station;

*get today’s date and obtain day, month and year. These are used to construct the url;%let Day=%substr(&sysdate9.,1,2);%let Month=%sysfunc(month(%sysfunc(today())));%let Year=%substr(&sysdate9.,6,4);%let EdmWeather="http://www.wunderground.com/weatherstation/WXDailyHistory.asp?

ID=&WStation.%str(&)month=&month.%str(&)day=&Day.%str(&)year=&Year.%str(&)format=1";

filename Weather pipe "curl --location %superq(EdmWeather)";*FILENAME URL doesn’t work;data weather(drop=line);

infile Weather lrecl=2000 length=len;input line $varying2000. len;if find(line,",") then do;

time=input(scan(line,1,",","m"),anydtdtm21.); format time datetime12.;Temp=input(scan(line,2,",","m"),5.0);Dewpoint=input(scan(line,3,",","m"),5.0);if not missing(time) then output;

end;run;filename Weather clear;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 4: Get and plot today’s temperatureSAS Codes for Plotting the Temperature

symbol1 i=j;legend1 label=(h=2 "Weather") value=(h=2);axis1 value=(h=2);title1 "Temperature on &sysdate9.";proc gplot data=weather;

plot (temp Dewpoint)*time/haxis=axis1 vaxis=axis1 overlay legend=legend1;run; quit;

Resulting Temperature Plot

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postingsMost web accessing use POST method - Use Perl/LWP for web accessing and also for data extractionThen in SAS use filename pipe statement to run the Perl program and get the extracted data into a data set

SAS Codes

filename EdJobs pipe "C:\Perl\bin\perl C:\esug\edmjobs.pl";data Edm_Jobs;

*Totally 16 variables, 14 are read in as character strings. Declare their length;length JobCode $10 Title $100 Description $10000 Qualification $10000 Salary $500;length Hours $500 PDate $20 CDate $20 JobType1 $100 JobType2 $100 Union $100;length Department $100 WorkLocation $200 WorkAddress $200;infile EdJobs lrecl=32767 length=len;

*use multiple input because the final result from Perl is printed in multiple lines;input JobCode $varying10. len;input Title $varying100. len;input JobNum;input Description $varying10000. len;input Qualification $varying10000. len;input Salary $varying500. len;input Hours $varying500. len;input PDate $varying20. len;input CDate $varying20. len;input NumOfOpening;input JobType1 $varying100. len;input JobType2 $varying100. len;input Union $varying100. len;input Department $varying100. len;input WorkLocation $varying200. len;input WorkAddress $varying200. len;

run;filename EdJobs clear;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postings

Perl program: edmjobs.pl

#=====================================================================================# Perl program to download the job postings from City of Edmonton# Written by George Zhu ([email protected])# for the Edmonton SAS User Group (eSUG) meeting, Fall 2011# Usage:# 1. Stand Alone:# C:\Perl\bin\perl C:\esug\edmjobs.pl >[output file name]# 2. With SAS, use filename pipe statement:# filename EdJobs pipe "C:\Perl\bin\perl C:\esug\edmjobs.pl"# then in the data set step, use infile statement:# infile EdJobs lrecl=32767 length=len;# Note: change the folder name to where the program edmjobs.pl is located.#=====================================================================================use strict;use LWP::UserAgent;use XML::Parser;use URI::Escape;

my $EdJobs=’https://edmonton.taleo.net/careersection/2/moresearch.ftl’;my $EdBrowser=LWP::UserAgent->new();$EdBrowser->agent(’Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2’);

my $EdResponse=$EdBrowser->get($EdJobs);my $EdPage1=$EdResponse->content() if $EdResponse->is_success();

## obtain the form items using the XML Parser;my %inputAttrs; #for the POST data for next inquery;my $EdParse=XML::Parser->new(Handlers=>{Start=>&Input_start,});

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postings

Perl program: edmjobs.pl - continued

sub Input_start{

my ($expat,$element,%attrs)=@_;if($element=˜/input/i) # look for the input tag{

my ($param,$pmval);while(my($key,$value)=each(%attrs)){

if ($key=˜/name/i) {$param=$value;}if ($key=˜/value/i) {$value=˜tr/ /+/; $pmval=$value;}

}$inputAttrs{$param}=$pmval;

}}

$EdParse->parse($EdPage1);my %AttrsPage1=%inputAttrs;my %AttrsPost1=%inputAttrs;

## total number of postings;my $nJobs=$1

if $AttrsPage1{"initialHistory"}=˜/listRequisition\.nbElements!\%7C!(\d+?)!\%7C!/i;## Number of postings per page;my $listSize=$1

if $AttrsPage1{"initialHistory"}=˜/listRequisition\.size!\%7C!(\d+?)!\%7C!/i;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postings

Perl program: edmjobs.pl - continued

## calculate number of pages;my $nPages=$nJobs/$listSize;$nPages=int($nPages+1) if ($nPages>int($nPages));

## Obtain the pages of the postings, first set the parameters;$AttrsPage1{"countryPanelErrorDrawer.state"}="false";$AttrsPage1{"errorMessageDrawer.state"}="false";$AttrsPage1{"ftlcompclass"}="PagerComponent";$AttrsPage1{"ftlcallback"}="ftlPager_processResponse";$AttrsPage1{"ftlcallback"}="ftlPager_processResponse";$AttrsPage1{"ftlcompid"}="rlPager";$AttrsPage1{"ftlinterfaceid"}="requisitionListInterface";

my $pp=1;my $i=0;my @JobCodes;my $jobDetails=’https://edmonton.taleo.net/careersection/2/jobdetail.ftl’;while ($pp<=$nPages){

$AttrsPage1{"rlPager.currentPage"}="$pp";my $JobPage=$EdBrowser->post(jobDetails,\%AttrsPage1, ’Referer’=>EdJobs,);my @fillList=split("\’,\’",$1) if $JobPage->content()=˜/api\.fillList(.*?)\;/i;shift(@fillList);shift(@fillList);

my $nLine=3;my $listSize=@fillList;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postings

Perl program: edmjobs.pl - continued

while($nLine<$listSize){

$JobCodes[$i]=$fillList[$nLine];$nLine+=33;$i++;

}$pp++;

}$AttrsPost1{"ftlcompid"}="actOpenRequisitionDescription";$AttrsPost1{"ftlinterfaceid"}="requisitionListInterface";

foreach my $code (@JobCodes){

$AttrsPost1{"actOpenRequisitionDescription.requisitionNo"}="$code";# Use the POST method to get the detailed info about a postingmy $Job1=$EdBrowser->post(jobDetails,\%AttrsPost1,’Referer’=>EdJobs,);

$EdParse->parse($Job1->content());$inputAttrs{"initialHistory"}=˜tr/+/ /;my @infor=split("!\%7C!",$inputAttrs{"initialHistory"});my $description=uri_unescape($infor[15]);$description=˜s/(<.*?>)|(!\*!)//g;$description=˜s/(\&nbsp\;)/ /g;$description=˜s/(\&\#39\;)/\’/g;$description=˜s/(\n)/ /g;

my $qualification=uri_unescape($infor[16]);$qualification=˜s/(<.*?>)|(!\*!)//g;

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS

Outline Introduction Web Elements Tools Examples Example 1 Example 2 Example 3 Example 4 Example 5

Example 5: Download City of Edmonton job postings

Perl program: edmjobs.pl - continued

$qualification=˜s/(\&nbsp\;)/ /g;$qualification=˜s/(\&\#39\;)/\’/g;

my $salary=$1 if $qualification=˜s/Salary Range: (.*?)\n//i;my $Hours=$1 if $qualification=˜s/Hours of Work: (.*?)\n//i;$qualification=˜s/(\n)/ /g;

## these are the extracted information from a job posting;print $code,"\n"; #jobcodeprint $infor[12],"\n"; #titleprint $infor[13],"\n"; #Job Numberprint $description,"\n"; #Descriptionprint $qualification,"\n"; #Qualificationprint $salary,"\n"; #Salary Rangeprint $Hours,"\n"; #Hours of Workprint $infor[20],"\n"; #Posting Dateprint $infor[22],"\n"; #Closing Dateprint $infor[23],"\n"; #Number of openingprint $infor[25],"\n"; #Job type 1print $infor[26],"\n"; #Job type 2print $infor[27],"\n"; #Unionprint $infor[29],"\n"; #Departmentprint $infor[32],"\n"; #Work locationprint $infor[33],"\n"; #Work location address#print "=================================\n\n";

}#### End of program ####

George Zhu & Sunita Ghosh (AHS - Cancer Care) Accessing and Extracting Data from Internet Using SAS