· Web viewNovartis TransWorld Group Part I I I: Technical Report DSBA 6100 David Farthing...
Transcript of · Web viewNovartis TransWorld Group Part I I I: Technical Report DSBA 6100 David Farthing...
Novartis
TransWorld Group
Part III: Technical Report
DSBA 6100
David Farthing
Paramita Ghosh
Lyndsay Richter
Kaitlyn Schroeder
Raja Sekhar Potluri
December 16, 2015
Table of Contents
II. Technical Report
a. Executive Summary3
b. Appendix I: Bash Script and Hadoop4
c. Appendix II: Java Code for Parsing XML6
d. Appendix III: Regression Model9
e. Appendix IV: Topic Modeling using Mallet11
f. Appendix V: Patent Data + FDA Orange Book13
g. Appendix VI: Sentiment Analysis with Alchemy API14
Technical Report Executive Summary
DATA PRE-PROCESSING
The first step in organizing the patent data was to use bash shell scripting to organize them one file per line into a single document. To do this we replaced the opening XML tag of each patent with a unique non-English word that would not be contained in any patent. Then, we removed all possible newline characters from the entire file. The final step was to separate the patent files again by replacing the special non-English word and replacing it with a newline character. Code for these steps can be found in Appendix I.
The second step in processing the patent data was to extract only patents belonging to Novartis or one of its subsidiaries. We used bash shell scripting and perl to find and extract those particular lines, saving them to a separate file for further processing (see Appendix I).
The final step in processing the patent data was to extract the specific information we needed from each patent, requiring that we parse the XML tags. Eventually, we discovered Apache Pig and its special language for parsing XML called PigLatin(see Appendix I, point 3), but not before one group member wrote Java code that parsed each patent file one character at a time and extracted the relevant data (see Appendix II).
STATISTICAL MODELING
Each individual patent had a lag time between the time that it was filed to the time it was granted. Creating a linear regression model would indicate the mean lag time of patents given the grant type, so we could determine which patent types should receive more resources in the research and development phase. A sample of the data of lag times and patent variable codes is given in Appendix III.
We ran a simple linear regression in SPSS, a statistical package by IBM. The results of the regression are given in Appendix III in two tables, Coefficients and ANOVA. In the Coefficients table, the p-value (given in the column labeled “sig.”) is less than .05 in each, so each of the variables relationships to lag time is significant. The patents labeled “Chemistry” were used as a baseline or constant coefficient giving a basic regression equation for mean lag time as: lag_time = 4.792 + (b * patent_type)
where b is the coefficient for each patent type. This gave regression equations for each patent type which are given in Appendix III.
As seen in the ANOVA table, the p-value was less than alpha=.05, so we rejected the null hypothesis and concluded that at least one of the parameters differed from zero. In other words, we were able to conclude that the data provide sufficient evidence that the mean lag time does vary among the seven major patent types held by Novartis.
A bar graph of the mean lag times for each patent type show that the greatest lag times occur for patents of type surgery, chemistry and drug, while the least lag times are for optics, synthetic compounds and prosthesis. This suggests that putting more effort into research in the categories of optics, synthetics and prostheses will increase profits in the short term since products in those categories can be brought to market more quickly than the other types.
TOPIC MODELING
The titles of the patents were used to identify ten topics in which the patents are divided using mallet tool. This was helpful in finding out that Novartis is producing most patents in hiv related and surgical related matters. The complete steps are given in Appendix IV.
MERGING USPTO AND FDA DATASETS
To analyze patents for drug ingredients by expiration date in terms of their brand name equivalent, we had to merge and manipulate two large datasets, our patent data and FDA data. The complete steps are listed in Appendix V.
SENTIMENT ANALYSIS
We used Alchemy API to analyze online news articles published after the January 28, 2015 announcement of the three-way asset swap between Novartis, GSK, and Eli Lilly & Co. The results are shown in Appendix VI.
Appendix I
Bash Script and Hadoop
1. Following Bash script was used to unzip the xml patent files and move to home folder:
#!/bin/bash
#Copy Files
cd dsba-6100/patentData2000_2015/patBiblio2000_1h2015
for f in *2006*
do
cp $f /users/pghosh2/
done
cd
unzip '*.zip'
mv /users/pghosh2/*.xml /users/pghosh2/Nova_hdfs_unzipped/2006
cp -r /users/pghosh2/Nova_hdfs_unzipped/2006/* /users/pghosh2/Nova_hdfs_unzipped/2006_hadoop
hadoop fs -copyFromLocal /users/pghosh2/Nova_hdfs_unzipped/2006_hadoop/*.* /user/pghosh2/In_2006
2. Following bash script is used for extracting Novartis and other competitor companies’ patent record and format one record in one row:
#!/bin/bash
#extract_nova
FILES="/users/pghosh2/Nova_hdfs_unzipped/all/*"
for f in $FILES
do
filename=$(basename "$f")
sed 's/.*xml.*/ghosh01/' $f |perl -p -e 's/\n/ /' |sed 's/us-patent-grant>/&\n/g'|grep -i "orgname>Merck" > $filename
done
3. New pig script used for extracting particular data from the resultant files to csv format:
REGISTER /users/pghosh2/pig/lib/piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A= LOAD '$input' using org.apache.pig.piggybank.storage.XMLLoader('us-patent-grant') as (x:chararray);
B= FOREACH A GENERATE XPath(x,'us-patent-grant/us-bibliographic-data-grant/assignees/assignee/addressbook/orgname'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/publication-reference/document-id/doc-number'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/invention-title'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/application-reference/document-id/date'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/publication-reference/document-id/date'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/classification-national/main-classification'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/length-of-grant'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/us-term-extension'),XPath(x,'us-patent-grant/abstract');
STORE B INTO '/user/pghosh2/result_alcon.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage();
4. This pig script is being called by the following shell script to iteratively extract from all the ipgb files in a folder-
#!/bin/bash
#extract_nova
FILES="/user/pghosh2/2006/*"
for f in $FILES
do
pig -f xml_parse.pig -param input=$f
done
The following data fields were extracted from the xml:
Company name,
patent number,
title,
filed date,
granted date,
main classification code,
length of grant,
term extension,
abstract.
The following data fields were later added or calculated:
1. Patent class (Utility/design/reissue),
2. Lag time between filing and granting patent in years (Granting date- Filing date),
3. Filed date is divided into ‘Filed month’, ‘Filed day’ and ‘Filed year’ using Excel formulas for ease of manipulation later,
4. Granted date is divided into ‘Granted month’, ‘Granted day’ and ‘Granted year’ using Excel formulas for ease of manipulation later,
5. main_class_number is extracted from main classification code through excel formula.
6. main_class_name is taken from USPTO website and populated manually for each record.
7. Length of grant for utility records were not present in the xml files. So it was populated manually for all records. Length of grant for utility patents are always 20 years.
8. Probable expiry is calculated using granted date, length of grant and us term extension through excel formulas.
Appendix II
Java Code for parsing XML
public class XMLParser {
public static String removeComma(String s) {
StringBuffer r = new StringBuffer( s.length() );
r.setLength( s.length() );
int current = 0;
for (int i = 0; i < s.length(); i++) {
char cur = s.charAt(i);
if (cur != ',') r.setCharAt( current++, cur );
}
return r.toString();
}
public static void main(String[] args) throws IOException {
Scanner keyboard = new Scanner(System.in);
System.out.print("Enter name of file to parse: ");
String filename = keyboard.nextLine();
//take filename and create one for cleaned output
StringTokenizer st = new StringTokenizer(filename,".");
String outfile = st.nextElement()+"_parsed.txt";
//Scanner to read from file containing URL data
Scanner input = new Scanner(new File(filename));
//File object to write data of URLs embedded in HTML code
PrintWriter output = new PrintWriter(new FileWriter(outfile));
//convert entire file to ArrayList of strings
ArrayList doc = new ArrayList<>();
while(input.hasNext()){
doc.add(input.nextLine().trim());
}
int index = 0; //index of ArrayList of strings
while(index < doc.size()){
//StringBuilder objects to hold data we are looking for in each document
StringBuilder company = new StringBuilder();
StringBuilder assigneeLast = new StringBuilder();
StringBuilder assigneeFirst = new StringBuilder();
StringBuilder yearGranted = new StringBuilder();
StringBuilder yearApplied = new StringBuilder();
StringBuilder patentClass = new StringBuilder();
StringBuilder patentNumber = new StringBuilder();
StringBuilder patentTitle = new StringBuilder();
StringBuilder patentAbstract = new StringBuilder();
//looking through document for company
String s = doc.get(index);
int i = 0;
/* we need pointers to the first '<' character and the last '>'
character so we can replace the entire subsequence with something
*/
int start = 0;
int end = 0;
//this flag will turn true if we are inside a
boolean insideTag = false;
//this flag will turn true when we want to append a field
boolean addField = false;
//this flag will turn true when the assignee tag has been found
boolean foundAssignee = false;
/* loop through each character in s and if its a tag don't append
it to the StringBuilder. Also if it's a tag before some data
we want store it in the correct variable
*/
//looking through document for patent number
s = doc.get(index);
i = 0;
start = 0;
end = 0;
insideTag = false;
addField = false;
boolean foundPatentNum = false;
while(i < s.length()){
if (!insideTag && s.charAt(i) == '<'){
insideTag = true;
addField = false;
start = i+1;
}
else if (insideTag && s.charAt(i) == '>'){
insideTag = false;
end = i;
}
else if (addField && s.charAt(i) != '<'){
if(patentNumber.length() < 8) patentNumber.append(s.charAt(i));
}
else if (addField && s.charAt(i) == '<'){
addField = false;
insideTag = true;
start = i+1;
}
if(end!=0 && end-start > 0){
if(s.substring(start,end).equals("publication-reference")) foundPatentNum = true;
if(foundPatentNum && s.substring(start,end).equals("doc-number")) { addField = true;}
}
i++;
}//end parsing document
[Note: Each patent file “string” had to be parsed a number of different times, once for each field of data we needed. That code is repeated and nearly identical, so we left it out of this report for brevity]
//remove commas from all fields so they don't get treated as separate columns in a csv file
company = new StringBuilder(removeComma(company.toString()));
patentTitle = new StringBuilder(removeComma(patentTitle.toString()));
patentAbstract = new StringBuilder(removeComma(patentAbstract.toString()));
assigneeFirst = new StringBuilder(removeComma(assigneeFirst.toString()));
assigneeLast = new StringBuilder(removeComma(assigneeLast.toString()));
//write the fields in order to the outputfile separated by commas
output.print(company.toString()+",");
output.print(assigneeFirst.toString()+ " " + assigneeLast.toString()+",");
output.print(yearGranted.toString()+",");
output.print(yearApplied.toString()+",");
output.print(patentClass.toString()+",");
output.print(patentNumber.toString()+",");
output.print(patentTitle.toString()+",");
output.println(patentAbstract.toString());
index++;
}//end while loop through all documents in file
output.close();
}//end main
}
APPENDIX III: Regression Model
Top Novartis Patent Classes
Patent Main Class
Variable Code
Number of Patents
Chemistry: molecular biology and microbiology
1
114
Drug, bio-affecting and body treating compositions
2
1042
Optics: eye examining, vision testing and correcting
3
61
Synthetic resins or natural rubbers
4
69
Organic compounds
5
225
Surgery
6
64
Prosthesis
7
32
Sample Data Excerpt
Lag Time (Years)
Patent Variable
3.19
1
3.9
1
4.06
1
3.71
1
3.34
2
1.9
2
2.99
2
4.58
2
4.36
2
2.72
2
SPSS Regression Results
Regression Equations for Each Patent Type (from Coefficients Table)
lag_time = 4.792 + (-.719 * 1) = 4.073 (drug)
lag_time = 4.792 + (-1.888* 1) = 2.904 (optics)
lag_time = 4.792 + (-1.847 * 1) = 2.945 (synthetic)
lag_time = 4.792 + (-.857 * 1) = 3.935 (organic)
lag_time = 4.792 + (1.025 * 1) = 5.817 (surgery)
lag_time = 4.792 + (-1.470 * 1) = 3.322 (prosthesis)
lag_time = 4.792 (chemistry)
Bar Chart of Mean Lag Times for Patent Types
Analysis of Variance (Significance of Entire Model)
APPENDIX IV:
Topic Modeling using Mallet
Table A: Screenshot of data file used as input in mallet
Table B: Commands for topic modeling:
Table C: Topic model and associated keywords:
From the topic proportion document, we took the topic with highest proportion for all the patents and made a new document. We used the new document for making graphs in tableau.
Table D: Topic proportion document screenshot
Table E: Document used for tableau graph
Table F: Patent distribution as per topics with timeline
APPENDIX V:
Combining Patent Data with FDA Drug Codes and Exclusivity Data
Following steps were taken to produce the comprehensive files:
1. Download Orange Book Data Files (compressed) from http://www.fda.gov/Drugs/InformationOnDrugs/ucm129689.htm.
2. The compressed file obtained from this link is shared as EOBZIP_2015_07.zip.
3. The text files obtained after extracting the zip files are turned to csv using notepad++(replace all ‘,’ by ‘|’, then replace all ‘~’ by ‘,’).
4. the csv files are product.csv, patent.csv and exclusivity.csv.
5. These files are imported to SAS and data manipulation tasks were performed.
-all novartis,alcon and sandoz data were filtered out from product.csv
-The data produced is shared in the file PRODUCT_NOVARTIS.csv
6. From PRODUCT_NOVARTIS and exclusivity table, all the application with exclusivity data are filtered out. The result dataset is PRODUCT_EXCLUSIVITY_NOVARTIS.csv.
7. PRODUCT_EXCLUSIVITY_NOVARTIS.csv is imported to mysql and the data is grouped by Appl_No,Product_No and Exclusivity_Code using the following query.
select * from product_exclusivity_novartis
group by Appl_No,Product_No,Exclusivity_Code;
8. The resultant data set is shared in the file Novartis_product_exclusivity.csv.
9. From PRODUCT_NOVARTIS and patent table, all the application with patent data are filtered out. The result dataset is PRODUCT_PATENT.csv.
10. PRODUCT_PATENT.csv is imported into mysql and data is grouped by Appl_No,Product_No and Patent_No using the following querry
select * from product_patent
group by Appl_No,Product_No,Patent_No
11. The resultant dataset is shared in the file Novartis_product_patent.csv.
12. From this dataset, all the patents expiring by 2017 are filtered out using the query
select * from product_patent
group by Appl_No,Product_No,Patent_No
having (Patent_Expire_Date_Text like '%2016'
or Patent_Expire_Date_Text like '%2017'
or Patent_Expire_Date_Text like '%2018');
13. the resultant dataset is shared in the file Novartis_ProdPat_ExpireBy2017.csv
We repeated the process for competitors Merck and GlaxoSmithKline.
APPENDIX VI:
Sentiment Analysis with Alchemy API
1. PMLive.com
http://www.pmlive.com/blogs/the_editors/archive/2023/march/gsk_and_novartis_asset_swap_a_tale_of_two_companies_677593
2. The Wall Street Journal
http://www.wsj.com/articles/gsk-cuts-return-to-shareholders-from-novartis-asset-swap-1430911793
3. The Telegraph
http://www.telegraph.co.uk/finance/newsbysector/pharmaceuticalsandchemicals/10791444/GSK-Novartis-deal-is-a-game-changer-for-MandA.html
4. The Reuters
http://www.reuters.com/article/us-novartis-gsk-assetswaps-idUSBREA3M1KX20140423#CCzHcuqW3ruU0YqZ.97
5. The Gaurdian
http://www.theguardian.com/business/2014/apr/22/glaxsmithkline-novartis-joint-venture-assets
6. The Financial Times
http://www.ft.com/cms/s/8dafdef0-c574-11e4-bd6b-00144feab7de,Authorised=false.html?siteedition=uk&_i_location=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F8dafdef0-c574-11e4-bd6b-00144feab7de.html%3Fsiteedition%3Duk&_i_referer=&classification=conditional_standard&iab=barrier-app#axzz3sFYoawTr
7. Barron’s
http://www.barrons.com/articles/novartis-looks-much-healthier-1430533899
8. FiercePharma
http://www.fiercepharma.com/story/gsk-deals-done-now-novartis-has-deliver-cancer-growth-pledge/2015-03-02
9. The Irish Tmes
http://www.irishtimes.com/business/health-pharma/novartis-gsk-complete-deals-to-reshape-both-drugmakers-1.2122899
4