· Web viewNovartis TransWorld Group Part I I I: Technical Report DSBA 6100 David Farthing...

32
Novartis TransWorld Group Part III: Technical Report DSBA 6100 David Farthing Paramita Ghosh Lyndsay Richter Kaitlyn Schroeder Raja Sekhar Potluri 1

Transcript of  · Web viewNovartis TransWorld Group Part I I I: Technical Report DSBA 6100 David Farthing...

Novartis

TransWorld Group

Part III: Technical Report

DSBA 6100

David Farthing

Paramita Ghosh

Lyndsay Richter

Kaitlyn Schroeder

Raja Sekhar Potluri

December 16, 2015

Table of Contents

II. Technical Report

a. Executive Summary3

b. Appendix I: Bash Script and Hadoop4

c. Appendix II: Java Code for Parsing XML6

d. Appendix III: Regression Model9

e. Appendix IV: Topic Modeling using Mallet11

f. Appendix V: Patent Data + FDA Orange Book13

g. Appendix VI: Sentiment Analysis with Alchemy API14

Technical Report Executive Summary

DATA PRE-PROCESSING

The first step in organizing the patent data was to use bash shell scripting to organize them one file per line into a single document. To do this we replaced the opening XML tag of each patent with a unique non-English word that would not be contained in any patent. Then, we removed all possible newline characters from the entire file. The final step was to separate the patent files again by replacing the special non-English word and replacing it with a newline character. Code for these steps can be found in Appendix I.

The second step in processing the patent data was to extract only patents belonging to Novartis or one of its subsidiaries. We used bash shell scripting and perl to find and extract those particular lines, saving them to a separate file for further processing (see Appendix I).

The final step in processing the patent data was to extract the specific information we needed from each patent, requiring that we parse the XML tags. Eventually, we discovered Apache Pig and its special language for parsing XML called PigLatin(see Appendix I, point 3), but not before one group member wrote Java code that parsed each patent file one character at a time and extracted the relevant data (see Appendix II).

STATISTICAL MODELING

Each individual patent had a lag time between the time that it was filed to the time it was granted. Creating a linear regression model would indicate the mean lag time of patents given the grant type, so we could determine which patent types should receive more resources in the research and development phase. A sample of the data of lag times and patent variable codes is given in Appendix III.

We ran a simple linear regression in SPSS, a statistical package by IBM. The results of the regression are given in Appendix III in two tables, Coefficients and ANOVA. In the Coefficients table, the p-value (given in the column labeled “sig.”) is less than .05 in each, so each of the variables relationships to lag time is significant. The patents labeled “Chemistry” were used as a baseline or constant coefficient giving a basic regression equation for mean lag time as: lag_time = 4.792 + (b * patent_type)

where b is the coefficient for each patent type. This gave regression equations for each patent type which are given in Appendix III.

As seen in the ANOVA table, the p-value was less than alpha=.05, so we rejected the null hypothesis and concluded that at least one of the parameters differed from zero. In other words, we were able to conclude that the data provide sufficient evidence that the mean lag time does vary among the seven major patent types held by Novartis.

A bar graph of the mean lag times for each patent type show that the greatest lag times occur for patents of type surgery, chemistry and drug, while the least lag times are for optics, synthetic compounds and prosthesis. This suggests that putting more effort into research in the categories of optics, synthetics and prostheses will increase profits in the short term since products in those categories can be brought to market more quickly than the other types.

TOPIC MODELING

The titles of the patents were used to identify ten topics in which the patents are divided using mallet tool. This was helpful in finding out that Novartis is producing most patents in hiv related and surgical related matters. The complete steps are given in Appendix IV.

MERGING USPTO AND FDA DATASETS

To analyze patents for drug ingredients by expiration date in terms of their brand name equivalent, we had to merge and manipulate two large datasets, our patent data and FDA data. The complete steps are listed in Appendix V.

SENTIMENT ANALYSIS

We used Alchemy API to analyze online news articles published after the January 28, 2015 announcement of the three-way asset swap between Novartis, GSK, and Eli Lilly & Co. The results are shown in Appendix VI.

Appendix I

Bash Script and Hadoop

1. Following Bash script was used to unzip the xml patent files and move to home folder:

#!/bin/bash

#Copy Files

cd dsba-6100/patentData2000_2015/patBiblio2000_1h2015

for f in *2006*

do

cp $f /users/pghosh2/

done

cd

unzip '*.zip'

mv /users/pghosh2/*.xml /users/pghosh2/Nova_hdfs_unzipped/2006

cp -r  /users/pghosh2/Nova_hdfs_unzipped/2006/* /users/pghosh2/Nova_hdfs_unzipped/2006_hadoop

hadoop fs -copyFromLocal /users/pghosh2/Nova_hdfs_unzipped/2006_hadoop/*.* /user/pghosh2/In_2006

2. Following bash script is used for extracting Novartis and other competitor companies’ patent record and format one record in one row:

#!/bin/bash

#extract_nova

FILES="/users/pghosh2/Nova_hdfs_unzipped/all/*"

for f in $FILES

do

filename=$(basename "$f")

sed 's/.*xml.*/ghosh01/' $f |perl -p -e 's/\n/ /' |sed 's/us-patent-grant>/&\n/g'|grep -i "orgname>Merck" > $filename

done

3. New pig script used for extracting particular data from the resultant files to csv format:

REGISTER /users/pghosh2/pig/lib/piggybank.jar

DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();

A= LOAD '$input' using org.apache.pig.piggybank.storage.XMLLoader('us-patent-grant') as (x:chararray);

B= FOREACH A GENERATE XPath(x,'us-patent-grant/us-bibliographic-data-grant/assignees/assignee/addressbook/orgname'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/publication-reference/document-id/doc-number'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/invention-title'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/application-reference/document-id/date'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/publication-reference/document-id/date'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/classification-national/main-classification'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/length-of-grant'),XPath(x,'us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/us-term-extension'),XPath(x,'us-patent-grant/abstract');

STORE B INTO '/user/pghosh2/result_alcon.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage();

4. This pig script is being called by the following shell script to iteratively extract from all the ipgb files in a folder-

#!/bin/bash

#extract_nova

FILES="/user/pghosh2/2006/*"

for f in $FILES

do

pig -f xml_parse.pig -param input=$f

done

The following data fields were extracted from the xml:

Company name,

patent number,

title,

filed date,

granted date,

main classification code,

length of grant,

term extension,

abstract.

The following data fields were later added or calculated:

1. Patent class (Utility/design/reissue),

2. Lag time between filing and granting patent in years (Granting date- Filing date),

3. Filed date is divided into ‘Filed month’, ‘Filed day’ and ‘Filed year’ using Excel formulas for ease of manipulation later,

4. Granted date is divided into ‘Granted month’, ‘Granted day’ and ‘Granted year’ using Excel formulas for ease of manipulation later,

5. main_class_number is extracted from main classification code through excel formula.

6. main_class_name is taken from USPTO website and populated manually for each record.

7. Length of grant for utility records were not present in the xml files. So it was populated manually for all records. Length of grant for utility patents are always 20 years.

8. Probable expiry is calculated using granted date, length of grant and us term extension through excel formulas.

Appendix II

Java Code for parsing XML

public class XMLParser {

public static String removeComma(String s) {

StringBuffer r = new StringBuffer( s.length() );

r.setLength( s.length() );

int current = 0;

for (int i = 0; i < s.length(); i++) {

char cur = s.charAt(i);

if (cur != ',') r.setCharAt( current++, cur );

}

return r.toString();

}

public static void main(String[] args) throws IOException {

Scanner keyboard = new Scanner(System.in);

System.out.print("Enter name of file to parse: ");

String filename = keyboard.nextLine();

//take filename and create one for cleaned output

StringTokenizer st = new StringTokenizer(filename,".");

String outfile = st.nextElement()+"_parsed.txt";

//Scanner to read from file containing URL data

Scanner input = new Scanner(new File(filename));

//File object to write data of URLs embedded in HTML code

PrintWriter output = new PrintWriter(new FileWriter(outfile));

//convert entire file to ArrayList of strings

ArrayList doc = new ArrayList<>();

while(input.hasNext()){

doc.add(input.nextLine().trim());

}

int index = 0; //index of ArrayList of strings

while(index < doc.size()){

//StringBuilder objects to hold data we are looking for in each document

StringBuilder company = new StringBuilder();

StringBuilder assigneeLast = new StringBuilder();

StringBuilder assigneeFirst = new StringBuilder();

StringBuilder yearGranted = new StringBuilder();

StringBuilder yearApplied = new StringBuilder();

StringBuilder patentClass = new StringBuilder();

StringBuilder patentNumber = new StringBuilder();

StringBuilder patentTitle = new StringBuilder();

StringBuilder patentAbstract = new StringBuilder();

//looking through document for company

String s = doc.get(index);

int i = 0;

/* we need pointers to the first '<' character and the last '>'

character so we can replace the entire subsequence with something

*/

int start = 0;

int end = 0;

//this flag will turn true if we are inside a

boolean insideTag = false;

//this flag will turn true when we want to append a field

boolean addField = false;

//this flag will turn true when the assignee tag has been found

boolean foundAssignee = false;

/* loop through each character in s and if its a tag don't append

it to the StringBuilder. Also if it's a tag before some data

we want store it in the correct variable

*/

//looking through document for patent number

s = doc.get(index);

i = 0;

start = 0;

end = 0;

insideTag = false;

addField = false;

boolean foundPatentNum = false;

while(i < s.length()){

if (!insideTag && s.charAt(i) == '<'){

insideTag = true;

addField = false;

start = i+1;

}

else if (insideTag && s.charAt(i) == '>'){

insideTag = false;

end = i;

}

else if (addField && s.charAt(i) != '<'){

if(patentNumber.length() < 8) patentNumber.append(s.charAt(i));

}

else if (addField && s.charAt(i) == '<'){

addField = false;

insideTag = true;

start = i+1;

}

if(end!=0 && end-start > 0){

if(s.substring(start,end).equals("publication-reference")) foundPatentNum = true;

if(foundPatentNum && s.substring(start,end).equals("doc-number")) { addField = true;}

}

i++;

}//end parsing document

[Note: Each patent file “string” had to be parsed a number of different times, once for each field of data we needed. That code is repeated and nearly identical, so we left it out of this report for brevity]

//remove commas from all fields so they don't get treated as separate columns in a csv file

company = new StringBuilder(removeComma(company.toString()));

patentTitle = new StringBuilder(removeComma(patentTitle.toString()));

patentAbstract = new StringBuilder(removeComma(patentAbstract.toString()));

assigneeFirst = new StringBuilder(removeComma(assigneeFirst.toString()));

assigneeLast = new StringBuilder(removeComma(assigneeLast.toString()));

//write the fields in order to the outputfile separated by commas

output.print(company.toString()+",");

output.print(assigneeFirst.toString()+ " " + assigneeLast.toString()+",");

output.print(yearGranted.toString()+",");

output.print(yearApplied.toString()+",");

output.print(patentClass.toString()+",");

output.print(patentNumber.toString()+",");

output.print(patentTitle.toString()+",");

output.println(patentAbstract.toString());

index++;

}//end while loop through all documents in file

output.close();

}//end main

}

APPENDIX III: Regression Model

Top Novartis Patent Classes

Patent Main Class

Variable Code

Number of Patents

Chemistry: molecular biology and microbiology

1

114

Drug, bio-affecting and body treating compositions

2

1042

Optics: eye examining, vision testing and correcting

3

61

Synthetic resins or natural rubbers

4

69

Organic compounds

5

225

Surgery

6

64

Prosthesis

7

32

Sample Data Excerpt

Lag Time (Years)

Patent Variable

3.19

1

3.9

1

4.06

1

3.71

1

3.34

2

1.9

2

2.99

2

4.58

2

4.36

2

2.72

2

SPSS Regression Results

Regression Equations for Each Patent Type (from Coefficients Table)

lag_time = 4.792 + (-.719 * 1) = 4.073 (drug)

lag_time = 4.792 + (-1.888* 1) = 2.904 (optics)

lag_time = 4.792 + (-1.847 * 1) = 2.945 (synthetic)

lag_time = 4.792 + (-.857 * 1) = 3.935 (organic)

lag_time = 4.792 + (1.025 * 1) = 5.817 (surgery)

lag_time = 4.792 + (-1.470 * 1) = 3.322 (prosthesis)

lag_time = 4.792 (chemistry)

Bar Chart of Mean Lag Times for Patent Types

Analysis of Variance (Significance of Entire Model)

APPENDIX IV:

Topic Modeling using Mallet

Table A: Screenshot of data file used as input in mallet

Table B: Commands for topic modeling:

Table C: Topic model and associated keywords:

From the topic proportion document, we took the topic with highest proportion for all the patents and made a new document. We used the new document for making graphs in tableau.

Table D: Topic proportion document screenshot

Table E: Document used for tableau graph

Table F: Patent distribution as per topics with timeline

APPENDIX V:

Combining Patent Data with FDA Drug Codes and Exclusivity Data

Following steps were taken to produce the comprehensive files:

1. Download Orange Book Data Files (compressed) from http://www.fda.gov/Drugs/InformationOnDrugs/ucm129689.htm.

2. The compressed file obtained from this link is shared as EOBZIP_2015_07.zip.

3. The text files obtained after extracting the zip files are turned to csv using notepad++(replace all ‘,’ by ‘|’, then replace all ‘~’ by ‘,’).

4. the csv files are product.csv, patent.csv and exclusivity.csv.

5. These files are imported to SAS and data manipulation tasks were performed.

-all novartis,alcon and sandoz data were filtered out from product.csv

-The data produced is shared in the file PRODUCT_NOVARTIS.csv

6. From PRODUCT_NOVARTIS and exclusivity table, all the application with exclusivity data are filtered out. The result dataset is PRODUCT_EXCLUSIVITY_NOVARTIS.csv.

7. PRODUCT_EXCLUSIVITY_NOVARTIS.csv is imported to mysql and the data is grouped by Appl_No,Product_No and Exclusivity_Code using the following query.

select * from product_exclusivity_novartis

group by Appl_No,Product_No,Exclusivity_Code;

8. The resultant data set is shared in the file Novartis_product_exclusivity.csv.

9. From PRODUCT_NOVARTIS and patent table, all the application with patent data are filtered out. The result dataset is PRODUCT_PATENT.csv.

10. PRODUCT_PATENT.csv is imported into mysql and data is grouped by Appl_No,Product_No and Patent_No using the following querry

select * from product_patent

group by Appl_No,Product_No,Patent_No

11. The resultant dataset is shared in the file Novartis_product_patent.csv.

12. From this dataset, all the patents expiring by 2017 are filtered out using the query

select * from product_patent

group by Appl_No,Product_No,Patent_No

having (Patent_Expire_Date_Text like '%2016'

or Patent_Expire_Date_Text like '%2017'

or Patent_Expire_Date_Text like '%2018');

13. the resultant dataset is shared in the file Novartis_ProdPat_ExpireBy2017.csv

We repeated the process for competitors Merck and GlaxoSmithKline.

APPENDIX VI:

Sentiment Analysis with Alchemy API

1. PMLive.com

http://www.pmlive.com/blogs/the_editors/archive/2023/march/gsk_and_novartis_asset_swap_a_tale_of_two_companies_677593

2. The Wall Street Journal

http://www.wsj.com/articles/gsk-cuts-return-to-shareholders-from-novartis-asset-swap-1430911793

3. The Telegraph

http://www.telegraph.co.uk/finance/newsbysector/pharmaceuticalsandchemicals/10791444/GSK-Novartis-deal-is-a-game-changer-for-MandA.html

4. The Reuters

http://www.reuters.com/article/us-novartis-gsk-assetswaps-idUSBREA3M1KX20140423#CCzHcuqW3ruU0YqZ.97

5. The Gaurdian

http://www.theguardian.com/business/2014/apr/22/glaxsmithkline-novartis-joint-venture-assets

6. The Financial Times

http://www.ft.com/cms/s/8dafdef0-c574-11e4-bd6b-00144feab7de,Authorised=false.html?siteedition=uk&_i_location=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F8dafdef0-c574-11e4-bd6b-00144feab7de.html%3Fsiteedition%3Duk&_i_referer=&classification=conditional_standard&iab=barrier-app#axzz3sFYoawTr

7. Barron’s

http://www.barrons.com/articles/novartis-looks-much-healthier-1430533899

8. FiercePharma

http://www.fiercepharma.com/story/gsk-deals-done-now-novartis-has-deliver-cancer-growth-pledge/2015-03-02

9. The Irish Tmes

http://www.irishtimes.com/business/health-pharma/novartis-gsk-complete-deals-to-reshape-both-drugmakers-1.2122899

4