Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... ·...

55
Open Data and Data Formats 1

Transcript of Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... ·...

Page 1: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Open Data and Data Formats

1

Page 2: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Data-based analysis: caveatsQuality: Your analysis is only as good as your data

Correlation vs. Causation

Prone to filter bubbles

Some questions are imprecise: Qualitative and quant. techniques need to co-exist

A lot of data is boring/hard to use — signal vs. noise

2

https://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html

Page 3: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Data lifecycle

e

3

Graphic: Jeff Heer

Page 4: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data organization

4

Page 5: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Databases

How is it stored?

Organization (encoding, structure, meaning…) or “Schema”

a.k.a. code book

How is it queried?

Query language, visualization, interfaces, etc.

5

Page 6: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Relational Databases

One way to store data: As a table of rows and columns

Pioneered by IBM (E.F. Codd in 1970)

Subsequently used by Oracle in a project for the CIA in the 1970s

Developed into a huge industry over 40 years

6

Page 7: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Relational DatabasesSets of tables (like a sheet in a spreadsheet)

Each table has rows and columns

Each table has a“schema” (the set of columns and their meanings); all rows follow the same structure

Efficient at retrieving rows quickly based on the values in some of their columns (using keys and indexes), or computing aggregates

Popular query language: Structured Query Language (SQL)

Like a spreadsheet (but no formulae)

A CSV file is a simple way to store a small(ish) database

7

https://data.gov.in/sites/default/files/NDSAP_Implementation_Guidelines-2.1.pdf

Page 8: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

SQL

Query language for databases, e.g.

SELECT CAND_NAME, YEAR, POSITION, VOTES

FROM CANDIDATES

WHERE PC_NUMBER = 543;

(analogous to filtering rows in Excel)

8

Page 9: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Other kinds of databases

9

https://www.flickr.com/photos/caseorganic/4935757995

Graph based

Page 10: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Other kinds of databases

Spatial databases (efficient at quick spatial queries)

Unstructured databases (e.g. plain text)

Image databases

RDF databases

etc.

10

Page 11: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

Unstructured Data Processing

Key Techniques (NLP/Information extraction):

Parsing: telling parts of speech in a (well-formed) sentence

Entity recognition: identify people names, places, organizations etc.

Disambiguation: Which “Ashoka” does Ashoka refer to?

Sentiment analysis: What feeling does a sentence convey about something?

Topic modeling: Splitting a group of documents by topic

Word Embeddings: Finding co-occurring or semantically similar words in text (Use languages like Python or Java to access these functions)

Question-Answering e.g. IBM Watson

11

Page 12: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Open Data

12

Page 13: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University13

• https://data.gov.in/sites/default/files/NDSAP.pdf

Page 14: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

NDSAP guidelines

RTI Act, Section 4(2)It shall be a constant endeavour of every public authority … to provide as much information suo motu to the public at regular intervals through various means of communications, including internet, so that the public have minimum resort to the use of this Act to obtain information.

Ministries/Departments will upload at least 5 “high-value” datasets on data.gov.in …

All datasets are to be updated regularly every quarter

14

Page 15: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University15

Also see NDSAP (2012)

https://data.gov.in/sites/default/files/NDSAP_Implementation_Guidelines-2.1.pdf

Page 16: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

5-Star Open Data

16

http://5stardata.info

Page 17: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

RDF files<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/office> <http://dbpedia.org/resource/Prime_Minister_of_India> .

<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/president> <http://dbpedia.org/resource/R._Venkataraman> .

<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/termStart> "1984-10-31"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/termEnd> "1989-12-02"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/predecessor> <http://dbpedia.org/resource/Indira_Gandhi> .

<http://dbpedia.org/resource/Rajiv_Gandhi> <http://dbpedia.org/property/successor> <http://dbpedia.org/resource/V._P._Singh> .

17

Page 18: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2018 (c) Ashoka University

SPARQLQuery language for RDF databases, e.g.

SELECT ?name

WHERE {

?name <http://dbpedia.org/property/office> <http://dbpedia.org/resource/Prime_Minister_of_India> .

}

18

Page 19: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Lok Dhaba

tcpd.ashoka.edu.in

19

Page 20: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Indian Elections Data

20

eci.gov.in

Page 21: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

ECI Data: Problems

Legacy data mostly in PDF format

No standard data schema

Very hard for researchers to work with

No data quality checks

21

Page 22: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

22

tcpd.ashoka.edu.in

http://tcpd.ashoka.edu.in

Page 23: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

23

tcpd.ashoka.edu.in

Page 24: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Quality checksExample issues found on ECI data:- Same party, seat, year but multiple candidates- Inconsistent PC Type Information (Gen/SC/ST)- Multiple parties with the same code- Inconsistent or missing Sex field- Uncontested constituencies are completely missing- Inconsistency of elected winners w.r.t. Lok Sabha records

24

http://eci.nic.in/eci_main1/ElectionStatistics.aspx

Page 25: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

25

http://164.100.47.194/Loksabha/Members/lokprev.aspx

Page 26: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Incumbency visualization

26 http://shivangitikekar.com/portfolio/karnataka_viz.html

Page 27: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

hdworks.org

27

Page 28: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

e

28

hdworks.org

Page 29: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

e

29

hdworks.org

Page 30: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

e

30

hdworks.org

Page 31: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

e

31

hdworks.org

Page 32: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD Works

e

32

hdworks.org

Page 33: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

HD WorksEncourages citizen participation (crowdsourcing)Provides feedback to city officialsIncreases transparencyAllows citizens to subscribe to areas of works of interestGeo-enabled

To be done: ties to budget spendingTo be done: ties to tender documents

33

Page 34: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

34

Page 35: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

What do we know about the functioning of our Parliament? & (more) importantly, what concerns

do our MPs pose in the House?

Page 36: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Parliamentary ProcedureGoverned by Article 118 of the Indian Constitution:

“Each House of Parliament may make rules for regulations, subject to the provisions of this Constitution, its procedure and the conduct of its business”

Page 37: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

The Question Hour! Questions are tools through which parliamentarians

ensure administrative accountability to the people.

! Not subject to party whips

! Can demand oral or written answers from Ministers

! An MP can submit a maximum of 10 questions for each day, max 230 questions are admitted

The number of questions received by the Parliamentary Secretariat are far more, hence questions are selected to be answered by a random ballot.

Page 38: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:
Page 39: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data Available on Questions! Ministry to which question

is asked

! House, Starred/Unstarred

! Date on which question is tabled

! Title of question

! Members asking question

! Text of Question

Page 40: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data Available on Questions! Ministry to which question

is asked

! House, Starred/Unstarred

! Date on which question is tabled

! Title of question

! Members asking question

! Text of Question

Page 41: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Methods to Understand Content! Ministry data is

insufficient: The Secretariat decides which Ministry will be taking questions and for how many days in each session.

! Often, questions span >1 themes

! Key is to categorizing question by understanding the text of the content

Page 42: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Step 1: Preparing the Dataset! Scraped the meta-data of

questions using programs in Python & R from the Lok Sabha Website

! Scraped the text of the questions & answers for each question

! Combining this with member data from ECI data prepared by the TCPD (candidate type, gender, religion, constituency etc.)

Page 43: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Step 2: Extracting Topical Information! Required powerful tool to

understand context of each question

! ‘Informative words’ in a question

! Sufficient to search for appearance for ‘Scheduled Caste’ and determine the resulting data as the dataset on all caste questions?

Page 44: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Word Embeddings! Words are mapped to a vector of real numbers, useful to

calculate word similarities

! Word2vec (Mikolov et al, Google 2013) - representation of meaning of a word is determined by looking at the context in which it appears.

! Preserves semantic and syntactic relationships and captures the meaning of words

Result for the word ‘frog’ using GloVe

Page 45: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Method for Extracting Themes! Train the word2vec algorithm on the questions from the dataset

! Determine word similarity: get ~500 words similar to the anchor word

! For example: get words similar to ‘woman’, ‘minority’, ‘caste’ etc.

! Prepare a curated list of 50-80 words

Anchor word (Theme)

Words (curated using word2vec)

woman woman, widowed, mothers, ladies, female, creches, maternity, sabla, janani, pregnant, girl

education Educational, learning, elementary, syllabi, teacher,secondary, rte, shiksha, cabe, pupil, literacy, madarsas, aicte, ncert

Page 46: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Method for Extracting Topics! What if we don’t know the topic that might appear in a corpus?

! Topic Modelling - The LDA Topic Model understands ‘content’ and helps to cluster the words falling under the same topic (for n topics)

! Input the training data (question text) & the number of topics required, output is as follows:

Word list (generated by the Topic Model) Topicsstudents scholarship amount scheduled scheme scholarships government coaching post matric post-matric details belonging proposes increase schemes caste fixed obc tribe

scholarships

act cases atrocities prevention scheduled courts special sc/st castes protection rights registered pending set state-wise implementation law legal disposal provisions

crimes/atrocities

Page 47: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Working with Data:A few tips

47

Page 48: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Primary vs. derived dataData should be broken down into “primary” and “derived” data, e.g.

primary data: gender of a personderived data: number of women in the dataset

primary data: candidate contesting in an ACderived data: number of candidates contesting in an AC

48

Page 49: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Primary vs. derived dataPrimary data is assumed to be correctDerived data is computed from primary data by an automated process (like a formula)If primary data is correct, derived data is correct (if the automated process is correct)If primary data changes, derived data can be updated automaticallyNot all primary source data is primary data (often sources will provide redundant or derived data)

49

Page 50: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data quality checksData should not be blindly trusted

Think: What could be wrong?How can I catch anomalies?Write consistency checks on the value in fields, the relationship between fields in a row, etc.If derived data is present, check its correctness

50

Page 51: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data correctionsSometimes primary data may be wrong/inconsistentCorrections in primary data should be carried out through a scriptScript captures intention of update (e.g. change “ASAWARPUR” to “ASAVARPUR”)Script can be re-applied automatically if base data changes

51

Page 52: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data versioningStore all versions of dataTools like git can store many versions efficiently (use simple formats like CSV)Allows rollback if anything changes by mistakeAllows branching and multiple people working on different sections of the dataGit support built in to Rstudio

52

Page 53: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

Data provenanceSource of data needs to be properly documented (and acknowledged) All updates should be controlled and documented carefullyIf you make a data correction, have it reviewed by someone else, and write a justificationWhen providing data-based analysis, also offer primary data for reproducibility (many wrong conclusions have been made due to errors in spreadsheets!)

53

Page 54: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

SchemaBuilding a long-lived schema can be quite hardAssumptions change, new variables come in, etc.Think of the types of each column:

What are the allowed values?What are the allowed ranges?What field/combination of fields uniquely identifies a row (i.e., is a “key”)?e.g. <Const#, Year of election> is not sufficient: Same year could have multiple (by)polls in the same const. Const. # also changes across delimitations

54

Page 55: Open Data and Data Formats - 13.127.28.23913.127.28.239/wp-content/uploads/2018/10/TCPD... · Entity recognition: identify people names, places, organizations etc. Disambiguation:

TCPD Summer School 2017 (c) Ashoka University

Types of variables

Nominal (a.k.a Categorical) - discrete set of values, (e.g. Candidate caste, sex)

Ordinal- numeric, with ordering only (e.g. rank)

Quantitative- numeric with math operations (e.g. number of votes)

Think carefully about which type of variable each column is

55

http://eci.nic.in/eci_main1/ElectionStatistics.aspx