WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of...

19
wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response quality uRos2019, 20./21. May 2019, București © Federal Statistical Office of Germany (Destatis)

Transcript of WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of...

Page 1: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

WebScraping Notices of Insolvency Proceedings with R

Using publicly available data to enhance survey response quality

uRos2019, 20./21. May 2019, București

© Federal Statistical Office of Germany (Destatis)

Page 2: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Insolvency Proceedings and Official Statistics

2

© Federal Statistical Office of Germany (Destatis)

Debtor Insolvency Court

Insolvency Administrator

Official Statistics

Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 3: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Insolvency Proceedings and Official Statistics

3

© Federal Statistical Office of Germany (Destatis)

Opening of Insolvency Proceedings

Closing of Insolvency Proceedings

Residual Debt Discharge Granted or not Granted

Residual Debt Discharge Revoked

Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 4: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Problem: Two Respondents on One Case

4

© Federal Statistical Office of Germany (Destatis)

Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 5: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

A Further Source: Publication Obligations of Insolvency Courts

5

© Federal Statistical Office of Germany (Destatis) Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 6: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

www.insolvenzbekanntmachungen.de

6

© Federal Statistical Office of Germany (Destatis) Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 7: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Unrestricted Search

7

An unrestricted search, "- All insolvency courts -", by announcements is possible according to § 2 of the regulation to public announcements in insolvency proceedings in the Internet only within two weeks after the first day of publication. After expiry of this period, only a detailed search is permitted.

© Federal Statistical Office of Germany (Destatis)

Page 8: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Unrestricted Search

8

An unrestricted search, "- All insolvency courts -", by announcements is possible according to § 2 of the regulation to public announcements in insolvency proceedings in the Internet only within two weeks after the first day of publication. After expiry of this period, only a detailed search is permitted.

To achieve a complete picture of the opened proceedings, one needs to search every two weeks

Cron Job

© Federal Statistical Office of Germany (Destatis)

Page 9: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Data Collection

9

Objective: Complete list of identifiers (case number, court) of opened proceedings, as well as of proceedings with statistically relevant decisions.

Path: R-script that, within the two-week period, searches the texts of the notices in the unrestricted search for certain keywords and notes the corresponding case/court identifiers.

© Federal Statistical Office of Germany (Destatis) Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 10: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Data Collection

10

First collection in March 2016. Since then, full picture of openings and monthly lists of relevant procedures for which notifications should be received.

Currently about 3.7 million decisions have been analysed …

© Federal Statistical Office of Germany (Destatis) Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 11: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

11

© Federal Statistical Office of Germany (Destatis)

0

20.000

40.000

60.000

80.000

100.000

120.000

Apr

il

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

2017 2018 2019

Number of Court Publications

Page 12: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

12

© Federal Statistical Office of Germany (Destatis)

0

2.000

4.000

6.000

8.000

10.000

12.000A

pril

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

May

June July

Aug

ust

Sep

tem

ber

Oct

ober

Nov

embe

r

Dec

embe

r

Janu

ary

Febr

uary

Mar

ch

Apr

il

2016 2017 2018 2019

Num

ber o

f Ope

ned

Inso

lven

cy P

roce

edin

gs

Number of Opened Insolvency Proceedings ...

Difference … Court Publications … Official Statistics

Page 13: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

13

© Federal Statistical Office of Germany (Destatis)

Page 14: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

What to do with the Data?

14

Main Goal:

Use data to get information about court decisions that should trigger a report by the insolvency administrators …

… so we can remind them that a report is due.

© Federal Statistical Office of Germany (Destatis)

?

Icons made by Eucalyp, Kiranshastry and Freepik from www.flaticon.com are licensed by CC 3.0 BY

Page 15: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

What to do with the Data?

15

Identification of relevant court decisions is done by applying a set of regular expressions to the full text of the decisions.

Resulting lists are reasonably short and can be matched with incoming reports.

Staff can use the information to contact insolvency administrators efficiently.

© Federal Statistical Office of Germany (Destatis)

Page 16: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

What we cannot do?

16

Privacy matters.

Court decisions contain personal data of debtors and insolvency administrators. Therefore full texts of court decisions are not stored.

Only Case-IDs as well as type and time of decision are stored.

Emojis by Noto Color Emoji

© Federal Statistical Office of Germany (Destatis)

Page 17: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Pit Falls

17

Q: What if one day the service is relaunched?

A: A legal basis for direct data access needs to be formulated.

Picture by Albert Uderzo

© Federal Statistical Office of Germany (Destatis)

Page 18: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Possible Improvements

18

Keyword search currently via regular expressions. Are there any smarter ways to analyze text?

Since we have collected a training set, could it be done with supervised learning approaches?

Emojis by Noto Color Emoji

© Federal Statistical Office of Germany (Destatis)

Page 19: WebScraping Notices of Insolvency Proceedings with R · wissen.nutzen. WebScraping Notices of Insolvency Proceedings with R Using publicly available data to enhance survey response

wissen.nutzen.

Contact

19

Joerg Feuerhake

[email protected]

Tel.: 0049 611 75 4116

© Federal Statistical Office of Germany (Destatis)