Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling...
Transcript of Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling...
![Page 1: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/1.jpg)
Structuring and restructuring OIM and Machine Learning
• Mark Goodhand • Head of Research, CoreFiling, UK • Chairman of XII Base Spec WG
![Page 2: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/2.jpg)
Structuring: Generating new insights about the world
Restructuring: Recovering insights previously known
![Page 3: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/3.jpg)
Restructuring
Further reading: Wired: How the CIA Used a Fake Sci-Fi Flick to Rescue Americans From Tehran
…
![Page 4: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/4.jpg)
Restructuring Word
Excel
Custom XML
Custom JSON
![Page 5: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/5.jpg)
Restructuring
Further reading: • H2G2 • Google's Neural Machine Translation
System: Bridging the Gap between Human and Machine Translation
![Page 6: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/6.jpg)
Data Wrangling
• “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
- Source: New York Times
![Page 7: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/7.jpg)
Structuring
• “This potato has gone bad” • ”This company is about to go bust” • “There is something unusual about this transaction” • “You are about to be eaten by a tiger”
![Page 8: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/8.jpg)
Structuring and restructuring through Machine Learning
• Speech recognition • Optical character recognition • Natural language processing • Landmark detection • …
• Classification • Prediction • Clustering • Outlier detection • Inference engines
![Page 9: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/9.jpg)
Deep Learning “Instead of programming a computer, you teach a computer to learn something and it does what you want.” - Eric Schmidt
Source: Infuse Your Business with Machine Learning
![Page 10: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/10.jpg)
Automation is a boon to humanity …
• Heuristics can fail • Deep Learning models can fail too • Human judgement is sometimes required
… but it has its limits
• Millions of man hours saved • Humans saved from boring, error-prone tasks • Barriers to communication and collaboration are reduced
![Page 11: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/11.jpg)
Airbnb & Facebook • New listings & comments all
the time • Translation improves
interactions • Translations don’t need to be
perfect
![Page 12: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/12.jpg)
Company accounts • Investors and
regulators demand accurate figures
• Financial and legal penalties for bad data
• Human review and sign-off is essential
![Page 13: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/13.jpg)
Form-based websites
• Relatively simple • Slowly changing • Can be model-driven • Better translations by humans may be appropriate
![Page 14: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/14.jpg)
Once you have structure don’t lose it! • Use open, internationally-recognised standards • Share the information as freely as possible • Preserve metadata • Prefer text
Who, EventType, Date mrg, DeathByTiger, 2018-02-01
![Page 15: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/15.jpg)
Enjoy the power and beauty of your youth XBRL OIM
![Page 16: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/16.jpg)
XBRL – The Good Parts “In XBRL, there is a beautiful, elegant, highly expressive language that is buried under a steaming pile of good intentions and blunders. The best nature of XBRL is so effectively hidden that for many years the prevailing opinion of XBRL was that it was an unsightly, incompetent toy. Our intention with OIM is to expose the goodness in XBRL"
![Page 17: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/17.jpg)
The Good Parts
• Extensible dimensional model • Strong types & validation • Standards-based rendering • Multi-language support • Model-centric applications
![Page 18: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/18.jpg)
The Bad Parts
• Complex typed domains • Segment/scenario • contextRef & unitRef • Custom attributes • XLink • Tuples
![Page 19: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/19.jpg)
OIM to the rescue
• Cut out the cruft • Focus on the semantics • Improve consistency • Improve efficiency
![Page 20: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/20.jpg)
OIM JSON
• Clearest, simplest expression of the model • Streaming-friendly (mostly) • Designed for easy access to relevant information
![Page 21: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/21.jpg)
OIM CSV
• Efficient for large volumes of data • Built on the W3C’s CSVW (Tabular Metadata) spec • Debate over how far we go to cope with variation in input
![Page 22: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/22.jpg)
The world is full of Dimensional Data • Who • What • Where • When • … Why?
Expressed in a needless variety of formats
![Page 23: Structuring and restructuring OIM and Machine Learning · and Machine Translation . Data Wrangling • “Data scientists, according to interviews and expert estimates, spend from](https://reader033.fdocuments.in/reader033/viewer/2022060506/5f1f329d0a661c1a7869f550/html5/thumbnails/23.jpg)
Share and enjoy