Structure Interpretation of Text Formats (SPLASH 2020 - OOPSLA)

Sun 15 - Sat 21 November 2020 Online Conference

Who

Sumit Gulwani, Vu Le, Arjun Radhakrishna, Ivan Radiček, Mohammad Raza

Track

SPLASH 2020 OOPSLA

Time Zone

The program is currently displayed in (GMT-06:00) Central Time (US & Canada).

Use conference time zone: (GMT-06:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 18 Nov 2020 15:40 - 16:00 at SPLASH-I - W-5 Chair(s): Dan Barowy, Mohsen Lesani
Thu 19 Nov 2020 03:40 - 04:00 at SPLASH-I - W-5 Chair(s): Filip Křikava, Nengkun Yu

Abstract

Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.

We present \textsc{Unravel}, an extensible framework for structure interpretation of ad-hoc formats. \textsc{Unravel} can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called \emph{partial structures}. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes \textsc{Unravel} more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of \emph{re-interpretation} where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of $617$ text files gathered from a variety of sources, \textsc{Unravel} is able to extract the intended table in many more cases compared to state-of-the-art techniques.

Link to Publication

https://dl.acm.org/doi/pdf/10.1145/3428280

DOI

https://doi.org/10.1145/3428280

Sumit Gulwani

Microsoft

Vu Le

Microsoft

Arjun Radhakrishna

Microsoft

Ivan Radiček

Microsoft

Mohammad Raza

Microsoft

Media

Time Zone

The program is currently displayed in (GMT-06:00) Central Time (US & Canada).

Use conference time zone: (GMT-06:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 18 Nov
Displayed time zone: Central Time (US & Canada) change

15:00 - 16:20	W-5OOPSLA at SPLASH-I +12h Chair(s): Dan Barowy Williams College, Mohsen Lesani University of California at Riverside, USA

15:00 20m Talk		A Model for Detecting Faults in Build Specifications OOPSLA Thodoris Sotiropoulos Athens University of Economics and Business, Stefanos Chaliasos Athens University of Economics and Business, Dimitris Mitropoulos Athens University of Economics and Business, Diomidis Spinellis Athens University of Economics and Business Link to publication DOI Pre-print Media Attached
15:20 20m Talk		Persistent Owicki-Gries Reasoning: A Program Logic for Reasoning about Persistent Programs on Intel-x86 OOPSLA Azalea Raad Imperial College London, Ori Lahav Tel Aviv University, Viktor Vafeiadis MPI-SWS Link to publication DOI Media Attached
15:40 20m Talk		Structure Interpretation of Text Formats OOPSLA Sumit Gulwani Microsoft, Vu Le Microsoft, Arjun Radhakrishna Microsoft, Ivan Radiček Microsoft, Mohammad Raza Microsoft Link to publication DOI Media Attached
16:00 20m Talk		Statically Verified Refinements for Multiparty Protocols OOPSLA Fangyi Zhou Imperial College London, Francisco Ferreira Imperial College London, Raymond Hu University of Hertfordshire, Rumyana Neykova Brunel University London, Nobuko Yoshida Imperial College London Link to publication DOI Pre-print Media Attached

Thu 19 Nov
Displayed time zone: Central Time (US & Canada) change

03:00 - 04:20	W-5OOPSLA at SPLASH-I Chair(s): Filip Křikava Czech Technical University, Nengkun Yu University of Technology Sydney

03:00 20m Talk		A Model for Detecting Faults in Build Specifications OOPSLA Thodoris Sotiropoulos Athens University of Economics and Business, Stefanos Chaliasos Athens University of Economics and Business, Dimitris Mitropoulos Athens University of Economics and Business, Diomidis Spinellis Athens University of Economics and Business Link to publication DOI Pre-print Media Attached
03:20 20m Talk		Persistent Owicki-Gries Reasoning: A Program Logic for Reasoning about Persistent Programs on Intel-x86 OOPSLA Azalea Raad Imperial College London, Ori Lahav Tel Aviv University, Viktor Vafeiadis MPI-SWS Link to publication DOI Media Attached
03:40 20m Talk		Structure Interpretation of Text Formats OOPSLA Sumit Gulwani Microsoft, Vu Le Microsoft, Arjun Radhakrishna Microsoft, Ivan Radiček Microsoft, Mohammad Raza Microsoft Link to publication DOI Media Attached
04:00 20m Talk		Statically Verified Refinements for Multiparty Protocols OOPSLA Fangyi Zhou Imperial College London, Francisco Ferreira Imperial College London, Raymond Hu University of Hertfordshire, Rumyana Neykova Brunel University London, Nobuko Yoshida Imperial College London Link to publication DOI Pre-print Media Attached