FinSBD-2 Shared Task 2020 : Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

Yokohama, Japan
Event Date: Mar 13, 2020 - May 08, 2020
Submission Deadline: May 15, 2020


Sentences are basic units of the written language. Detecting the beginning and end of sentences, or sentence boundary detection (SBD), is the foundational first step in many Natural Language Processing (NLP) applications such as POS tagging; syntactic, semantic, and discourse parsing; information extraction; or machine translation.

Despite its important role in NLP, Sentence Boundary Detection has so far not received enough attention. Previous research in the area has been confined to only formal texts (news, European Parliament proceedings, etc.) where existing rule-based and machine learning approaches are extremely accurate so-long the data is perfectly clean. No sentence boundary detection research to date has addressed the problem in noisy texts extracted automatically from machine-readable files (generally PDF file format) such as financial documents.

One type of financial document is the prospectus. Financial prospectuses are official PDF documents in which investment funds precisely describe their characteristics and investment modalities. The most important step of extracting any information from these files is to parse them to get noisy unstructured text, clean the text, format the information (by adding several tags) and finally, transform it into semi-structured text, where sentence and list boundaries are well marked.

These prospectuses also contain many visual demarcations indicating a hierarchy of sections including bullets and numbering. There are many sentence fragments and titles, and not just complete sentences. The prospectuses more often than not contain punctuation errors. And in order to structure the dense information in a more easily read format, lists are often used.

Call For Paper

We invite submissions of research papers on all topics related to NLP for Financial Technology (FinTech) applications. Besides, one of our goals of this workshop is to foster collaboration between researchers and developers from computational linguistics and finance and economic areas. Original studies reporting joint work are therefore especially encouraged. Topics of interest include, but are not limited to:

  • Text-based Market Provisioning
  • NLP-based Investment Management
  • Crowdfunding Analysis with Text Data
  • Text-oriented Customer Preference Analysis
  • Insurance Application with Textual Information
  • NLP-based Know Your Customer (KYC) Approach
  • Applications or Systems for FinTech with NLP Methods

