Skip to content

[GSoC] Week-7/8 Starting of Neural Extraction

Published: at 05:00 AM

In the past few weeks, I’ve have contributed to the Extraction framework of DBpedia which focused on using infoboxes. But as the infoboxes in Hindi Wikipedia have either very less information or none atall(see below for comparison), thus neural extraction becomes very important. Thus in the seventh and eighth week(8-19th July) of the coding period of GSoC my main aim was to start experimenting the Neural Extraction pipeline of Hindi Wikipedia.

Table of Contents

Open Table of Contents

Pipeline overview

The expected pipeline for processing Hindi Wikipedia pages into structured data involves several key steps, each leveraging different tools and models. Here’s a brief explanation of the process:

  1. Hindi Wikipedia Pages to Plain Text Files: The pipeline begins by converting Hindi Wikipedia pages into plain text files, which serve as the raw data for further processing.

  2. Tokenization, POS Tagging, and NER:

    • Tokenization: The sentences from the text files are tokenized using the Stanza library, breaking down the text into individual words or tokens.
    • POS Tagging: The tokenized sentences are then POS-tagged using Stanza to identify the grammatical parts of speech (e.g., nouns, verbs) for each token.
    • NER Tagging: Named Entity Recognition (NER) is performed using the IndicNER model to identify and classify named entities (e.g., people, locations, organizations) within the text.
  3. Mention Detection: Sentences with mentions (i.e., entities that need to be tracked across the text) are identified by combining the results from POS tagging and tokenization in a rule-based manner.

  4. Coreference Resolution: Coreference resolution is applied using the TransMuCoRes model to link mentions of the same entity across different parts of the text, creating a unified representation of each entity.

  5. Triple Extraction (Subject, Relation, Object): From the coreference-resolved sentences, subject, relation, and object triples are extracted using mREBEL, IndIE, or LLM-augmented models. These triples represent the relationships between entities within the text.

  6. Entity Linking and Ontology Mapping:

    • The extracted subjects and objects are linked to their corresponding Wikidata entities using the mGENRE model.
    • The relations are mapped to the DBpedia ontology, which standardizes the relationships within the knowledge graph.
  7. Final Integration: The linked entities and mapped relations are integrated into a coherent structure, creating a rich, interconnected dataset that can be used for various applications.

This pipeline efficiently converts unstructured text from Hindi Wikipedia into structured knowledge, ready for inclusion in DBpedia KG.

Project Update

In this two weeks, i focused on developing some components of the pipeline:

Challenges/Solutions

During the implementation, I encountered some hurdles and how I’m planning to address them using the below ways:

Next Steps

Looking ahead, I have some clear action items that I’ll be focusing on:

The upcoming weeks will be crucial as I dive into these LLM-based approaches, and I’m excited about the potential advancements they could bring to the project.