Evolving Open Information Extraction for Portuguese employing Language Models

Nome completo do aluno

 

Bruno Souza Cabral

 

Título do trabalho

 

Evolving Open Information Extraction for Portuguese employing Language Models

 

Resumo do trabalho

 

Open Information Extraction (OpenIE) is an important task in Computer Science focused on extracting structured information from text, typically as (argument 1, relation,
argument 2) triples, without requiring predened target relations. OpenIE aims to extract valuable information for uses such as enhancing language understanding, populating knowledge bases, and text comprehension. The extraction of OpenIE relations from
Portuguese text presents substantial challenges, primarily due to its rich morphology,
frequent use of clitic pronouns, exible word order, inected nature, and other linguistic
peculiarities.
Deep Learning has signicantly advanced OpenIE for the English language, with
sequence labeling being a common approach. Recently, a new approach, Generative Information Extraction, particularly leveraging generative Large Language Models (LLMs),
has emerged as a fruitful alternative. Generative techniques can take a sentence as input
and generate structured semantic representations.
Despite numerous OpenIE studies focusing on English, research on OpenIE for the
Portuguese language, particularly employing Deep Learning methods, remains limited.
Existing work often relies on datasets automatically translated from English. Moreover,
most Deep Learning approaches for OpenIE in Portuguese have adopted a multilingual
perspective, treating it as just one language among many in training datasets, thereby
often neglecting its unique linguistic characteristics.
This thesis investigates neural methods for the automated extraction of OpenIE relations from Portuguese texts. A core contribution is the development and curation of
diverse Portuguese OpenIE datasets to address data scarcity and enable robust evaluation. These include both manually annotated corpora and novel corpora generated using
LLMs. The study involves developing and evaluating a sequence labeling model and
assessing the performance of generative LLMs on these Portuguese datasets.
A comprehensive comparative analysis of these methods is conducted, focusing on
their ecacy in extracting OpenIE relations, including abstractive ones, from Portuguese
text. This research signicantly contributes to the growing body of literature on the
application of Deep Learning techniques for OpenIE in the Portuguese language, addresses
critical resource gaps, and lays the foundation for further advancements in this eld,
particularly in exploring generative and abstractive extraction capabilities.

 

Orientador

 

Daniela Barreiro Claro

 

Co-orientador

 

Marlo Vieira dos Santos e Souza

 

Membro Titular Externo 1 (com afiliação)

 

Renata Vieira

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/6218967777630412

 

Membro Titular Externo 2 (com afiliação)

 

Marcos Garcia

 

Link para o curriculum lattes

 

https://citius.gal/team/marcos-garcia-gonzalez/

 

Membro Titular Interno 1 ou Titular Externo 3 (com afiliação)

 

Aline Marins Paes Carvalho

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/0506389215528790

 

Membro Titular Interno 2 ou Titular Externo 4 (com afiliação)

 

Vládia Célia Monteiro Pinheiro

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/2991281565518934

 

Membro Suplente Externo 1 (com afiliação)

 

Helena de Medeiros Caseli

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/6608582057810385

 

Membro Suplente Externo 2 (com afiliação)

 

Pablo Gamallo

 

Link para o curriculum lattes

 

https://fegalaz.usc.es/~gamallo/

 

Membro Suplente Interno 1 ou Suplente Externo 3 (com afiliação)

 

Tatiane Rios

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/0851148137941240

 

Membro Suplente Interno 2 ou Suplente Externo 4 (com afiliação)

 

Lais do Nascimento Salvador

 

Link para o curriculum lattes

 

http://lattes.cnpq.br/1972531466861737

 

Data da defesa

 

15 Sep, 2025

 

Horário da defesa

 

8:00 AM

 

Quais os principais impactos deste trabalho (social, tecnológico, científico, ambiental)?

 

O principal impacto científico deste trabalho é no avanço do estado da arte da área de OpenIE para a língua portuguesa, com o objetivo de impulsionar novos LLMs para extração de informação aberta. Impacto social corresponde a uma melhoria no pipeline de distribuição para diversas tarefas do PT que fazem uso da OpenIE.

 

 

Data da Defesa: 
15/09/2025 - 08:00
Tipo de Defesa: 
Defesa de Doutorado