Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.

WikiDocsAligner: An Off-the-Shelf Wikipedia Documents Alignment Tool

Saad M.
Primo
Membro del Collaboration Group
;
2017-01-01

Abstract

Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.
2017
9781509065387
File in questo prodotto:
File Dimensione Formato  
WikiDocsAligner_An_Off-the-Shelf_Wikipedia_Documents_Alignment_Tool.pdf

solo utenti autorizzati

Tipologia: Versione editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 606.4 kB
Formato Adobe PDF
606.4 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11587/561289
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 2
social impact