Python - Acceso Corpora

Corpora es un grupo que presenta múltiples colecciones de documentos de texto. Una sola colección se llama corpus. Uno de esos famosos corpus es el Gutenberg Corpus, que contiene unos 25.000 libros electrónicos gratuitos, alojados en http://www.gutenberg.org/. En el siguiente ejemplo, accedemos a los nombres de solo aquellos archivos del corpus que son texto sin formato con el nombre de archivo terminado como .txt.

from nltk.corpus import gutenberg
fields = gutenberg.fileids()
print(fields)

Cuando ejecutamos el programa anterior, obtenemos el siguiente resultado:

[austen-emma.txt', austen-persuasion.txt', austen-sense.txt', bible-kjv.txt', 
blake-poems.txt', bryant-stories.txt', burgess-busterbrown.txt',
carroll-alice.txt', chesterton-ball.txt', chesterton-brown.txt', 
chesterton-thursday.txt', edgeworth-parents.txt', melville-moby_dick.txt',
milton-paradise.txt', shakespeare-caesar.txt', shakespeare-hamlet.txt',
shakespeare-macbeth.txt', whitman-leaves.txt']

Acceso a texto sin formato

Podemos acceder al texto sin formato de estos archivos usando la función sent_tokenize que también está disponible en nltk. En el siguiente ejemplo, recuperamos los dos primeros párrafos del texto de blake poen.

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
token = sent_tokenize(sample)
for para in range(2):
    print(token[para])

Cuando ejecutamos el programa anterior, obtenemos el siguiente resultado:

[Poems by William Blake 1789]
 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL
 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.