Read text from PDF, Microsoft Word, HTML, and plain text files (2024)

Read text from PDF, Microsoft Word, HTML, and plain text files

collapse all in page

Syntax

str = extractFileText(filename)

str = extractFileText(filename,Name,Value)

Description

example

str = extractFileText(filename) reads the text data from a file as a string.

example

str = extractFileText(filename,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Extract Text Data from Text File

Open Live Script

Extract the text from sonnets.txt using extractFileText. The file sonnets.txt contains Shakespeare's sonnets in plain text.

str = extractFileText("sonnets.txt");

View the first sonnet.

i = strfind(str,"I");ii = strfind(str,"II");start = i(1);fin = ii(1);extractBetween(str,start,fin-1)
ans = "I From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in nigg*rding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "

Extract Text Data from PDF

Open Live Script

Extract the text from exampleSonnets.pdf using extractFileText. The file exampleSonnets.pdf contains Shakespeare's sonnets in a PDF file.

View the second sonnet.

ii = strfind(str,"II");iii = strfind(str,"III");start = ii(1);fin = iii(1);extractBetween(str,start,fin-1)
ans = "II When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "

Extract the text from pages 3, 5, and 7 of the PDF file.

pages = [3 5 7];str = extractFileText("exampleSonnets.pdf", ... 'Pages',pages);

View the 10th sonnet.

x = strfind(str,"X");xi = strfind(str,"XI");start = x(1);fin = xi(1);extractBetween(str,start,fin-1)
ans = "X Is it for fear to wet a widow's eye, That thou consum'st thy self in single life? Ah! if thou issueless shalt hap to die, The world will wail thee like a makeless wife; The world will be thy widow and still weep That thou no form of thee hast left behind, When every private widow well may keep By children's eyes, her husband's shape in mind: Look! what an unthrift in the world doth spend Shifts but his place, for still the world enjoys it; But beauty's waste hath in the world an end, And kept unused the user so destroys it. No love toward others in that bosom sits That on himself such murd'rous shame commits. X For shame! deny that thou bear'st love to any, Who for thy self art so unprovident. Grant, if thou wilt, thou art belov'd of many, But that thou none lov'st is most evident: For thou art so possess'd with murderous hate, That 'gainst thy self thou stick'st not to conspire, Seeking that beauteous roof to ruinate Which to repair should be thy chief desire. "

Import Text from Multiple Files Using a File Datastore

Open Live Script

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords
bag = bagOfWords with properties: Counts: [] Vocabulary: [1x0 string] NumWords: 0 NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds) str = read(fds); document = tokenizedDocument(str); bag = addDocument(bag,document);end

View the updated bag-of-words model.

bag
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" ... ] (1x276 string) NumWords: 276 NumDocuments: 4

Extract Text from HTML

Open Live Script

To extract text data directly from HTML code, use extractHTMLText and specify the HTML code as a string.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"

Input Arguments

collapse all

filenameName of file
string scalar | character vector | 1-by-1 cell array containing a character vector

Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.

Data Types: string | char | cell

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Pages',[1 3 5] specifies to read pages 1, 3, and 5 from a PDF file.

EncodingCharacter encoding
'auto' (default) | 'UTF-8' | 'ISO-8859-1' | 'windows-1251' | 'windows-1252' | ...

Character encoding to use, specified as the comma-separated pair consisting of 'Encoding' and a character vector or a string scalar. The character vector or string scalar must contain a standard character encoding scheme name such as the following.

"Big5"

"ISO-8859-1"

"windows-874"

"Big5-HKSCS"

"ISO-8859-2"

"windows-949"

"CP949"

"ISO-8859-3"

"windows-1250"

"EUC-KR"

"ISO-8859-4"

"windows-1251"

"EUC-JP"

"ISO-8859-5"

"windows-1252"

"EUC-TW"

"ISO-8859-6"

"windows-1253"

"GB18030"

"ISO-8859-7"

"windows-1254"

"GB2312"

"ISO-8859-8"

"windows-1255"

"GBK"

"ISO-8859-9"

"windows-1256"

"IBM866"

"ISO-8859-11"

"windows-1257"

"KOI8-R"

"ISO-8859-13"

"windows-1258"

"KOI8-U"

"ISO-8859-15"

"US-ASCII"

"Macintosh"

"UTF-8"

"Shift_JIS"

If you do not specify an encoding scheme, then the function performs heuristic auto-detection for the encoding to use. The heuristics depend on your locale. If these heuristics fail, then you must specify one explicitly.

This option only applies when the input is a plain text file.

Data Types: char | string

ExtractionMethodExtraction method
'tree' (default) | 'article' | 'all-text'

Extraction method, specified as the comma-separated pair consisting of 'ExtractionMethod' and one of the following:

OptionDescription
'tree'Analyze the DOM tree and text contents, then extract a block of paragraphs.
'article'Detect article text and extract a block of paragraphs.
'all-text'Extract all text in the HTML body, except for scripts and CSS styles.

This option supports HTML file input only.

PasswordPassword to open PDF file
character vector | string scalar

Password to open the PDF file, specified as the comma-separated pair consisting of 'Password' and a character vector or a string scalar. This option only applies if the input file is a PDF.

Example: 'Password','skroWhtaM'

Data Types: char | string

PagesPages to read from PDF file
vector of positive integers

Pages to read from PDF file, specified as the comma-separated pair consisting of 'Pages' and a vector of positive integers. This option only applies if the input file is a PDF file. The function, by default, reads all pages from the PDF file.

Example: 'Pages',[1 3 5]

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Tips

  • To read text directly from HTML code, use extractHTMLText.

  • To read text separated by lines in a text file, use readlines.

Version History

Introduced in R2017b

expand all

See Also

pdfinfo | extractHTMLText | readPDFFormData | writeTextDocument | tokenizedDocument

Topics

  • Extract Text Data from Files
  • Prepare Text Data for Analysis
  • Create Simple Text Model for Classification

MATLAB Command

You clicked a link that corresponds to this MATLAB command:

 

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Read text from PDF, Microsoft Word, HTML, and plain text files (1)

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

Americas

  • América Latina (Español)
  • Canada (English)
  • United States (English)

Europe

  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • Switzerland
    • Deutsch
    • English
    • Français
  • United Kingdom (English)

Asia Pacific

Contact your local office

Read text from PDF, Microsoft Word, HTML, and plain text files (2024)
Top Articles
Latest Posts
Article information

Author: Fr. Dewey Fisher

Last Updated:

Views: 5907

Rating: 4.1 / 5 (62 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Fr. Dewey Fisher

Birthday: 1993-03-26

Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827

Phone: +5938540192553

Job: Administration Developer

Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball

Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.