Given a link to a filing document (e.g. the 10-K, 8-K) in TXT, process the file into parts and items. This enables follow-up processing of a desired section - e.g. just the Risk Factors. `item.name` and `part.name` are taken directly from the document without any attempt to normalize.
parse_text_filing(x, strip = TRUE, include.raw = FALSE, fix.errors = TRUE)
x | - URL to a filing text document or actual text |
---|---|
strip | - Should non-text elements be removed? Default: true |
include.raw | - Include unprocessed nodes in result? Default: false |
fix.errors | - Try to fix document errors (e.g. missing part labels). WIP. Default: true |
a dataframe with one row per paragraph
Detected name of the Part
Detected name of the Item
Text of the paragraph / node
Raw HTML of the node if include.raw = TRUE
NOTE: This has been tested on a range of documents, but formatting differences could cause failures. Please report an issue for any document that isn't parsed correctly.
FURTHER NOTE: Not all filings are well formed - missing headings, bad spacing, etc. These can all throw the parsing off!
# \donttest{ try(head(parse_text_filing( "https://www.sec.gov/Archives/edgar/data/37996/000003799602000015/v7.txt" )))#> text #> 1 UNITED STATES\n SECURITIES AND EXCHANGE COMMISSION\n Washington, D.C. 20549 #> 2 FORM 10-K #> 3 (Mark One) #> 4 X Annual report pursuant to Section 13 or 15(d) of the Securities\n------ Exchange Act of 1934 (No Fee Required) #> 5 For the fiscal year ended December 31, 2001 #> 6 or #> part.name item.name #> 1 #> 2 #> 3 #> 4 #> 5 #> 6# }