Parse Text Filing — parse_text

Given a link to a filing document (e.g. the 10-K, 8-K) in TXT, process the file into parts and items. This enables follow-up processing of a desired section - e.g. just the Risk Factors. `item.name` and `part.name` are taken directly from the document without any attempt to normalize.

parse_text_filing(x, strip = TRUE, include.raw = FALSE, fix.errors = TRUE)

Arguments

x	- URL to a filing text document or actual text
strip	- Should non-text elements be removed? Default: true
include.raw	- Include unprocessed nodes in result? Default: false
fix.errors	- Try to fix document errors (e.g. missing part labels). WIP. Default: true

Value

a dataframe with one row per paragraph

part.name: Detected name of the Part
item.name: Detected name of the Item
text: Text of the paragraph / node
raw*: Raw HTML of the node if include.raw = TRUE

Details

NOTE: This has been tested on a range of documents, but formatting differences could cause failures. Please report an issue for any document that isn't parsed correctly.

FURTHER NOTE: Not all filings are well formed - missing headings, bad spacing, etc. These can all throw the parsing off!

Examples

# \donttest{
try(head(parse_text_filing(
  "https://www.sec.gov/Archives/edgar/data/37996/000003799602000015/v7.txt"
)))
#>                                                                                                                            text
#> 1 UNITED STATES\n                       SECURITIES AND EXCHANGE COMMISSION\n                             Washington, D.C. 20549
#> 2                                                                                                                     FORM 10-K
#> 3                                                                                                                    (Mark One)
#> 4       X      Annual report pursuant to Section 13 or 15(d) of the Securities\n------   Exchange Act of 1934 (No Fee Required)
#> 5                                                                                   For the fiscal year ended December 31, 2001
#> 6                                                                                                                            or
#>   part.name item.name
#> 1                    
#> 2                    
#> 3                    
#> 4                    
#> 5                    
#> 6                    
# }