Given a link to filing document (e.g. the 10-K, 8-K) in HTML, process the file into parts and items. This enables follow-up processing of a desired section - e.g. just the Risk Factors. `item.name` and `part.name` are taken directly from the document without any attempt to normalize.
parse_filing(x, strip = TRUE, include.raw = FALSE, fix.errors = TRUE)
- URL to a filing HTML document, html text or xml_document
- Should non-text elements be removed? Default: true
- Include unprocessed nodes in result? Default: false
- Try to fix document errors (e.g. missing part labels). WIP. Default: true
a dataframe with one row per paragraph
Detected name of the Part
Detected name of the Item
Text of the paragraph / node
Raw HTML of the node if
include.raw = TRUE
NOTE: This has been tested on a range of documents, but formatting differences could cause failures. Please report an issue for any document that isn't parsed correctly.
FURTHER NOTE: Not all filings are well formed - missing headings, bad spacing, etc. These can all throw the parsing off!
head(parse_filing(paste0('https://www.sec.gov/Archives/edgar/data/', '712515/000071251517000010/ea12312016-q3fy1710qdoc.htm')), 6)#> text part.name item.name #> 1 Document #> 2 Table of Contents #> 3 UNITED STATES #> 4 SECURITIES AND EXCHANGE COMMISSION #> 5 WASHINGTON, D.C. 20549 #> 6 FORM 10-Q