March 12th, 2013
As part of a project I'm working on in my free time, I needed to figure out corporate relationships. The SEC requires that all publicly held corporations file a list of their subsidiaries in their form 10K each year. So by scraping a section (called exhibit 21.1) in the 10k document, you can extract a list of subsidiaries from that registrant. The issue is that every company files their 10k in a different format, and lack of uniformity makes scraping a lot harder. Moreover, it says nothing about privately held companies. Anyway, I did my best and it manages to extract a lot of information.
I made the project lightweight and separate from any storage backend, so I should be able to easily integrate it back into the larger project I'm doing at a later date. Also, I was hoping that others might find it useful. It's a little bit out there though, so who knows. It's up on Github here