Social Market Analytics, Inc. (SMA) in April 2021 released Global Machine Readable Filings (GMRF) in partnership with S&P Global Market Intelligence. Global Machine Readable Filings is the first product to provide parsed textual data of over 200 Countries Company Annual Reports, Quarterly Report, Semi-Annual and Financial Supplements across ~1.6M documents.

This blog will explore details of this offering and some of unique features.  For Global filings there is no central repository of documents.  S&P archived global filings for the last 20 years.  At SMA we used our patented document processing technology to turn these files into machine readable JSON.  The structure of each document is created by individual companies, so the structure is subject to change.  For example, Company A can change formatting YoY. Since each company has its own format there are challenges from comparing Company A and Company B.  The original files come as PDFs with significant variations: “Magazine style”, scanned documents and tables without HTML tags.  Our machine learning algorithms have processed these files and successfully converted them to JSON.

Below are a few examples of parsing different file types. The SIKA 2018 Annual Report table of contents is below.

By reading the Table of Contents (TOC) we identify primary headings.  In addition, we identify subheadings under Strategic Target Markets & Risk Management based off of font and structure variations which are not available in the TOC.

The next example is Credit Suisse Group 2017 AR.

The SMA parser dynamically identifies Items, parts and subsections.

We use our Generic document parser to read and identify parts.  Below is the strategy section broken into sub-parts.


These are just two filings.  As we mentioned there are 1.6 million documents in the archives and updates are provided in real-time through XpressFeed, Snowflake or through the SMA Filings API.  To trial this product or learn more about other Social Market Analytics datasets please email us at


Smart phones, electric/ hybrid cars and X-Rays have undoubtedly taken an important role in today’s society. All three essential items depend on a limited resource called Rare Earth Metals. The use of Rare Earth Elements (REE) does not stop there. REEs are also used for refining crude oil, building wind turbines, making televisions and computer screens, and are necessary for production of many other important products. There is no doubt Rare Earth Metals play an important role in the world, therefore, from an investing standpoint, you might wonder who is mining these crucial metals and who is using them? This blog answers those questions by using Machine Readable Filings (MRF) and Social Market Analytics, Inc. (SMA) Complex Topic Model engine. The dataset is run across all 10-K, 10-Q, 8-K, 20-F, 6-K and 40-F filings from SEC Edgar.

Social Market Analytics, Inc. (SMA) has partnered with S&P Global Market Intelligence on ‘Machine Readable Filings’ (MRF). Machine Readable Filings is the first product to provide parsed textual data of SEC Edgar Regulatory Filings at the Total Document, Part, Item, Sub-section, and Notes level with historical baselines back to 2006. The Textual Data, Sentiment NLP, and Word Count values are provided by SMA’s patented technology. Extraneous information such as page numbers, images, and tables are removed. S&P Global and SMA also developed the same structured data values on International Company Reports.

SEC regulatory filings are formal documents reported to the U.S. Securities and Exchange Commission (SEC) that contain important information about companies such as financial statements, forward looking statements, risks associated with the company, etc. Every U.S. publicly traded company is required to submit regulatory filings to the SEC.

Companies discuss their forward focus in these documents, and you can glean a lot about a company’s direction or risks by the topics they discuss in filings. With SMA’s topic modeling capabilities you can quickly scan documents for curated topics. If you are searching for companies investing in a specific topic, this functionality makes the process much more efficient. As an example, if you want to find companies with negative exposure to Covid-19 you can search for Covid-19 mentions in the Risk Factors section of a 10-K or 10-Q. If you would like to find a company’s strategy for handling their business during Covid-19, you may want to search in the Management Discussion & Analysis section.

Machine Readable Filings (MRF) used in conjunction with SMA’s Complex Topic Model engine allows for the programmatic curation of Thematic Investing themes. Our topic modeling allows us to capture ever relevant ‘Rare Earth’ terms within SEC filings and exclude terms that are not relevant. For example, U.S. Rare Earth Minerals, Inc. mentioning their own name in a filing would not be considered relevant. SMA’s proprietary Complex Topic Model engine uses smart synonym and proximity search to make sure you capture all relevant information.

Example of the Dataset

The data is available through an API, CSV reports or other customizable options to be easily read. Below is a picture of part of the data set used for this blog. The circled context reads: “We manufacture and sell an array of plain surface prestressed steel materials and rare earth coated and zinc coated prestressed steel materials…”

In this blog we will be illustrating the power in SMA’s topic searching by identifying companies with the most mentions of the topic ‘Rare Earth’ in their filings. The Topic Model data extraction across ~1.4 Million documents going back to 2006 only takes a few minutes. The MRF textual data structure also enables the creation of baselines.

Full Rare Earth Universe

To answer who is mining and using these Rare Earth Materials, we first must look at the topic model on a macro level. Below is a chart illustrating mentions from each sector. A hit is defined as the frequency of which a term in mentioned within an SEC filing. The topic contains model contains over 40,000 hits from January 1, 2006- May 19, 2021.

The chart demonstrates that 65% of the hits come from the Materials sector. Of the Materials Sector hits, 24,768 (96%) of the hits from materials are from the Metals and Mining Industry. Using that information, we can determine that the large majority of the Materials Sector are the suppliers. The rest of the sectors appear to be the consumers. The top two consumer sectors are Health Care (4,329 hits, 11% of total hits) and Information Technology (2,908 hits, 7% of total hits). We will look in depth at the determined largest suppliers and two largest consumers.

Materials Sector

Zooming in on the suppliers, we see that the three most mentioned Rare Earth Metals are scandium with 5,191 hits, neodymium with 1,266 hits and yttrium with 1,237. The most popular term hit by far was ‘Rare Earth’ with 11,654 hits. Knowing these are the three most talked about metals by suppliers could possibly imply that they are the three most profitable or abundant.

We know which sector the supplier is in and which metals they are discussing the most. Which specific companies are discussing these most frequently? The graph below shows the top 20 companies mentioning Rare Earth terms the most in their filings from January 1, 2006 to May 19, 2021. The bars are color coded by which metal they discuss. Scandium International, a Canadian company is the most popular and they talk about scandium an overwhelming majority of the time. It might be inferred they mention their name a lot in their filings, using by SMA’s Complex Topic Model engine we are able to exclude the mentions of “Scandium International” and only capture the relevant mentions of scandium. Scandium International is very centered on scandium and discusses the metal a lot within their filings. Molycorp (last filed in 2016), although no longer publicly traded is a close second, discusses a variety of metals. The third most popular is Texas Mineral Resources Corp which has the most mentions of any current U.S. publicly traded stock.

The top two suppliers of metals are a Canadian company and an American company that are no longer publicly traded. The next three companies are all trading below $5 a share. This could imply that the U.S. is not a great supplier of Rare Earth Metals. Many articles suggest that China has the majority of Rare Earth Metals. The top companies in this chart support that the majority of the supply of Rare Earth Metals are not mined by American companies.

Consuming Sectors of Rare Earth

Now that we have looked in depth into the suppliers, we will shift to the consumers. The most popularly talked about terms in these filings from January 1, 2006 to May 19, 2021 are yttrium with 800 hits, gadolinium with 708 hits and holmium with 695.  There is a large drop to lutetium with 473 hits. This is vastly different from the suppliers. Scandium and neodymium are the 13th and 8th most talked metals within the consumers. Both suppliers and consumer talk about yttrium in their top three metals, however, the difference in conversation between the rest of the metals could hint at a disconnect.

After looking at the most metals, we can transition to which sectors use which metals. The graph below demonstrates which sectors discuss which metals. The Health Care sector discusses 6 of these metals more than any other sector. Their top being yttrium with 800 hits, gadolinium with 708 hits and holmium with 695 hits. These results are not surprising as yttrium is used in cancer treatments, gadolinium is used in MRI machines and holmium is used in holmium lasers, which are used in surgeries. Information technology talks about 5 different metals more than any other sector. The most popular being ytterbium with 358 hits, erbium with 339 hits and yttrium with 238 hits. These metals are all used in alloys to strengthen or soften metals.

The companies with the most hits from January 1, 2006 to May 19, 2021 within the healthcare sector include Vivos Inc., Trimedyne, Inc. and Immunomedics, Inc. These companies mention yttrium and holmium the most. That could imply that these companies have large dependencies on those metals.

Below shows the top 20 companies within the IT Sector. The top 3 companies include IPG Photonics, Western Digital and II-VI Incorporated. IPG Photonics develops fiber lasers, Western Digital manufactures hard drives and data center equipment, and II-VI Incorporated manufactures optical materials and semi-conductors.

It might be surprising that large companies like Apple or Intel do not make the list for top 20 mentions of rare earth metals, considering they are used in building microchips and cell phones. That could mean that they are not dependent enough on rare earth elements to be mentioned in their Risk Factors or Business portions of their filings. In that manner it is just as important to look at which companies are not mentioning Rare Earth Metals in their filings. A shortage in Rare Earth Metals may not drastically impact their business. However, these other companies are dependent enough that a shortage and an event revolving around Rare Earth Metals could impede their business.


How does knowing who is and who is not mentioning Rare Earth Metals help from an investment standpoint? Investing in the suppliers could be beneficial if you believe the consumer businesses will grow. If the consumers business grows, then the demand will increase, which will allow these mining companies to sell more Rare Earth Metals at higher prices, likely increasing the value of these companies. Knowing which companies are the most and least dependent on Rare Earth Metals could give you an understanding of the risk in these companies if Rare Earth Metals become scarcer due to geo-political risk, mining accidents, civil disputes or many other reasons. Although this data gives the picture of Rare Earth from American, Canadian, and ADR SEC Filings, the data is still very powerful. SMA’s Complex Topic Model engine will soon include Global filings and Sustainability Reports.

In April 2021, SMA released in partnership with S&P Global Market Intelligence ‘Global Machine Filings’ cover +200 Countries Company Annual Reports (AR). SMA will be running a Rare Earth analysis on ARs shortly.

Today’s Macro and Technology Themes are unprecedented and cover a multitude of emerging “Themes” from Drones to ESG to Genome Research to Rare Earth, etc. The investment potential is unlimited. Thematic Investing is becoming a powerful vehicle across investments wrappers such as ETFs and Structured Products, as well as, in Portfolio holdings by Hedge Funds and Asset Management. Prior to Machine Readable Filings (MRF), capturing data within Regulatory Filings and Company Reports was difficult to impossible. The MRF data is broad across Sectors and can include Global and or U.S. listed Companies. The data can be run on any timeframe including Monthly and Quarterly for portfolio reconstitution. As you can see Social Market Analytics’ ability to read and classify documents is a powerful tool for your investment process. To learn how Social Market Analytics (SMA) can help you instantly search for Thematic Investing themes using Machine Readable Filings (MRF) SEC Edgar Filings and SMA’s Complex Topic Model engine, please


Machine Readable Filings (MRF) Extract the Significance of ESG in Future Earnings

August 4, 2020

Discussions of Environmental, Social and Governance (ESG) has grown exponentially in the last few years. Socially conscious investors believe ESG criteria better determine the future financial performance of companies. There is a growing number of ESG funds, ETFs and other products based on ESG. To determine whether the increase in ESG discussions have influenced company operations, SMA analyzed the breadth ESG mentions in SEC Filings using ‘Machine Readable Filings’ (MRF).

Social Market Analytics, Inc. (SMA) has partnered with S&P Global Market Intelligence on ‘Machine Readable Filings’ (MRF). MRF is the first product to provide parsed textual data of SEC Edgar Regulatory Filings at the Item, Section, Sub-section, and Notes level with historical baselines back to 2006. Extraneous information such as page numbers, images, and tables are removed.  SMA is currently developing the same structured data on International Reports which will be released in Q4 2020.

SEC filings are formal documents reported to the U.S. Securities and Exchange Commission (SEC) that contain important information about Companies, such as Management Discussion & Analysis and Risk Factors. Every U.S. publicly traded company is required to submit regulatory filings.

SMA’s proprietary Topic Modeling allows us to analyze mentions of ESG within MRF, while filtering out the unfitting noise. The ESG Topic Models captures every mention of “ESG” within SEC filings, including related terms using statistical synonym capabilities while filtering out mentions of “ESG” where it does not apply to Environmental, Social and Governance. For example, mentions of an internal group called Energy Services Group abbreviated as “ESG” in Halliburton Company’s regulatory filings would not be included in the ESG Topic Model. The ESG Topic Model flags the filings matching the topic model criteria since 2006 and extracts the textual context of the mention of the ‘ESG”-related phrase to analyze who is talking about ESG and how they are talking about it.

The first step in analyzing the dataset was to see the trend since 2006. In order to see the growth in conversation of ESG, we charted the total number of unique documents each year, based on the year the document is published. The chart demonstrates exponential growth since 2006. The least number of documents published in a year that mention ESG was in 2006 with 12 documents and the most is in 2020 with 481 documents as of July 23. The total amount of documents mentioning ESG since 2006 is 1,502.

When companies mention ESG in their filings, the mentions tend to be in 10-Ks where companies typically go into the most detail. Of the 1,502 documents that mention ESG, 683 documents were 10-Ks or 10-K/A and 291 were 10-Qs or 10-Q/As. There are about 3x as many 10-Qs total than 10-Ks because of how often they are reported each year. 2020 is by far the largest year, and more companies will mention ESG in 10-Q filings reported over the next two quarters. Occasionally 8-K filings will mention ESG when there is a new ESG initiative or update.

In order to validate that few companies are not releasing many filings mentioning ESG, we calculated the number of unique companies mention ESG each year. The number of unique companies each year shows parallel exponential growth to total documents. ESG initiatives are spreading across many companies.  There are a total of 597 different companies reporting filings that mention ESG since 2006.  This validates that the documents with mentions of ESG are not dominated by few companies. It also makes sense there is not bias towards one company because most of the documents mentioning ESG come from 10-K filings which companies only release once a year.

We then wanted to look at when companies mention ESG for the first time. Almost half (279/597) of the companies in our dataset were flagged by the topic model for the first time in 2020. Between 2006 and 2018, new companies mentioning ESG for the first time in their filings was between 10 and 36. In 2019, the amount of new companies mentioning ESG in their filings jumped to 290, almost tripling the highest year before. In 2020, that number more than tripled again. Still only 579 companies have mentioned ESG to date which is about 15% of publicly traded U.S. companies.

Since few companies are mentioning ESG in their filings, we then looked at the data by sector to see if there is a trend. The three largest sectors mentioning ESG are Financials, Energy and Industrials. The Financials industry primarily talks about products offered or investments in companies that are thought to value ESG more. Energy and Industrials could be companies that are required to practice environmentally friendly procedures within their business model. The two smallest sectors are Communication Services and Consumer Staples. The two sectors are not as impacted by economic cycles, therefore could be more resistant to economic trends.

After looking at the share of companies mentioning ESG by sector, we decided to see if there was a trend over time. The Financials sector has been the clear leader in having the most companies mention ESG in filings for the last 5 years. However, the second leading sector, Energy, had not been nearly as high until 2020. Before 2020, the Energy sector had an average of less than 2 companies per year mentioning ESG in their filings and had only the 7th most companies mentioning ESG in 2019, which indicates there may have been an increase in public or investor pressure to mention ESG in their filings. Prior to the surge in the Energy sector mentioning, the Industrials and Real Estate sectors had been going back and forth for the last couple years for the second and third most companies that mention ESG in their filings. The growth rate was much less spontaneous than the Energy sector.

ESG is a growing topic for both Companies and Investors and has been increasing exponentially within SEC filings. The number of companies mentioning ESG in their filings has been growing dramatically over the past four years. We expect this trend to continue as investors become more socially conscious in their investing. Companies tend to discuss ESG in their most detailed documents, either 10-Ks or 10-Qs. The sectors mentioning ESG the most are Financials, Energy, Industrials and Real Estate. The greatest recent surge in companies has come from the Energy sector. Utilizing MRF and the Topic Models built to analyze the dataset, Firms can get a better picture of which companies are truly ESG conscious.

As the breadth of companies widens and the depth of ESG mentions increase in Filings, we believe ESG will drive investor returns. We believe MRF will become an important tool for Asset Managers to evaluate ESG as part of their investment strategy. In future blogs we will explore the relative return of companies with and without ESG principles.

For more information about Social Market Analytics By David Stolz

In April 2020, S&P Global Market Intelligence and Social Market Analytics, Inc. (SMA) launched ‘Machine Readable Filings’ (MRF), a sophisticated textual data offering which applies Parsing and Natural Language Processing to generate machine readable text extracted from SEC Regulatory Filings. Machine Readable Filings allows businesses and investors to incorporate more qualitative measures of company performance into their investment strategy by using machine readable text from full or individual sections of regulatory filings to enhance their analysis of companies. The parsed textual data allows firms to drill down on both historical and new filings in near real-time.  My last blog introduced the product and illustrated some basic return characteristics present in filings word count.

This blog explores the predictive nature of filings using SMA patented NLP and machine learning. For our analysis we used all active securities with a price greater than 5 dollars. Our analysis starts in 2006.  Securities are broken into quintiles based on each factor.  These factors are samples of the extensive metrics that can be created with this data.  Quintiles are re-balanced monthly based on each company’s most recent filing. 10-Q’s are compared to prior 10-Q’s and 10-Ks are compare to prior 10-K’s.  These are not meant to be trading models. They illustrate the predictive power of the data and use as broad a universe as possible. Two interesting distributions are below: distribution of word counts for 10-K (mean 36,000) and distribution of average sentiment.  As you can see companies try and keep the 10-K as upbeat as possible.

Our first factor is Change in Sentiment Hits. Sentiment hits are the number of times our NLP was able to identify a word or segment in a sentence. Positive hits + Negative hits + Neutral hits.  The green line represents filings with the largest increase in sentiment hits while the red line represents filings with the largest decrease in sentiment hits. Large increases in sentiment hits tend to under perform and large decreases in sentiment hits tend to outperform its peers.

The quintile performance characteristics are below.   Although quintile 2 and 3 are out of order you see the average values for those quintiles are near zero.  Quintile 1 outperforms quintile 5 by 3% annualized.

The next factor we are analyze is what percentage of the document does the parser hit.  Many filings are filled with general information not necessarily providing meaningful statements.   The green line represents filings with the highest percentage of sentiment hits in the document while the red line represents filings with the lowest percentage of sentiment hits in the document. A higher percentage of sentiment hits tend to  outperform and a lower percentage of sentiment hits tend to under perform its peers.   Companies with documents containing more meaningful content outperform companies with documents with less meaningful content by about 3.5% annualized.

Quintile 5 – Quintile 1 annualized is 3.5%

The third factor we are exploring is changes in negative hits.  Companies with increasing negative hits are discussing more negative information than prior quarters, they subsequently under perform.  The green line represents filings with the largest increase in negative hits while the red line represents filings with the largest decrease in negative hits. A large increase in negative hits tend to under perform and a large decrease in negative hits tend to outperform its peers.

The last factor we explore is cumulative document sentiment.  Quintiles are based on summations of all sentiment hits in the document.  More common analysis of sentiment is by section.  We identify parts sections and subsections in this product providing a myriad of ways to analyze the data.  At the most aggregated level sentiment is predictive.   Document length has a large impact on overall sentiment.  Z-Scores of this factor are a good way to compare prior documents.  As you can see in the chart companies with more positive total document sentiment tend to outperform companies with more negative total sentiment.

Quintile 5 outperforms quintile 1 by 1.7 percent annualized.

There are many ways to analyze the MRF data set. Filings are parsed by Item, Section, and sub-Sections to 2006 for historical back testing. This analysis looked at only 10-K’s and Q’s ‘Machine Readable Filings’ (MRF) cover 20 types of SEC filings. This blog covers a small portion of the research. The U.S. SEC Edgar Data is live on the S&P Xpressfeed. International Reports will be released later in 2020. To learn more or to start a trial please

Visit Our Website

S&P Global and Social Market Analytics today launched Machine Readable Filings (MRF), a sophisticated new data offering which applies Parsing and Natural Language Processing to generate machine readable text extracted from SEC Regulatory Filings. Machine Readable Filings allows businesses and investors to incorporate more qualitative measures of company performance into their investment strategy by using machine readable text from full or individual sections of regulatory filings to enhance their analysis of companies. The parsed textual data allows firms to drill down on both historical and new filings in near real-time.

Machine Readable Filings features the following:

  • 23 Years of History
  • 48,000+ Companies
  • 20 Filing Types
  • 4 Million Documents

The product feed contains three levels of detail:

  • Parsed Filings
    • A normalized JSON of all financial documents. Parts, Items, and subsections are normalized
    • Document and Section Summaries
  • Summaries for each section and the whole document, with their respective changes over time, including
    • Word Counts
    • Numbers of Positive Words
    • Numbers of Negative Words
  • Sentiment Feed
  • Full Patented SMA Sentiment Feed and associated metrics

As regular readers will attest, my previous blogs have focused on NLP and parsing Twitter and StockTwits based messages. This blog breaks new ground for Social Market Analytics as our first piece featuring Machine Readable Filings. We processed all 10-Ks and 10-Qs mapped to a pricing source (~150,000 documents) and looked at subsequent returns based on two word count based factors.

The word count factors we explored were Raw Change and Magnitude Change. Raw Change is the difference between the number of words in a filing and the number of words in the most recent filing of the same type for the same company. Magnitude Change is the absolute value of Raw Change, so it does not account for the direction of change. Below is the formula for the factors created where i represents the company, j represents the filing type (10-K or 10-Q), and k represents the period of the filing.


First, we looked at the Management Discussion & Analysis (MD&A) section because this section has the largest variability across all companies. This section addresses the company’s performance in a qualitative manner. Each value is carried forward from the previous filing until a new filing is released or until the data is 3 months old. The chart below represents a Quintile plot of Magnitude Change of word count in the MD&A section from January 2010 to December 2019.



Quintile 1 contains filings with the least amount of change in the MD&A section. Average Magnitude Change of word count in MD&A section for the lowest quintile group is 118 words. Quintile 5 represents the largest Magnitude Change of word count in MD&A section. Average Magnitude Change of word count in this quintile is 3,073 words. This graph shows that in the MD&A section, smaller changes in word count tend to outperform the market and larger changes in word count tend to underperform the market. The hypothetical Long/Short of this variable (Q1 – Q5) is proven significant at a 95% confidence level meaning the average monthly return is greater than 0%.

Next we looked at how changes in word count across an entire document can impact future returns. The largest increases in the number of words in the total document is represented by the green line. The largest decreases in word count is represented by the red line.



As you can see if there is a large increase in the number of words in the document, the stock subsequently underperforms its peers. On the other hand, if there is a large decrease in word count throughout the document, the stock tends to outperform its peers.

Although this analysis only includes the change in word count of the whole document and the MD&A section, other sections within regulatory filings can provide additional insights into a security’s future return. Furthermore, we expect additional insights to be uncovered using natural language processing to quantify the sentiment of the underlying text at the various levels of the document. These analyses and more will be explored by Social Market Analytics and S&P Global in the future.

S&P and SMA are excited about the launch of this new product. This is the first product to break out documents into component parts and provide a full historical analysis. To learn more about or schedule a trial please