Skip to content

Commit

Permalink
Adding mistral ocr to eval
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Mar 6, 2025
1 parent 4053ea5 commit bdc0d75
Show file tree
Hide file tree
Showing 12 changed files with 313 additions and 1 deletion.
1 change: 1 addition & 0 deletions olmocr/bench/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ async def process_with_semaphore(task):
"marker": ("olmocr.bench.runners.run_marker", "run_marker"),
"mineru": ("olmocr.bench.runners.run_mineru", "run_mineru"),
"chatgpt": ("olmocr.bench.runners.run_chatgpt", "run_chatgpt"),
"mistral": ("olmocr.bench.runners.run_mistral", "run_mistral"),
"server": ("olmocr.bench.runners.run_server", "run_server"),
}

Expand Down
42 changes: 42 additions & 0 deletions olmocr/bench/runners/run_mistral.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import json
import os

from mistralai import Mistral


def run_mistral(pdf_path: str, page_num: int = 1) -> str:
"""
Convert page of a PDF file to markdown using the mistral OCR api
https://docs.mistral.ai/capabilities/document/
Args:
pdf_path (str): The local path to the PDF file.
Returns:
str: The OCR result in markdown format.
"""
api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

with open(pdf_path, "rb") as pf:
uploaded_pdf = client.files.upload(
file={
"file_name": os.path.basename(pdf_path),
"content": pf,
},
purpose="ocr"
)

signed_url = client.files.get_signed_url(file_id=uploaded_pdf.id)

ocr_response = client.ocr.process(
model="mistral-ocr-2503",
document={
"type": "document_url",
"document_url": signed_url.url,
}
)

client.files.delete(file_id=uploaded_pdf.id)

return ocr_response.pages[0].markdown
61 changes: 61 additions & 0 deletions olmocr/bench/sample_data/mistral/discoverworld_crazy_table4_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Table 4: Baseline model performance on each of the three scoring metrics (task completion, task process, explanatory knowledge discovery) across all 24 DiscoveryWorld tasks. Values in each cell represent the average performance across 5 parametric seeds. Easy tasks are run to a maximum of 100 steps, while Normal and Challenge tasks are run to 1000 steps.

| | | ReACT | | | Plan+Execute | | | Hypothesizer | | |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| | Topic | Task | | | | | | | | |
| Proteomics | | Clustering | | | | | | | | |
| 1 | Easy | Simplified Clustering | 0.87 | 0.20 | 0.20 | 0.80 | 0.00 | 0.00 | 0.40 | 0.20 |
| 2 | Normal | Clustering (2D) | 0.88 | 0.40 | 0.40 | 0.68 | 0.20 | 0.00 | 0.93 | 0.40 | 0.40 |
| 3 | Challenge | Clustering (3D) | 0.88 | 0.40 | 0.60 | 0.58 | 0.20 | 0.00 | 0.93 | 0.40 | 0.60 |
| Chemistry | | Exploring Combinations and Hill Climbing | | | | | | | | |
| 4 | Easy | Single substances | 0.87 | 0.20 | 0.20 | 0.70 | 0.60 | 0.40 | 0.90 | 0.00 | 0.40 |
| 5 | Normal | Mix of 3 substances | 0.82 | 0.00 | 0.00 | 0.87 | 0.40 | 0.00 | 0.93 | 0.60 | 0.40 |
| 6 | Challenge | Mix of 4 substances | 0.90 | 0.40 | 0.00 | 0.90 | 0.40 | 0.00 | 0.82 | 0.00 | 0.00 |
| Archaeology | | Correlations | | | | | | | | |
| 7 | Easy | Simple instrument | 0.27 | 0.60 | 0.00 | 0.33 | 0.20 | 0.00 | 0.60 | 0.20 | 0.50 |
| 8 | Normal | Instrument Use | 0.72 | 0.40 | 0.30 | 0.78 | 0.00 | 0.00 | 0.64 | 0.40 | 0.40 |
| 9 | Challenge | Correlation | 0.46 | 0.20 | 0.00 | 0.46 | 0.00 | 0.05 | 0.55 | 0.20 | 0.05 |
| Reactor Lab | | Regression | | | | | | | | |
| 10 | Easy | Slope only | 0.42 | 0.00 | 0.40 | 0.44 | 0.00 | 0.10 | 0.38 | 0.00 | 0.20 |
| 11 | Normal | Linear regression | 0.44 | 0.00 | 0.20 | 0.49 | 0.00 | 0.00 | 0.51 | 0.00 | 0.00 |
| 12 | Challenge | Quadratic regression | 0.43 | 0.00 | 0.20 | 0.39 | 0.00 | 0.00 | 0.39 | 0.00 | 0.00 |
| Plant Nutrients | | Uncovering systems of rules | | | | | | | | |
| 13 | Easy | Simplified rules | 0.80 | 0.20 | 0.20 | 0.70 | 0.20 | 0.20 | 0.60 | 0.00 | 0.00 |
| 14 | Normal | Presence rules | 0.91 | 0.60 | 0.00 | 0.84 | 0.40 | 0.00 | 0.56 | 0.00 | 0.00 |
| 15 | Challenge | Logical Rules | 0.89 | 0.40 | 0.00 | 0.73 | 0.40 | 0.00 | 0.62 | 0.00 | 0.00 |
| Space Sick | | Open-ended discovery | | | | | | | | |
| 16 | Easy | Single instrument | 0.78 | 0.60 | 0.00 | 0.68 | 0.40 | 0.10 | 0.80 | 0.60 | 0.60 |
| 17 | Normal | Multiple instruments | 0.58 | 0.00 | 0.13 | 0.45 | 0.00 | 0.13 | 0.16 | 0.00 | 0.33 |
| 18 | Challenge | Novel instruments | 0.55 | 0.00 | 0.00 | 0.26 | 0.00 | 0.00 | 0.20 | 0.00 | 0.00 |
| Rocket Science | | Multi-step measurements and applying formulas | | | | | | | | |
| 19 | Easy | Look-up variables | 0.33 | 0.00 | 0.00 | 0.53 | 0.00 | 0.07 | 0.13 | 0.40 | 0.00 |
| 20 | Normal | Measure 2 variables | 0.51 | 0.00 | 0.05 | 0.34 | 0.00 | 0.00 | 0.11 | 0.00 | 0.00 |
| 21 | Challenge | Measure 5 variables | 0.43 | 0.00 | 0.00 | 0.15 | 0.00 | 0.00 | 0.22 | 0.00 | 0.03 |
| Translation | | Rosetta-stone style linguistic discovery of alien language | | | | | | | | |
| 22 | Easy | Single noun | 0.40 | 0.40 | 0.20 | 0.30 | 0.00 | 0.00 | 0.20 | 0.20 | 0.00 |
| 23 | Normal | Noun and verb | 0.20 | 0.00 | 0.00 | 0.68 | 0.40 | 0.00 | 0.88 | 0.40 | 0.00 |
| 24 | Challenge | Noun, adj., and verb | 0.49 | 0.00 | 0.00 | 0.55 | 0.20 | 0.05 | 0.15 | 0.00 | 0.00 |
| Average (Easy) | | | 0.59 | 0.38 | 0.25 | 0.56 | 0.18 | 0.11 | 0.56 | 0.28 | 0.34 |
| Average (Normal) | | | 0.63 | 0.18 | 0.14 | 0.64 | 0.18 | 0.02 | 0.58 | 0.23 | 0.19 |
| Average (Challenge) | | | 0.63 | 0.18 | 0.10 | 0.50 | 0.15 | 0.01 | 0.49 | 0.08 | 0.08 |

Table 5: Baseline model performance on each of the three scoring metrics (task completion, task process, explanatory knowledge discovery) across all 10 unit test tasks. Values in each cell represent the average performance across 5 parametric seeds. Unit tests tasks are run to a maximum of 100 steps.

| | | ReACT | | Plan+Execute | | Hypothesizer |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: |
| | Unit Test Topic | | | | | |
| 25 | Multi-turn dialog with an agent | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 26 | Measure an object with an instrument | 0.87 | 0.60 | 0.73 | 0.40 | 1.00 |
| 27 | Pick-and-place object | 0.90 | 0.80 | 0.80 | 0.60 | 1.00 |
| 28 | Pick-and-give object | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 29 | Read DiscoveryFeed posts | 1.00 | 1.00 | 0.90 | 0.80 | 1.00 |
| 30 | Move through doors | 0.58 | 0.20 | 0.25 | 0.00 | 0.90 |
| 31 | Using keys with doors | 0.69 | 0.20 | 0.54 | 0.00 | 0.69 |
| 32 | Navigate to a specific room in a house | 0.20 | 0.20 | 0.20 | 0.00 | 0.20 |
| 33 | Search an environment for an object | 0.80 | 0.80 | 0.60 | 0.60 | 0.80 |
| 34 | Interact with a moving agent | 0.60 | 0.20 | 0.53 | 0.00 | 0.53 |
| Average (Unit Tests) | | 0.76 | 0.60 | 0.66 | 0.44 | 0.77 |

# 4.2 Baseline Agent Models

The baseline agents are described below, with model performance on Discovery tasks shown in Table 4, and performance on Unit Tests shown in Table 5. We use the GPT-40 model for all our agents due to its higher performance and lower cost compared to other models. For space we provide
39 changes: 39 additions & 0 deletions olmocr/bench/sample_data/mistral/earnings_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Table of Contents

# NVIDIA Corporation and Subsidiaries Notes to the Consolidated Financial Statements

(Continued)

## Recently Issued Accounting Pronouncement

## Recently Adopted Accounting Pronouncement

In November 2023, the Financial Accounting Standards Board, or FASB, issued a new accounting standard requiring disclosures of significant expenses in operating segments. We adopted this standard in our fiscal year 2025 annual report. Refer to Note 16 of the Notes to the Consolidated Financial Statements in Part IV, Item 15 of this Annual Report on Form 10-K for further information.

## Recent Accounting Pronouncements Not Yet Adopted

In December 2023, the FASB issued a new accounting standard which includes new and updated income tax disclosures, including disaggregation of information in the rate reconciliation and income taxes paid. We expect to adopt this standard in our fiscal year 2026 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.

In November 2024, the FASB issued a new accounting standard requiring disclosures of certain additional expense information on an annual and interim basis, including, among other items, the amounts of purchases of inventory, employee compensation, depreciation and intangible asset amortization included within each income statement expense caption, as applicable. We expect to adopt this standard in our fiscal year 2028 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.

## Note 2 - Business Combination

## Termination of the Arm Share Purchase Agreement

In February 2022, NVIDIA and SoftBank Group Corp, or SoftBank, announced the termination of the Share Purchase Agreement whereby NVIDIA would have acquired Arm from SoftBank. The parties agreed to terminate it due to significant regulatory challenges preventing the completion of the transaction. We recorded an acquisition termination cost of $\$ 1.4$ billion in fiscal year 2023 reflecting the write-off of the prepayment provided at signing.

## Note 3 - Stock-Based Compensation

Stock-based compensation expense is associated with RSUs, PSUs, market-based PSUs, and our ESPP.
Consolidated Statements of Income include stock-based compensation expense, net of amounts capitalized into inventory and subsequently recognized to cost of revenue, as follows:

| | Year Ended | | | |
| :--: | :--: | :--: | :--: | :--: |
| | Jan 26, 2025 | Jan 28, 2024 | Jan 29, 2023 | |
| | (In millions) | | | |
| Cost of revenue | \$ 178 | \$ 141 | \$ 138 | |
| Research and development | 3,423 | 2,532 | 1,892 | |
| Sales, general and administrative | 1,136 | 876 | 680 | |
| Total | \$ 4,737 | \$ 3,549 | \$ 2,710 | |

Stock-based compensation capitalized in inventories was not significant during fiscal years 2025, 2024, and 2023.
4 changes: 4 additions & 0 deletions olmocr/bench/sample_data/mistral/lincoln_letter_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Executive Manajard,
Washington City,
Samantha 1801,
Major Seneral Metcticec, Commissionary of Sithanges is authorized and dricited to offed Englander Seneral Sromble, now a forienc of over in Port Dettemng, in exchange for Mayo lotite, who is held as a foriemer at Richmond). He is aler dricited to vend forward the offor of exchange by Stony in. Warfell, Sg. of Saltmone, under affap of truce, and ywe benn a fase to Cely bint. Abraham Sircotis
40 changes: 40 additions & 0 deletions olmocr/bench/sample_data/mistral/mathfuncs_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# The 20 Most Important Mathematical Equations

A journey through the most elegant and influential formulas in mathematics

## 1. Euler's Identity

$$
e^{i \pi}+1=0
$$

Connects five fundamental constants ( $\mathrm{e}, \mathrm{i}, \mathrm{n}, 1,0$ ), revealing the profound relationship between exponential functions and trigonometry.

## 3. The Fundamental Theorem of Calculus

$$
\int_{a}^{b} f(x) d x=F(b)-F(a)
$$

Establishes that differentiation and integration are inverse operations. If $F$ is an antiderivative of $f$, the definite integral equals $F(b)-F(a)$. Revolutionized mathematical problemsolving.

## 2. Pythagorean Theorem

$$
a^{2}+b^{2}=c^{2}
$$

In right triangles, the hypotenuse squared equals the sum of the squares of the other sides. Cornerstone of geometry with applications in navigation and architecture.

## 4. Maxwell's Equations

$$
\begin{gathered}
\nabla \cdot \mathbf{E}=\frac{\varrho}{\varepsilon_{0}} \\
\nabla \cdot \mathbf{B}=0 \\
\nabla \times \mathbf{E}=-\frac{\partial \mathbf{B}}{\partial t} \\
\nabla \times \mathbf{B}=\mu_{0} \mathbf{J}+\mu_{0} \varepsilon_{0} \frac{\partial \mathbf{E}}{\partial t}
\end{gathered}
$$

Unified electricity and magnetism as manifestations of the same force. Describes electromagnetic field behavior, predicting waves traveling at light speed. Enabled technologies from radio to smartphones.
17 changes: 17 additions & 0 deletions olmocr/bench/sample_data/mistral/mattsnotes_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
V-February Flow
Data Compenents:
Code:
The-stuck-v2
CodeText:
SE, whatever we're scraped
WebText:
H\& DCLn
DATA MEXES
$\left.\begin{array}{l}\text { N } 85 \% \text { Source Code } \\ \text { N } 10 \% \text { CodeText }\end{array}\right\}$ Deepseek
$\sim 5 \%$ Webster
$\sim 85 \%$ The-stuck-v2
$\sim 15 \%$ CodeText
$\sim 0 \%$ Webtext
Staroder
2
$\sim 100 \%$ Source Code JArctic
39 changes: 39 additions & 0 deletions olmocr/bench/sample_data/mistral/multi_column_miss_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
stakeholders has occurred in other nations, with groups and individuals refusing to risk being appropriated into the industry's public relations ambitions. It now looks like that with vigilance, tobacco control advocates can easily foment similar distaste in many areas of the business community. Our actions sought to denormalise the tobacco industry by disrupting its efforts to take its place alongside other industries-often with considerable social credit-in the hope that it might gain by association.

Tobacco industry posturing about its corporate responsibility can never hide the ugly consequences of its ongoing efforts to "work with all relevant stakeholders for the preservation of opportunities for informed adults to consume tobacco products"1 (translation: "we will build alliances with others who want to profit from tobacco use, to do all we can to counteract effective tobacco control"). BAT has $15.4 \%$ and Philip Morris $16.4 \%$ of the global cigarette market. ${ }^{8}$ With 4.9 million smokers currently dying from tobacco use each year,
and the industry unblinkingly concurring that its products are addictive, this leaves BAT to argue why it should not be held to be largely accountable for the annual deaths of some 754600 smokers, and Philip Morris some 803600 smokers.

## REFERENCES

1 British American Tobacco. Social Report. http://www.bat.com/204pp.
2. Wroe D. Tobacco ad campaign angers. M Ps. The Age (H allbourne) 2004, May 17 http://www.theage.com.au/articles/2004/05/16/ 1084646069771.html?oneclid=true.
3. Hirschhorn N. Corporate social responsibility and the tobacco industry: hope or hype? Tobacco Control 2004;13:447-53.
4. Ethical Corporation Axio 2004. Conference website. http:// www.ethicalcorp.com/azio2004/.
5 Chapman S, Shatenstein S. Extreme corporate makeover: tobacco companies, corporate responsibility and the corruption of "ethics". Globalink petition. http://petition.globalink.org/view.php?code=extreme.
6 Mackey J, Eriksen M. The tobacco oflas. Geneva: World Health Organization, 2002.

# INDUSTRY WATCH

## Corporate social responsibility and the tobacco industry: hope or hype?

N Hirschhorn

Tobacco Control 2004;13:447-453. doi: 10.1136/te.2003.006676
Corporate social responsibility (CSR) emerged from a realisation among transnational corporations of the need to account for and redress their adverse impact on society: specifically, on human rights, labour practices, and the environment. Two transnational tobacco companies have recently adopted CSR: Philip Morris, and British American Tobacco. This report explains the origins and theory behind CSR; examines internal company documents from Philip Morris showing the company's deliberations on the matter, and the company's perspective on its own behaviour; and reflects on whether marketing tobacco is antithetical to social responsibility.

Correspondence to: Dr Norbert Hirschhorn, N oatalanhe 6, A3 00600 Helsinki, Finland; [email protected]

Received
13 November 2003
Accepted 15 July 2004

Over the past three decades increasing pressure from non-governmental organisations (NGOs), governments and the United Nations, has required transnational corporations (TNCs) to examine and redress the adverse impact their businesses have on society and the environment. Many have responded by taking up what is known as "corporate social responsibility" (CSR); only recently have two major cigarette companies followed suit: Philip Morris (PM) and British American Tobacco (BAT). This report first provides the context and development of CSR; then, from internal company documents, examines how PM came to its own version. This paper examines whether a
tobacco company espousing CSR should be judged simply as a corporate entity along standards of business ethics, or as an irretrievably negative force in the realm of public health, thereby rendering CSR an oxymoron.

## CORPORATE SOCIAL RESPONSIBILITY: THE CONTEXT

The term "corporate social responsibility" is in vogue at the moment but as a concept it is vague and means different things to different people. ${ }^{1}$

Some writers on CSR trace its American roots to the 19th century when large industries engaged in philanthropy and established great public institutions, a form of "noblesse oblige". But the notion that corporations should be required to return more to society because of their impact on society was driven by pressures from the civil rights, peace, and environmental movements of the last half century. ${ }^{25}$ The unprecedented expansion of power and influence of TNCs over the past three decades has accelerated global trade and development, but also environmental damage and abuses of

Abbreviations: ASH, Action on Smoking and Health; BAT, British American Tobacco; CERES, Coalition for Environmentally Responsible Economies; CSR, corporate social responsibility; DISI, Dow Jones Sustainability Index; GCAC, Global Corporate Affairs Council; GRI, Global Reporting Initiative; MSA, Master Settlement Agreement; NGOs, non-governmental organisations; PM, Philip Morris; TNCs, transnational corporations; UNEP, United Nations Environment Program
Loading

0 comments on commit bdc0d75

Please sign in to comment.