Skip to content

Commit

Permalink
Better convert script
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Mar 5, 2025
1 parent fa68c6b commit fb0a729
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 23 deletions.
34 changes: 34 additions & 0 deletions olmocr/bench/sample_data/chatgpt45/earnings_1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
NVIDIA Corporation and Subsidiaries
Notes to the Consolidated Financial Statements
(Continued)

Recently Issued Accounting Pronouncements

Recently Adopted Accounting Pronouncement

In November 2023, the Financial Accounting Standards Board, or FASB, issued a new accounting standard requiring disclosures of significant expenses in operating segments. We adopted this standard in our fiscal year 2025 annual report. Refer to Note 16 of the Notes to the Consolidated Financial Statements in Part IV, Item 15 of this Annual Report on Form 10-K for further information.

Recent Accounting Pronouncements Not Yet Adopted

In December 2023, the FASB issued a new accounting standard which includes new and updated income tax disclosures, including disaggregation of information in the rate reconciliation and income taxes paid. We expect to adopt this standard in our fiscal year 2026 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.

In November 2024, the FASB issued a new accounting standard requiring disclosures of certain additional expense information on an annual and interim basis, including, among other items, the amounts of purchases of inventory, employee compensation, depreciation and intangible asset amortization included within each income statement expense caption, as applicable. We expect to adopt this standard in our fiscal year 2028 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.

Note 2 - Business Combination

In February 2022, NVIDIA and SoftBank Group Corp., or SoftBank, announced the termination of the Share Purchase Agreement whereby NVIDIA would have acquired Arm from SoftBank. The parties agreed to terminate it due to significant regulatory challenges preventing the completion of the transaction. We recorded an acquisition termination cost of $1.4 billion in fiscal year 2023 reflecting the write-off of the prepayment provided at signing.

Note 3 - Stock-Based Compensation

Stock-based compensation expense is associated with RSUs, PSUs, market-based PSUs, and our ESPP.

Consolidated Statements of Income include stock-based compensation expense, net of amounts capitalized into inventory and subsequently recognized to cost of revenue, as follows:

| | Jan 26, 2025 | Jan 28, 2024 | Jan 29, 2023 |
|-------------------------------------|--------------|--------------|--------------|
| Cost of revenue | $178 | $141 | $138 |
| Research and development | 3,423 | 2,532 | 1,892 |
| Sales, general and administrative | 1,136 | 876 | 680 |
| Total | $4,737 | $3,549 | $2,710 |

Stock-based compensation capitalized in inventories was not significant during fiscal years 2025, 2024, and 2023.
52 changes: 29 additions & 23 deletions olmocr/bench/scripts/convert_all.sh
Original file line number Diff line number Diff line change
Expand Up @@ -152,42 +152,48 @@ create_conda_env "olmocr" "3.11"
source $(conda info --base)/etc/profile.d/conda.sh
source activate olmocr

# # Run olmocr benchmarks
# echo "Running olmocr benchmarks..."
# python -m olmocr.bench.convert olmocr --repeats 5
# Run olmocr benchmarks
echo "Running olmocr benchmarks..."
python -m olmocr.bench.convert olmocr --repeats 5

# # Install marker-pdf and run benchmarks
# echo "Installing marker-pdf and running benchmarks..."
# pip install marker-pdf
# python -m olmocr.bench.convert marker
# Install marker-pdf and run benchmarks
echo "Installing marker-pdf and running benchmarks..."
pip install marker-pdf
python -m olmocr.bench.convert marker

# # Install verovio and run benchmarks
# echo "Installing verovio and running benchmarks..."
# pip install verovio
# python -m olmocr.bench.convert gotocr
# Install verovio and run benchmarks
echo "Installing verovio and running benchmarks..."
pip install verovio
python -m olmocr.bench.convert gotocr

# # Run chatgpt benchmarks
# echo "Running chatgpt benchmarks..."
# python -m olmocr.bench.convert chatgpt
# Run chatgpt benchmarks
echo "Running chatgpt benchmarks..."
python -m olmocr.bench.convert chatgpt
python -m olmocr.bench.convert chatgpt:name=chatgpt45:model=gpt-4.5-preview-2025-02-27

# Run raw server benchmarks with sglang server
# For each model, start server, run benchmark, then stop server

# Check port availability at script start
check_port || exit 1

# olmocr_base_temp0_1
start_sglang_server "allenai/olmOCR-7B-0225-preview" --chat-template qwen2-vl --mem-fraction-static 0.7
python -m olmocr.bench.convert server:name=olmocr_base_temp0_1:model=allenai/olmOCR-7B-0225-preview:temperature=0.1:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
python -m olmocr.bench.convert server:name=olmocr_base_temp0_8:model=allenai/olmOCR-7B-0225-preview:temperature=0.8:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
stop_sglang_server
# # olmocr_base_temp0_1
# start_sglang_server "allenai/olmOCR-7B-0225-preview" --chat-template qwen2-vl --mem-fraction-static 0.7
# python -m olmocr.bench.convert server:name=olmocr_base_temp0_1:model=allenai/olmOCR-7B-0225-preview:temperature=0.1:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
# python -m olmocr.bench.convert server:name=olmocr_base_temp0_8:model=allenai/olmOCR-7B-0225-preview:temperature=0.8:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
# stop_sglang_server

# qwen2_vl_7b
start_sglang_server "Qwen/Qwen2-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
python -m olmocr.bench.convert server:name=qwen2_vl_7b:model=Qwen/Qwen2-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
stop_sglang_server
# # qwen2_vl_7b
# start_sglang_server "Qwen/Qwen2-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
# python -m olmocr.bench.convert server:name=qwen2_vl_7b:model=Qwen/Qwen2-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
# stop_sglang_server

# qwen25_vl_7b
# needs to run in separate conda env for now, honestly it's broken and doesn't work right
create_conda_env "qwen25" "3.11"
source activate qwen25
pip install olmocr
pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python transformers==4.48.3
start_sglang_server "Qwen/Qwen2.5-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
python -m olmocr.bench.convert server:name=qwen25_vl_7b:model=Qwen/Qwen2.5-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
stop_sglang_server
Expand Down

0 comments on commit fb0a729

Please sign in to comment.