Better convert script

allenai · Mar 5, 2025 · fb0a729 · fb0a729
1 parent fa68c6b
commit fb0a729
Show file tree

Hide file tree

Showing 2 changed files with 63 additions and 23 deletions.
diff --git a/olmocr/bench/sample_data/chatgpt45/earnings_1.md b/olmocr/bench/sample_data/chatgpt45/earnings_1.md
@@ -0,0 +1,34 @@
+NVIDIA Corporation and Subsidiaries
+Notes to the Consolidated Financial Statements
+(Continued)
+
+Recently Issued Accounting Pronouncements
+
+Recently Adopted Accounting Pronouncement
+
+In November 2023, the Financial Accounting Standards Board, or FASB, issued a new accounting standard requiring disclosures of significant expenses in operating segments. We adopted this standard in our fiscal year 2025 annual report. Refer to Note 16 of the Notes to the Consolidated Financial Statements in Part IV, Item 15 of this Annual Report on Form 10-K for further information.
+
+Recent Accounting Pronouncements Not Yet Adopted
+
+In December 2023, the FASB issued a new accounting standard which includes new and updated income tax disclosures, including disaggregation of information in the rate reconciliation and income taxes paid. We expect to adopt this standard in our fiscal year 2026 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.
+
+In November 2024, the FASB issued a new accounting standard requiring disclosures of certain additional expense information on an annual and interim basis, including, among other items, the amounts of purchases of inventory, employee compensation, depreciation and intangible asset amortization included within each income statement expense caption, as applicable. We expect to adopt this standard in our fiscal year 2028 annual report. We do not expect the adoption of this standard to have a material impact on our Consolidated Financial Statements other than additional disclosures.
+
+Note 2 - Business Combination
+
+In February 2022, NVIDIA and SoftBank Group Corp., or SoftBank, announced the termination of the Share Purchase Agreement whereby NVIDIA would have acquired Arm from SoftBank. The parties agreed to terminate it due to significant regulatory challenges preventing the completion of the transaction. We recorded an acquisition termination cost of $1.4 billion in fiscal year 2023 reflecting the write-off of the prepayment provided at signing.
+
+Note 3 - Stock-Based Compensation
+
+Stock-based compensation expense is associated with RSUs, PSUs, market-based PSUs, and our ESPP.
+
+Consolidated Statements of Income include stock-based compensation expense, net of amounts capitalized into inventory and subsequently recognized to cost of revenue, as follows:
+
+|                                     | Jan 26, 2025 | Jan 28, 2024 | Jan 29, 2023 |
+|-------------------------------------|--------------|--------------|--------------|
+| Cost of revenue                     | $178         | $141         | $138         |
+| Research and development            | 3,423        | 2,532        | 1,892        |
+| Sales, general and administrative   | 1,136        | 876          | 680          |
+| Total                               | $4,737       | $3,549       | $2,710       |
+
+Stock-based compensation capitalized in inventories was not significant during fiscal years 2025, 2024, and 2023.
diff --git a/olmocr/bench/scripts/convert_all.sh b/olmocr/bench/scripts/convert_all.sh
@@ -152,42 +152,48 @@ create_conda_env "olmocr" "3.11"
 source $(conda info --base)/etc/profile.d/conda.sh
 source activate olmocr
 
-# # Run olmocr benchmarks
-# echo "Running olmocr benchmarks..."
-# python -m olmocr.bench.convert olmocr --repeats 5
+# Run olmocr benchmarks
+echo "Running olmocr benchmarks..."
+python -m olmocr.bench.convert olmocr --repeats 5
 
-# # Install marker-pdf and run benchmarks
-# echo "Installing marker-pdf and running benchmarks..."
-# pip install marker-pdf
-# python -m olmocr.bench.convert marker
+# Install marker-pdf and run benchmarks
+echo "Installing marker-pdf and running benchmarks..."
+pip install marker-pdf
+python -m olmocr.bench.convert marker
 
-# # Install verovio and run benchmarks
-# echo "Installing verovio and running benchmarks..."
-# pip install verovio
-# python -m olmocr.bench.convert gotocr
+# Install verovio and run benchmarks
+echo "Installing verovio and running benchmarks..."
+pip install verovio
+python -m olmocr.bench.convert gotocr
 
-# # Run chatgpt benchmarks
-# echo "Running chatgpt benchmarks..."
-# python -m olmocr.bench.convert chatgpt
+# Run chatgpt benchmarks
+echo "Running chatgpt benchmarks..."
+python -m olmocr.bench.convert chatgpt
+python -m olmocr.bench.convert chatgpt:name=chatgpt45:model=gpt-4.5-preview-2025-02-27
 
 # Run raw server benchmarks with sglang server
 # For each model, start server, run benchmark, then stop server
 
 # Check port availability at script start
 check_port || exit 1
 
-# olmocr_base_temp0_1
-start_sglang_server "allenai/olmOCR-7B-0225-preview" --chat-template qwen2-vl --mem-fraction-static 0.7
-python -m olmocr.bench.convert server:name=olmocr_base_temp0_1:model=allenai/olmOCR-7B-0225-preview:temperature=0.1:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
-python -m olmocr.bench.convert server:name=olmocr_base_temp0_8:model=allenai/olmOCR-7B-0225-preview:temperature=0.8:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
-stop_sglang_server
+# # olmocr_base_temp0_1
+# start_sglang_server "allenai/olmOCR-7B-0225-preview" --chat-template qwen2-vl --mem-fraction-static 0.7
+# python -m olmocr.bench.convert server:name=olmocr_base_temp0_1:model=allenai/olmOCR-7B-0225-preview:temperature=0.1:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
+# python -m olmocr.bench.convert server:name=olmocr_base_temp0_8:model=allenai/olmOCR-7B-0225-preview:temperature=0.8:prompt_template=fine_tune:response_template=json --repeats 5 --parallel 20
+# stop_sglang_server
 
-# qwen2_vl_7b
-start_sglang_server "Qwen/Qwen2-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
-python -m olmocr.bench.convert server:name=qwen2_vl_7b:model=Qwen/Qwen2-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
-stop_sglang_server
+# # qwen2_vl_7b
+# start_sglang_server "Qwen/Qwen2-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
+# python -m olmocr.bench.convert server:name=qwen2_vl_7b:model=Qwen/Qwen2-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
+# stop_sglang_server
 
 # qwen25_vl_7b
+# needs to run in separate conda env for now, honestly it's broken and doesn't work right
+create_conda_env "qwen25" "3.11"
+source activate qwen25
+pip install olmocr
+pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python transformers==4.48.3
 start_sglang_server "Qwen/Qwen2.5-VL-7B-Instruct" --chat-template qwen2-vl --mem-fraction-static 0.7
 python -m olmocr.bench.convert server:name=qwen25_vl_7b:model=Qwen/Qwen2.5-VL-7B-Instruct:temperature=0.1:prompt_template=full:response_template=plain --repeats 5 --parallel 20
 stop_sglang_server