creating 2.9.0.md release notes (#402)

* Create 2.9.0.md * Update 2.9.0.md adding header for rendering * Update 2.9.0.md --------- Co-authored-by: Bilge Yücel <[email protected]>
deepset-ai · Jan 15, 2025 · 509efba · 509efba
1 parent 2ffd077
commit 509efba
Showing 1 changed file with 199 additions and 0 deletions.
diff --git a/content/release-notes/2.9.0.md b/content/release-notes/2.9.0.md
@@ -0,0 +1,199 @@
+---
+title: Haystack 2.9.0
+description: Release notes for Haystack 2.9.0
+featured_image: /images/release-notes.png
+images: ["/images/release-notes.png"]
+toc: True
+date: 2025-01-14
+last_updated: 2025-01-14
+tags: ["Release Notes"]
+link: https://github.com/deepset-ai/haystack/releases/tag/v2.9.0
+---
+
+
+
+## ⭐️ Highlights
+
+### Tool Calling Support
+
+We are introducing the [`Tool`](https://docs.haystack.deepset.ai/docs/tool), a simple and unified abstraction for representing tools in Haystack, and the [`ToolInvoker`](https://docs.haystack.deepset.ai/docs/toolinvoker), which executes tool calls prepared by LLMs. These features make it easy to integrate tool calling into your Haystack pipelines, enabling seamless interaction with tools when used with components like `OpenAIChatGenerator` and `HuggingFaceAPIChatGenerator`. Here's how you can use them:
+
+```python
+def dummy_weather_function(city: str):
+    return f"The weather in {city} is 20 degrees."
+
+tool = Tool(
+    name="weather_tool",
+    description="A tool to get the weather",
+    function=dummy_weather_function,
+    parameters={
+      "type": "object",
+      "properties": {"city": {"type": "string"}},
+      "required": ["city"],
+    }
+)
+
+pipeline = Pipeline()
+pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool]))
+pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool]))
+pipeline.connect("llm.replies", "tool_invoker.messages")
+
+message = ChatMessage.from_user("How is the weather in Berlin today?")
+result = pipeline.run({"llm": {"messages": [message]}})
+```
+**Use Components as Tools**
+As an abstraction of `Tool`, [`ComponentTool`](https://docs.haystack.deepset.ai/docs/componenttool) allows LLMs to interact directly with components like web search, document processing, or custom user components. It simplifies schema generation and type conversion, making it easy to expose complex component functionality to LLMs.
+
+```python
+# Create a tool from the component
+tool = ComponentTool(
+    component=SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3),
+    name="web_search",  # Optional: defaults to "serper_dev_web_search"
+    description="Search the web for current information on any topic"  # Optional: defaults to component docstring
+)
+```
+### New Splitting Method: `RecursiveDocumentSplitter`
+`RecursiveDocumentSplitter` introduces a smarter way to split text. It uses a set of separators to divide text recursively, starting with the first separator. If chunks are still larger than the specified size, the splitter moves to the next separator in the list. This approach ensures efficient and granular text splitting for improved processing.
+
+```python
+from haystack.components.preprocessors import RecursiveDocumentSplitter
+
+splitter = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
+doc_chunks = splitter.run([Document(content="...")])
+```
+### ⚠️ Refactored `ChatMessage` dataclass
+`ChatMessage` dataclass has been refactored to improve flexibility and compatibility. As part of this update, the `content` attribute has been removed and replaced with a new `text` property for accessing the ChatMessage's textual value. This change ensures future-proofing and better support for features like tool calls and their results. For details on the new API and migration steps, see the [ChatMessage documentation](https://docs.haystack.deepset.ai/docs/chatmessage). If you have any questions about this refactoring, feel free to let us know in [this Github discussion](https://github.com/deepset-ai/haystack/discussions/8721).
+
+## ⬆️ Upgrade Notes
+
+-   The refactoring of the `ChatMessage` data class includes some breaking changes involving `ChatMessage` creation and accessing attributes. If you have a `Pipeline` containing a `ChatPromptBuilder`, serialized with `haystack-ai =< 2.9.0`, deserialization may break. For detailed information about the changes and how to migrate, see the [ChatMessage documentation](https://docs.haystack.deepset.ai/docs/chatmessage).
+-   Removed the deprecated `converter` init argument from `PyPDFToDocument`. Use other init arguments instead, or create a custom component.
+-   The `SentenceWindowRetriever` output key `context_documents` now outputs a `List[Document]` containing the retrieved documents and the context windows ordered by `split_idx_start`.
+-   Update default value of `store_full_path` to `False` in converters
+
+## 🚀 New Features
+
+-   Introduced the `ComponentTool`, a new tool that wraps Haystack components, allowing them to be utilized as tools for LLMs (various ChatGenerators). This `ComponentTool` supports automatic tool schema generation, input type conversion, and offers support for components with run methods that have input types:
+
+    -   Basic types (str, int, float, bool, dict)
+    -   Dataclasses (both simple and nested structures)
+    -   Lists of basic types (e.g., `List[str]`)
+    -   Lists of dataclasses (e.g., `List[Document]`)
+    -   Parameters with mixed types (e.g., `List[Document]`, str etc.)
+
+    Example usage: 
+    ```python 
+
+    from haystack import component, Pipeline
+    from haystack.tools import ComponentTool
+    from haystack.components.websearch import SerperDevWebSearch
+    from haystack.utils import Secret
+    from haystack.components.tools.tool_invoker import ToolInvoker
+    from haystack.components.generators.chat import OpenAIChatGenerator
+    from haystack.dataclasses import ChatMessage
+
+    # Create a SerperDev search component
+    search = SerperDevWebSearch(api_key=Secret.from_env_var("SERPERDEV_API_KEY"), top_k=3)
+
+    # Create a tool from the component
+    tool = ComponentTool(
+        component=search,
+        name="web_search",  # Optional: defaults to "serper_dev_web_search"
+        description="Search the web for current information on any topic"  # Optional: defaults to component docstring
+    )
+
+    # Create pipeline with OpenAIChatGenerator and ToolInvoker
+    pipeline = Pipeline()
+    pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini", tools=[tool]))
+    pipeline.add_component("tool_invoker", ToolInvoker(tools=[tool]))
+
+    # Connect components
+    pipeline.connect("llm.replies", "tool_invoker.messages")
+
+    message = ChatMessage.from_user("Use the web search tool to find information about Nikola Tesla")
+
+    # Run pipeline
+    result = pipeline.run({"llm": {"messages": [message]}})
+
+    print(result)
+    ```
+
+- Add `XLSXToDocument` converter that loads an Excel file using Pandas + openpyxl and by default converts each sheet into a separate `Document` in CSV format.
+
+-  Added a new `store_full_path` parameter to the `__init__` methods of `PyPDFToDocument` and `AzureOCRDocumentConverter`. The default value is `True`, which stores the full file path in the metadata of the output documents. When set to `False`, only the file name is stored.
+
+-   Add a new experimental component `ToolInvoker`. This component invokes tools based on tool calls prepared by Language Models and returns the results as a list of `ChatMessage` objects with tool role.
+
+-   Adding a `RecursiveSplitter`, which uses a set of separators to split text recursively. It attempts to divide the text using the first separator, and if the resulting chunks are still larger than the specified size, it moves to the next separator in the list.
+
+-  Added a `create_tool_from_function` function to create a `Too` instance from a function, with automatic generation of name, description and parameters. Added a `tool` decorator to achieve the same result.
+
+-   Add support for Tools in the Hugging Face API Chat Generator.
+
+-   Changed the `ChatMessage` dataclass to support different types of content, including tool calls, and tool call results.
+
+-   Add support for Tools in the OpenAI Chat Generator.
+
+-   Added a new `Tool` dataclass to represent a tool for which Language Models can prepare calls.
+
+-   Added the component `StringJoiner` to join strings from different components to a list of strings.
+
+## ⚡️ Enhancement Notes
+
+-  Added `default_headers` parameter to `AzureOpenAIDocumentEmbedder` and `AzureOpenAITextEmbedder`.
+
+-   Add `token` argument to `NamedEntityExtractor` to allow usage of private Hugging Face models.
+
+-   Add the `from_openai_dict_format` class method to the `ChatMessage` class. It allows you to create a `ChatMessage` from a dictionary in the format that OpenAI's Chat API expects.
+
+-   Add a testing job to check that all packages can be imported successfully. This should help detect several issues, such as forgetting to use a forward reference for a type hint coming from a lazy import.
+
+-  `DocumentJoiner` methods `_concatenate()` and `_distribution_based_rank_fusion()` were converted to static methods.
+
+-   Improve serialization and deserialization of callables. We now allow serialization of class methods and static methods and explicitly prohibit serialization of instance methods, lambdas, and nested functions.
+
+-   Added new initialization parameters to the `PyPDFToDocument` component to customize the text extraction process from PDF files.
+
+-   Reorganized the document store test suite to isolate `dataframe` filter tests. This change prepares for potential future deprecation of the Document class's `dataframe` field.
+
+-   Move `Tool` to a new dedicated `tools` package. Refactor `Tool` serialization and deserialization to make it more flexible and include type information.
+
+-   The `NLTKDocumentSplitter` was merged into the `DocumentSplitter` which now provides the same functionality as the `NLTKDocumentSplitter`. The `split_by="sentence"` now uses a custom sentence boundary detection based on the `nltk` library. The previous `sentence` behaviour can still be achieved by `split_by="period"`.
+
+-   Improved deserialization of callables by using `importlib` instead of `sys.modules`. This change allows importing local functions and classes that are not in `sys.modules` when deserializing callable.
+
+-   Change `OpenAIDocumentEmbedder` to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
+
+
+## ⚠️ Deprecation Notes
+
+-   The `NLTKDocumentSplitter` will be deprecated and will be removed in the next release. The `DocumentSplitter` will instead support the functionality of the `NLTKDocumentSplitter`.
+
+-   The function role and `ChatMessage.from_function` class method have been deprecated and will be removed in Haystack 2.10.0. `ChatMessage.from_function` also attempts to produce a valid tool message. For more information, see the documentation: <https://docs.haystack.deepset.ai/docs/chatmessage>
+
+-  The `SentenceWindowRetriever` output of `context_documents` changed. Instead of a `List[List[Document]`, the output is a `List[Document]`, where the documents are ordered by `split_idx_start` value.
+
+
+## 🐛 Bug Fixes
+
+- Add missing stream mime type assignment to the `LinkContentFetcher` for the single url scenario.
+
+- Previously, the pipelines that use `FileTypeRouter` could fail if they received a single URL as an input.
+
+- OpenAIChatGenerator no longer passes tools to the OpenAI client if none are provided. Previously, a null value was passed. This change improves compatibility with OpenAI-compatible APIs that do not support tools.
+
+- ByteStream now truncates the data to 100 bytes in the string representation to avoid excessive log output.
+
+- Make the HuggingFaceLocalChatGenerator compatible with the new ChatMessage format, by converting the messages to the format expected by HuggingFace.
+
+- Serialize the `chat_template` parameter.
+
+- Moved the NLTK download of `DocumentSplitter` and `NLTKDocumentSplitter` to `warm_up()`. This prevents calling to an external API during instantiation. If a `DocumentSplitter` or `NLTKDocumentSplitter` is used for sentence splitting outside of a pipeline, `warm_up()` now needs to be called before running the component.
+
+- `PDFMinerToDocument` now creates documents with `id` based on converted text and metadata. Before, `PDFMinerToDocument` did not consider the document's meta field when generating the document's `id`.
+
+- Pin OpenAI client to \>=1.56.1 to avoid issues related to changes in the httpx library.
+
+- `PyPDFToDocument` now creates documents with id based on converted text and metadata. Before it didn't take the meta data into account.
+
+- Fixes issues with deserialization of components in multi-threaded environments.