-
Notifications
You must be signed in to change notification settings - Fork 481
WIP: Reimplementing search_dates
#945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gavishpoddar
wants to merge
44
commits into
scrapinghub:master
Choose a base branch
from
gavishpoddar:search_dates
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 34 commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
02220da
Implimenting new search_dates
gavishpoddar f933d3a
Fixing DATE_ORDER, implimenting deep_search, tests
gavishpoddar 77727b5
Unproving _joint_parse with data_carry accurate_return_text, deep_se…
gavishpoddar e7f38e8
implementing _final_text_clean()
gavishpoddar 962066c
Simplifying text_clean and modifying tests
gavishpoddar 624ac8e
Implementing relative date
gavishpoddar 42ca6f6
Fixing tests
gavishpoddar 51749a2
secondary_split_implimentation
gavishpoddar f5e4635
positional args to keyword argument
gavishpoddar 121b15f
Micro fixes
gavishpoddar 2cd93f0
Removing codes now part of #953
gavishpoddar 006d2a5
adding check_settings
gavishpoddar 10404c9
implimenting double_punctuation_split
gavishpoddar 22596e0
Updating docs and removing test (TMP)
gavishpoddar b799dfb
cleaning code, adding tests, improving coverage
gavishpoddar 42c984a
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 8fc5e0d
Improving codecov
gavishpoddar 74b6ec4
temporary commit to get diff
gavishpoddar 56e0505
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 5a1b1c5
temporary file change for review
gavishpoddar aa2aa8f
reverting the previous commit
gavishpoddar 41eff6a
improvements
gavishpoddar f65531b
formatting code
gavishpoddar 982fc08
formatting code
gavishpoddar 3621b2d
improvements in text filter
gavishpoddar 8a9496b
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 45996b4
removing previous search_dates
gavishpoddar 2ac88c6
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 5dabc62
adding test
gavishpoddar ab1778d
fixing doc string
gavishpoddar 14adf89
fixing doc string
gavishpoddar d57223a
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 88afa30
updating xfail
gavishpoddar 9209f3d
updating tests
gavishpoddar 85254e0
Apply suggestions from code review
gavishpoddar e4604e6
Merge branch 'master' into search_dates
gavishpoddar 4f119dd
Updates
gavishpoddar e6da4be
Fixing upstraem merges
gavishpoddar f6116bf
DateSearch -> DateSearchWithDetection
gavishpoddar 0525cdc
Merge branch 'scrapinghub:master' into search_dates
gavishpoddar 96b91c0
updating test with xfail
gavishpoddar b9d12f3
Merge branch 'search_dates' of https://github.com/gavishpoddar/datepa…
gavishpoddar 99e66c6
minor fixes
gavishpoddar 2935aae
Merge branch 'master' into search_dates
serhii73 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,57 +1,119 @@ | ||
from dateparser.search.search import DateSearchWithDetection | ||
from dateparser.search.search import DateSearch | ||
from dateparser.conf import apply_settings | ||
|
||
|
||
_search_with_detection = DateSearchWithDetection() | ||
_search_dates = DateSearch() | ||
|
||
|
||
@apply_settings | ||
def search_dates(text, languages=None, settings=None, add_detected_language=False): | ||
"""Find all substrings of the given string which represent date and/or time and parse them. | ||
|
||
:param text: | ||
A string in a natural language which may contain date and/or time expressions. | ||
:type text: str | ||
|
||
:param languages: | ||
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will | ||
not attempt to detect the language. | ||
:type languages: list | ||
|
||
:param settings: | ||
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`. | ||
:type settings: dict | ||
|
||
:param add_detected_language: | ||
Indicates if we want the detected language returned in the tuple. | ||
:type add_detected_language: bool | ||
|
||
:return: Returns list of tuples containing: | ||
substrings representing date and/or time, corresponding :mod:`datetime.datetime` | ||
object and detected language if *add_detected_language* is True. | ||
Returns None if no dates that can be parsed are found. | ||
:rtype: list | ||
:raises: ValueError - Unknown Language | ||
|
||
>>> from dateparser.search import search_dates | ||
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.') | ||
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))] | ||
|
||
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.', | ||
>>> add_detected_language=True) | ||
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en')] | ||
|
||
>>> search_dates("The client arrived to the office for the first time in March 3rd, 2004 " | ||
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer " | ||
>>> "returned indicating a defect on the part") | ||
[('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0)), | ||
('on May 6th 2004', datetime.datetime(2004, 5, 6, 0, 0))] | ||
|
||
""" | ||
result = _search_with_detection.search_dates( | ||
:param text: | ||
A string in a natural language which may contain the date and/or time expressions. | ||
:type text: str | ||
|
||
:param languages: | ||
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will | ||
not attempt to detect the language. | ||
:type languages: list | ||
|
||
:param settings: | ||
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`. | ||
:type settings: dict | ||
|
||
:param add_detected_language: | ||
Indicates if we want the detected language returned in the tuple. | ||
:type add_detected_language: bool | ||
|
||
:return: Returns tuples containing: | ||
gavishpoddar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
substrings representing date and/or time, corresponding :mod:`datetime.datetime` | ||
object and detected language if *add_detected_language* is True. | ||
Returns None if no dates that can be parsed are found. | ||
:rtype: list | ||
:raises: ValueError - Unknown Language | ||
|
||
>>> from dateparser.search import search_dates | ||
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.') | ||
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0))] | ||
|
||
>>> search_dates('The first artificial Earth satellite was launched on 4 October 1957.', | ||
>>> add_detected_language=True) | ||
[('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en')] | ||
|
||
>>> search_dates("The client arrived to the office for the first time in March 3rd, 2004 " | ||
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer " | ||
>>> "returned indicating a defect on the part") | ||
[('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0)), | ||
('on May 6th 2004', datetime.datetime(2004, 5, 6, 0, 0))] | ||
|
||
""" | ||
|
||
result = _search_dates.search_dates( | ||
text=text, languages=languages, settings=settings | ||
) | ||
dates = result.get('Dates') | ||
|
||
dates = result.get("Dates") | ||
if dates: | ||
if add_detected_language: | ||
language = result.get('Language') | ||
dates = [date + (language, ) for date in dates] | ||
language = result.get("Language") | ||
dates = [date + (language,) for date in dates] | ||
return dates | ||
|
||
|
||
@apply_settings | ||
def search_first_date(text, languages=None, settings=None, add_detected_language=False): | ||
"""Find first substring of the given string which represent date and/or time and parse it. | ||
|
||
:param text: | ||
A string in a natural language which may contain the date and/or time expression. | ||
:type text: str | ||
|
||
:param languages: | ||
A list of two letters language codes.e.g. ['en', 'es']. If languages are given, it will | ||
not attempt to detect the language. | ||
:type languages: list | ||
|
||
:param settings: | ||
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`. | ||
:type settings: dict | ||
|
||
:param add_detected_language: | ||
Indicates if we want the detected language returned in the tuple. | ||
:type add_detected_language: bool | ||
|
||
:return: Returns tuples containing: | ||
substrings representing date and/or time, corresponding :mod:`datetime.datetime` | ||
gavishpoddar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
object and detected language if *add_detected_language* is True. | ||
Returns None if no dates that can be parsed are found. | ||
:rtype: tuple | ||
:raises: ValueError - Unknown Language | ||
|
||
>>> from dateparser.search import search_first_date | ||
>>> search_first_date('The first artificial Earth satellite was launched on 4 October 1957.') | ||
('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0)) | ||
|
||
>>> from dateparser.search import search_first_date | ||
>>> search_first_date('Caesar Augustus, also known as Octavian') | ||
None | ||
|
||
>>> search_first_date('The first artificial Earth satellite was launched on 4 October 1957.', | ||
>>> add_detected_language=True) | ||
('on 4 October 1957', datetime.datetime(1957, 10, 4, 0, 0), 'en') | ||
|
||
>>> search_first_date("The client arrived to the office for the first time in March 3rd, 2004 " | ||
>>> "and got serviced, after a couple of months, on May 6th 2004, the customer " | ||
>>> "returned indicating a defect on the part") | ||
('in March 3rd, 2004 and', datetime.datetime(2004, 3, 3, 0, 0)) | ||
|
||
""" | ||
|
||
result = _search_dates.search_dates( | ||
text=text, languages=languages, limit_date_search_results=1, settings=settings | ||
) | ||
dates = result.get("Dates") | ||
if dates: | ||
if add_detected_language: | ||
language = result.get("Language") | ||
dates = [date + (language,) for date in dates] | ||
return dates[0] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
from collections.abc import Set | ||
|
||
from dateparser.search.text_detection import FullTextLanguageDetector | ||
from dateparser.languages.loader import LocaleDataLoader | ||
|
||
|
||
class SearchLanguages: | ||
def __init__(self): | ||
self.loader = LocaleDataLoader() | ||
self.available_language_map = self.loader.get_locale_map() | ||
self.language = None | ||
|
||
def get_current_language(self, language_shortname): | ||
if self.language is None or self.language.shortname != language_shortname: | ||
self.language = self.loader.get_locale(language_shortname) | ||
|
||
def translate_objects(self, language_shortname, text, settings): | ||
self.get_current_language(language_shortname) | ||
result = self.language.translate_search(text, settings=settings) | ||
return result | ||
|
||
def detect_language(self, text, languages): | ||
if isinstance(languages, (list, tuple, Set)): | ||
|
||
if all([language in self.available_language_map for language in languages]): | ||
languages = [ | ||
self.available_language_map[language] for language in languages | ||
] | ||
else: | ||
unsupported_languages = set(languages) - set( | ||
self.available_language_map.keys() | ||
) | ||
raise ValueError( | ||
"Unknown language(s): %s" | ||
% ", ".join(map(repr, unsupported_languages)) | ||
) | ||
elif languages is not None: | ||
raise TypeError( | ||
"languages argument must be a list (%r given)" % type(languages) | ||
) | ||
|
||
if languages: | ||
self.language_detector = FullTextLanguageDetector(languages=languages) | ||
else: | ||
self.language_detector = FullTextLanguageDetector( | ||
list(self.available_language_map.values()) | ||
) | ||
|
||
return self.language_detector._best_language(text) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.