Paperless-ngx error storing document (nltk tokenizer)

:fr: Erreur lors de l’import paperless-ngx

Je n’ai pas trouvé le tag ‘paperless-ngx’ , j’en suis navré.

Mon serveur YunoHost

Matériel: VPS acheté en ligne
Version de YunoHost: 11.2.30
J’ai accès à mon serveur : En SSH | Par la webadmin
Êtes-vous dans un contexte particulier ou avez-vous effectué des modificiations particulières sur votre instance ? : non
Si votre requête est liée à une application, précisez son nom et sa version: paperless-ngx v2.11.6~ynh1

Description du problème

Depuis la dernière mise à jour, impossible d’ajouter un nouveau document. J’obtiens une erreur The following error occurred while storing document ebay_lenovo_thinkpad_t14s_gen2.pdf after parsing:

nouveau_fichier.pdf: The following error occurred while storing document nouveau_fichier.pdf after parsing: 
  Resource e[93mpunkt_tabe[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  e[31m>>> import nltk
  For more information see:

  Attempted to load e[93mtokenizers/punkt_tab/english/e[0m

  Searched in:
    - PosixPath('/var/www/paperless-ngx/nltk_data')

En regardant /var/log/paperless-ngx/paperless-ngx-task-queue.log je trouve peu d’informations supplémentaires :


[2024-09-02 10:26:31,301] [ERROR] [] Task documents.tasks.consume_file[56b68187-31b5-4968-abaa-a19b2246c60e] raised unexpected: ConsumerError("nouveau_fichier.pdf: The following error occurred while storing document nouveau_fichier.pdf after parsing: \n**********************************************************************\n  Resource \x1b[93mpunkt_tab\x1b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \x1b[31m>>> import nltk\n  >>>'punkt_tab')\n  \x1b[0m\n  For more information see:\n\n  Attempted to load \x1b[93mtokenizers/punkt_tab/english/\x1b[0m\n\n  Searched in:\n    - PosixPath('/var/www/paperless-ngx/nltk_data')\n**********************************************************************\n")
Traceback (most recent call last):
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/asgiref/", line 327, in main_wrap
    raise exc_info[1]
  File "/var/www/paperless-ngx/src/documents/", line 670, in run
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/django/dispatch/", line 176, in send
    return [
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/django/dispatch/", line 177, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "/var/www/paperless-ngx/src/documents/signals/", line 95, in set_correspondent
    potential_correspondents = matching.match_correspondents(document, classifier)
  File "/var/www/paperless-ngx/src/documents/", line 37, in match_correspondents
    pred_id = classifier.predict_correspondent(document.content) if classifier else None
  File "/var/www/paperless-ngx/src/documents/", line 413, in predict_correspondent
    X = self.data_vectorizer.transform([self.preprocess_content(content)])
  File "/var/www/paperless-ngx/src/documents/", line 386, in preprocess_content
    words: list[str] = word_tokenize(
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/", line 142, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/", line 119, in sent_tokenize
    tokenizer = _get_punkt_tokenizer(language)
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/", line 105, in _get_punkt_tokenizer
    return PunktTokenizer(language)
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/", line 1744, in __init__
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/", line 579, in find
    raise LookupError(resource_not_found)
  Resource ^[[93mpunkt_tab^[[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  ^[[31m>>> import nltk
  For more information see:

  Attempted to load ^[[93mtokenizers/punkt_tab/english/^[[0m

  Searched in:
    - PosixPath('/var/www/paperless-ngx/nltk_data')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/celery/app/", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/celery/app/", line 736, in __protected_call__
    return*args, **kwargs)
  File "/var/www/paperless-ngx/src/documents/", line 149, in consume_file
    msg =
  File "/var/www/paperless-ngx/src/documents/", line 733, in run
  File "/var/www/paperless-ngx/src/documents/", line 304, in _fail
    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
documents.consumer.ConsumerError: nouveau_fichier.pdf: The following error occurred while storing document nouveau_fichier.pdf after parsing:
  Resource ^[[93mpunkt_tab^[[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  ^[[31m>>> import nltk
  For more information see:

  Attempted to load ^[[93mtokenizers/punkt_tab/english/^[[0m

  Searched in:
    - PosixPath('/var/www/paperless-ngx/nltk_data')

Je ne sais pas trop par où commencer. Désinstaller-réinstaller ? J’ai un peu peur de ne pas réussir à réimporter ma sauvegarde.

Bon, j’ai résolu mon problème :

$ sudo yunohost app shell paperless-ngx
$ python
>>> import nltk
[nltk_data] Downloading package punkt_tab to /var/www/paperless-
[nltk_data]     ngx/nltk_data...
[nltk_data]   Unzipping tokenizers/
>>> quit()

Ça peut servir à d’autres utilisateurs.


Merci beaucoup d’avoir partagé cette solution ! J’étais également bloqué depuis quelques hier…

Many thanks for sharing your solution! I don’t understand french but th Python Code works well!