RNC

API for Russian National Corpus

Downloads Tests status MIT licensed Latest Release PyPi status Supported python versions

Installation

pip install rnc

Structure

Corpus object contains list of obtained examples. There are two types of examples:

ru = rnc.MainCorpus(...)
ru.request_examples()

print(type(ru[0]))
>>> MainExample

Examples’ objects fields

Usage

import rnc

ru = rnc.MainCorpus(
    query='корпус', 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)

ru.request_examples()

Corpora you can use.

Full query form

query = {
    'word1': {
        'gramm': 'acc', # grammar tags for lexgramm search
        'flags': 'bdot' # additional tags for lexgramm search
    },
    # you can get as a value one string or dict of params
    # params are: any name of dict key, name of tag (you can see them below)  
    'word2': {
        'gramm': { 
            # the NAMES of these keys might be any
            'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
            'case (any name)': 'acc' or ['acc', 'nom'],
        },
        'flags': {}, # all the same to here
        # distance between first and second words
        'min': 1,  
        'max': 3
    },  
}

corp = rnc.MainCorpus(
    query, 5, file='filename.csv', marker=str.upper, **kwargs)
corp.reques_examples()

Lexgramm search params

String as a query

Also you can pass as a query a string with the vocabulary forms of the words, divided by space: query = 'get down' or query = 'я получить'. Distance between them will be default.

Additional request params

These params are optional, you can ignore them. Here are the default values.

corp = rnc.ParallelCorpus(
    query=query, 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    
    dpp=5, # documents per page
    spd=10, # sentences per document (<= than spd)
    text='lexgramm' or 'lexform', # way to search
    out='normal' or 'kwic', # output format
    kwsz=5, # if out=kwic, count of words in context
    sort='i_grtagging', # way to sort the results, see HOWTO section below
    mycorp='', # see HOWTO section below
    lang=rnc.Languages.en,
    accent=0, # with accentology (1) or without (0), if it is available
)

Sort keys

API can work with a local file too

ru = rnc.SpokenCorpus(file='local_database.csv') # it must exist
print(ru)

If the file exists, API works with it. If the data list is not empty you cannot request new examples.

If you work with a file, it is not demanded to pass any argument to Corpus except for the file name (file=...).

Working with corpora

corp = rnc.corpus_name(...) 

Magic methods:

for r in corp: print(r.left) print(r.src)


Set default values to all objects you will create:
* `corpus_name.set_dpp(value)` – change default `document per page` value.
* `corpus_name.set_spd(value)` – change default `sentences per document` value.
* `corpus_name.set_text(value)` – change default search way.
* `corpus_name.set_sort(value)` – change default sort key.
* `corpus_name.set_min(value)` – change default min distance between words.
* `corpus_name.set_max(value)` – change default max distance between words.
* `corpus_name.set_restrict_show(value)` – change default amount of shown examples in print. 
If it is equal to `False`, the Corpus shows all examples. 


### Corpora features
#### ParallelCorpus
* The query might be both in the original language and in the language of 
  translation. 

#### MultilingualParaCorpus
* Working with files is removed.
* Param `mycorp` is not demanded by default, but it might be passed, see 
  **HOWTO** section below.

#### MultimodalCorpus
* `corp.download_all()` – download all media files. **It is recommended** to use 
this method instead of `expl.download_file()`.
* `async corp.download_all_async()` – download all media files using the running event loop.


## Logger
* See all log messages
```python
rnc.set_stream_handler_level('debug')

ATTENTION

RIGHT:

ru = rnc.MainCorpus(...,  marker=str.upper)

WRONG:

ru = rnc.MainCorpus(..., marker=str.upper())

HOWTO

You can ask any question you want here.

How to set sort?

There are some sort keys:

  1. i_grtagging – by default.
  2. random – randomly.
  3. i_grauthor – by author.
  4. i_grcreated_inv – by creation date.
  5. i_grcreated – by creation date in reversed order.
  6. i_grbirthday_inv – by author’s birth date.
  7. i_grbirthday – by author’s birth date in reversed order.

Some of HTTP params.

How to set language in ParallelCorpus?

en = rnc.ParallelCorpus('get', 5, lang=rnc.Languages.en)

Languages the corpus supports:

  1. Armenian
  2. Bashkir
  3. Belarusian
  4. Bulgarian
  5. Buryatian
  6. Chinese
  7. Czech
  8. English
  9. Estonian
  10. Finnish
  11. French
  12. German
  13. Italian
  14. Latvian
  15. Lithuanian
  16. Polish
  17. Spanish
  18. Swedish
  19. Ukrainian

If you want to search something by several languages, choose and set the mycorp in the site, pass this param to Corpus.

How to set subcorpus?

Means specify the sample where you want to search the query.

There are default keys in rnc.mycorp (working checked in MainCorpus) – Russian writers and poets:

Example:

ru = rnc.MainCorpus('нету', 1, mycorp=rnc.mycorp['Pushkin'])

OR

ru = rnc.MainCorpus('нету', 1, mycorp=rnc.mycorp.Pushkin)

OR

1 2 3 4

Requirements

Licence

rnc is offered under MIT licence.

Source code

The project is hosted on Github


Please file an issue in the bug tracker if you have found a bug or have some suggestions to improve the library.