Developers Club geek daily blog

1 year, 10 months ago

From the translator


КДПВ It is the last article from a cycle about work with lines and memory in Rust from Herman Radtke which I transport. Me it seemed to the most useful, and initially I wanted to begin transfer with it, but then it seemed to me that other articles in a series are necessary too, for creation of a context and introduction to simpler, but very important, the language moments without which this article loses the usefulness.


We learned how to create function which accepts String or &str; (English) as argument. Now I want to show you how to create function which returns String or &str;. Still I want to discuss why it can be necessary for us.

For a start let's write function which deletes all spaces from the set line. Our function can look approximately so:

fn remove_spaces(input: &str) -> String {
   let mut buf = String::with_capacity(input.len());

   for c in input.chars() {
      if c != ' ' {
         buf.push(c);
      }
   }

   buf
}


This function selects memory for the line buffer, takes place on all characters in a line input also adds all not whitespace characters to the buffer buf. Now question: what if on an input there is no space? Then value input will be just the same, as well as buf. It would be in that case more effective not to create at all buf. Instead we just would like to return set input back to the user of function. Type input&str;, but our function returns String. We could change type input on String:

fn remove_spaces(input: String) -> String { ... }

But there are two problems. First, if input will become String, the user of function should move the ownership right input in our function so it will not be able to work with the same data in the future. We should take ownership input only if it is really necessary for us. Secondly, on an input can already be &str;, and then we force the user to transform a line in String, nullifying our attempt to avoid memory allocation for buf.

Cloning at record


Actually we want to have an opportunity to return our input string (&str;) if in it there are no spaces, and a new line (String) if spaces are also to us it was required to delete them. Here also the copying type - comes to the rescue at - records (clone-on-write) Cow. Type Cow allows us to abstract from whether we own a variable (Owned) or we only borrowed it (Borrowed). In our example &str; — the link to the existing line so it will be the borrowed data. If in line there are spaces, we need to select memory for a new line String. Variable buf owns this line. Usually we would move ownership buf, having returned it to the user. When using Cow we want to move ownership buf in type Cow, and then to return already it.

use std::borrow::Cow;

fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> {
    if input.contains(' ') {
        let mut buf = String::with_capacity(input.len());

        for c in input.chars() {
            if c != ' ' {
                buf.push(c);
            }
        }

        return Cow::Owned(buf);
    }

    return Cow::Borrowed(input);
}

Our function checks whether contains initial argument input at least one space, and only then selects memory under the new buffer. If in input there are no spaces, it just returns as is. We add a little complexity at the runtime to optimize work with memory. Pay attention that at our type Cow the same lifetime, as at &str;. As we already spoke earlier, the compiler needs to monitor use of the link &str;, that the nobility when it is possible to release safely memory (or to cause a method destructor if the type implements Drop).

Beauty Cow that it implements a type Deref, so you can call for it not changing these methods, even without knowing whether the new buffer is selected for result. For example:

let s = remove_spaces("Herman Radtke");
println!("Длина строки: {}", s.len());

If I need to change s, I can transform it to the owning variable by means of a method into_owned(). If Cow contains the borrowed data (the option is selected Borrowed), there will be memory allocation. Such approach allows us to clone (that is to select memory) is lazy, only when we need really to write (or to change) in a variable.

Example with changeable Cow::Borrowed:

let s = remove_spaces("Herman"); // s завёрнута в Cow::Borrowed
let len = s.len(); // функция с доступом только для чтения вызывается через Deref
let owned: String = s.into_owned(); // выделяется память для новой строки String

Example with changeable Cow::Owned:

let s = remove_spaces("Herman Radtke"); // s завёрнута в Cow::Owned
let len = s.len(); // функция с доступом только для чтения вызывается через Deref
let owned: String = s.into_owned(); // выделения памяти не происходит, у нас уже есть строка String

Idea Cow in the following:

  • To postpone memory allocation on as it is possible long term. At best we will never select new memory.
  • To give the chance to the user of our function remove_spaces not to worry about memory allocation. Use Cow will be identical anyway (whether new memory will be selected, or not).

Use of a type of Into


Earlier we spoke about use of a type of Into (English) for conversion &str; in String. In the same way we can use it for converting &str; or String in the necessary option Cow. Challenge .into() will force the compiler to select right option of converting automatically. Use .into() will not slow down our code at all, it is just method to get rid of the explicit indication of option Cow::Owned or Cow::Borrowed.

fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> {
    if input.contains(' ') {
        let mut buf = String::with_capacity(input.len());
        let v: Vec<char> = input.chars().collect();

        for c in v {
            if c != ' ' {
                buf.push(c);
            }
        }

        return buf.into();
    }
    return input.into();
}

Well and finally we can simplify a little our example with use of iterators:

fn remove_spaces<'a>(input: &'a str) -> Cow<'a, str> {
    if input.contains(' ') {
        input
        .chars()
        .filter(|&x| x != ' ')
        .collect::<std::string::String>()
        .into()
    } else {
        input.into()
    }
}

Real use of Cow


My example with space suppression seems a little far-fetched, but in a real code such strategy finds application too. In a kernel of Rust there is a function which will transform bytes to UTF-8 a line with loss of non-valid combinations of bytes, and function which transfers the ends of lines from CRLF to LF. For both of these functions there is a case at which it is possible to return &str; in an optimum case, and less optimum case demanding memory allocation under String. Other examples which come to my mind: coding of a line in valid XML/HTML or correct shielding of special characters in SQL request. In many cases input data are already correctly coded or screened, and then just it is better to return an input string back as is. If data need to be changed, then we should select memory for the line buffer and to return already it.

Why to use String:: with_capacity ()?


While we speak about effective management of memory, pay attention that I used String::with_capacity() instead of String::new() during creation of the line buffer. You can use and String::new() instead of String::with_capacity(), but it is much more effective to select memory for the buffer all required memory at once instead of perevydelyat it as we add new characters to the buffer.

String — actually vector Vec from code positions (code points) UTF-8. By a challenge String::new() Rust creates a vector of zero length. When we place the character in the line buffer a, for example with the help input.push('a'), Rust has to increase vector capacity. For this purpose it will select 2 bytes of memory. At the further room of characters in the buffer when we exceed the selected memory size, Rust doubles the size of a line, reselecting memory. It will continue to increase vector capacity every time at its exceeding. Sequence of the selected capacity such: 0, 2, 4, 8, 16, 32, …, 2^n, where n — the number of times when Rust found exceeding of the selected memory size. Rememory allocation very slow (correction: explained kmc_v3 that it can be not so slow as I thought). Rust not only has to ask a kernel to select new memory, it still has to copy vector contents from old area of memory in new. Look at the source code Vec:: push that to see logic of size variation of a vector.

Refining about rememory allocation from kmc_v3
Everything can be not so bad because:

  • Any decent allokator asks memory from OS big pieces, and then issues it to users.
  • Any decent multithreaded allokator of memory also supports caches for each flow so you should not synchronize to it access all the time.
  • It is very often possible to increase the selected memory on site, and in such cases copying of data will not be. Can you and selected only 100 bytes, but if the following one thousand bytes is free, the allokator will just issue them to you.
  • Even in case of copying, byte copying with the help is used memcpy, with completely predictable method of a memory access. So it is, perhaps, the most effective method of data movement from memory in memory. The system library libc usually includes memcpy with optimization for your specific microarchitecture.
  • You can also "move" the big selected pieces of memory by means of reconfiguration of MMU, that is you need to copy only one data page. However, usually change of page tables has the big fixed cost so the method is suitable for very big vectors. I am not sure that jemalloc in Rust does such to optimization.

Size variation std::vector in C ++ it can be very slow because it is necessary to cause designers of movement individually for each element, and they can throw out an exception.

Generally, we want to select new memory only when it is necessary, and there is exactly so much how many it is necessary. For short lines, such as remove_spaces("Herman Radtke"), overhead costs of rememory allocation do not play a large role. But what if I want to delete all spaces in all JavaScript files on my website? Overhead costs of rememory allocation for the buffer will be much more. When placing data in a vector (String or any other), it is very useful to specify the extent of memory which will be required, during creation of a vector. At best you foreknow the necessary length so the capacity of a vector can be set precisely. Code comments Vec warn approximately about the same.

What else to esteem?



This article is a translation of the original post at habrahabr.ru/post/274565/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus