Developers Club geek daily blog

3 years, 11 months ago
I — the scientist [here about it am more detailed]. «Proletarian of brainwork». The physicist are more by training. I works on a field of handling of the medical and biological information 30 + years.
In R 10 years works exactly, ha migrat on it after 15 years of the dense cooperation with Matlab. As an original cause of migration on other working platform my own physical migration on the opposite edge of the Earth serving in Okland, New Zealand. Here life with the first days pushing me in embraces of R about what I had not to be sorry yet.

Even more often I watches flashouts of interest to R in professional ru.nete. Well here and on this dear resource there was articles about it. Further under a cat my first attempt of Russian-speaking introduction in R — the first (verbal) part of presentation whom I doing for colleagues of faculty of Animal Science, Iova State University three of year back.
(aside: and as, okazyvetsya, it are difficult to translate itself...)

My experience of introduction in R or «I Love R»"alt = «image»/>
In this post
  • What are R
  • Whence it undertaking
  • For what I loves it
  • For what and as I uses it (examples)
  • Myths and truth

What are R

First of all R are system for the statistical and other scientific computations, S us a programming language.

S — the language wr by statistics for statisticans. on determination of the author of John Chambersa. Language of appearance from the moment of it were very well accept and test by generations of rather captious users statisticans. It are possible to consider that it are widely enough known and accept in world statistical community. In language of S was implement and till now was exploit a row of critical epidemiological, ecological and financial models worldwide and in many branches. As language from the point of view of me as «the writing user», S represented rather pleasant alternative to language of SAS.

From my private experience — Acquaintance and the first lessons of S I receiving in the beginning of 90th from experts statisticans the CART with which it were intersect on scientific researches of that time.

By many estimations of R (as on me — and not strongly exaggerat) — opensorsa one of the most successful projects, are spread freely with tens mirrors worldwide on standards of licenses of GNU.
Authors answered with flat refusal all sentences on commercialization of the project though for today there are a base to assume that the amount of the install spears of R in a pattern exceeded a cumulative amount of spears of all remaining systems of statistical analysis.

From the very beginning and up to that moment the project caused in me the deepest respect (on the verge with delight) stability, support of the user, compatibility of the codes and so forth that I would unite in concept culture.
However, the last sentence, faster, for the subsequent subsections.

Whence S and what it concerned R undertaking

Undoubtedly, Wikipedia will give you much more letters.
I only will mark that I considers important for understanding of a place of S and R in this life in this pattern.

Laboratories of Bell (Ak of Bell Labs, AT&T; Bells Labaratories) was known enough in the history of a science and technique, and Ayti in particular. Statistical researches there always deliver rather seriously and also seriously support by all accessible computer means (read — tons of fortranovsky and the lispovsky code).

That then becoming language of S, arising in 70th on the initiative and under the direction of John Chambersa (John Chambers), as a dial-up of the scripts facilitat "skarmlivaniye" of the data to the fortranovsky code. I.e. the task of interactive manipulation by the data, compactness, a gratefulness in writing and readership of the code and obtaining of a decent output to various devices of tables and schedules were as the corner-stone put.

In a language syntax creation practically as much as difficult data structures are provid, means for the description of specific statistical tasks and objects — to become. tests, models and so forth.

With 1984 language finding a name, own "Bible" (the book of Chambersa of and Beckers are publish: S: An Interactive Environment for Data Analysis and Graphics), beginning to contain by default practically the complete «a gentleman's dial-up» statistican and "probabilist" — raspredleniye, generators of random numbers, the statistical tests, many standard statistical analyses, operations with matrixes and so forth, not to mention developed system of scientific graphics. The most important thing — it becoming accessible to users on the vy world for rather moderate price.

In 1988 (one more book of The New S Langugage are publish) — it are modif with application of OOP, all becoming objects with rather reasonable values by default, availability to modification, elements of samodukomentirovaniye and so forth and so forth.

In the same time laboratories publishing source codes and "Bell-labovsky" S becoming free for students and for usage in the scientific purposes. It are all it were somehow connect with "raskulachivaniye" AT&T; but me already these particulars not strongly interest.

Exist and, probably, still existed commercial implementation of language of S. I facing with S-Plus and S2000. They at various times support by the different companies, in the core, liv (lived?) at the expense of support of applications earlier creat on S. In these fast-bellovsky versions of S there were a new version of OOP engine, but for the pure user it transiting almost without blood in respect of compatibility of the historical code.

R — single noncommercial completely independent (from initial Bellovsky) implementation of language of S.

And on rare presently to the agreement any unimaginable for me a method developers of current versions of commercial S and noncommercial R supported them practically the complete compatibility and eligibility.

And now R

Behind any considerable phenomenon in this life there are any charismatic person. However, it could happen and eat determination of relevancy of the phenomenon.

In a case with R of such people a three.
About John Chambersa I already telling.

Ross Ikhaka (Ross Ihaka) — the student, and then the research assistant of faculty of statistics of Oklendsky university a subject of the dissertation (which it were fulfill in MIT, USA) selecting research of possibility of creation of the virtual machine (VM) for statistical programming languages. As the intermediate language Lisp (by Common Lisp, CL) were select and on it the prototype of VM "underst" small subsets from SAS and S are implement.
To finish the dissertation Ross returning in Okland where soon meeting Roberta Dzhentelmenv and were fond of the project of R.
Ross and doing not protect the dissertation, but already had a scientific degree from several universities «on set of merits». In the past to year to it the rank were appropriat and it receiving a post Associate Professor (senior lecturer) at the native university.

Robert of Dzhentelmen (Robert Gentleman) — one more statistics with passion to programming, came from Canada, at Oklendsky university on training (it then working in Australia), suggesting Rossa «to write any uvula».
It agreed to a legend whom I hearing from these "founding fathers" of all almost in a month they in a burst of mad enthusiasm rewr on CL practically all commands of S, including power library of the linear simulation.

Computing engine of R, follow traditions of a prototype, the known, conventional and free library of BLAS, (with possibility of usage of ATLAS and so forth with the same interface) were select.
The floor of Murrel one from of the closest friends of Ross and also the employee of Oklandsky University rasstaralsya also writing (it seemed, on C) with zero the graphic engine which completely is play back that multifunction in S.

As a result the free full-function bag which had instantly receiv a place in educational process of Oklanlsky university, completely appropriate to descriptions in very detailed and qualitative books of Chambersa who by tradition publish in soft covers and average quality of printing turning out, but cheap and accessible.
Some groups active workers GNU шного (for example GIS) motions accept R as a platform for scientific computings.

But indeed the broadest popularity of R acquiring in biocomputer science when Robert one of "fathers" the Gentleman involv at that time in operations of firm of Affimmetrix, produbliroval and launching all multifunction of a commercial software of firm (well not one, it are finite) the opensorsny project of Bioconductor. Now Bioconductor are the unconditional leader of bioinformatichesky opensorsa for all "-omiksov" (genomics, proteomics, metabolomics etc.).

Uniform language of the interface for this violence the bioinformaticheskikh of imaginations becoming, naturally, R.

The circle becoming isolated, when retir Chambers, the creator of language of S, were includ as the full member into group of the active developers of R.

For what I loves it (list)

  1. Interactivity, «Programming are more dataful» — my favourite style of operation
  2. Refined (on the fan) language — I loves lists, data frames, the functional programming and a lambda function (and-la) yu the Freedom of expression: the same task can be solv ten methods (softened sensation of routine)
  3. «Soberly looked at this world» — rarely "fell" or whom - the thread "suspended", logical operations with the pass data, error handling in runtime (try-error), a light exchange with system at level of standard I/O and so forth.
  4. The complete set of statistical procedures ready to the use
  5. It are well documentary and are well accompan — compatibility, eligibility, etc.
  6. Collecting round itself humanly pleasant professional community (forums, conferences of users and so forth)
  7. Well dokumetirovanny interface for exterior libraries and functions on everything — the Fortran, With, Java. From here the sea of well documentary libraries on all aspects of statistics and data handling practically in all spheres of a science, but with the main emphasis on biocomputer science/biostatistics; all are regularly and correctly updat, if there are on that an authoring will
  8. Absence of mandatory GUI in «a basic complete set» — Well not "mouse" I am a person!

Out of the list: To me it are simply pleasant that my main working tool had … I smothers.
That I, actually, also tries to show in the article.

For what and as I uses it (examples)

Starting to write to this section, but stopping.
Otherwise I never would finish.
Oh, probably, as nibud then.

Myths and truth

R slow

R — "thin", for calculations used blas/lapack/atlas of library, try to write something more quickly than these old was kind the fortranovskikh (frequently) «working horsies». All critical functions, as a rule, used vectorial AND operations was implement on With.

R irrationally used computing resources, in particular — storage

Yes, developers recognized such sin. But working hours of the expert now was differently more expensive some "iron". Preempt from the modern working computer of a toy and with the majority of actual data sets at you with R of problems will not be.

The free software could not be reliable

Could: Fortran, Linux, C, Lisp, Java etc.

Instead of the Epilogue

As it are t above, the post lower are actually transfer of my presentation for spetsifichesuy enough target audience, and I briefly will describe this audience.

Many "pure" Ayti with such people should meet, as manufacture of foodstuff on engaging of the capital and oscillation of profits competed for a long time already with oil and other energy carriers. And the capacity of the bioinformatichesky market at medicine and pharmacology are restrict, whatever one may do.

So, my audience — people, with basic formation in genetics and selection, veterinary science, are more rare — biology (preferentially — molecular). Uncles and aunts (last it are more), years on 20-30-… program (!) on FORTRANe or VB which are dashing control with excel-tables in 100к of lines/columns and periodically "drop" tasks (and the programming) computing linuksovy 500+ядерный a cluster 12Тб the common storage and from time to time demand extensions of disk storage in the next ten terabyte.

Methodical basis — a rattling compound ancient as the world of dispersing analyses with the mixed models solv in any way differently, as soon as the maximum likelihood method, «melt a brain» bayesian networks, etc.

Data — data sheets from units to tens thousand the lines, 1-5 columns includ sometimes with phenotypes, but even more often — tens or hundreds "Ka" of columns of variables, the slabokorelliruyushchikh among themselves and with phenotypes.

Well, still they have «a good tradition» to consider all in aspect of related communications (genetics, after all). Related communications was traditionally present in the form of a matrix of «related communications» (pedigree) in the sizes, for example 40 000 h 40 000 (it if 40 000 animals). Well or (while, fortunately, only in the project) 20 000 000 h 20 000 000 — it to "envelop" uniform model all 20 million historical animals which are available in a database (DB2 if to whom it are interesting, and even Сobol still "cut" not from everywhere...)

On the desktops which have been fill up with the literature on (simultaneously) Fortran, Java, C#, Scalа, Octavia, Linux for Dummies it are possible to learn recent graduates-bioinformatikov. But somehow quickly many of them left a science in "coders".

However, I knows also a case of reverse driving. So R still are useful to much.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus