Spark local mode: processing of big files on the normal notebook

2 years, 10 months ago
All hi.
On January 4 there was a new version of introduction before experience of use in projects. Spark works at the majority of operating systems and it can be started in the local mode even on the normal notebook. Using simplicity of the Spark setup in this case a sin not to use the main to functions. In this article we will look as on the notebook quickly to configure processing of the big file (more random access memory of the computer) by means of normal SQL queries. It will allow to make requests even to the unprepared user. Additional connection of iPython (Jupyter) notebook will allow to make full reports. In article the simple example of processing of the file is sorted, other examples on Python are here.

We draw elliptic curves by means of SQL

2 years, 10 months ago
The benefit of approach on the basis of elliptic curves in comparison with a problem of the factorization of number used in RSA, or the problem of integer logarithming applied in Diffie-Hellman's algorithm and in DSS is that in this case equivalent protection at smaller key length is provided.

Generally the equation of an elliptic curve E in the field of real numbers of R has an appearance:

— y^2+a1*x*y+a3*y = x^3+a2*x^2+a4*x+a6

or in case of a final ring of deductions of Z|n:

— y^2+a1*x*y+a3*y = x^3+a2*x^2+a4*x+a6 mod N

Let's set for ourselves the task of visualization of an elliptic curve.

Elliptic curve E in the field of real numbers of R

If the elliptic curve E is considered in the field of real numbers of R, then creation of the diagram can be described, using only knowledge of algebra and geometry of the senior classes of school

arguments of N a1 a2 a3 a4 a6 xmin xmax
  1. We select the range [xmin — xmax] of argument x
  2. We note on the selected range of argument x necessary number of x1 values..., xN
  3. Each of x1 values..., x^3+a2*x^2+a4*x+a6 is substituted xN in y^2+a1*x*y+a3*y equation = and we receive the normal square equation of argument of y
  4. We find roots of the square equation of argument of y
  5. If the square equation of argument of y has solutions, then we add two points on the diagram
  6. We connect lines all "upper" points on the diagram and all "lower" points on the diagram

There was a new LinqTestable version — libraries for testing of requests to a DB through ORM

2 years, 11 months ago
LinqTestable is the library helping to overcome with tests the conceptual gap between OOP and a relational DB arising because of a difference of behavior NULL-and in these two paradigms. For example, comparison of NULL == returns to NULL truth in object languages, and lie in a relational model. In addition, NULL.SomeField will return NULL in a relational model and will throw out NullReferenceException in C#. LinqTestable is intended for a solution of this problem.

Partitsionirovaniye in PostgreSQL – That? What for? As?

2 years, 11 months ago
In PostgreSQL, unfortunately, yet not many actively use function of a partitsionirovaniye of tables. In my opinion, very adequately tells about it in the work of Hubert Lubaczewski ( I offer you one more transfer of its article!

Recently I noticed that I in increasing frequency face cases where it would be possible to use a partitsionirovaniye. And though, theoretically, most of people knows about its existence, actually this feature is not too well understood, and some are even rather afraid of it.

So I will try to explain moderately the knowledge and opportunities what is it why it should be used and as to make it.

Story about msdb of 42 GB in size

2 years, 11 months ago
Recently minute was issued to look why the old test server godlessly braked … I had no relation to it, but I was overcome by sports interest to understand that is with it not so.

First of all opened Resource Monitor and looked at a total load. Process of sqlserv.exe loaded the CPU under 100% and created big disk queue which was for 300 … while value above of unit is already considered problem.

In the analysis of disk activity noticed continuous IO operations in msdb:


Looked at the msdb size:

SELECT name, size = size * 8. / 1024, space_used = FILEPROPERTY(name, 'SpaceUsed') * 8. / 1024
FROM sys.database_files

also included the hand person mode:

name         size           space_used
------------ -------------- ---------------
MSDBData     42626.000000   42410.374395
MSDBLog      459.125000     6.859375

The data file occupied 42 GB … Having taken a small break I began to understand what the reason of such unhealthy volume of msdb and how to overcome problems with server performance.

DataGrip release (ex-0xDBE) 1.0 — new IDE for SQL

2 years, 11 months ago
Hi! We let out IDE for work with databases.

One and a half years we did 0xDBE according to the program of early access (EAP). And here, we understood that it is time to put end under our work. We thank all who tried 0xDBE on the projects and wrote us — you very much helped. We will miss this name too.

Now IDE is called DataGrip.

The supported DBMS

DataGrip is universal IDE for work with MySQL, PostgreSQL, Oracle, SQL Server, Sybase, DB2, SQLite, HyperSQL, Apache Derby and H2.

Work with objects of a DB and code generation

DataGrip provides tools for work with database objects. If you create or change the table, add or change a column, an index, a key in already existing, use the graphic interface. Similar changes are followed by generation of the corresponding script — you can execute the made changes in base at once or copy the generated DDL request in the editor and work already directly with a code.

How to work with time stamps (timestamp) in PostgreSQL?

2 years, 11 months ago
The subject of work with time marks in PostgreSQL is badly opened in Russian-language profile publications on the Internet and is a frequent source of problems in work of programmers. I bring to your attention transfer of material from Hubert Lubaczewski, the author of the popular foreign blog I hope, article will be useful to you!


From time to time in IRC or in mailing groups somebody asks questions which show deep misunderstanding (or a lack of understanding) of time stamps, especially those which consider time zones. As I already faced it earlier, allow me to tell that such timestamps as to work with them and what most widespread hitches you can face.

We study the graph - the oriented Neo4j DBMS on the example of the lexical Wordnet base

2 years, 11 months ago
Neo4j DBMS is NoSQL the database oriented to storage of graphs. A highlight of a product is declarative language of requests Cypher.

Cypher borrowed a key word like WHERE, ORDER BY from SQL; syntax from such different languages as Python, Haskell, SPARQL; and as a result there was a language allowing to make requests to graphs in a visual form like ASCII art. For example, I would present heading of this article in the form of the graph (Neo4j) — [we study]-> (Wordnet). And it is almost ready database request!

7 errors of the ETL developer

2 years, 11 months ago
Projects of data storages are part of IT infrastructure of the majority of large enterprises for a long time. Processes of ETL are part of these projects, however developers sometimes make the same mistakes at design and maintenance of these processes. Some of these errors are described in this post.

XML, XPath and threefold grief with a performance

2 years, 11 months ago
Trip to Dnipropetrovsk, chronic sleep debt the last couple of days, but a pleasant bonus on arrival to Kharkiv … A winter weather which motivates on writing something interesting …

For a long time in plans was to tell about "reefs" during the work with XML and XQuery which can lead to tricky problems with performance.

Generally, for those who often uses SQL Server, XQuery and likes to parsit values from XML it is recommended to get acquainted with the following material …

