IslandSQL Episode 4: Expressions
This post was originally published on this site
In the last episode, we extended the IslandSQL grammar covering the complete
lock table statement. However, the support for expressions was very limited. It was not possible to use a date literal or
to_date function to determine a partition to be locked. Time to fix that. In this episode, we will have a closer look at some SQL expressions and how to deal with the complexity of the SQL language.
The full source code is available on GitHub and the binaries are on Maven Central.
Lock Table Partition
To lock a table partition we use the partition name like this:
lock table sales partition (sales_q2_2000) in exclusive mode nowait;
Or we let the Oracle Database determine the partition by passing the values of a partition key like this:
lock table sales partition for (date '2000-04-01') in exclusive mode nowait;
Both statements will lock the same partition of the table
sales. The latter is IMO better since it works also with system-generated partition names, which you might use with interval partitioning.
There are a lot of possibilities to specify the date. Here is a selection of sensible and less sensible alternatives:
lock table sales partition for (to_date('2000-04', 'YYYY-MM')) in exclusive mode nowait; lock table sales partition for (to_date('2000-092', 'YYYY-DDD')) in exclusive mode nowait; lock table sales partition for (timestamp '2000-04-01 08:42:42') in exclusive mode nowait; lock table sales partition for (add_months(trunc(to_date('2000-12-31', 'YYYY-MM-DD'), 'YYYY'), 3)) in exclusive mode nowait; lock table sales partition for (date '2000-01-01' + interval '3' month) in exclusive mode nowait; lock table sales partition for (timestamp '2000-12-31 18:00:00' + 1/4 + -9 * interval '1' month) in exclusive mode nowait;
All these alternatives lock the partition in the
sales table where the data for 1st April 2000 is stored. Using different SQL expressions, of course.
I mentioned in my last post that expressions are the most extensive part of the SQL grammar. However, there are ways to optimise the extent of the grammar. ANTLR 4 helps because it allows defining left recursive parts of a grammar naturally. See the expressions labelled with
binaryExpression in the excerpt of the parser in version 0.4.0 of the grammar below.
expression: expr=STRING # simpleExpressionStringLiteral | expr=NUMBER # simpleExpressionNumberLiteral | K_DATE expr=STRING # dateLiteral | K_TIMESTAMP expr=STRING # timestampLiteral | expr=intervalExpression # intervalLiteral | expr=sqlName # simpleExpressionName | LPAR exprs+=expression (COMMA exprs+=expression)* RPAR # expressionList | expr=caseExpression # caseExpr | operator=unaryOperator expr=expression # unaryExpression | expr=functionExpression # functionExpr | expr=AST # allColumnWildcardExpression | left=expression operator=(AST|SOL) right=expression # binaryExpression | left=expression ( operator=PLUS | operator=MINUS | operator=VERBAR VERBAR ) right=expression # binaryExpression | left=expression operator=K_COLLATE right=expression # binaryExpression | left=expression operator=PERIOD right=expression # binaryExpression ;
In line 13 we deal with multiplication and division. Both sides of the operation allow an expression. Unlike grammars based on ANTLR 3, it is no longer necessary to left-factor this left-recursion in ANTLR 4.
ANTLR 4 solves the ambiguity by prioritizing the alternatives in the order of the definition. As a result, multiplication/division has a higher priority than addition/subtraction and a lower priority than the function expression on line 11.
Left-factoring still might be helpful to optimize the runtime performance of the parser. However, it’s not necessary anymore. And I’m happy with a simpler grammar.
It would be a bad idea to handle each function like
to_char separately in the grammar. Why? It would be more work and we still would need a solution for custom functions. As a result, we generically implement functions. The most naïve grammar definition would look like this:
functionExpression: name=sqlName LPAR (params+=expression (COMMA params+=expression)*)? RPAR ;
Parameterless functions that do not allow parentheses like
sysdate will be treated as
simpleExpressionName. This rule handles parameterless functions that require parentheses like
sys_guid(). And it can handle functions with an unbounded number of parameters.
So what’s the problem? Why do I call this definition “naïve”? Because the following cases are not covered:
- named parameters, e.g.
dbms_utility.get_hash_value(name => 'text1', base => 0, hash_size => 16)
- parameter prefixes, e.g.
- parameter suffixes, e.g.
deterministicin approx_median or partition_by_clause/order_by_clause in approx_rank
- partial analytic clauses, e.g.
within groupin approx_percentile
- analytic clauses, e.g.
In most cases, it makes probably sense to handle them generically. However, there are cases where it’s probably better to define dedicated grammar rules for very special functions such as json_table or xml_table.
You find a less naïve implementation on GitHub. However, it is still incomplete. IMO a good way to assess completeness is to write tests. For function expressions, this means tests for every single function according to the SQL Language Reference. My tests currently cover
cardinality. The next function on the to-do list is cast, which contains the following uncovered grammar constructs:
- parameter prefix
multisetfor a subquery expression, which is also not yet covered
- parameter suffix
- parameter suffix
DEFAULT return_value ON CONVERSION ERROR
Shall we define a dedicated grammar rule for the
cast function? I guess yes. However, I’d probably implement
multiset as a unary operator and the
subquery as part of the
expression rule. For that, we need to implement the grammar for the complete
Knowing the Scope and the Limitations
A typical question I get is “Why are you writing a SQL grammar? There are already some available for ANTLR 4, right?”. That’s an excellent question. The ANLTR organisation on GitHub manages a repository with example grammars. You find the ones for SQL here.
Most of the 3rd party grammars cover just a (large) subset of the underlying languages. They define what they cover in ANTLR or by EBNF. However, they often do not define which versions they cover and they do not define what they don’t cover. As a result, you have to try if the grammar is sufficient for your use case. Furthermore, you have to assess if it is sufficient for future use cases and if it will cover the changes in newer versions. And of course, you have to decide if you are ready to extend/fix the grammar in areas where it does not meet your expectations. You will have to maintain a fork. This can become painful, especially if the grammar contains a lot of stuff you are not interested in and if the existing test suites are incomplete.
Furthermore, I’m not aware of an open-sourced grammar that covers the relevant portions for the Oracle Database and PostgreSQL. Yes, the goal is that IslandSQL supports the recent versions of both dialects as if they were a single dialect.
We develop the grammar iteratively. As a result, there are a lot of interim limitations like the one mentioned regarding the
SQL*Plus Substitution Variables
However, some limitations are supposed to be permanent. Like the one for substitution variables. Substitution variables can contain arbitrary text. They are replaced before the execution of a script. The IslandSQL grammar provides limited support for substitution variables. They can be used in places where a
sqlName is valid. This is basically everywhere you can use an
Here’s an example of a supported usage:
lock table &table_name in exclusive mode wait &seconds;
And here’s an example of an unsupported usage:
lock table dept in &lock_mode mode nowait;
The grammar expects certain keywords at the position of
&lock_mode. Here’s the excerpt of the grammar that should make that clear:
lockTableStatementUnterminated: K_LOCK K_TABLE objects+=lockTableObject (COMMA objects+=lockTableObject)* K_IN lockmode=lockMode K_MODE waitOption=lockTableWaitOption? ; lockMode: K_ROW K_SHARE # rowShareLockMode | K_ROW K_EXCLUSIVE # rowExclusiveLockMode | K_SHARE K_UPDATE # shareUpdateLockMode | K_SHARE # shareLockMode | K_SHARE K_ROW K_EXCLUSIVE # shareRowExclusiveLockMode | K_EXCLUSIVE # exclusiveLockMode ;
And the next excerpt shows how substitution variables are defined in the grammar.
sqlName: unquotedId | QUOTED_ID | substitionVariable ; substitionVariable: AMP AMP? name=substitionVariableName period=PERIOD? ; substitionVariableName: NUMBER | sqlName ;
A substitution variable starts with one or two
&. The name can be either a
1) or a
tableName). And a substitution variable can optionally end on
.. That’s it. This way we can provide limited support for SQL*Plus substitution variables.
The grammar evolves quite nicely. However, expressions are still incomplete. This will be covered with the full support of the
As I’m sure you’ve already found out for yourself, the version of the grammar matches the episodes in this blog post series. And these are the planned versions of IslandSQL with their main features:
- v0.5.0: Fully parse
selectstatement, complete expressions and conditions
- v0.6.0: Fully parse remaining DML statements (
- v0.7.0: PostgreSQL syntax compatibility of implemented statements
- v0.8.0: Fully parse PL/SQL block
- v0.9.0: Fully parse
createstatements of the Oracle Database (
- v0.10.0: pgPL/SQL syntax compatibility (
And after episode 10 the fun begins. We can start to provide value for database developers and others. A linter is one option. But there is more. Stay tuned.
The post IslandSQL Episode 4: Expressions appeared first on Philipp Salvisberg’s Blog.