OBDC, Excel and memory allocation problems

orr721 · August 31, 2017, 9:29am

Error: Couldn't allocate column buffer

Hi there! First time posting on a Rebol/Ren-C/Red forum, I hope this is the right place for my question. I have been testing the ODBC functionality in Ren-C build 2.102.0.3.1 from 10-Aug-2017/5:25:53 on Windows 10 Pro.

Is there any way how to increase the available memory for the ODBC query? Because I have some kind of memory allocation problem while running my queries..

>> insert conn-port "select * from [Sheet0$] where Meno = 'ALENA'"
** Error: Couldn't allocate column buffer!
** Where: insert-odbc --anonymous-- insert
** Near: reduce compose [(sql)] ??

The query should return 1 record from an XLSX file with 13 columns, where 2 of the columns contain strings longer than 1500 characters. When I run the select query without these two columns the query runs fine and I receive my data.

I am a bit confused with this problem because in Rebol2/View I can run the same query with all 100 records (without the where clause) in a single statement without any memory problems. Am I missing something or is it a bug in the ODBC Ren-C implementation? Thank you very much for an answer and for your excellent work.

P.S. running TRACE ON crashes the executable immediately..

gchiu · August 31, 2017, 10:21am

Can you try casting the result eg varchar(1500) ?

hostilefork · August 31, 2017, 10:01pm

Welcome! Yes, this is a good place to ask questions regarding Rebol. I was once a skeptic of the idea of making a forum (I preferred our chat, GitHub Issues, Trello, StackOverflow Q&A...) and wondered if it would just be a burden to have another "place". But I think it turned out to be a good medium for longer-attention-span discussions that don't fit elsewhere.

(Note that for Red, they do most of their support on Gitter... here is their list of rooms)

The ODBC support was an attempt by me to get the ball rolling by hacking on an R3-Alpha extension written by Christian Ensel so that it could work in a minimal sense. I don't know how true that extension was to Rebol2, as I have no real experience with historical ODBC in Rebol (I did not arrive on the scene until the R3 times).

So over the span of a week or so I managed to get something going, against MySQL's ODBC driver on Windows. I concocted a single test file to see what it could do:

https://github.com/metaeducation/ren-c/blob/master/tests/misc/odbc-test.reb

That's about as far as we've gotten, though it's better than nothing!!

So I don't presume to know what it should or should not do. It's mostly on those in the community who have a stake to come in and say what it would take for them to feel engaged and want to be involved. I've got a lot of questions about the PORT! design in the first place, and I also feel like there's a lot more STRING! use than I'd like to see for a dialect. One of Rebol's strengths is supposed to be DSLs, so it would seem a smoother and more integrated SQL using WORD!s and such would be an objective...

We could definitely use some more developers to hack on it with us. C programmers would be great but even writing tests and defining what it should do helps.

In any case, please feel free to introduce yourself and your stake in the project here!

orr721 · August 31, 2017, 11:01am

Can you try casting the result eg varchar(1500) ?

I am sorry, you have to help me a bit here, because I don't know how to do that.

Anyway I have checked the column properties and there is something strange with the 2 mentioned columns:

>> insert conn-port ['columns "Sheet0$"]
== [TABLE_CAT TABLE_SCHEM TABLE_NAME COLUMN_NAME DATA_TYPE TYPE_NAME COLUMN_SIZE BUFFER_LENGTH DECIMAL_DIGITS NUM_PREC_RADIX NULLABLE REMARKS COLUMN_DEF SQL_DATA_TYPE SQL_DATETIME_SUB CHAR_OCTET_LENGTH ORDINAL_POSITION IS_NULLABLE ORDINAL]

>> copy conn-port
== [["C:\exp.xlsx" _ "Sheet0$" "Priezvisko" -9 "VARCHAR" 255 510 _ _ 1 _ _ -9 _ 510 1 "YES" 1]
["C:\exp.xlsx" _ "Sheet0$" "Meno" -9 "VARCHAR" 255 510 _ _ 1 _ _ -9 _ 510 2 "YES" 2]

--snip--

["C:\exp.xlsx" _ "Sheet0$" "Nález" -1 _ 1073741824 1073741824 _ _ 1 _ _ -1 _ 1073741824 7 "YES" 7]
["C:\exp.xlsx" _ "Sheet0$" "Záver" -1 _ 1073741824 1073741824 _ _ 1 _ _ -1 _ 1073741824 8 "YES" 8]

 --snip--
        ]

The problematic columns are "Nález" and "Záver".

Thank you very much.

orr721 · August 31, 2017, 3:00pm

Whoa, after some googling I have found that this works:

insert conn-port "select Meno, {fn convert(Nález, SQL_VARCHAR)} as nalez, {fn convert(Záver, SQL_VARCHAR)} as zaver from [Sheet0$]"
== [Meno nalez zaver]

So thanks for the pointers. The question remains though, shouldn't it work automatically like in Rebol2/View?

gchiu · August 31, 2017, 11:12pm

Rebol2 was a mature product of many decades of work. We don't have the source code but it would seem some of its magic was not safe!

The error message suggested that the ODBC driver was not allocating enough memory for an unexpected result hence my suggestion that you cast the result first so that the driver would know to allocate the right amount of memory.

orr721 · August 31, 2017, 11:25pm

Well, unfortunately I was a bit too quick in making my conclusions. It works, but after the typecast the big columns are truncated to 255 chars.

This specific ODBC driver supports VARCHAR only for length 255 and no other types for larger strings are available (i.e. no LONGVARCHAR, BLOB, etc.).

>> insert conn-port ['types]
== [TYPE_NAME DATA_TYPE COLUMN_SIZE LITERAL_PREFIX LITERAL_SUFFIX CREATE_PARAMS NULLABLE CASE_SENSITIVE SEARCHABLE UNSIGNED_ATTRIBUTE FIXED_PREC_SCALE AUTO_UNIQUE_VALUE LOCAL_TYPE_NAME MINIMUM_SCALE MAXIMUM_SCALE SQL_DATA_TYPE SQL_DATETIME_SUB NUM_PREC_RADIX INTERVAL_PRECISION]                                                                                                          
                                                                                                                                
>> copy conn-port
== [["LOGICAL" -7 1 _ _ _ 0 0 2 _ 0 _ _ 0 0 -7 _ _ _] ["CURRENCY" 2 19 _ _ _ 1 0 2 0 1 0 _ 4 4 2 _ 10 _] ["NUMBER" 8 53 _ _ _ 1 0 2 0 0 0 _ _ _ 8 _ 2 _] ["VARCHAR" 12 255 "'" "'" _ 1 1 3 _ 0 _ _ _ _ 12 _ _ _] ["DATETIME" 93 19 "#" "#" _ 1 0 2 _ 0 _ _ 0 0 9 3 _ _]]

While the ODBC driver itself can work with the big columns without any problems (even though the datatype itself is "unsupported") I have not found a way how to tell the Ren-C ODBC module to accept it. Unfortunately I am not much of a C programmer and really can't fix any C code.

I have tested various SQL string operations and all work with the full column data, i.e. beyond the 255 char limit. So a potential work-around would be to split the big columns in SQL to several 255 char parts and then join them back in Ren-C. Or go back to Rebol2/View where it strangely enough doesn't have this problem.

Regarding myself, I use Rebol mostly for scripting short "glue" code to collect data from various sources, modify it and export it for further use in databases, making statistics, etc. I like it very much because it is small, multiplatform (I have used it on Solaris, NetBSD, MS Windows, Linux, MacOSX, even classic MacOS, ...), it is dynamic, easy to use, etc.. A real swiss army knife for any quick work. I have to admit I still fight with the parse rules for string processing so sometimes I use awk (which is an excellent small tool as well).

Thanks again!

gchiu · August 31, 2017, 11:44pm

If you're trying to do "SQL" on an Excel file, then I'd suggest you consider using Ashley's munge script which he has made compatible with r3/Ren-c.

orr721 · August 31, 2017, 11:52pm

Yeah, I have tried. But even after reinstalling MS Office, installing the dependencies, it doesn't work for me (I have tried yesterday). But I have to admit I do have a strange combination of Office 2003 and Office 2010 Starter on my Windows computer so I am not complaining.

ODBC seemed to me like a more universal, cleaner and faster way. Hey, and it mostly works!

hostilefork · September 1, 2017, 2:54am

I'm not that surprised if SQL statements issued to some random "database" willingly works with an overlong VARCHAR(255), because you're just throwing instructions over to the database; it does all the work on its side, and there's not any issue with whether ODBC understands what it's asking.

What I might find surprising would be if the type indicator is formally marked VARCHAR(255) in its specification, yet using some clients it returns longer than 255 and this is tolerated...and for instance Rebol2 is such a client? Are you sure?

It might be possible, though sketchy. The way ODBC works, is that you provide memory buffers of sufficient size for each column, in a process called binding. It is the case that for some types, if you provide too small a buffer, you can get back a status code indicating this... SQL_SUCCESS_WITH_INFO with SQLSTATE 01004 (Data truncated)

It says:

When data truncation occurs, the application can sometimes allocate a larger buffer and refetch the data; this is not true in all cases. (emphasis mine)

The routine that would be returning SQLSTATE 01004 would be, I guess, SQLFetch...here:

So the deal would be that it's possible to notice you've been lied to about the length and a field wound up truncated, update the binding, and then fetch again perhaps. Of course you're operating in an inconsistent world, where the length limit is a lie, and I'd certainly think it's something any given ODBC client or driver would consider undefined behavior.

If you've got proof it's being done then it has been done, though it would help if that proof came along with source code to see how they were doing it. Hence if you can point to an open source application which has the feature, that would be more useful.

All this said, I don't know how much work we can promise. We'd like to have more participants here, but we don't really have a huge developer "budget" for things beyond our core focus unless people are willing to contribute in kind.

orr721 · September 1, 2017, 7:25am

These are the column properties ODBC is reporting for the big columns (see my second post):

TABLE_CAT C:\exp.xlsx
TABLE_SCHEM
TABLE_NAME Sheet0$
COLUMN_NAME Nález
DATA_TYPE -1
TYPE_NAME
COLUMN_SIZE 1073741824
BUFFER_LENGTH 1073741824
DECIMAL_DIGITS
NUM_PREC_RADIX
NULLABLE 1
REMARKS
COLUMN_DEF
SQL_DATA_TYPE -1
SQL_DATETIME_SUB
CHAR_OCTET_LENGTH 1073741824
ORDINAL_POSITION 7
IS_NULLABLE YES
ORDINAL 7

Please notice TYPE_NAME (empty), COLUMN_SIZE, BUFFER_LENGTH, CHAR_OCTET_LENGTH (1073741824).

So ODBC wants Ren-C to allocate 1 GB of memory and this fails. However Rebol2/View can somehow deduce that this is some kind of a special type (by the SQL_DATA_TYPE property?) and allocates just enough for it to make it work.

Could it be that the datatype -1 means BLOB datatype? I have tried to search through Microsoft ODBC documentation but I have found nothing. They refer to types only by their names not by the constant values.

This is taken from Rebol2/View, so you believe me, (the big columns are number 7 and 8)..

>> insert conn-port "select * from [Sheet0$]"
>> rec: copy conn-port
== [[ ... ]]
>> reclen: copy [] foreach i rec [append reclen length? i/7] print unique sort reclen
446 478 487 495 504 510 537 570 572 575 578 584 588 624 630 645 650 651 656 658 661 663 673 678 679 691 728 730 733 744 756 76
3 801 806 809 819 821 822 828 839 844 879 887 888 907 908 927 950 969 982 996 1006 1012 1015 1033 1054 1064 1073 1074 1077 110
9 1139 1165 1179 1207 1245 1359 1384 1446 1574
>>

I will try to search for some other ODBC implementation where it works and will report. Any tips what would be the easiest to understand and to port later on? Ruby? Scheme?

Thanks.

orr721 · September 1, 2017, 4:36pm

I got it running in Python:

C:\TEMP\py>python
Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:16:59) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyodbc
>>> cnxn = pyodbc.connect(r'DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=C:\exp.xlsx;', autocommit=True)
>>> cur = cnxn.cursor()
>>> cur.execute("select * from [Sheet0$]")
<pyodbc.Cursor object at 0x03173E20>
>>> rows = cur.fetchall()
>>> for r in rows:
...     print(len(r[6]), ' ', end='')
...
733  887  656  510  663  828  763  1006  982  678  927  908  1245  1074  806  661  1207  744  1139  839  537  446  756  624  879  809  763  570  1165  733  1064  821  588  1033  584  691  673  730  1446  572  907  645  819  801  1384  487  1179  1109  575  950  888  1012  673  650  844  678  1574  1077  495  996  1015  630  1359  822  679  673  969  578  1054  658  1073  504

The ODBC library is called pyodbc and the source is at: https://github.com/mkleehammer/pyodbc

Thanks again for your consideration..

gchiu · September 1, 2017, 11:28pm

I'd report the code you used, the data so it can be reproduced etc on github as a bug in the meantime.

orr721 · September 3, 2017, 2:06am

I did some more digging..

The memory allocation problem is not Ren-C specific at all. There are several reports from people having the same problem:

in PHP: https://bugs.php.net/bug.php?id=68964,
in MySQL: https://forums.mysql.com/read.php?10,618038,618760,
in a ODBC bridge: http://www.easysoft.com/support/kb/kb00712.html).

According to Microsoft ODBC sqlext.h header file, the SQL_DATA_TYPE -1 corresponds to SQL_LONGVARCHAR. This data-type has no fixed length limit, hence the ODBC reported BUFFER_LENGTH and COLUMN_SIZE of 1073741824 (which equals 1 GB).
So i think the problematic code is the line number 834 of the ODBC_BindColumns function in the ren-c/src/extensions/odbc/mod-odbc.c file, where the buffer size is determined:

case SQL_CHAR:
case SQL_VARCHAR:
case SQL_LONGVARCHAR: // https://stackoverflow.com/a/9547441

--snip--
```
 c->c_type = SQL_C_WCHAR;
```
--snip--

c->buffer_size = sizeof(WCHAR) * (c->column_size + 1);
break;
The Microsoft recommended way how to handle this data-type is to use maximum allowed string size or a maximum known size of the source data. I have checked for Excel where the maximum cell size is 32K. It could be different (larger) for other ODBC databases. Actually that https://stackoverflow.com/a/9547441 reference talks about the 32K limit but as far as I can tell it is not enforced anywhere in the mod-odbc.c code.

So could some one able to compile Ren-C please devise a way how to cap the column_size to 32K and test it..

hostilefork · September 6, 2017, 12:14am

Great, thanks for looking at it and into the code itself...if people are willing to do more of that, I'm sure all this could be improved much faster! I've committed a change.

The Ren-C download page's HTML isn't getting updated right now, as @gchiu is in China and is firewalled out of the place where he runs the script. But the builds are being generated by Travis anyway, and are available by URL...if you just put the commit hash in. e.g. if you were downloading Windows 64-bit debug, the now-outdated last link that was at on the page is:

http://metaeducation.s3.amazonaws.com/travis-builds/0.3.40/r3-a0b5b94-debug.exe

But the shortened hash of the commit above (e6718d6) can be substituted:

http://metaeducation.s3.amazonaws.com/travis-builds/0.3.40/r3-e6718d6-debug.exe

(Builds should be available after all their Travis entries complete, which can take some time as the Mac builds often wait in a queue.)

Of course if people build the system themselves, then they can start making patches, and I'd argue we're doing a reasonable job of keeping it the kind of C you can read with one book's worth of knowledge.

orr721 · September 6, 2017, 3:09pm

Thanks a lot, it is working now.

I have even got it running under both the 32bit and 64bit versions (had to do some black magic first to have both versions of the ODBC drivers installed at the same time).

A minor nitpick is that the 64bit r3-e6718d6-debug.exe version reports a memory leak? after quitting:

>> close c
>> close p

>> quit
*** PG_Mem_Usage = 136436 ***
Memory accounting imbalance: Rebol internally tracks how much
memory it uses to know when to garbage collect, etc.  For
some reason this accounting did not balance to zero on exit.
Run under Valgrind with --leak-check=full --track-origins=yes
to find out why this is happening.

I can test further if you give me any instructions.

hostilefork · September 6, 2017, 5:08pm

I'm not surprised there are presently leaks in the ODBC. It really was just a first cut at taking Christian's code and getting it to build and do something.

I'm more surprised you've managed to accomplish productive work with it.

At this moment there are several higher priorities in the core, but if you're blocked by an issue please bring it up. If your code is open source, it would also be interesting to see what you're doing and perhaps suggest how you could take advantage of new features.

orr721 · September 6, 2017, 6:24pm

This warning is displayed after the r3 runtime quits.

I was not doing anything fancy really, just run the r3-... exe REPL, opened the ODBC connection, run a simple query, closed the connection and quit. It did not crash, freeze or anything.

And it is only displayed by the 64bit r3-...exe. Could it be because of the -debug compile? The 32bit runtime does not have this message.

hostilefork · September 6, 2017, 9:22pm

Yes: the memory balancing is only checked in the debug build, and program termination is the only time you can be sure the memory isn't being held onto for later use. A 32-bit debug build would have the same issue.

If you're only getting the message when you've run an ODBC query, then that's where the problem nearly certainly is.

gchiu · September 16, 2017, 9:34am

These are the listed ODBC bugs I found. If you've found another, please report it so that it can be eventually fixed.