ODBCOutput Vs Google Big Query - One Person's Experience

Alan Kirk · Post by **Alan Kirk** » Fri Sep 08, 2023 11:47 pm

This is probably somewhat of a niche post since I don't know how widespread the use of Google Big Query (BQ) is in the wider TM1-osphere, much less how prevalent the need to write to it is.

However I'll put my experience on the searchable record in the hope that it may save some future user from as much teeth grinding and hair pulling as my experience.

We were asked to send our forecast figures (created in TM1) to a Google Big Query table on a regular basis.

This is easy, right? We pull data from BQ all the time so the driver is in place, create a source view, a bit of ODBCOutput, and wham, bam, it's done ma'am.

Eeeeeh... not so much.

So, the ODBCOpen is naturally done in the Prolog, the ODBCOutput is done in the Data, the ODBCClose is done in the Epilog.

Time to test it. Wow, this is taking a while. Time to go downstairs and get a coffee. I get back... sip, OK, it's done, but... looks at the time calc in my control cube... it took how long? I had this in debug mode so I was tracking each write. About 3 seconds per record. That smells like indexing, but BQ isn't my domain, I just feed into and out of it.

For a new year which doesn't have the future months populated yet, that's about 4,000 rows so... not great, but survivable.

The problem is that a full year forecast is several hundred thousand rows. I talk to the BQ guys.

"Maybe it's network traffic; can you shift things like the timestamp and data source name to an Update Where clause after the load has been done?"

OK, I think, not a bad idea. So I do.

3 seconds per record again, although the Update Where which updated every record that had been uploaded took... 3 seconds.

Then things started getting worse. I played around with this a bit further, and randomly and occasionally would get:

"Forecast (Working)", "AUD","2023","M09","Redacted","Redacted","Redacted","Redacted","Redacted","Redacted","Redacted","Redacted","Redacted","Reportable Amount","AmountRedacted",Data Source line (301) Error: Data procedure line (80): Error executing SQL query: "..."

I haven't redacted the error. It was literally "...".

Not always for the same line. Not always for the same elements. And not always.

I talk to the BQ guys again and they suggest that maybe it has something to do with the long run time; "can you do it by month, with each call using a new connection?"

OK, worth a shot. I create a while loop in the master process that calls each month in turn.

It still happens. Sometimes. Not all the time. Never on the same rows. Rarely in the same months. As soon as the error occurs the process drops like a brick back to the master one.

I try something else:

If the number of completed records (stored in a global variable) does not equal the expected number of records for that month, run a delete query to purge the incomplete month data. (Incomplete data = where the Timestamp field Is Null, since the Epilog, where the timestamp was updated, was never reached on the first call), then
Have another crack at doing the month.
Quit if it still fails after 3 attempts.

Yaaay, that worked! Except... the count of row numbers isn't right. I have more rows in the table than I have in the cube. I pull the uploaded data from BQ and find that somehow, some of the rows were duplicated. Note: Not as many as the number that had been written in the first attempt, so no, it's not that the delete command was failing. The duplicated rows weren't necessarily sequential in the data source either.

How is this happening? I have no idea. The BQ guys have no idea. Bear in mind that I'm still in debug mode here and have text logs showing EVERY SINGLE RELEVANT LINE, including the ODBCOutput so I can see which lines had been written. On each attempt the duplicated lines have been written once, and once only. Some of the duplicated lines had been written on the first attempt, some had not.

Also, I still have the time problem.

It occurs to me that since the Update statement was executing in 3 seconds for {mumble} thousand rows, maybe I should try to aggregate the data and send it in batches rather than one row at a time. So...

In Metadata I work out the number of non-skipped rows. (Skipped = "Written to an invalid combination" or "Zero when rounded to less than a couple of decimal places")
I find that for this interface, I can comfortably fit 300 Value clauses into a string variable with a bit of room to spare, so I work out the number of blocks of 300 rows;
I work out the remainder from
Code: Select all
```
Total - (300n)
```
.
In data, if there are > 300 rows I start aggregating the values until I hit 300 rows, then commit to the database and repeat until I run out of blocks of 300.
I then aggregate the remainder, and punch that in as a single block.

You guessed it, 1 row takes 3 seconds to commit. 300 rows takes... 3 seconds to commit. OK, so there goes the time problem.

And, incidentally, I haven't had any of the idiopathic command failures since I made that change either. Why? I can't be certain, but my suspicion falls into two areas:

The amount of time that the ODBC connection was being held open for. With the "bulked up" Outputs the full year output was taking only about 3 minutes per month. Line by line? Well over a quarter of an hour. That's plenty of time for unexpected timeouts (though the BQ guys checked the specs and said that there shouldn't be any), network gremlins, interference by other processes on the BQ side, whatever, to interfere with the upload. 3 minutes is a much smaller target window.
The speed of doing line by line. Yes, it was taking 3 seconds for the Output to happen but the connection was being hit with the next Output request pretty much instantaneously. Granted, it takes barely a second for the 300 rows to be compiled on the TM1 side, but at least that is some breather for the connection.
I reiterate that this is speculation on my part.

Lessons Learned

If you need to write to BQ via ODBC, don't do it one row at a time. Aggregate the Values clause into as many rows as you can get away with, and hit the sucker with them then.
Write common values with an Update Where clause at the end so that you can cram as much variable data as possible into the Insert Into statement.
Be assiduous about keeping the time that the ODBC connection is open to the absolute minimum that you can.
If you get an idiopathic error, don't assume that you can work around it by re-running part of the upload since you can't be sure that the "do over" will be "clean". Drop that sucker like it's radioactive, purge the uploaded data, and spit an error log saying "Sorry, it didn't work, try again". (But if you follow the first three points, this should very much be an exception, not a rule.)
Thus, your hair follicles will be safe, and you should save a few bucks on dental work.

Wim Gielis · Post by **Wim Gielis** » Sat Sep 09, 2023 3:40 am

Thanks for the detailed descriptions.
I have never used Google BQ but it appears that a flat file can be loaded: https://stackoverflow.com/questions/521 ... y-using-bq

Wouldn't that be quicker and safer ?

At least I do a BULK INSERT into SQL tables (Epilog of a TI process) and it's almost instantaneous compared to several minutes with ODBCOutput in the Data tab (not bundled and about 10K lines in total).

Alan Kirk · Post by **Alan Kirk** » Sat Sep 09, 2023 4:18 am

Wim Gielis wrote: ↑Sat Sep 09, 2023 3:40 am Thanks for the detailed descriptions.
I have never used Google BQ but it appears that a flat file can be loaded: https://stackoverflow.com/questions/521 ... y-using-bq

Wouldn't that be quicker and safer ?

Not really. We want to move completely away from using files if at all possible. Files need to be secured. Files can be read in a text editor if they fall into the wrong hands. Files take up disk space. You can blow away files after you've loaded them, but if you do that and the file was loaded in an incomplete state you've no way of proving that (other than by there being numbers missing). We're trying to get to a situation where everything we do is direct machine to machine. I know that doing an equivalent load (into an unindexed staging table) into SQL Server is considerably faster, and with certain stuff that I've been doing with SSIS of late, machine to machine has been giving me noticeably better performance than file to machine. (Though of course much will depend on the environment.)

This started with Payroll data (which we don't want to exist anywhere outside of the payroll system and TM1, or the ephemeral connection between them), but the policy has flowed on to other data.

The BQ guys weren't a fan either since they would need to schedule pulls of the data when it was there... which may not be on a predictable schedule. Their preference was that we push it when it's ready, and I can understand why.

Wim Gielis wrote: ↑Sat Sep 09, 2023 3:40 am At least I do a BULK INSERT into SQL tables (Epilog of a TI process)

I'm pretty sure that BULK INSERT is specific to T-SQL, though other RDBMSs have something similar. (Though BQ is not an RDBMS, of course.) However that command again comes back to using files, which we really don't want to do any more.

I don't think we're there yet but with network speeds being what they are these days, I can see files dying out for internal transfers. The problem of course is that there are a lot of institutions and government agencies (for example) which only supply data in files, so commands like BULK INSERT will never entirely disappear either.

WilliamSmith · Post by **WilliamSmith** » Sat Sep 09, 2023 4:44 am

Alan, I'm not a BQ expert, but have you tried this syntax? Or is this what you meant by batching inserts?

Code: Select all

INSERT INTO `your_project_id.your_dataset_id.your_table_id`
(column1, column2, column3)
VALUES
('value1', 'value2', 'value3'),
('value4', 'value5', 'value6'),
('value7', 'value8', 'value9');

Also, just thinking out loud, is there anyway to run the SQL inserts in parallel instead of in series?

If not natively in TI, perhaps you can use another programming language like Python / Java / C# etc. to query TM1, then run your parallel inserts. Add error handling logic around your inserts so that if one fails, it retries X times, etc.

Alan Kirk · Post by **Alan Kirk** » Sat Sep 09, 2023 6:21 am

WilliamSmith wrote: ↑Sat Sep 09, 2023 4:44 am Alan, I'm not a BQ expert, but have you tried this syntax?
Or is this what you meant by batching inserts?

It is indeed what I meant.

WilliamSmith wrote: ↑Sat Sep 09, 2023 4:44 am

Code: Select all

INSERT INTO `your_project_id.your_dataset_id.your_table_id`
(column1, column2, column3)
VALUES
('value1', 'value2', 'value3'),
('value4', 'value5', 'value6'),
('value7', 'value8', 'value9');

That's basically what I was saying above. I aggregate (up to) 300 of the VALUES lines before doing the INSERT INTO. Doing it that way instead of one set of values at a time overcame the time problem. The only thing to watch is that the size of the string doesn't get too large.

This works, so we don't need to worry about calling anything external to the model like a Python process. I prefer the solution with the fewest number of moving parts.

Drg · Post by **Drg** » Thu Sep 14, 2023 2:36 pm

Alan Kirk wrote: ↑Sat Sep 09, 2023 6:21 am
WilliamSmith wrote: ↑Sat Sep 09, 2023 4:44 am Alan, I'm not a BQ expert, but have you tried this syntax?
Or is this what you meant by batching inserts?
It is indeed what I meant.
WilliamSmith wrote: ↑Sat Sep 09, 2023 4:44 am
Code: Select all
INSERT INTO `your_project_id.your_dataset_id.your_table_id`
(column1, column2, column3)
VALUES
('value1', 'value2', 'value3'),
('value4', 'value5', 'value6'),
('value7', 'value8', 'value9');
That's basically what I was saying above. I aggregate (up to) 300 of the VALUES lines before doing the INSERT INTO. Doing it that way instead of one set of values at a time overcame the time problem. The only thing to watch is that the size of the string doesn't get too large.

This works, so we don't need to worry about calling anything external to the model like a Python process. I prefer the solution with the fewest number of moving parts.

Me use javaextension(depricated but works) to batch update table.
1m lines [ODBCoutput(3-10 lines in batch) vs Javaextension export (1000 lines in batch) ]
200sec vs 170 sec

TM1 Forum

ODBCOutput Vs Google Big Query - One Person's Experience

ODBCOutput Vs Google Big Query - One Person's Experience

Re: ODBCOutput Vs Google Big Query - One Person's Experience

Re: ODBCOutput Vs Google Big Query - One Person's Experience

Re: ODBCOutput Vs Google Big Query - One Person's Experience

Re: ODBCOutput Vs Google Big Query - One Person's Experience

Re: ODBCOutput Vs Google Big Query - One Person's Experience