Author Topic: Split and join strings  (Read 2793 times)

0 Members and 1 Guest are viewing this topic.

Offline luke

  • QB64 Developer
  • Seasoned Forum Regular
  • Posts: 284
Split and join strings
« on: February 15, 2019, 04:11:07 AM »
Given a string of words separated by spaces (or any other character), splits it into an array of the words. I've no doubt many people have written a version of this over the years and no doubt there's a million ways to do it, but I thought I'd put mine here so we have at least one version. There's also a join function that does the opposite array -> single string.

Code is hopefully reasonably self explanatory with comments and a little demo. Note, this is akin to Python/JavaScript split/join, PHP explode/implode.

Code: QB64: [Select]
  1. redim words$(0)
  2.  
  3. original$ = "The rain   in Spain  "
  4. print "Original string: "; original$
  5.  
  6. split original$, " ", words$()
  7.  
  8. print "Words:"
  9. for i = lbound(words$) to ubound(words$)
  10.     print words$(i)
  11.  
  12. print "Joined with commas: ";join$(words$(), ",")
  13.  
  14. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  15. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  16. '
  17. 'delimiter$ must be one character long.
  18. 'result$() must have been REDIMmed previously.
  19. sub split(in$, delimiter$, result$())
  20.     redim result$(-1)
  21.     start = 1
  22.     do
  23.         while mid$(in$, start, 1) = delimiter$
  24.             start = start + 1
  25.             if start > len(in$) then exit sub
  26.         wend
  27.         finish = instr(start, in$, delimiter$)
  28.         if finish = 0 then finish = len(in$) + 1
  29.         redim _preserve result$(0 to ubound(result$) + 1)
  30.         result$(ubound(result$)) = mid$(in$, start, finish - start)
  31.         start = finish + 1
  32.     loop while start <= len(in$)
  33.  
  34. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  35. function join$(in$(), delimiter$)
  36.     result$ = in$(lbound(in$))
  37.     for i = lbound(in$) + 1 to ubound(in$)
  38.         result$ = result$ + delimiter$ + in$(i)
  39.     next i
  40.     join$ = result$

Offline RhoSigma

  • Seasoned Forum Regular
  • Posts: 456
  • GT v0.13 [sr]
Re: Split and join strings
« Reply #1 on: February 15, 2019, 04:29:43 AM »
Here are my two cents,

words/components can be seperated by arbitrary number of whitespace (TAB/SPACE) and "quoted" sections are taken as one word/component. It also is satisfied with any defined dynamic array, as it REDIMs it as needed.

The program has a couple of input lines at the top, just uncomment one and comment all other to try...

Code: QB64: [Select]
  1. '+---------------+---------------------------------------------------+
  2. '| ###### ###### |     .--. .         .-.                            |
  3. '| ##  ## ##   # |     |   )|        (   ) o                         |
  4. '| ##  ##  ##    |     |--' |--. .-.  `-.  .  .-...--.--. .-.        |
  5. '| ######   ##   |     |  \ |  |(   )(   ) | (   ||  |  |(   )       |
  6. '| ##      ##    |     '   `'  `-`-'  `-'-' `-`-`|'  '  `-`-'`-      |
  7. '| ##     ##   # |                            ._.'                   |
  8. '| ##     ###### |  Sources & Documents placed in the Public Domain. |
  9. '+---------------+---------------------------------------------------+
  10. '|                                                                   |
  11. '| === Parse.bas ===                                                 |
  12. '|                                                                   |
  13. '| == A simple test environment for my ParseLine&() function.        |
  14. '|                                                                   |
  15. '+-------------------------------------------------------------------+
  16. '| Done by RhoSigma, R.Heyder, provided AS IS, use at your own risk. |
  17. '| Find me in the QB64 Forum or mail to support@rhosigma-cw.net for  |
  18. '| any questions or suggestions. Thanx for your interest in my work. |
  19. '+-------------------------------------------------------------------+
  20.  
  21. 'REDIM a$(0)
  22. REDIM a$(3 TO 4)
  23. 'l$ =  "   " + CHR$(34)
  24. l$ = "Hello World - Greetings " + CHR$(34) + "to all" + CHR$(34) + " from Germany"
  25. 'l$ = "RectFill (sysViewTop%), 1, sysViewBot% - sysViewTop% + 1, sysPageWid%, fg%, bg%, ch$"
  26. 'l$ = "SUB RectFill (lin%, col%, hei%, wid%, fg%, bg%, ch$)"
  27. 'l$ = "       ABC   123" + CHR$(34) + " " + CHR$(34) + " X Y Z " + CHR$(34) + " " + CHR$(34) + CHR$(34) + "345  "
  28. 'l$ = "       ABC   123" + CHR$(34) + CHR$(34) + CHR$(34) + " X Y Z " + CHR$(34) + CHR$(34) + CHR$(34) + "345  "
  29. 'l$ = "     " + CHR$(34) + "  ABC  " + CHR$(34) + " 123 " + CHR$(34) + CHR$(34) + " " + CHR$(34) + CHR$(34) + "X Y Z" + CHR$(34) + CHR$(34)
  30. 'l$ = "--testfile=" + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output=logfile.txt"
  31. 'l$ = "--testfile " + CHR$(34) + "C:\My Folder\My File.txt" + CHR$(34) + " --testmode --output logfile.txt"
  32.  
  33. WIDTH 80, 30
  34. PRINT "square brackets just used for better visualization ..."
  35. PRINT "given input to function:"
  36. PRINT "------------------------"
  37. PRINT " Line: ["; l$; "]"
  38. PRINT "Array: LBOUND ="; LBOUND(a$), "UBOUND ="; UBOUND(a$)
  39. PRINT "result of function call (new UBOUND or -1):"
  40. PRINT "-------------------------------------------"
  41. PRINT "Result:"; ParseLine&(l$, a$(), 0)
  42. PRINT "array dump:"
  43. PRINT "-----------"
  44. FOR x& = LBOUND(a$) TO UBOUND(a$)
  45.     PRINT "Index:"; x&, "Content: ["; a$(x&); "]"; TAB(65); "Length:"; LEN(a$(x&))
  46. NEXT x&
  47.  
  48. '---------------------------------------------------------------------
  49. 'Function:  Parsing a given input line into its single words/components.
  50. '
  51. 'Synopsis:  ub& = ParseLine& (inpLine$, outArray$(), minUB&)
  52. '
  53. 'Result:    ub& --> the UBOUND of the "outArray$()" array after parsing,
  54. '                   is negative (-1) if there was nothing to parse,
  55. '                   the given array remains unchanged in that case
  56. '
  57. 'Inputs:    inpLine$    --> the input line which you want to compute
  58. '           outArray$() --> a REDIMed (dynamic) string array of at least one
  59. '                           element, in which the words will be stored, this
  60. '                           array will be automatically REDIMed as needed,
  61. '                           the array does not need to be SHAREed
  62. '           minUB&      --> the minimum UBOUND the array shall have, use if
  63. '                           you expect a static number of components, so you
  64. '                           may waive to explicit UBOUND checks, unused
  65. '                           elements remain empty, note that the array may
  66. '                           still grow bigger, hence passing zero is ok, if
  67. '                           you don't need this feature
  68. '
  69. 'Notes:     Whitespaces (TAB/SPACE) will separate the components unless
  70. '           they appear in quoted sections, parts of the line enclosed
  71. '           by quotes will be handled as one word/component, but the
  72. '           quotes are removed before saving it in the array. For more
  73. '           specific informations see "Parsing rules" below.
  74. '            The argument array "outArray$()" will be REDIMed by this call,
  75. '           it will always keep the LBOUND but raise/reduce the UBOUND
  76. '           as needed for the number of words/components found, the final
  77. '           UBOUND is retured as result "ub&" of the function call.
  78. '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  79. 'Parsing rules:
  80. ' 1.) Whitespaces (TAB/SPACE) will regulary separate words/components
  81. ' 2.) opening quotes will also separate even without a leading whitespace,
  82. '     hence also finishing the the actual word/component just processed
  83. ' 3.) closing quotes (even without trailing whitespace) will finish the
  84. '     actual (quoted) word/component
  85. ' 4.) quoted sections may be empty ("") and produce an empty array entry
  86. '     in that case
  87. ' 5.) a open quoted section with no closing quote in the remaining line
  88. '     will be closed by end of line, hence it will take all trailing
  89. '     whitespaces as is
  90. ' 6.) with respect to 4. and 5., a opening quote as very last char in the
  91. '     input line will produce an additonal empty array entry (as it is
  92. '     in fact a empty quoted section)
  93. '
  94. 'Quoting rules:
  95. ' Very simple and logic, 1st quote opens, 2nd does close, 3rd will open
  96. ' again, 4th does close again and so on.
  97. '---------------------------------------------------------------------
  98. FUNCTION ParseLine& (inpLine$, outArray$(), minUB&)
  99. '--- so far return nothing ---
  100. ParseLine& = -1
  101. '--- init & check some runtime variables ---
  102. ilen& = LEN(inpLine$): icnt& = 1
  103. IF ilen& = 0 THEN EXIT FUNCTION
  104. oalb& = LBOUND(outArray$): oaub& = UBOUND(outArray$): ocnt& = oalb&
  105. '--- skip preceding whitespaces ---
  106. plSkipWhite:
  107. flag% = 0
  108. WHILE icnt& <= ilen& AND NOT flag%
  109.     ch% = ASC(inpLine$, icnt&)
  110.     IF ch% <> 9 AND ch% <> 32 THEN flag% = -1
  111.     icnt& = icnt& + 1
  112. IF NOT flag% THEN 'nothing else? - then exit
  113.     IF ocnt& > oalb& THEN
  114.         GOTO plEnd
  115.     ELSE
  116.         EXIT FUNCTION
  117.     END IF
  118. '--- redim to clear array on 1st word/component ---
  119. IF ocnt& = oalb& THEN REDIM outArray$(oalb& TO oaub&)
  120. '--- expand array, if required ---
  121. plNextArg:
  122. IF ocnt& > oaub& THEN
  123.     oaub& = oaub& + 10
  124.     REDIM _PRESERVE outArray$(oalb& TO oaub&)
  125. '--- get current word/component until next separator ---
  126. flag% = 0: quot% = 0
  127. WHILE icnt& <= ilen& AND NOT flag%
  128.     IF ch% = 34 AND NOT quot% THEN
  129.         quot% = -1
  130.     ELSEIF ch% = 34 AND quot% THEN
  131.         quot% = 0
  132.     END IF
  133.     IF ch% <> 34 THEN outArray$(ocnt&) = outArray$(ocnt&) + CHR$(ch%)
  134.     ch% = ASC(inpLine$, icnt&)
  135.     IF (NOT quot% AND (ch% = 9 OR ch% = 32 OR ch% = 34)) OR (quot% AND ch% = 34) THEN flag% = -1
  136.     icnt& = icnt& + 1
  137. ocnt& = ocnt& + 1
  138. '--- more words/components following? ---
  139. IF flag% AND ch% = 34 AND NOT quot% AND icnt& <= ilen& GOTO plNextArg
  140. IF flag% THEN
  141.     GOTO plSkipWhite
  142.     IF (NOT quot% AND ch% <> 9 AND ch% <> 32 AND ch% <> 34) OR (quot% AND ch% <> 34) THEN
  143.         outArray$(ocnt& - 1) = outArray$(ocnt& - 1) + CHR$(ch%)
  144.     END IF
  145. '--- final array size adjustment, then exit ---
  146. plEnd:
  147. IF ocnt& - 1 < minUB& THEN ocnt& = minUB& + 1
  148. REDIM _PRESERVE outArray$(oalb& TO (ocnt& - 1))
  149. ParseLine& = ocnt& - 1
  150.  
  151.  
« Last Edit: December 22, 2019, 03:26:33 PM by RhoSigma »
My Projects:   https://www.qb64.org/forum/index.php?topic=809
GuiTools - Another graphic UI framework, supports multiple UI forms/windows in one program.
Libraries - Image processing/Data buffering/MD5/SHA2/LZW etc.
Bonus - Screen Blankers, QB64/Notepad++ setup pack

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #2 on: February 15, 2019, 01:38:44 PM »
And my 2 cents:
My Split sub handles more than one char delimiters, comes in handy for Splitting a file into an array of lines with the delimiter set as Chr($13)+chr$(10) for a .bas for instance. It also handles the space delimiter specially by removing all double spaces before splitting.

Oh, also, I like to use long strings for arrays. For that you need to be able to leave empty spots open for strings eg see the days of week example.

Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' I think I want to replace my inefficient Wrd function
  3.  
  4. '2018-08-25 reworked for space delimiters and more variable declares
  5. '2019-02-15 add Luke's version to compare
  6. ntests = 5
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8.  
  9. a(0) = ""
  10. d(0) = " "
  11. a(1) = " test test    test " 'good no error!
  12. d(1) = " "
  13. a(2) = " test"
  14. d(2) = " "
  15. a(3) = "3d,z6d,z1 10 #d,z5"
  16. d(3) = ",z"
  17. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  18. d(4) = ", "
  19.  
  20. FOR test = 0 TO ntests - 1
  21.     PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  22.     REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  23.     Split a(test), d(test), myarr()
  24.     amax = UBOUND(myarr)
  25.     FOR i = 0 TO amax
  26.         PRINT i; ":"; myarr(i)
  27.     NEXT i
  28.     INPUT "press enter for next test... "; wate$
  29.  
  30. ' how about a quick file reader test?
  31. PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  32. IF LEN(wate$) THEN END
  33.  
  34. 'other wise continue
  35. OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  36. ftext$ = SPACE$(LOF(1))
  37. GET #1, , ftext$
  38. Split ftext$, CHR$(13) + CHR$(10), myarr()
  39. FOR i = 0 TO UBOUND(myarr)
  40.     PRINT myarr(i)
  41.     IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  42. PRINT "the end"
  43. END ' end program
  44.  
  45.  
  46. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  47.  
  48.  
  49. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  50. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  51.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  52.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  53.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  54.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  55.  
  56.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  57.  
  58.     'special case if delim is space, probably want to remove all excess space
  59.     IF delim = " " THEN
  60.         copy = RTRIM$(LTRIM$(copy))
  61.         p = INSTR(copy, "  ")
  62.         WHILE p > 0
  63.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  64.             p = INSTR(copy, "  ")
  65.         WEND
  66.     END IF
  67.     curpos = 1
  68.     arrpos = 0
  69.     dpos = INSTR(curpos, copy, delim)
  70.     DO UNTIL dpos = 0
  71.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  72.         arrpos = arrpos + 1
  73.         REDIM _PRESERVE arr(arrpos + 1) AS STRING
  74.         curpos = dpos + LEN(delim)
  75.         dpos = INSTR(curpos, copy, delim)
  76.     LOOP
  77.     arr(arrpos) = MID$(copy, curpos)
  78.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  79.  
  80.  
  81. ' Luke 2019-02-15
  82. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  83. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  84. '
  85. 'delimiter$ must be one character long.
  86. 'result$() must have been REDIMmed previously.
  87. SUB Lsplit (in$, delimiter$, result$())
  88.     REDIM result$(-1)
  89.     start = 1
  90.     DO
  91.         WHILE MID$(in$, start, 1) = delimiter$
  92.             start = start + 1
  93.             IF start > LEN(in$) THEN EXIT SUB
  94.         WEND
  95.         finish = INSTR(start, in$, delimiter$)
  96.         IF finish = 0 THEN finish = LEN(in$) + 1
  97.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  98.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  99.         start = finish + 1
  100.     LOOP WHILE start <= LEN(in$)
  101.  
  102. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  103. FUNCTION join$ (in$(), delimiter$)
  104.     result$ = in$(LBOUND(in$))
  105.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  106.         result$ = result$ + delimiter$ + in$(i)
  107.     NEXT i
  108.     join$ = result$
  109.  
  110.  

I would like to test RhoSigma's but looks like a couple of specific delimiters are used.

I think I had a reason not to redim the arr inside the Split sub, can't recall it now. It does seem more convenient to handle it inside the sub.

Append: you only have to redim once outside the Split sub before the call to it, to let QB64 know you are using that name as a dynamic array.
« Last Edit: February 15, 2019, 03:16:31 PM by bplus »

Offline RhoSigma

  • Seasoned Forum Regular
  • Posts: 456
  • GT v0.13 [sr]
Re: Split and join strings
« Reply #3 on: February 15, 2019, 04:03:16 PM »
I would like to test RhoSigma's but looks like a couple of specific delimiters are used.

??? - I hope you doesn't speak about the CHR$(34) in the example lines, it's a quote (").

The ParseLine&() function is a rather simple (yet incomplete) command line parsing function, whitespace separates options/words, whitespace in this case means any number of TABs and/or SPACEs, and quoted parts are taken as is (ie. incl. whitespace) as you usually would need it on a command line for filenames which contain spaces or any chars with special meanings in the command interpreter suche as %()<>& etc..
My function does otherwise not allow for any user defined delimiters.
My Projects:   https://www.qb64.org/forum/index.php?topic=809
GuiTools - Another graphic UI framework, supports multiple UI forms/windows in one program.
Libraries - Image processing/Data buffering/MD5/SHA2/LZW etc.
Bonus - Screen Blankers, QB64/Notepad++ setup pack

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3572
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #4 on: February 15, 2019, 08:09:23 PM »
Here's a split routine which I'll toss into the mess as well:

Code: QB64: [Select]
  1. CONST ntests = 5
  2. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  3. REDIM results(0) AS STRING
  4.  
  5. a(0) = ""
  6. d(0) = " "
  7. a(1) = " test test    test " 'good no error!
  8. d(1) = " "
  9. a(2) = " test"
  10. d(2) = " "
  11. a(3) = "3d,z6d,z1 10 #d,z5"
  12. d(3) = ",z"
  13. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  14. d(4) = ", "
  15.  
  16. FOR i = 0 TO ntests - 1
  17.     PRINT "Splitting: "; a(i)
  18.     SteveSplit a(i), d(i), results()
  19.     FOR j = 1 TO UBOUND(results)
  20.         PRINT j, results(j)
  21.     NEXT
  22.     SLEEP
  23.  
  24. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  25.     STATIC count AS LONG
  26.     count = count + 1
  27.     u = UBOUND(storage_array)
  28.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  29.     i = INSTR(text$, delimiter$)
  30.     IF i THEN
  31.         storage_array(count) = LEFT$(text$, i - 1)
  32.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  33.     ELSE
  34.         storage_array(count) = text$
  35.         REDIM _PRESERVE storage_array(count) AS STRING
  36.         count = 0
  37.     END IF

I was even nice and named it "SteveSplit" so folks can test for speed and compare results if they wish verses the other split routines.

One thing to note:  This *doesn't* strip off any extra leading/trailing spaces.  Why would it, if they're delimiters in our data?

Let's say we have the data of the following: 

1,2,,,5,6

When we INPUT it from a file, we get data of:
"1"
"2"
""
""
"5"
"6"

Those commas are valid delimiters of null strings.  If we're using a space as a delimiter, then shouldn't it also follow the same behavior of the comma?  If the user doesn't want to include/process null strings, then let them ignore them elsewhere in their code.  As far as our data is concerned, they're valid split points, in my opinion.
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3572
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #5 on: February 16, 2019, 02:14:25 AM »
And a timed comparison of the three routines:

Code: QB64: [Select]
  1. CONST ntests = 5
  2. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  3. REDIM results1(0) AS STRING
  4. REDIM results2(0) AS STRING
  5. REDIM results3(0) AS STRING
  6.  
  7. CONST Limit = 1000000
  8.  
  9. a(0) = ""
  10. d(0) = " "
  11. a(1) = " test test    test " 'good no error!
  12. d(1) = " "
  13. a(2) = " test"
  14. d(2) = " "
  15. a(3) = "3d,z6d,z1 10 #d,z5"
  16. d(3) = ",z"
  17. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  18. d(4) = ", "
  19.  
  20. FOR i = 0 TO ntests - 1
  21.     t# = TIMER
  22.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  23.         SteveSplit a(i), d(i), results1()
  24.     NEXT
  25.     t1# = TIMER
  26.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  27.         REDIM results2(0) AS STRING
  28.         Split a(i), d(i), results2()
  29.     NEXT
  30.     t2# = TIMER
  31.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  32.         Lsplit a(i), d(i), results3()
  33.     NEXT
  34.     t3# = TIMER
  35.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); a(i); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  36.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  37.  
  38.     FOR j = 1 TO UBOUND(results1)
  39.         PRINT j, CHR$(34); results1(j); CHR$(34),
  40.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34),
  41.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34),
  42.         PRINT
  43.     NEXT
  44.     SLEEP
  45.  
  46.  
  47.  
  48.  
  49. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  50. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  51.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  52.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  53.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  54.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  55.  
  56.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  57.  
  58.     'special case if delim is space, probably want to remove all excess space
  59.     IF delim = " " THEN
  60.         copy = RTRIM$(LTRIM$(copy))
  61.         p = INSTR(copy, "  ")
  62.         WHILE p > 0
  63.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  64.             p = INSTR(copy, "  ")
  65.         WEND
  66.     END IF
  67.     curpos = 1
  68.     arrpos = 0
  69.     dpos = INSTR(curpos, copy, delim)
  70.     DO UNTIL dpos = 0
  71.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  72.         arrpos = arrpos + 1
  73.         REDIM _PRESERVE arr(arrpos + 1) AS STRING
  74.         curpos = dpos + LEN(delim)
  75.         dpos = INSTR(curpos, copy, delim)
  76.     LOOP
  77.     arr(arrpos) = MID$(copy, curpos)
  78.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  79.  
  80.  
  81. ' Luke 2019-02-15
  82. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  83. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  84. '
  85. 'delimiter$ must be one character long.
  86. 'result$() must have been REDIMmed previously.
  87. SUB Lsplit (in$, delimiter$, result$())
  88.     REDIM result$(-1)
  89.     start = 1
  90.     DO
  91.         WHILE MID$(in$, start, 1) = delimiter$
  92.             start = start + 1
  93.             IF start > LEN(in$) THEN EXIT SUB
  94.         WEND
  95.         finish = INSTR(start, in$, delimiter$)
  96.         IF finish = 0 THEN finish = LEN(in$) + 1
  97.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  98.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  99.         start = finish + 1
  100.     LOOP WHILE start <= LEN(in$)
  101.  
  102. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  103. FUNCTION join$ (in$(), delimiter$)
  104.     result$ = in$(LBOUND(in$))
  105.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  106.         result$ = result$ + delimiter$ + in$(i)
  107.     NEXT i
  108.     join$ = result$
  109.  
  110.  
  111.  
  112.  
  113.  
  114.  
  115.  
  116.  
  117.  
  118.  
  119.  
  120.  
  121.  
  122.  
  123.  
  124.  
  125.  
  126.  
  127. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  128.     STATIC count AS LONG
  129.     count = count + 1
  130.     u = UBOUND(storage_array)
  131.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  132.     i = INSTR(text$, delimiter$)
  133.     IF i THEN
  134.         storage_array(count) = LEFT$(text$, i - 1)
  135.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  136.     ELSE
  137.         storage_array(count) = text$
  138.         REDIM _PRESERVE storage_array(count) AS STRING
  139.         count = 0
  140.     END IF

On my machine, the test speeds are as follows for:

               SteveSplit , Split (bplus), LSplit (Luke)
TEST #0: 0.1650 , 0.4395, 0.3296
TEST #1: 2.3076, 2.3076, 1.5386
TEST #2: 0.4399, 0.5493, 0.5493
TEST #3: 1.0439, 1.2637, 1.5381 (False results)
TEST #4: 1.9229, 2.1431, 2.6924 (False results)

In tests 0, 2, 3, 4, SteveSplit ran fastest. 
In test 1, LSplit was the fastest routine.

The difference in the speeds in Test #1 is a general philosophy of *how* we behave with the act of splitting.  For my routine, we generate several null strings, delimited by spaces.  For everyone else, the extra spaces are simply ignored and removed.

Luke's routine is set to only use a single character as a string delimiter, so for test 3 and 4, it produces false results as we have a 2-character delimiter.

******************

As to WHY my routine tosses out those extra null strings, it's to simply uniformly answer the question of, "How would we behave if if used a FIND AND REPLACE to change all the spaces to commas?"

Instead of a(1) = " test test    test ", what results would we expect if a(1) = ",test,test,,,,test," and we used a comma as a delimiter instead of a space?  Would we not then count the characters between each comma as as null-string?

I'd think so, and if two commas side-by-side designate a null-string result between them, then I feel like two spaces side-by-side should do the same for space delimited data.  This concept also allows me to perfectly reproduce the original data when using a join function, without automatically losing those spaces which may (or may not) have been important for formatting or data purposes.

If I don't want the extra spaces, all I need to do is ignore those null-strings. 



https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #6 on: February 16, 2019, 09:26:49 AM »
Steve:
Quote
The difference in the speeds in Test #1 is a general philosophy of *how* we behave with the act of splitting.  For my routine, we generate several null strings, delimited by spaces.  For everyone else, the extra spaces are simply ignored and removed.

Luke's routine is set to only use a single character as a string delimiter, so for test 3 and 4, it produces false results as we have a 2-character delimiter.

******************

As to WHY my routine tosses out those extra null strings, it's to simply uniformly answer the question of, "How would we behave if if used a FIND AND REPLACE to change all the spaces to commas?"

Instead of a(1) = " test test    test ", what results would we expect if a(1) = ",test,test,,,,test," and we used a comma as a delimiter instead of a space?  Would we not then count the characters between each comma as as null-string?

I'd think so, and if two commas side-by-side designate a null-string result between them, then I feel like two spaces side-by-side should do the same for space delimited data.  This concept also allows me to perfectly reproduce the original data when using a join function, without automatically losing those spaces which may (or may not) have been important for formatting or data purposes.

If I don't want the extra spaces, all I need to do is ignore those null-strings. 

Well in defense of my routine, you have the best of both worlds:
1. If you want to ignore spaces and just use any amount of spaces to separate items then use a space delimiter,
2. otherwise use any other delimiter of any length!

There is a real need for option #1 like when use string a number of numbers together, you won't need to worry about trimming them which would otherwise load the array with unwanted null strings. Having to ignore unwanted null strings would likely pose a needless burden on an app that depends upon the order placement of the items. I have in mind my simple little interpreter that used no punctuation only spaces as a delimiter.

Once you have a good general Split routine working, it would be a snap to modify and optimize for the particular application.
Likely it would need one delimiter or 1 character only delimiters or it may go the n-spaces route. Piece of cake to drop the unneeded general options to speed the thing up for the app. For this reason, Steve your method of REDIM _PRESERVE in large batches of say 1000, instead of at every new item, that is a good idea that I will incorporate in my general Split routine.

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3572
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #7 on: February 16, 2019, 10:28:33 AM »
Generally, if I want to strip out extra spaces, I’ll just run the string through a find-replace routine to change “  “ to “ “ before running the split routine.  It’s a fast process and saves time overall to do the whole process at once, instead of checking/trimming every loop inside the split routine.

And if you notice, my little routine only resizes the array larger when needed, and since REDIM _PRESERVE is a slow process, it’s better to oversized it by 1000 (or more) elements and then resize down when finished, than it is to resize it to the proper size for each word.  It’s very rare that I’ll ever REDIM _PRESERVE Array(Limit + 1).  REDIM _PRESERVE Array(Limit + 1000) is generally the smallest increment I’ll use in a routine, and then I’ll resize to free unneeded memory when done.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #8 on: February 16, 2019, 11:21:02 AM »
Generally, if I want to strip out extra spaces, I’ll just run the string through a find-replace routine to change “  “ to “ “ before running the split routine.  It’s a fast process and saves time overall to do the whole process at once, instead of checking/trimming every loop inside the split routine.

And if you notice, my little routine only resizes the array larger when needed, and since REDIM _PRESERVE is a slow process, it’s better to oversized it by 1000 (or more) elements and then resize down when finished, than it is to resize it to the proper size for each word.  It’s very rare that I’ll ever REDIM _PRESERVE Array(Limit + 1).  REDIM _PRESERVE Array(Limit + 1000) is generally the smallest increment I’ll use in a routine, and then I’ll resize to free unneeded memory when done.  ;)

Well Steve, I did strip out all extra spaces (if more than one) before running through the Split code (only if the delimiter was a space) and I did notice increasing the array by chucks instead of by 1 and already mentioned I was going to use that time saver tip!

One more interesting point about your method is that it is recursive which is a favorite technique I like to see employed. But I have heard that recursive techniques aren't as efficient ultimately as non recursive ones ie for every recursive routine there exists a more efficient non recursive one. (The proof of that might be interesting!) If true, then there is a faster version still needing to be revealed. :)
« Last Edit: February 16, 2019, 11:24:42 AM by bplus »

Offline Pete

  • Forum Resident
  • Posts: 2567
  • Cuz I sez so, varmint!
Re: Split and join strings
« Reply #9 on: February 16, 2019, 01:17:11 PM »
I have had to trim extra spaces a lot in html parsing. I made something simple years ago that I still use for that. It's so simple, I'll just code it here...

Code: QB64: [Select]
  1. WIDTH 120, 25
  2. a$ = "   This   is a test      of   eliminating     multiple         spaces    in    a text    line.   "
  3. DO UNTIL INSTR(a$, "  ") = 0
  4.     a$ = MID$(a$, 1, INSTR(a$, "  ") - 1) + MID$(a$, INSTR(a$, "  ") + 1)
  5. a$ = RTRIM$(LTRIM$(a$))
  6.  

I probably should have read previous posts before putting it up. It may have no relevance, but at least it prints text with an apolitical message!

Replacement, splitting, concatenating are all very useful. Good luck with these additions, as I assume this is more stuff for the tool box forum.

Pete

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #10 on: February 17, 2019, 12:44:45 PM »
Ha, ha, ha! Steve pulled a fast one! ;)

With such trivial tests his Split shines but check out a test with 10,000 items to Split and the new Split1000 sub:
Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 6
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 100
  13.  
  14. 'trivial tests just to test accuracy of split
  15. a(0) = ""
  16. d(0) = " "
  17. a(1) = " test test    test " 'good no error!
  18. d(1) = " "
  19. a(2) = " test"
  20. d(2) = " "
  21. a(3) = "3d,z6d,z1 10 #d,z5"
  22. d(3) = ",z"
  23. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  24. d(4) = ", "
  25.  
  26. 'lets get a serious test in here!! test a 10,000 random number string
  27. FOR i = 1 TO 10000
  28.     s$ = s$ + STR$(RND)
  29. a(5) = s$
  30. d(5) = " "
  31.  
  32. FOR i = 0 TO ntests - 1
  33.     CLS
  34.     t# = TIMER
  35.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  36.         SteveSplit a(i), d(i), results1()
  37.     NEXT
  38.     t1# = TIMER
  39.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  40.         REDIM results2(0) AS STRING
  41.         Split1000 a(i), d(i), results2()
  42.     NEXT
  43.     t2# = TIMER
  44.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  45.         Lsplit a(i), d(i), results3()
  46.     NEXT
  47.     t3# = TIMER
  48.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  49.     PRINT: PRINT "Test names:", "SteveSplit", "Split1000", "Lsplit"
  50.     PRINT "Times:",
  51.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  52.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  53.     FOR j = 1 TO 10
  54.         p = 0
  55.         IF j <= UBOUND(results1) THEN p = 1: PRINT j, CHR$(34); MID$(results1(j), 1, 15); CHR$(34),
  56.         IF j <= UBOUND(results2) + 1 THEN p = 1: PRINT CHR$(34); results2(j - 1); CHR$(34),
  57.         IF j <= UBOUND(results3) + 1 THEN p = 1: PRINT CHR$(34); results3(j - 1); CHR$(34),
  58.         IF p THEN PRINT
  59.     NEXT
  60.     PRINT: INPUT "Press enter for next test... "; wate$
  61.  
  62.  
  63.  
  64. '' ================================= My Old Split test Code
  65. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  66. ''2018-08-25 reworked for space delimiters and more variable declares
  67. ''2019-02-15 add Luke's version to compare
  68. 'ntests = 5
  69. 'DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  70.  
  71. 'a(0) = ""
  72. 'd(0) = " "
  73. 'a(1) = " test test    test " 'good no error!
  74. 'd(1) = " "
  75. 'a(2) = " test"
  76. 'd(2) = " "
  77. 'a(3) = "3d,z6d,z1 10 #d,z5"
  78. 'd(3) = ",z"
  79. 'a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  80. 'd(4) = ", "
  81. 'REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  82. 'FOR test = 0 TO ntests - 1
  83. '    PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  84. '    Split1000 a(test), d(test), myarr()
  85. '    amax = UBOUND(myarr)
  86. '    FOR i = 0 TO amax
  87. '        PRINT i; ":"; myarr(i)
  88. '    NEXT i
  89. '    INPUT "press enter for next test... "; wate$
  90. 'NEXT
  91.  
  92. '' how about a quick file reader test?
  93. 'PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  94. 'IF LEN(wate$) THEN END
  95. 'CLS
  96.  
  97. ''other wise continue
  98. 'OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  99. 'ftext$ = SPACE$(LOF(1))
  100. 'GET #1, , ftext$
  101. 'CLOSE #1
  102. 'Split ftext$, CHR$(13) + CHR$(10), myarr()
  103. 'FOR i = 0 TO UBOUND(myarr)
  104. '    PRINT myarr(i)
  105. '    IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  106. 'NEXT
  107. 'PRINT "the end"
  108. 'END ' end program
  109.  
  110.  
  111.  
  112. '
  113. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  114. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  115.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  116.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  117.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  118.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  119.  
  120.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  121.  
  122.     'special case if delim is space, probably want to remove all excess space
  123.     IF delim = " " THEN
  124.         copy = RTRIM$(LTRIM$(copy))
  125.         p = INSTR(copy, "  ")
  126.         WHILE p > 0
  127.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  128.             p = INSTR(copy, "  ")
  129.         WEND
  130.     END IF
  131.     curpos = 1
  132.     arrpos = 0
  133.     dpos = INSTR(curpos, copy, delim)
  134.     DO UNTIL dpos = 0
  135.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  136.         arrpos = arrpos + 1
  137.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  138.         curpos = dpos + LEN(delim)
  139.         dpos = INSTR(curpos, copy, delim)
  140.     LOOP
  141.     arr(arrpos) = MID$(copy, curpos)
  142.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  143.  
  144.  
  145. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  146. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  147.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  148.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  149.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  150.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  151.  
  152.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  153.  
  154.     'special case if delim is space, probably want to remove all excess space
  155.     IF delim = " " THEN
  156.         copy = RTRIM$(LTRIM$(copy))
  157.         p = INSTR(copy, "  ")
  158.         WHILE p > 0
  159.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  160.             p = INSTR(copy, "  ")
  161.         WEND
  162.     END IF
  163.     curpos = 1
  164.     arrpos = 0
  165.     dpos = INSTR(curpos, copy, delim)
  166.     DO UNTIL dpos = 0
  167.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  168.         arrpos = arrpos + 1
  169.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  170.         curpos = dpos + LEN(delim)
  171.         dpos = INSTR(curpos, copy, delim)
  172.     LOOP
  173.     arr(arrpos) = MID$(copy, curpos)
  174.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  175.  
  176.  
  177. ' Luke 2019-02-15
  178. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  179. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  180. '
  181. 'delimiter$ must be one character long.
  182. 'result$() must have been REDIMmed previously.
  183. SUB Lsplit (in$, delimiter$, result$())
  184.     REDIM result$(-1)
  185.     start = 1
  186.     DO
  187.         WHILE MID$(in$, start, 1) = delimiter$
  188.             start = start + 1
  189.             IF start > LEN(in$) THEN EXIT SUB
  190.         WEND
  191.         finish = INSTR(start, in$, delimiter$)
  192.         IF finish = 0 THEN finish = LEN(in$) + 1
  193.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  194.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  195.         start = finish + 1
  196.     LOOP WHILE start <= LEN(in$)
  197.  
  198. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  199. FUNCTION join$ (in$(), delimiter$)
  200.     result$ = in$(LBOUND(in$))
  201.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  202.         result$ = result$ + delimiter$ + in$(i)
  203.     NEXT i
  204.     join$ = result$
  205.  
  206. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  207.     STATIC count AS LONG
  208.     count = count + 1
  209.     u = UBOUND(storage_array)
  210.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  211.     i = INSTR(text$, delimiter$)
  212.     IF i THEN
  213.         storage_array(count) = LEFT$(text$, i - 1)
  214.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  215.     ELSE
  216.         storage_array(count) = text$
  217.         REDIM _PRESERVE storage_array(count) AS STRING
  218.         count = 0
  219.     END IF
  220.  
  221.  

I knew his recursive method was curse worthy. ;-)))
« Last Edit: February 17, 2019, 12:58:02 PM by bplus »

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3572
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #11 on: February 17, 2019, 01:49:10 PM »
25 seconds doesn’t seem right at all.  I’ll do some digging to see what’s up later and report back.  ;)
https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline SMcNeill

  • QB64 Developer
  • Forum Resident
  • Posts: 3572
    • Steve’s QB64 Archive Forum
Re: Split and join strings
« Reply #12 on: February 17, 2019, 02:46:48 PM »
Try this non-recursive version and see how it performs for you:

Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 6
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 100
  13. CONST NoNull = -1
  14.  
  15. 'trivial tests just to test accuracy of split
  16. a(0) = ""
  17. d(0) = " "
  18. a(1) = " test test    test " 'good no error!
  19. d(1) = " "
  20. a(2) = " test"
  21. d(2) = " "
  22. a(3) = "3d,z6d,z1 10 #d,z5"
  23. d(3) = ",z"
  24. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  25. d(4) = ", "
  26.  
  27. 'lets get a serious test in here!! test a 10,000 random number string
  28. FOR i = 1 TO 10000
  29.     s$ = s$ + STR$(RND)
  30. a(5) = s$
  31. d(5) = " "
  32.  
  33. FOR i = 0 TO ntests - 1
  34.     CLS
  35.     t# = TIMER
  36.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  37.         SteveSplit2 a(i), d(i), results1(), NoNull
  38.     NEXT
  39.     t1# = TIMER
  40.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  41.         REDIM results2(0) AS STRING
  42.         Split1000 a(i), d(i), results2()
  43.     NEXT
  44.     t2# = TIMER
  45.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  46.         Lsplit a(i), d(i), results3()
  47.     NEXT
  48.     t3# = TIMER
  49.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  50.     PRINT: PRINT "Test names:", "SteveSplit", "Split1000", "Lsplit"
  51.     PRINT "Times:",
  52.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  53.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  54.     FOR j = 1 TO 10
  55.         p = 0
  56.         PRINT j,
  57.         IF j <= UBOUND(results1) THEN PRINT CHR$(34); MID$(results1(j), 1, 15); CHR$(34), ELSE PRINT ,
  58.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34), ELSE PRINT ,
  59.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34),
  60.         PRINT
  61.     NEXT
  62.     PRINT: INPUT "Press enter for next test... "; wate$
  63.  
  64.  
  65.  
  66.  
  67. '
  68. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  69. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  70.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  71.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  72.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  73.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  74.  
  75.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  76.  
  77.     'special case if delim is space, probably want to remove all excess space
  78.     IF delim = " " THEN
  79.         copy = RTRIM$(LTRIM$(copy))
  80.         p = INSTR(copy, "  ")
  81.         WHILE p > 0
  82.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  83.             p = INSTR(copy, "  ")
  84.         WEND
  85.     END IF
  86.     curpos = 1
  87.     arrpos = 0
  88.     dpos = INSTR(curpos, copy, delim)
  89.     DO UNTIL dpos = 0
  90.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  91.         arrpos = arrpos + 1
  92.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  93.         curpos = dpos + LEN(delim)
  94.         dpos = INSTR(curpos, copy, delim)
  95.     LOOP
  96.     arr(arrpos) = MID$(copy, curpos)
  97.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  98.  
  99.  
  100. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  101. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  102.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  103.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  104.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  105.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  106.  
  107.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  108.  
  109.     'special case if delim is space, probably want to remove all excess space
  110.     IF delim = " " THEN
  111.         copy = RTRIM$(LTRIM$(copy))
  112.         p = INSTR(copy, "  ")
  113.         WHILE p > 0
  114.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  115.             p = INSTR(copy, "  ")
  116.         WEND
  117.     END IF
  118.     curpos = 1
  119.     arrpos = 0
  120.     dpos = INSTR(curpos, copy, delim)
  121.     DO UNTIL dpos = 0
  122.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  123.         arrpos = arrpos + 1
  124.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  125.         curpos = dpos + LEN(delim)
  126.         dpos = INSTR(curpos, copy, delim)
  127.     LOOP
  128.     arr(arrpos) = MID$(copy, curpos)
  129.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  130.  
  131.  
  132. ' Luke 2019-02-15
  133. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  134. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  135. '
  136. 'delimiter$ must be one character long.
  137. 'result$() must have been REDIMmed previously.
  138. SUB Lsplit (in$, delimiter$, result$())
  139.     REDIM result$(-1)
  140.     start = 1
  141.     DO
  142.         WHILE MID$(in$, start, 1) = delimiter$
  143.             start = start + 1
  144.             IF start > LEN(in$) THEN EXIT SUB
  145.         WEND
  146.         finish = INSTR(start, in$, delimiter$)
  147.         IF finish = 0 THEN finish = LEN(in$) + 1
  148.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  149.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  150.         start = finish + 1
  151.     LOOP WHILE start <= LEN(in$)
  152.  
  153. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  154. FUNCTION join$ (in$(), delimiter$)
  155.     result$ = in$(LBOUND(in$))
  156.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  157.         result$ = result$ + delimiter$ + in$(i)
  158.     NEXT i
  159.     join$ = result$
  160.  
  161. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  162.     STATIC count AS LONG
  163.     count = count + 1
  164.     u = UBOUND(storage_array)
  165.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  166.     i = INSTR(text$, delimiter$)
  167.     IF i THEN
  168.         storage_array(count) = LEFT$(text$, i - 1)
  169.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  170.     ELSE
  171.         storage_array(count) = text$
  172.         REDIM _PRESERVE storage_array(count) AS STRING
  173.         count = 0
  174.     END IF
  175.  
  176. SUB SteveSplit2 (text$, delimiter$, storage_array() AS STRING, Options AS INTEGER)
  177.     IF Options AND 1 THEN text$ = LTRIM$(text$)
  178.     IF Options AND 2 THEN text$ = RTRIM$(text$)
  179.     count = 1: oldi = 1
  180.     l = LEN(delimiter$)
  181.     u = UBOUND(storage_array)
  182.     IF u < 1 THEN REDIM _PRESERVE storage_array(1000) AS STRING
  183.     DO
  184.         i = INSTR(oldi, text$, delimiter$)
  185.         IF i THEN
  186.             length = i - oldi
  187.             u = UBOUND(storage_array)
  188.             storage_array(count) = MID$(text$, oldi, length)
  189.             IF (Options AND 4) AND (LEN(storage_array(count)) = 0) THEN
  190.                 count = count - 1 'remove null-strings.
  191.             END IF
  192.             oldi = i + l
  193.             i = oldi
  194.             count = count + 1
  195.             IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  196.         END IF
  197.     LOOP UNTIL i = 0
  198.     storage_array(count) = MID$(text$, oldi)
  199.     REDIM _PRESERVE storage_array(count) AS STRING
  200.  

Added feature:  This now has a simple flag which you can use to make null$ a non-acceptable result for your storage_array, so it'll behave exactly as the other routines.  Just change CONST NoNull = -1  to 0 and see how it toggles between the two methods.

(And both are the fastest splitters yet, so now it might not sound so bad to say, "I pulled a fast one."  ;D )

((And, if you like keeping the null-strings inside a split routine, as I generally do, you can remove those IF checks and make it even speedier. ))



https://github.com/SteveMcNeill/Steve64 — A github collection of all things Steve!

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #13 on: February 17, 2019, 03:13:10 PM »
Performs great Steve!

They are now running neck and neck on the big test and I agree without the IF check for options, SteveSplit2 should be faster yet.
So that means I might have another fix to speed up Split1000 because for my code, the delimiter check for space is a sort of Option check done before running the main part of the Split code.

« Last Edit: February 17, 2019, 03:15:16 PM by bplus »

Offline bplus

  • Forum Resident
  • Posts: 6843
  • b = b + ...
Re: Split and join strings
« Reply #14 on: February 17, 2019, 04:44:29 PM »
One tiny mod to Split1000 and another test that shows it out paces SteveSplit2 when delimiter isn't a space (which is what slows down Split1000 specially clear now that the limit is increased), probably because of the extra decision about Options in SteveSplit2.
Code: QB64: [Select]
  1. 'split test.bas for qb64 bplus 2018-05-07
  2. ' directly below is Steve's Timed test orig tests commented out
  3. ' 2019-02-17 modified by B+ with a new Split1000 sub and a SERIOUS String to Split!
  4.  
  5. '=================================================================== steve;s speed test
  6. CONST ntests = 7
  7. DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  8. REDIM results1(0) AS STRING
  9. REDIM results2(0) AS STRING
  10. REDIM results3(0) AS STRING
  11.  
  12. CONST Limit = 1000
  13. CONST NoNull = -1
  14.  
  15. 'trivial tests just to test accuracy of split
  16. a(0) = ""
  17. d(0) = " "
  18. a(1) = " test test    test " 'good no error!
  19. d(1) = " "
  20. a(2) = " test"
  21. d(2) = " "
  22. a(3) = "3d,z6d,z1 10 #d,z5"
  23. d(3) = ",z"
  24. a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  25. d(4) = ", "
  26.  
  27. 'lets get a serious test in here!! test a 10,000 random number string
  28. FOR i = 1 TO 10000
  29.     s$ = s$ + STR$(RND)
  30. a(5) = s$
  31. d(5) = " "
  32. FOR i = 1 TO 10000
  33.     IF i = 1 THEN s$ = STR$(RND * 1000 \ 1) ELSE s$ = s$ + "," + STR$(RND * 1000 \ 1)
  34. a(6) = s$
  35. d(6) = ", "
  36.  
  37.  
  38. FOR i = 0 TO ntests - 1
  39.     CLS
  40.     t# = TIMER
  41.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  42.         SteveSplit2 a(i), d(i), results1(), NoNull
  43.     NEXT
  44.     t1# = TIMER
  45.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  46.         REDIM results2(0) AS STRING
  47.         Split1000 a(i), d(i), results2()
  48.     NEXT
  49.     t2# = TIMER
  50.     FOR j = 1 TO Limit 'repeat the process multiple times so we can time it.
  51.         Lsplit a(i), d(i), results3()
  52.     NEXT
  53.     t3# = TIMER
  54.     PRINT "TEST #"; i; " -- Splitting: "; CHR$(34); MID$(a(i), 1, 80); CHR$(34); " with "; CHR$(34); d(i); CHR$(34)
  55.     PRINT: PRINT "Test names:", "SteveSplit2", "Split1000", "Lsplit"
  56.     PRINT "Times:",
  57.     PRINT USING "###.####     ###.####     ###.####"; t1# - t#, t2# - t1#, t3# - t2#
  58.     PRINT: PRINT "First Items in Results arrays (up to 10):"
  59.     FOR j = 1 TO 10
  60.         PRINT j,
  61.         IF j <= UBOUND(results1) THEN PRINT CHR$(34); MID$(results1(j), 1, 15); CHR$(34), ELSE PRINT "      ",
  62.         IF j <= UBOUND(results2) + 1 THEN PRINT CHR$(34); results2(j - 1); CHR$(34), ELSE PRINT "      ",
  63.         IF j <= UBOUND(results3) + 1 THEN PRINT CHR$(34); results3(j - 1); CHR$(34), ELSE PRINT "      ",
  64.         PRINT
  65.     NEXT
  66.     PRINT: INPUT "Press enter for next test... "; wate$
  67.  
  68.  
  69.  
  70. '' ================================= My Old Split test Code
  71. 'the space delimiter is such a special case perhaps I should develope a single split for that alone?
  72. ''2018-08-25 reworked for space delimiters and more variable declares
  73. ''2019-02-15 add Luke's version to compare
  74. 'ntests = 5
  75. 'DIM a(ntests - 1) AS STRING, d(ntests - 1) AS STRING
  76.  
  77. 'a(0) = ""
  78. 'd(0) = " "
  79. 'a(1) = " test test    test " 'good no error!
  80. 'd(1) = " "
  81. 'a(2) = " test"
  82. 'd(2) = " "
  83. 'a(3) = "3d,z6d,z1 10 #d,z5"
  84. 'd(3) = ",z"
  85. 'a(4) = "Monday, , Wednesday, THursday, Friday, , Sunday"
  86. 'd(4) = ", "
  87. 'REDIM myarr(0) AS STRING '<<<<< REDIM forces the creation of a dynamic/resizable array
  88. 'FOR test = 0 TO ntests - 1
  89. '    PRINT: PRINT "splitting {"; a(test); "} with delimeter {"; d(test); "}"
  90. '    Split1000 a(test), d(test), myarr()
  91. '    amax = UBOUND(myarr)
  92. '    FOR i = 0 TO amax
  93. '        PRINT i; ":"; myarr(i)
  94. '    NEXT i
  95. '    INPUT "press enter for next test... "; wate$
  96. 'NEXT
  97.  
  98. '' how about a quick file reader test?
  99. 'PRINT: INPUT "Press enter for file test, any other + enter quits! "; wate$
  100. 'IF LEN(wate$) THEN END
  101. 'CLS
  102.  
  103. ''other wise continue
  104. 'OPEN "Split test.bas" FOR BINARY AS #1 '<<< this file name!!!
  105. 'ftext$ = SPACE$(LOF(1))
  106. 'GET #1, , ftext$
  107. 'CLOSE #1
  108. 'Split ftext$, CHR$(13) + CHR$(10), myarr()
  109. 'FOR i = 0 TO UBOUND(myarr)
  110. '    PRINT myarr(i)
  111. '    IF i MOD 20 = 19 THEN PRINT: INPUT "press enter for more "; wate$
  112. 'NEXT
  113. 'PRINT "the end"
  114. 'END ' end program
  115.  
  116. '
  117. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  118. SUB Split1000 (mystr AS STRING, delim AS STRING, arr() AS STRING)
  119.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  120.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  121.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  122.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  123.  
  124.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  125.  
  126.     'special case if delim is space, probably want to remove all excess space
  127.     IF delim = " " THEN
  128.         copy = RTRIM$(LTRIM$(copy))
  129.         p = INSTR(copy, "  ")
  130.         WHILE p > 0
  131.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  132.             p = INSTR(copy, "  ")
  133.         WEND
  134.     END IF
  135.     curpos = 1
  136.     arrpos = 0
  137.     LD = LEN(delim) 'mod
  138.     dpos = INSTR(curpos, copy, delim)
  139.     DO UNTIL dpos = 0
  140.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  141.         arrpos = arrpos + 1
  142.         IF arrpos > UBOUND(arr) THEN REDIM _PRESERVE arr(UBOUND(arr) + 1000) AS STRING
  143.         curpos = dpos + LD
  144.         dpos = INSTR(curpos, copy, delim)
  145.     LOOP
  146.     arr(arrpos) = MID$(copy, curpos)
  147.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  148.  
  149.  
  150. 'notes: REDIM the array(0) to be loaded before calling Split '<<<<<<<<<<<<<<<<<<<<<<< IMPORTANT!!!!
  151. SUB Split (mystr AS STRING, delim AS STRING, arr() AS STRING)
  152.     ' bplus modifications of Galleon fix of Bulrush Split reply #13
  153.     ' http://xmaxw.[abandoned, outdated and now likely malicious qb64 dot net website - don’t go there]/forum/index.php?topic=1612.0
  154.     ' this sub further developed and tested here: \test\Strings\Split test.bas
  155.     DIM copy AS STRING, p AS LONG, curpos AS LONG, arrpos AS LONG, dpos AS LONG
  156.  
  157.     copy = mystr 'make copy since we are messing with mystr when the delimiter is a space
  158.  
  159.     'special case if delim is space, probably want to remove all excess space
  160.     IF delim = " " THEN
  161.         copy = RTRIM$(LTRIM$(copy))
  162.         p = INSTR(copy, "  ")
  163.         WHILE p > 0
  164.             copy = MID$(copy, 1, p - 1) + MID$(copy, p + 1)
  165.             p = INSTR(copy, "  ")
  166.         WEND
  167.     END IF
  168.     curpos = 1
  169.     arrpos = 0
  170.     dpos = INSTR(curpos, copy, delim)
  171.     DO UNTIL dpos = 0
  172.         arr(arrpos) = MID$(copy, curpos, dpos - curpos)
  173.         arrpos = arrpos + 1
  174.         REDIM _PRESERVE arr(arrpos + 1000) AS STRING
  175.         curpos = dpos + LEN(delim)
  176.         dpos = INSTR(curpos, copy, delim)
  177.     LOOP
  178.     arr(arrpos) = MID$(copy, curpos)
  179.     REDIM _PRESERVE arr(arrpos) AS STRING 'need this line? YES to get the ubound correct
  180.  
  181.  
  182. ' Luke 2019-02-15
  183. 'Split in$ into pieces, chopping at every occurrence of delimiter$. Multiple consecutive occurrences
  184. 'of delimiter$ are treated as a single instance. The chopped pieces are stored in result$().
  185. '
  186. 'delimiter$ must be one character long.
  187. 'result$() must have been REDIMmed previously.
  188. SUB Lsplit (in$, delimiter$, result$())
  189.     REDIM result$(-1)
  190.     start = 1
  191.     DO
  192.         WHILE MID$(in$, start, 1) = delimiter$
  193.             start = start + 1
  194.             IF start > LEN(in$) THEN EXIT SUB
  195.         WEND
  196.         finish = INSTR(start, in$, delimiter$)
  197.         IF finish = 0 THEN finish = LEN(in$) + 1
  198.         REDIM _PRESERVE result$(0 TO UBOUND(result$) + 1)
  199.         result$(UBOUND(result$)) = MID$(in$, start, finish - start)
  200.         start = finish + 1
  201.     LOOP WHILE start <= LEN(in$)
  202.  
  203. 'Combine all elements of in$() into a single string with delimiter$ separating the elements.
  204. FUNCTION join$ (in$(), delimiter$)
  205.     result$ = in$(LBOUND(in$))
  206.     FOR i = LBOUND(in$) + 1 TO UBOUND(in$)
  207.         result$ = result$ + delimiter$ + in$(i)
  208.     NEXT i
  209.     join$ = result$
  210.  
  211. SUB SteveSplit2 (text$, delimiter$, storage_array() AS STRING, Options AS INTEGER)
  212.     IF Options AND 1 THEN text$ = LTRIM$(text$)
  213.     IF Options AND 2 THEN text$ = RTRIM$(text$)
  214.     count = 1: oldi = 1
  215.     l = LEN(delimiter$)
  216.     u = UBOUND(storage_array)
  217.     IF u < 1 THEN REDIM _PRESERVE storage_array(1000) AS STRING
  218.     DO
  219.         i = INSTR(oldi, text$, delimiter$)
  220.         IF i THEN
  221.             length = i - oldi
  222.             u = UBOUND(storage_array)
  223.             storage_array(count) = MID$(text$, oldi, length)
  224.             IF (Options AND 4) AND (LEN(storage_array(count)) = 0) THEN
  225.                 count = count - 1 'remove null-strings.
  226.             END IF
  227.             oldi = i + l
  228.             i = oldi
  229.             count = count + 1
  230.             IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  231.         END IF
  232.     LOOP UNTIL i = 0
  233.     storage_array(count) = MID$(text$, oldi)
  234.     REDIM _PRESERVE storage_array(count) AS STRING
  235.  
  236. SUB SteveSplit (text$, delimiter$, storage_array() AS STRING)
  237.     STATIC count AS LONG
  238.     count = count + 1
  239.     u = UBOUND(storage_array)
  240.     IF count > u THEN REDIM _PRESERVE storage_array(u + 1000) AS STRING
  241.     i = INSTR(text$, delimiter$)
  242.     IF i THEN
  243.         storage_array(count) = LEFT$(text$, i - 1)
  244.         SteveSplit MID$(text$, i + LEN(delimiter$)), delimiter$, storage_array()
  245.     ELSE
  246.         storage_array(count) = text$
  247.         REDIM _PRESERVE storage_array(count) AS STRING
  248.         count = 0
  249.     END IF
  250.  
  251.  
« Last Edit: February 17, 2019, 04:50:21 PM by bplus »